Skip to content

GSoC 2016

Balázs Dukai edited this page Aug 23, 2016 · 11 revisions

The rpostgisLT package started as Google Summer of Code (GSoC) project in 2016. The following page summarizes my experience as GSoC student and main developer during this phase.

Mentoring organization of the Google Summer of Code 2016: R project for statistical computing

Mentors:

Developer: Balázs Dukai ([email protected])

As the package has been mainly written by me until the end of GSoC 2016, the tagged version 0.3 comprises well my contribution in code. However, the complete list of my commits to rpostgisLT during the project (until 23.08.2016) can be found at: https://github.com/mablab/rpostgisLT/commits/master?author=balazsdukai

About me

My name is Balázs Dukai and currently I am a student of the MSc Geomatics programme at the Delft University of Technology. However I took a bit of a detour before eventually ending up in the field of geomatics. First I studied and practiced landscape architecture, then I did several courses on data science and programming with R being the first scripting language I have learnt. After having some exposure to geographical information systems (GIS) and land surveying I decided to pursue the career of spatial data acquisition, management and analysis.

I only came across GSoC not long before the application deadline in 2016 and I realized that I will not have enough time to thoroughly elaborate on more than one application. As I am particularly interested in spatio-temporal data management, the rpostgisLT project was a natural choice for me.

The project plan and work flow

The initial project description defined three major goals:

  1. Define and implement a PostgreSQL data model to store the database equivalent of an ltraj, called a pgtraj;

  2. Write functions and compile them into an R package for seamless transitioning between pgtraj (PostgreSQL) and ltraj (R);

  3. Create a tool for visualizing and interactively processing pgtraj-es;

So I planned the project accordingly and with my mentors we agreed that the visualization part will be a stretch goal. At the initial stage we agreed that the final goal of the project is to have a fully functional ropostgisLT package submitted to the Comprehensive R Archive Network (CRAN).

As two of my mentors reside in the USA, one in France and I live in the Netherlands unfortunately there was no possibility to meet in person. Therefore we set up to have weekly online status meetings on Mondays, a a GitLab repository (private, for internal development) and a GitHub repository (public), a shared Google Drive folder for non-code files and e-mail communication on demand. Thus we had a relatively close collaboration where during the status meetings we discussed design decisions and the current status of the work.

I strove to apply an agile development method, however I realized that in most cases the size of the sprint targets and the length of the sprints usually did not match. Furthermore, as the project evolved and I became more and more immersed in it, I managed the project less and less. Therefore the project differed from the original (and not updated) plan, but thanks to the clear project goals and the weekly meetings I could still succeed.

My contribution

As the foundation of the rpostgisLT package I developed the PostgreSQL data model to store animal trajectories. A main requirement for the data model was that it had to align with the features of the existing ltraj R object class. Additionally it should take advantage of the features of PosgreSQL/PostGIS and allow the storage and visualization of large number of trajectories in a database.

Probably this was my favourite and at the same time the most difficult part of the project. Right at the beginning we spent more time on developing the database model than we planned, but after seven revisions we finally ended up with a version that we considered appropriate. So I started writing the functions that operate on it and unfortunately later than sooner I realized some major flaws in the model. Although it was an intensive work to alter the model at a late stage of the project, it was for the better as the new model is more efficient and reliable (see the diagram below). For more detailed description see the package vignette The traj database model.

ER-diagram of the traj database model

After the database model was set, I immediately started working on the functions to load and extract data from it. Hence I wrote the three main functions of rpostgisLT:

  • as_pgtraj – load data from a database table;
  • ltraj2pgtraj – load data from an ltraj object;
  • pgtraj2ltraj – extract data into an ltraj object;

In the meantime we defined a set of use cases (see package vignette Use cases) to identify and document a work flow that the package should be able to handle. Probably most importantly, the use cases are also test cases for the integration test of the package. This set of use cases practically became the backbone of the whole development process, so I strongly recommend defining them early on in a project.

The R object class ltraj contains several parameters that describe a trajectory and are computed from its location and time information. Because it is planned that rpostgisLT will not only store pgtraj-es in the database, but also perform operations on them (e.g. subset, interpolate), it was necessary to be able to recompute the ltraj parameters in the database. The criteria for success was that if an ltraj A1 is loaded into the database, its parameters are recomputed and the pgtraj is extracted to R as an ltraj A2, then comparing the two should result in A1 = A2. Fulfilling this requirement consumed a considerable amount of time as I had to account for all the possible side cases as well.

Because the step parameters are only used for transferring a pgtraj to R and a pgtraj might also change in the database, the parameters are not stored in tables but calculated on demand with a view, <pgtraj_name>_parameters. The second view, <pgtraj_name>_step_geometry, can be loaded directly to QGIS to get a quick visual overview of the trajectory.

Finally I defined exported and non-exported functions that streamline the rpostgisLT work flow, namely:

  • pgTrajDB2TempT – helper function for as_pgtraj
  • pgTrajDrop – deletes a pgtraj from a traj schema
  • pgTrajSchema – creates a traj schema
  • pgTrajTempT – helper function for as_pgtraj
  • pgTrajVacuum – maintains a traj schema
  • pgTrajViewParams – creates the <pgtraj_name>_parameters view in a traj schema
  • pgTrajViewStepGeom – creates the <pgtaj_name>_step_geometry view in a traj schema

Unfortunately I couldn't get to a point that the package is published on CRAN, neither did I work on the visualization of trajectories. However, the final product is fully operational, and after implementing the infolocs feature and formally preparing the package it will be ready to be published on CRAN. The trajectory visualization will likely to be part of future development.

What I will take with me

The GSoC as a whole was very valuable experience for me. It was the first time that I carried out a software development project in a real(-istic) setting and learnt from every aspect of it. The product is definitely something that I am very proud of and the experience gives me great confidence to venture further as a developer. I am very glad that I could see firsthand what open source development really means, both from the personal and technical perspective.

At the end of the project I can say that I gained advanced skills in data modeling, PostgreSQL/PostGIS and SQL, R scripting, package development, project management and communication, remote collaboration and open source development.