-
Notifications
You must be signed in to change notification settings - Fork 1
GSoC 2016
The rpostgisLT
package started as Google Summer of Code (GSoC) project in 2016. The following page summarizes my experience as GSoC student and main developer during this phase.
Mentoring organization of the Google Summer of Code 2016: R project for statistical computing
Mentors:
- Mathieu Basille ([email protected])
- David Bucklin ([email protected])
- Clément Calenge ([email protected])
Developer: Balázs Dukai ([email protected])
As the package has been mainly written by me until the end of GSoC 2016, the tagged version 0.3 comprises well my contribution in code. However, the complete list of my commits to rpostgisLT
during the project (until 23.08.2016) can be found at: https://github.com/mablab/rpostgisLT/commits/master?author=balazsdukai
My name is Balázs Dukai and currently I am a student of the MSc Geomatics programme at the Delft University of Technology. However I took a bit of a detour before eventually ending up in the field of geomatics. First I studied and practiced landscape architecture, then I did several courses on data science and programming with R being the first scripting language I have learnt. After having some exposure to geographical information systems (GIS) and land surveying I decided to pursue the career of spatial data acquisition, management and analysis.
I only came across GSoC not long before the application deadline in 2016 and I realized that I will not have enough time to thoroughly elaborate on more than one application. As I am particularly interested in spatio-temporal data management, the rpostgisLT
project was a natural choice for me.
The initial project description defined three major goals:
-
Define and implement a PostgreSQL data model to store the database equivalent of an
ltraj
, called apgtraj
; -
Write functions and compile them into an R package for seamless transitioning between
pgtraj
(PostgreSQL) andltraj
(R); -
Create a tool for visualizing and interactively processing
pgtraj
-es;
So I planned the project accordingly and with my mentors we agreed that the visualization part will be a stretch goal. At the initial stage we agreed that the final goal of the project is to have a fully functional ropostgisLT
package submitted to the Comprehensive R Archive Network (CRAN).
As two of my mentors reside in the USA, one in France and I live in the Netherlands unfortunately there was no possibility to meet in person. Therefore we set up to have weekly online status meetings on Mondays, a a GitLab repository (private, for internal development) and a GitHub repository (public), a shared Google Drive folder for non-code files and e-mail communication on demand. Thus we had a relatively close collaboration where during the status meetings we discussed design decisions and the current status of the work.
I strove to apply an agile development method, however I realized that in most cases the size of the sprint targets and the length of the sprints usually did not match. Furthermore, as the project evolved and I became more and more immersed in it, I managed the project less and less. Therefore the project differed from the original (and not updated) plan, but thanks to the clear project goals and the weekly meetings I could still succeed.
As the foundation of the rpostgisLT
package I developed the PostgreSQL data model to store animal trajectories. A main requirement for the data model was that it had to align with the features of the existing ltraj
R object class. Additionally it should take advantage of the features of PosgreSQL/PostGIS and allow the storage and visualization of large number of trajectories in a database.
Probably this was my favourite and at the same time the most difficult part of the project. Right at the beginning we spent more time on developing the database model than we planned, but after seven revisions we finally ended up with a version that we considered appropriate. So I started writing the functions that operate on it and unfortunately later than sooner I realized some major flaws in the model. Although it was an intensive work to alter the model at a late stage of the project, it was for the better as the new model is more efficient and reliable (see the diagram below). For more detailed description see the package vignette The traj database model.
After the database model was set, I immediately started working on the functions to load and extract data from it. Hence I wrote the three main functions of rpostgisLT
:
-
as_pgtraj
– load data from a database table; -
ltraj2pgtraj
– load data from anltraj
object; -
pgtraj2ltraj
– extract data into anltraj
object;
In the meantime we defined a set of use cases (see package vignette Use cases
) to identify and document a work flow that the package should be able to handle. Probably most importantly, the use cases are also test cases for the integration test of the package. This set of use cases practically became the backbone of the whole development process, so I strongly recommend defining them early on in a project.
The R object class ltraj
contains several parameters that describe a trajectory and are computed from its location and time information. Because it is planned that rpostgisLT
will not only store pgtraj
-es in the database, but also perform operations on them (e.g. subset, interpolate), it was necessary to be able to recompute the ltraj
parameters in the database. The criteria for success was that if an ltraj
A1
is loaded into the database, its parameters are recomputed and the pgtraj
is extracted to R as an ltraj
A2
, then comparing the two should result in A1 = A2
. Fulfilling this requirement consumed a considerable amount of time as I had to account for all the possible side cases as well.
Because the step parameters are only used for transferring a pgtraj
to R and a pgtraj
might also change in the database, the parameters are not stored in tables but calculated on demand with a view, <pgtraj_name>_parameters
. The second view, <pgtraj_name>_step_geometry
, can be loaded directly to QGIS to get a quick visual overview of the trajectory.
Finally I defined exported and non-exported functions that streamline the rpostgisLT
work flow, namely:
-
pgTrajDB2TempT
– helper function foras_pgtraj
-
pgTrajDrop
– deletes apgtraj
from atraj
schema -
pgTrajSchema
– creates atraj
schema -
pgTrajTempT
– helper function foras_pgtraj
-
pgTrajVacuum
– maintains atraj
schema -
pgTrajViewParams
– creates the<pgtraj_name>_parameters
view in atraj
schema -
pgTrajViewStepGeom
– creates the<pgtaj_name>_step_geometry
view in atraj
schema
Unfortunately I couldn't get to a point that the package is published on CRAN, neither did I work on the visualization of trajectories. However, the final product is fully operational, and after implementing the infolocs
feature and formally preparing the package it will be ready to be published on CRAN. The trajectory visualization will likely to be part of future development.
The GSoC as a whole was very valuable experience for me. It was the first time that I carried out a software development project in a real(-istic) setting and learnt from every aspect of it. The product is definitely something that I am very proud of and the experience gives me great confidence to venture further as a developer. I am very glad that I could see firsthand what open source development really means, both from the personal and technical perspective.
At the end of the project I can say that I gained advanced skills in data modeling, PostgreSQL/PostGIS and SQL, R scripting, package development, project management and communication, remote collaboration and open source development.