The InTaVia Knowledge Graph (KB) is combining data from 4 national biographies (Austria, Finland, The Netherlands and Slovenia) and references resources such as wikidata and Europeana. To make the data reproducible - in the sense of "check out the commit from YYYY-MM-DD and run script A to reproduce the state of the KB valid at this day" - and easier to handle - there is simply to much data to do any manual curation and/or validity check - we came up with a plugin system. This system is built around Prefect.io. Prefect.io is a workflow orchestration system that allows to build complex data processing pipelines and execute them on various events. While Prefect.io itself allows for various setups, our plugin system is deployed in the ACDH-CH Kubernetes Cluster. Every job that gets submitted to the pipeline fires up a new container which is teared down as soon as the job has finished. The workflows are stored in this repo and the up-to-date code is fetched on every run. This secures the simple interaction of the InTaVia development team with the plugin system (every developer with access to the repo can update the plugins). This repo contains the various flows that have been developed so far, as well as a job-template that allows us to pass kubernetes-secrets to the job itself.
In this section we briefly describe the purpose and structure of the flows.
We are currently working on a ShEx schema for validating the datasets before ingesting them. As soon as this is ready we will add a plugin that validates new (versions of) datasets against this schema and stops the ingestion process in case the file(s) do not validate.
The ingest workflows (only the mock data one is currently published) downloads data from a given location (currently a GitHub repository) and uploads it to a configurable triplestore in a configurable named graph.
In the current setup the InTaVia Knowledge Graph uses a blazegraph triplestore in quad mode (to allow for named graphs). Blazegraph does not allow for inference in quad mode, we therefore generate inference triples with this plugin and push them to the triplestore.
This plugin uses reference resources URIs (such as GND and Wikidata) to find the corresponding entity in wikidata.org. In a second step it uses the Wikidata object to retrieve missing reference resource URIs and adds them to the KB. This is an important step as datasets very often use different reference resources to identify entities in the dataset. E.g. the Austrian data (ÖBL) uses GND identifiers, while BiographySampo uses Wikidata.
This workflows created provided entity instances and reconciles proxy entities based on shared reference resource URIs. I.e. if two entities are linked to a same reference resource URI (such as GND or Wikidata) they are connected to the same provided entity.
This plugin uses the wikidata identifier added by the "Person id linker" plugin to search wikidata for persons and then downloads cultural heritage objects linked to these persons from wikidata. Before ingesting it into the InTaVia KB it converts the data to the IDM-RDF datamodel. To avoid timeouts it has a configurable number of persons it works on in parallel and it also allows to set the target named graph.
This plugin uses the wikidata identifier added by the "Person id linker" plugin to search wikidata for interperson relations connected the people in InTaVia dataset. The relations can be genealogical (parent, child, spouse, ...), educations (teacher, student, supervisor, ..), or related to career (co-worker, colleague, influencer, ...). Current implementations first queries the Wikidata identifiers from the InTaVia triplestore, and uses the found links to extract the results from Wikidata triplestore.
This plugin uses the Getty Union List of Artists' names (ULAN) database for extracting interperson relations of people in InTaVia dataset. The relations can be genealogical (parent, child, spouse, ...), educations (teacher, student, supervisor, ..), or related to career (co-worker, colleague, influencer, ...). Current implementations first queries the Wikidata identifiers from the InTaVia triplestore, then Wikidata to achieve the ULAN ID, and finally extracts the relations from Getty triplestore.
During work on the flows we came across several shortcomings of the structure we had in mind when designing the plugin system.
Every plugin (flow in the sense of prefect) consists of several tasks that are triggered in a certain sequence and/or by certain events (such as result of task A = B
). However, a lot of these tasks are rather simple and generic: e.g. fetch a SPARQL query from location A and return it. Currently we copy those tasks between the flows (as a simple import is due to our setup not possible), but plan on packing those generic tasks into a module which gets installed in every plugin.
Currently every plugin (flow) gets triggered and executed on its own. However, we are working on a flow that controls all the other flows depending on the state of the Knowledge Graph and certain events. E.g.: if there is a new version of a dataset available it will start the ingestion plugin, after that the inference plugin, then the enrichment plugin etc. By implementing that we will secure a better separation of concerns: the plugins themselves need to care about changing the KB only and not about when to run etc. The "orchestration flow" on the other hand will only listen on events and trigger the plugins accordingly. This secures also that dependencies between events need to be dealt with only in the orchestration flow.
You can add custom job template either on the agent or job level. Agent level template can be added to the agent start command.
prefect agent kubernetes start --job-template /intavia-job-template.yaml
Or you can add to the run_config of a single job.
flow.run_config = KubernetesRun(job_template_path="/intavia-job-template.yaml")
poetry install
poetry run python SCRIPT.PY
- Comment out the lines for configuring the flow on the Kubernetes cluster: flow.run_config & flow.storage
- Uncomment the line for running the flow locally: flow.run()
If you wish to serialize the data in file instead of storing the data in a named graph on SPARQL server:
- Comment out the line for updating target graph.
- Uncomment the line for serializing graph into file.
RDFDB_USER=... RDFDB_PASSWORD=... poetry run python person_id_linker.py