A library that provides support for using the FFDQ framework for making data quality assertions.
The core of the library provides a set of model classes for serializing/deserializing Java Bean representations of FFDQ concepts to/from rdf in a number of formats. These model classes are located under the org.datakurator.ffdq.model
packages.
Also provided is a set of utilities that make use of the model classes to generate rdf and run tests implemented in Java. These utilities, listed in order of intended use, are:
- test-util - when starting with a csv file that lists a set of standardized tests, run this utility as the first step to generate FFDQ RDF and/or Java classes with stub methods for implementing the tests. The output of this utility can be used with the test-runner described below.
- test-runner - takes the FFDQ RDF containing test metadata and a Java annotated class that implements the tests and runs each to produce an RDF report suitable for postprocessing.
- query-util - can be used with any of the RDF output from the above along with the SPARQL queries found in
competencyquestions/sparql
to query FFDQ instance documents and produce a tsv result file.
See the sections below for more information and usage.
Available from maven central.
<dependency>
<groupId>org.datakurator</groupId>
<artifactId>kurator-ffdq</artifactId>
<version>3.0.0</version>
</dependency>
mvn package
A command-line utility is provided at query-util.sh
for running sparql queries
on sample rdf data. Example sparql query files can be found in the competencyquestions/sparql
directory relative to the
project root and example turtle and jsonld files can be found in competencyquestions/rdf
.
To view competency questions and example rdf see the readme at: https://github.com/kurator-org/kurator-ffdq/blob/master/competencyquestions/README.md
Run the jar via the following and provide the utility with the options specified below or run with no options to see usage.
- -q,--query <arg> - Input file containing sparql query.
- -t,--triples <arg> - Input rdf file containing triples (jsonld or turtle).
- -o,--out <arg> - Output tsv file for query result.
For example:
./query-util.sh -t competencyquestions/rdf/example.jsonld -q competencyquestions/sparql/results.sparql -o results.tsv
Example working with the BDQ Core tests (assuming checkout of tdwg/bdq in a directory next to kurator-ffdq):
./query-util.sh -t ../bdq/tg2/core/TG2_tests.ttl -q competencyquestions/sparql/tests_bdqcore.sparql -o results.tsv
This utility provides authors of actors a way to convert the spreadsheet of standardized tests into FFDQ RDF and/or Java classes containing stub methods for implementing tests.
The utility takes a csv file with each row representing a single test and a properties file with metadata about the mechanism as inputs. Examples for the Date Validator can be found at data/DwCEventDQ.csv
and data/DwCEventDQ.properties
.
The properties file must contain a guid that uniquely identifies and a human readable name for the mechanism implementing the tests. In order to use class generation, a Java package and class name must be specified for the implementation.
ffdq.mechanism.guid=b844059f-87cf-4c31-b4d7-9a52003eef84
ffdq.mechanism.name=Kurator: Date Validator - DwCEventDQ
ffdq.mechanism.javaPackage=org.filteredpush.qc.date
ffdq.mechanism.javaClass=DwCEventDQ
The csv file containing the test metadata defines the following metadata :
- term_localName (was GUID) - The test guid
- Label - Human readable name of the test
- Description - Describes the test conditions (pass/fail). This value is used to define the Criterion for a Validation or the Enhancement for an Amendment.
- Specification - Technical description of expected behavior when running the test
- Type - The assertion type. Must be one of the values
Measure
,Validation
orAmendment
- Resource Type - Either
SingleRecord
orMultiRecord
- Dimension - Defines the data quality dimension of a test for Measures, can be one of
Value
,Vocab Match
,Completeness
,Accuracy
,Precision
orUniqueness
- Information Element - Term or list of terms from a controlled vocabulary that a test acts upon. Must contain the namespace prefix (e.g. "dwc:eventDate, dwc:verbatimEventDate")
- Source - Source of the tests
- Example Implementation - Link to the source code on GitHub, SourceForge, etc
To run the utility use the test-util.sh
shell script with the following required options:
- config <arg> - Properties file defining the mechanism to use
- in <arg> - Input CSV file containing list of tests
- out <arg> - Output file for the rdf representation of the tests
Optionally, UseCase-relationships can be included and loaded from a file:
- useCaseFile <arg> - Input CSV file containing one UseCase per line, with a pipe delimited list of test labels for that use case.
The default format is turtle but this can be changed via the following option:
- format <arg> - Output format (RDFXML, TURTLE, JSON-LD)
By default the utility only generates the rdf. In order to generate a new Java class or append new tests to an existing one, you can also specify the following options:
- generateClass - Generate a new Java class with stub methods for each test
- appendClass - Append to an existing Java class stub methods for new tests
- srcDir <arg> - The Java sources root directory (e.g. src/main/java)
When using the class generation feature via the generateClass
or appendClass
options, the class is located by using the ffdq.mechanism.javaPackage
property from the config file to determine the package directory relative the Java sources root directory specified by the srcDir
option.
For example, to run the utility on the example data provided in this project use the following command:
./test-util.sh -config data/DwCEventDQ.properties -in data/DwCEventDQ.csv -out data/DwCEventDQ.ttl -srcDir event_date_qc/src/main/java -appendClass
After generating FFDQ RDF from the spreadsheet of tests and implementing methods tied to test GUIDs in the DQClass, the test runner utility can be used to produce rdf containing report concepts for describing the results.
Using the options below, run the utility from the directory containing the jar files(s) that include the annotated DQ Classes (e.g. event_date_qc-3.0.5-SNAPSHOT.jar)
- cls <arg> - Fully qualified name of Java class on the classpath to run tests from
- rdf <arg> - Input file containing the rdf representation of the tests
- in <arg> - Input occurrence data file
- informat <arg> - Input occurrence data file format (csv or tdf)
- out <arg> - Output file for the rdf representation of the dq report
- format <arg> - Input/output rdf format (RDFXML, TURTLE, JSON-LD)
Classes from all jar files present in the working directory when running the utility are included on the classpath by default. In order to avoid class loader conflicts, ensure that the jar file that contains the DQClass specified by the cls option is the latest version.
For example, run from the command line via:
cd ~/event_date_qc/target
~/kurator-ffdq/test-runner.sh -cls org.filteredpush.qc.date.DwCEventDQ -rdf ../conf/DwCEventDQ.ttl -in ~/Downloads/occurrence.txt -out dq-report.ttl
Using the query utility mentioned above along with the postprocess.sparql
query, you can create a tsv query result for previewing the report:
~/kurator-ffdq/query-util.sh -q ~/kurator-ffdq/competencyquestions/sparql/postprocess.sparql -t dq-report.ttl -o result.tsv
libreoffice result.tsv
TODO: Work in progress, need to match guids for tests (DataQualityNeed subclass instances) to Specifications in model
sh test-runner.sh -format RDFXML -in ../bdq/tg2/core/TG2_test_validation_data_synthetic_occurrence.csv -out output.rdf -rdf ../bdq/tg2/_review/dist/bdqcore.xml -cls org.filteredpush.qc.date.DwCEventDQ -informat csv &> temp.out
The higher level use of the framework makes use of Java annotations defined in the org.datakurator.ffdq.annotations package.
To use the annotations, in the project that defines the methods and classes corresponding to a set of assertion tests, add the dependency via maven to your pom.xml file
<dependency>
<groupId>org.datakurator</groupId>
<artifactId>kurator-ffdq</artifactId>
<version>3.0.0</version>
</dependency>
Provided is a class level annotation that defines an FFDQ Mechanism that implements the test (methods). Example usage of the @Mechanism annotation applied to a class:
@Mechanism("urn:uuid:b844059f-87cf-4c31-b4d7-9a52003eef84")
public class DwCEventDQ {
// ...
}
The mechanism above has a guid value property that uniquely identifies the mechanism. This is tied to metadata in RDF about the mechanism.
Next is the method level annotation that maps Java code to FFDQ concepts by associating a method implementing a test with the specification GUID.
@Provides("urn:uuid:da63f836-1fc6-4e96-a612-fa76678cfd6a")
public static DQResponse<ComplianceValue> eventDateConsistentWithVerbatim(...) {
// ...
}
Lastly, the method parameter level annotation is provided for defining how the parameters (the fields acted upon or fields consulted) map to information elements in ffdq defined in terms of a controlled vocabulary such as DWC. The value for this annotation must contain the namespace prefix (e.g. dwc:eventDate).
public static EventDQValidation eventDateConsistentWithVerbatim(
@ActedUpon("dwc:eventDate") String eventDate,
@ActedUpon("dwc:verbatimEventDate") String verbatimEventDate) {
// ...
}
Depending of the type of assertion, the generic return type DQResponse can be parameterized with the following. Examples usage of each parameterized type below.
For Measures, depending on the dimension, use DQResponse<NumericalValue>
:
DQResponse<NumericalValue> result = new DQResponse<>();
long seconds = DateUtils.measureDurationSeconds(eventDate);
result.setValue(new NumericalValue(seconds));
result.setResultState(ResultState.RUN_HAS_RESULT);
or for the dimension of Completeness use DQResponse<CompletenessValue>
:
DQResponse<CompletenessValue> result = new DQResponse<>();
result.setValue(CompletenessValue.COMPLETE);
result.addComment("Value provided for eventDate.");
result.setResultState(ResultState.RUN_HAS_RESULT);
For Validations use DQResponse<ComplianceValue>
:
DQResponse<ComplianceValue> result = new DQResponse<>();
result.setValue(ComplianceValue.COMPLIANT);
result.addComment("Provided value for day '" + day + "' is an integer in the range 1 to 31.");
result.setResultState(ResultState.RUN_HAS_RESULT);
For Amendments use DQResponse<AmendmentValue>
and add changed values to an instance of AmendmentValue:
DQResponse<AmendmentValue> result = new DQResponse<>();
AmendmentValue extractedValues = new AmendmentValue();
extractedValues.addResult("dwc:eventDate", DateUtils.createEventDateFromStartEnd(startDate, endDate));
result.setValue(extractedValues);
result.setResultState(ResultState.CHANGED);
See classes in the org.datakurator.ffdq.model package.
To deploy a snapshot to the snapshotRepository:
mvn clean deploy
To deploy a new release to maven central, set the version in pom.xml to a non-snapshot version, then deploy with the release profile (which adds package signing and deployment to release staging:
mvn clean deploy -P release
Certain Java classes dedicated to fundamental concepts in the Fitness For Use framework have Java comments that credit a paper by Veiga et al available at: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0178731
FFDQ is a sort of acronym for a conceptual Framework For Data Quality assessment and management of biodiversity data
An example is: https://github.com/kurator-org/kurator-ffdq/blob/master/src/main/java/org/datakurator/data/ffdq/model/report/DataResource.java
Export a copy of the tests with RDFBean class binding axioms in the RDF:
./test-util.sh -config data/tg2_tests.properties -format RDFXML -out TG2_tests.xml -in data/TG2_tests.csv -guidFile ../bdq/tg2/core/TG2_tests_additional_guids.csv -useCaseFile ../bdq/tg2/core/usecase_test_list.csv -ieGuidFile ../bdq/tg2/core/information_element_guids.csv -includeBindings
Run a test class using the RDF with those bindings.
sh test-runner.sh -format RDFXML -in ../bdq/tg2/core/TG2_test_validation_data_synthetic_occurrence.csv -out output.rdf -rdf TG2_tests.xml -cls org.filteredpush.qc.date.DwCEventDQ -informat csv &> temp.out
TODO: Not all needed now, can get additional guids from CSV input.
./test-util.sh -config data/tg2_tests.properties -format RDFXML -out ../bdq/tg2/core/TG2_tests.xml -in data/TG2_tests.csv -guidFile ../bdq/tg2/core/TG2_tests_additional_guids.csv -useCaseFile ../bdq/tg2/core/usecase_test_list.csv -ieGuidFile ../bdq/tg2/core/information_element_guids.csv