Skip to content

Data and Values.bak

Damion Dooley edited this page Oct 29, 2018 · 1 revision

Introduction

OBI has adopted the Information Artifact Ontology's Information Content Entity (ICE), very generally defined as a type of entity which bears information about something1. Any ICE class may have is about object relations to other entities which define its aboutness, for example, a "minimal inhibitory concentration is about some dose response curve"; an "age since planting measurement datum is about some Spermatophyta" (among other things). If the target of an is about relation is a quality then one may see an is about sub-property called is quality measurement of used, for example "length measurement datum is quality measurement of some length ".

Note that a figs/data_obi_draw.io.xml file holds the diagrams in this section. It is provided in a draw.io diagram format for reuse in ontology design work.

The Information Content Entity data item class holds singular entities or collections of entities that specifically record inputs or outputs (measurement / prediction / transformation datums) of processes / equipment / human interaction. OBI also has a measurement datum class intended to name and model the datum outputs of an assay, and their contextual semantics while avoiding the value specification level of detail. One can model processes, inputs, and outputs without having to reference a "data" layer of value specifications.

To establish constraints on what a singular data item (aka datum) can have for a value, OBI introduces a value specification (VS) class which can express those constraints in axioms (for example string length, pertinent units, or valid categorical choices). An instance of a value specification can have a has specified value data property that holds its literal value. A value specification details allowable values for a given purpose; in contrast, a data item references the process that generated it.

The connection between a measurement datum (or any ICE term) and a value specification is accomplished with the has value specification object property. Thus an age measurement datum would be linked to a numeric value specification which details the unit - year, month, day etc. of the measure. (For past discussion on this, see #870, #945 and #833).

Different assays may output the same datum and value specification combination. For example an age since planting measurement datum and its age in years VS could be output from assays that calculate or estimate by tree ring count, carbon 14 analysis, planting date, height of species etc. A measurement datum is enriched by output of relations to the process(es) that generated it, thus providing a context that has implications for what it is about.

A value specification's primary aboutness is expressed using the specifies value of object relation. For example "mass value specification specifies value of some mass", as shown below. An instance of the VS specifies value of an instance of the mass quality which inheres in John. The VS instance has a kilogram unit, and a decimal value of 70.0 .

(full-sized image)

In the case where a value specification is about the conjunction of a few different things, the aboutness target can be a precomposed term or instance of those components, with one component providing the primary type of the measure. For example "eye color" is primarily about color - and so limited to ways that can be reported, but secondarily about the body part being observed, and finally - in the instance, references a particular organism being observed. An example value specification instance: "_vs1 has specified value some color and is quality of eye and is about Patient123".

However, the defining characteristic of an Information Content Entity is that it is 'about' something. Thus, the scope of the OBI representation of data is to capture the details and characteristics of the information, rather than the thing that it describes. This is a crucial scoping step in developing representations in OBI.

Value specifications vs data properties

One advantage of having value specifications about qualities is that the duo reduces the need for a plethora of data properties. Rather than establish a has age data property, we express a value specification about age. Both hold a value, but the latter allows us to focus on defining the semantics of the quality 'age' and its subclasses - age since planting etc. In this view a data property is analogous to a kind of compressed and semantically opaque value specification because its semantic detail is limited to data property attributes.

Data property values don't support units directly. In the example diagram of John's data properties, is the has weight decimal value given in kilos or grams? Is height in meters or cm? Either assumptions are made about units associated with a data property, or a second set of data properties must be created to capture unit info. Value specifications allow units to be expressed explicitly.

A data property doesn't support a relevant time of measurement value. For example in the "_John" diagram, when was his has height observed, and was it the same time as has weight, such that an accurate BMI can be calculated? Time differentiated data is rather complex to model in OWL, so a pragmatic approach, if workable, is to only load data into an OWL reasoner platform that doesn't need to be differentiated by time. In the BMI example, we could assume height and weight data properties are adequately close in time that any derivative calculations are appropriate.

For these reasons, OBI has a very small number of data properties, and a large and growing number of measurement datums.

Time stamped datums

OBI's preferred measurement datum and value specification approach can be enhanced with time data points in a few ways. Assays and other processes can be marked with start and/or stop times, and therefore any specified output would have those time points to mark its relevance [using which relations?]. Currently a time stamped measurement datum exists to associate time directly with a datum rather than the process that led to it; however this older term is under discussion as it needs clarification.

Basic Implementation Issues: RDF, OWL, etc.

OWL inherits most of RDF's ability to specify XML string, numeric, datetime, and URI datatype values as data properties of an entity, and can compare data properties across entities (see here and here). OWL can also be used to specify constraints on string value length and content, and can specify numeric bounds on numbers. OBI currently focuses on reuse of RDF/XML datatypes to capture experimental data. Those who need further functionality may find other datatype representations useful (e.g. here).

In addition to reasoning prowess, using an OWL ontology to detail types of assay data - parameters, measurables, independent and dependent variables - will encourage standardization of their usage, enable experimental reproducibility, and facilitate data exchange and conversion.

Data Types

Here a handful of the primitive datatypes from RDF which are used in OWL are discussed. The possibility of user defined datatypes is avoided in favour of using enhanced value specifications to do the same work. Some examples are expressed directly in OBI, while others can be constructed in an application ontology that draws on OBI components. Recognizing that OWL isn't suitable for doing all types of validation, we have shown how value specifications can be enhanced with basic numeric range and string content restrictions.

String

An OWL data property can hold a string as a plain literal with an optional language tag (see here ). This enables constraints on string length and its contents (by way of regular expressions).

For example a US Zip Code is a string of 5 digits (stored as a string to anticipate compatibility with its Zip+4 extension). One could construct the following representation:

Class: 'postal code specification'
    subClassOf 'value specification'
    subClassOf 'has specified value' only xsd:string[pattern "[0-9A-Za-z \-]{2,10}"]

Class: 'ZIP code specification'
    subClassOf 'postal code specification'
    subClassOf 'specifies value of' some ('postal code' and 'is about' some (site and 'located in' some 'United States of America')
    subClassOf 'has specified value' only xsd:string[pattern "[0-9]{5}"]

[diagram]

String length constraints can be set via "length", "minLength" and "maxLength" parameters, e.g. "xsd:string[length "5"^^xsd:integer]. A "pattern" parameter supports regular expression syntax to some extent, allowing "[0-9] [a-z] [A-Z] . ? * + {m,n}" components. Thus we can express fairly well-validated email addresses:

Class: 'email address specification'
    subClassOf 'value specification'
    subClassOf 'specifies value of' only 'email address' 
    subClassOf 'has specified value' only xsd:string[pattern "[A-Za-z0-9]+([_.\-][A-Za-z0-9]+)*\@[A-Za-z0-9]+([.\-][A-Za-z0-9]+){1,3}"]

Note one quirk: In pattern matching, the "@" character must be escaped or else the remainder of test string is ignored (i.e. "@" is interpreted as a language facet addition to the string). Also more work is required to cover possible validation of international / UTF-8 strings.

Categorical

A categorical value specification is a flat list or hierarchic tree structure containing a finite number of pre-determined choices. Here we provide for choices whose values are either xsd:string or xsd:anyURI references to ontology terms.

Categorical string choice

If a string must conform to a smaller set of choices, and nothing more needs to be axiomatized about each choice, then this can be accomplished with a value specification that is both string and categorical. The value specification has a 'has specified value' component which uses a regular expression to enumerate the permitted strings. Note that in this approach one cannot easily provide other information (label, description) about choice in a user interface.

For example, an "E-coli K antigen value specification" can be represented as:

Class: 'E-coli K antigen value specification'
    subClassOf 'categorical value specification'
    subClassOf 'specifies value of' only 'K antigen'
    subClassOf 'has specified value' only xsd:string[pattern "K(1|2a|2ac|3|4|5|6|7|8|9|10|11|12|13|14|15|16|18a|18ab|19|20|22|23|24|26|27|28|29|30|31|34|37|39|40|41|42|43|44|45|46|47|49|50|51|52|53|54|56|96|55|74|82|84|85ab|85ac|87|92|93|95|97|98|100|101|102|103|X104|X105|X106)"]]

[diagram]

This allows a reasoner to raise the unsatisfiable alarm when an instance of E-coli K antigen value specification has specified value 'K17a'.

One can potentially leave the has specified value axiom out, in which case validation enforcement would need to occur outside the OWL reasoning context.

Categorical ontology term choice

Categorical choice lists or trees of ontology terms (e.g. of organism taxonomy, of disease, etc.) essentially have an xsd:anyURI datatype since a selection is an ontology URI. The aim here is to point to existing ontology class or instance identifiers within one's application ontology and/or imported from 3rd party ontologies as selections for a categorical variable. However, some complications arise which the following example will explore. We could try to capture a handedness quality with:

Class: 'handedness value specification'
    subClassOf 'categorical value specification'
    subClassOf 'has specified value' only handedness 

However, this is not permitted in OWL since has specified value data property can only have a literal on the right side. The target could be expressed simply as "has specified value only xsd:anyURI", thus allowing values like xsd:anyURI right-handedness but this then requires some validation mechanism external to an OWL reasoner for limiting categorical values. The Class doesn't indicate what the choices are.

In a different approach, an OBI example using categorical value specification focuses on describing a tumor grading standard histologic grade according to AJCC 7th edition. Here the value specification class has individuals which are each interpreted as grades, and which could potentially be augmented with data properties that detail their assessment differentiae. This approach is suited to cases where selections are not already established (and would not be in the future) as ontology classes situated within their own hierarchic context.

Alternately one could use the specifies value of relation to point to existing categorical choices (qualities, etc.):

Class: 'handedness value specification'
    subClassOf 'categorical value specification'
    subClassOf 'specifies value of' only handedness

Now an instance of handedness value specification can have a specifies value of axiom pointing to a handedness class instance. This involves some extra setup because all handedness selections need to be "punned" since they can't be referenced directly as classes. In other words an individual needs to be created to mirror each categorical choice, so for example classes for left handedness, right handedness, ambidextrous handedness all need mirrored individuals - and in this case these are not native to the PATO ontology that the classes originate from. (Punning is accomplished manually in Protege by copying an existing class URI into the "Create a new Named individual" form, with the "new entity options ..." set to expect a user supplied name. This preserves the same identifier for both class and individual).

Note that in the past OBI used/tried categorical measurement datum for enumerating categorical choices, with a has category label object property that linked to a set or class of permissible terms (as shown in OBI's existing handedness value specification example). This class and relation is being discouraged in favour of the categorical value specification approach.

Boolean

Under discussion is the formalization of a "boolean value specification" datatype that pertains to the presence or absence of a quality or categorical entity. Essentially any quality taken on its own can be treated as a boolean variable. The information that an animal is characterized as a neonate, for example may be the focus of interest in a study even if a more comprehensive categorical value specification of its developmental stage could have been posed as a Likert scale.

Class: 'neonate value specification'
    subClassOf 'value specification'
    subClassOf 'has specified value' only xsd:boolean
    subClassOf 'specifies value of' only 'neonate' 

Any categorical value specification choice instance can potentially be interpretable as a boolean too.

Ordinal

OBI does not currently have a recommendation about how to define an ordered categorical variable. A ranking data property for each choice could be used; or potentially previous/next relations could be established between choices.

Numeric

Currently all numeric value specifications are handled under the scalar value specification term, which implies that each must have a unit as well.

OWL introduced the owl:real data type as the most generic numeric type, and owl:rational as its subbordinate. Under owl:rational is xml:decimal, the general basis of more specific integer and float datatypes; numeric conversion appears to be smooth between these types. Any number type can be paired with a unit as described below.

OBI currently does not provide functionality for dealing with numeric precision or error range.

Decimal

Here the pH acidity scale is effectively characterized as a decimal between 0.0 and 14.0:

Class 'ph value specification'
    subClassOf 'scalar value specification'
    subClassOf 'has measurement unit label' only 'pH' 
    subClassOf 'specifies value of' only 'pH measurement'
    subClassOf 'has specified value' only xsd:decimal[ >=0, <=14 ]))

Note that the Protege axiom editor can be very fussy about exactly how the >,>=,<,<= comparators are positioned with spaces with respect to brackets and numbers.

Integer

Some variables are inherently integers - countable things that can't meaningfully have fractions except as intermediate calculations (quantities of water can be described in decimal to handle portions like 1.5 cups, while basepairs are not meaningful as fractions. Use xsd:integer where rounding during comparison won't be an issue.

Class 'MIC diffusion measurement specification'
    subClassOf 'scalar value specification'
    subClassOf 'has measurement unit label' only 'millimeter' 
    subClassOf 'specifies value of' only 'MIC value'
    subClassOf 'has specified value' only xsd:integer[ >5 ,< 100]

(OWL actually provides access to further subclasses of integer such as xsd:positiveInteger, but OBI does not have a matching granularity of value specification classes.)

Float

Class 'MIC dilution measurement specification'
    subClassOf 'scalar value specification'
    subClassOf 'has measurement unit label' only ('milligram per liter' or 'microgram per milliliter')
    subClassOf 'specifies value of' only 'MIC value'
    subClassOf 'has specified value' only xsd:float[ >=0.01f ,<= 2048.0f]

Units

OBI uses the has measurement unit label relation to pair numeric scalar parameters with related units. The Units of Measurement Ontology (UO) is the default unit ontology the OBOFoundry community uses, although there are other options2,3. It is left to a unit ontology to express the base units of the International System of Units, as well as compound units that have numerators and denominators sufficient for a problem space.

A value specification can select at a general level all the permissible units which underlying value specifications and their instances must conform to.

Units extend to countable things like nucleotide 'basepairs' and potentially even 'oranges' or 'fruit' etc. In this respect they indicate the aboutness of the value specification.

Duration

A duration is a difference in time calculated from an interval of two time points. (Semantically the interval is about those points and the events they mark). Value specifications for date and time durations or intervals are generally handled by decimal value specifications with one or more time units attached to them. This allows for decimal fraction amounts, e.g. 2.5 days. An 'age since birth' value specification could be:

Class: 'age since birth value specification'
    subClassOf 'scalar value specification'
    subClassOf 'has specified value' only xsd:decimal
    subClassOf 'specifies value of' some 'age since birth'
    subClassOf 'has measurement unit label' only (year or month or day or hour)

Datetime

Of XML's native date/time datatypes, OWL has currently adopted xsd:date, xsd:datetime (format [-]CCYY-MM-DDThh:mm:ss.sss[Z|(+|-)hh:mm] according to the ISO 8601 standard) and xsd:dateTimeStamp (format CCYY-MM-DDThh:mm:ss.sss(Z|(+|-)hh:mm), i.e. time zone required) into its reasoning specification. A Gregorian calendar 24 hour clock instant of time is used, and will be compared down to the second and timezone offset for xsd:dateTime/Stamp formats.

Class: 'hospital admission date specification'
    subClassOf 'scalar value specification'
    subClassOf 'has specified value' only xsd:date
    subClassOf 'specifies value of' only 'hospital admission date'

Often a need for date obfuscation arises when dealing with confidential data points. Pairing a unit such as year, month, day, hour etc. can convey the semantic granularity of the given xsd:datetime but won't have an effect on reasoner equality test, so the remainder of the datetime components need to be the same (zero'ed out for example) in order for an equality constraint to succeed. This could be done as a pre-processing step.

If a more complex model of date/time is required, the "Time Ontology in OWL"4,5 may suffice.

Missing values and other metadata

Data sources likely have a variety of ways to mark missing values. A food database example: “When the content of a food for a component is not known, a hyphen stands in place of the number. It is important for users to take into account these missing values and not to consider them as zero”6. Currently, a simple way to express this is to have an instance of a value specification, but no 'has specified value' data property for it.

Other metadata may need to be marked e.g. how to deal with: “In some cases, a component is detected in the food matrix, but it cannot be quantified precisely. The analytical result can therefore be considered as ‘trace’.” Another case is where a data item exists but has been obfuscated for privacy reasons. OBI does not currently have a metadata standard that addresses these cases.

"Other" values

Data Sets

A collection of datums of a given data type is called a data set. A numeric data set (like a numeric spreadsheet column) can have statistical calculations performed on it by using an RDF query language like SPARQL. The member of relation can connect datum or value specification instances to such a data set.


References:


1The ability of an ICE to bear information depends on the coding scheme and medium it inheres in, hence it is a generically dependent continuant.

2https://github.com/HajoRijgersberg/OM

3http://qudt.org/

4https://www.w3.org/2001/sw/BestPractices/OEP/Time-Ontology

5https://www.w3.org/TR/owl-time/

6https://ciqual.anses.fr/cms/sites/default/files/inline-files/TableCiqual2017_XML_docENG.pdf