Proposal: Extensible Artifact Model #318
Labels
layer-api
An issue involving the vizier API layer
layer-python
An issue involving the Python compatibility code
layer-scala
An issue involving Scala compatibility code
Milestone
Challenge
Vizier's current data model is:
Proposal Summary
Checklist
Artifact
to drop the existing Type and MIME columns, and replace them with a new 'artifact_type' based on java-style namespaces. Keep existing types labeled as `info.vizierdb.legacy.[TYPE].[MIME/].[TYPE]. At this stage, the type and mime_type methods can be re-implemented by parsing out the above notation.Concrete Proposal
The core idea is to decouple the physical representation of an artifact from the ways in which user code interacts with it. This breaks down into four concepts:
Encoding
: The physical representation of the artifactInterface
: A conceptual 'role' that an artifact may play (e.g., Dataset, Image, or Integer), defined as a set of methods.Implementation
: Implementations of the methods of anInterface
for a specific Encoding (or for anInterface
).Conversion
: Code that translates oneEncoding
into anotherEncoding
(or anInterface
into anEncoding
)Encoding
At present, Vizier's representation of artifacts consists of a small, opaque blob of text data (typically json). These are interpreted based on the specific type of artifact, but the interpretation is entirely unstructured and performed on read. There is no common structure to the artifacts. This, in particular, makes things like reachability checks hard, since inter-artifact dependencies (e.g., a SQL query over existing tables) always need to be implemented ad-hoc.
The first major goal is to define a schema definition language for Artifacts. The schema definition needs to capture:
Then, we define encodings for all of the existing artifact types, perhaps strengthening them somewhat (e.g., explicitly typed primitives, instead of generic parameters).
To emphasize the point, an encoding simply gives a name to the physical manifestation of the artifact, and dictates how it is stored in the database. This should be the minimum required to reproduce the artifact (see Artifact Caching below); and can should disregard any data that is only needed for efficiency (e.g., the URL of a file, but not the contents).
Some TODOs:
Interface
At present, Vizier uses ArtifactType and MIME types to differentiate different roles that an artifact can play. The
Interface
plays a similar role, by dictating a specific API to which an artifact can conform (i.e., governing how Vizier, its subsystems, and the user interacts with it). Some examples include:Some TODOs:
Implementation
(An Encoding -> Interface, or Interface -> Interface edge)
In order to decouple
Encoding
andInterface
, we need a binding between the two. Somewhere in the code, we need to be able to define code that implements a specific interface for a specific encoding. (e.g., how do I get the spark dataframe for a CSV file; How do I get the arrow dataframe, etc...).Some TODOs:
Conversion
(An Encoding -> Encoding edge)
This is more/less the same as an implementation, save that it generates a new encoding (and consequent additional data)
Platform Interactions
Generic artifacts necessitate decoupling Vizier from its target platforms, including Spark (but also Scala and Python). This means that we need a code component to translate an Encoding of an artifact into the platform-native equivalent. The natural approach here is to define a set of tiered fallbacks:
Artifact Caching
[more to come]
The text was updated successfully, but these errors were encountered: