Skip to content

Data model and DB schema

Justin Littman edited this page Jan 9, 2023 · 1 revision

Schema Explanation

Schema Entity-Relationship Diagram

Schema Entity-Relationship Diagram

Please keep this up to date as the schema changes!

To generate the updated ER diagram

  • Make sure the db container is running, and that the development database has all migrations applied (bin/rails db:create db:migrate).
  • Run docker compose run --rm gen_er_diagram from the project root.
  • Commit the updated copy of db/schema_er_diagram.svg.
Key:
How to read the ER diagram (expand for explanation)
  • The diagram is generated by querying PostgreSQL directly, so the tables are listed by their SQL names. Each SQL table has a corresponding ActiveRecord class, whose name is a camel case version of the snake case SQL name (with one exception, explained below). The explanation tries to use the SQL table name or the ActiveRecord classname according to which is most appropriate, and may be arbitrary when either would suffice.

  • The edges drawn between tables should align so that they connect a foreign key column to the primary key column in the referent table.

  • The cross side of the connection between tables indictates a required foreign key relationship (ActiveRecord's belongs_to with a null: false on the column definition and possibly a corresponding presence: true on the ActiveRecord class definition). In our current data model, none of our foreign key reference fields may be null. Each such ActiveRecord object must point to exactly one instance of its foreign key referent.

    • E.g., a row in moab_record (retrievable as a MoabRecord ActiveRecord object) must point to (belong_to) exactly one row in preserved_objects (a PreservedObject).
  • The fork side of the connection between tables indicates a one to many relationship, (ActiveRecord's has_many).

    • E.g., a row in zipped_moab_versions (a ZippedMoabVersion) may have many corresponding rows in zip_parts (retrievable as ZipPart objects).
  • The "o" at the end of a connection indicates there may be zero of the referent (e.g. there may not be any zipped_moab_versions yet for a new zip_endpoints entry).

What do these table rows (ActiveRecord objects) represent in the "real" world? (a list of the ActiveRecord subclasses, and a (non-exhaustive) list of their fields)

  • A PreservedObject represents the master record for a druid, tying together its moab_record and zipped_moab_versions. It also holds some high level summary info.
    • druid is the digital resource unique identifier
    • current_version is current latest version we've seen for the druid.
  • A MoabRecord represents a physical copy of a PreservedObject, stored on premises as a Moab (containing all versions for the druid) and accessed via Ceph mount. There is only one instance of MoabRecord for a PreservedObject (= druid) across the storage roots. (Note: Ceph can be configured to store the one logical copy with some internal redundancy, without the need to expose that at the POSIX file system level.)
    • size: is approximate, and given in bytes. It's intended to be used for things like allocating storage. It should not be treated as an exact value for fixity checking.
    • status: a high-level summary of the copy's current state. This is just a reflection of what we last saw on disk, and does not capture history, nor does it necessarily enumerate all errors for a copy that needs remediation.
    • version: should be the same as PreservedObject current version.
  • A MoabStorageRoot represents a physical storage location (on premises) on which MoabRecords reside, e.g., a single Ceph or NFS mounted storage root. A MoabRecord can only live on one MoabStorageRoot at a time, while a single MoabStorageRoot can store many MoabRecords.
  • ZipEndpoint represents an endpoint to which the zipped_moab_version is (or is being) replicated.
    • endpoint_name: the human readable name of the endpoint (e.g. aws_s3_us_east_1)
    • delivery_class: the name of the class that does the delivery (e.g S3WestDeliveryJob)
    • endpoint_node: the network location of the endpoint relative to the preservation catalog instance (e.g. localhost for a locally mounted NFS volume, s3.us-east-2.amazonaws.com for a S3 bucket, etc).
    • storage_location: the bucket name (e.g. sul-sdr-aws-us-east-1-test)
  • ZippedMoabVersion corresponds to a Moab-Version archived as a zip file on a ZipEndpoint.
    • version: the version from the Moab for which this record tracks cloud archiving.
    • preserved_object_id: references the parent preserved object on disk.
    • zip_endpoint_id: the endpoint to which the Moab-Version has been replicated
  • ZipParts: We chunk archives of Moab versions into segments. This represents metadata for one such part (we currently segment every 10 GB, so most archive zips are only a single part).
    • size represents the size of the actual zip_part
    • zipped_moab_version_id references the parent Moab-Version on a ZipEndpoint. 99% of the time, we will have 1 ZippedMoabVersion to 1 ZipPart.
    • md5 represents the checksum used for checksum validation
    • create_info is a hash containing the zip command invoked to create the parts and the version of the command.
    • parts_count displays how many total zip parts were created during replication (for this version on this endpoint).
    • suffix if there is 1 ZipPart suffix will always be .zip, if there are more than 1 ZipPart the suffix will be .z01 through .z(n-1) (e.g. 3 parts will be ['.z01', '.z02', '.zip'])
    • status: displays whether the ZipPart has been replicated or not, whether there's an error with the cloud copy of the replicated part, etc.
    • last_existence_check: the last time we confirmed whether the part was still on cloud storage.
    • last_checksum_validation: the last time our stored checksum for the part was compared against the stored checksum on the cloud provider. Eventually we'd like this (or a separate but similar field) to denote an actual fixity check, where the copy is retrieved from the cloud and the checksum on the retrieved copy is compared to our stored checksum.

Other Terminology

  • An "online" copy is an exploded Moab folder structure (on a POSIX file system), on which we can run structural verification or checksum verification for constituent files, and from which we can retrieve individual assets of the Moab.
  • An "archive" copy corresponds to a Moab-Version on a ZipEndpoint. The format is a multipart zip upload.
  • "TTL" is an acronym for "time-to-live", or an expiry age. In the case of our archive_ttl and fixity_ttl values, it's the age beyond which we consider the last archive or fixity check result to be stale (in which case those checks should be re-run at the next scheduled opportunity).
Clone this wiki locally