-
Notifications
You must be signed in to change notification settings - Fork 2
Data model and DB schema
Please keep this up to date as the schema changes!
To generate the updated ER diagram
- Make sure the
db
container is running, and that thedevelopment
database has all migrations applied (bin/rails db:create db:migrate
). - Run
docker compose run --rm gen_er_diagram
from the project root. - Commit the updated copy of
db/schema_er_diagram.svg
.
How to read the ER diagram (expand for explanation)
-
The diagram is generated by querying PostgreSQL directly, so the tables are listed by their SQL names. Each SQL table has a corresponding ActiveRecord class, whose name is a camel case version of the snake case SQL name (with one exception, explained below). The explanation tries to use the SQL table name or the ActiveRecord classname according to which is most appropriate, and may be arbitrary when either would suffice.
-
The edges drawn between tables should align so that they connect a foreign key column to the primary key column in the referent table.
-
The cross side of the connection between tables indictates a required foreign key relationship (ActiveRecord's
belongs_to
with anull: false
on the column definition and possibly a correspondingpresence: true
on the ActiveRecord class definition). In our current data model, none of our foreign key reference fields may be null. Each such ActiveRecord object must point to exactly one instance of its foreign key referent.- E.g., a row in
moab_record
(retrievable as aMoabRecord
ActiveRecord object) must point to (belong_to
) exactly one row inpreserved_objects
(aPreservedObject
).
- E.g., a row in
-
The fork side of the connection between tables indicates a one to many relationship, (ActiveRecord's
has_many
).- E.g., a row in
zipped_moab_versions
(aZippedMoabVersion
) may have many corresponding rows inzip_parts
(retrievable asZipPart
objects).
- E.g., a row in
-
The "o" at the end of a connection indicates there may be zero of the referent (e.g. there may not be any
zipped_moab_versions
yet for a newzip_endpoints
entry).
What do these table rows (ActiveRecord objects) represent in the "real" world? (a list of the ActiveRecord subclasses, and a (non-exhaustive) list of their fields)
- A
PreservedObject
represents the master record for a druid, tying together itsmoab_record
andzipped_moab_versions
. It also holds some high level summary info.-
druid
is the digital resource unique identifier -
current_version
is current latest version we've seen for the druid.
-
- A
MoabRecord
represents a physical copy of aPreservedObject
, stored on premises as a Moab (containing all versions for the druid) and accessed via Ceph mount. There is only one instance ofMoabRecord
for aPreservedObject
(= druid) across the storage roots. (Note: Ceph can be configured to store the one logical copy with some internal redundancy, without the need to expose that at the POSIX file system level.)-
size
: is approximate, and given in bytes. It's intended to be used for things like allocating storage. It should not be treated as an exact value for fixity checking. -
status
: a high-level summary of the copy's current state. This is just a reflection of what we last saw on disk, and does not capture history, nor does it necessarily enumerate all errors for a copy that needs remediation. -
version
: should be the same asPreservedObject
current version.
-
- A
MoabStorageRoot
represents a physical storage location (on premises) on whichMoabRecords
reside, e.g., a single Ceph or NFS mounted storage root. AMoabRecord
can only live on oneMoabStorageRoot
at a time, while a singleMoabStorageRoot
can store manyMoabRecords.
-
ZipEndpoint
represents an endpoint to which thezipped_moab_version
is (or is being) replicated.-
endpoint_name
: the human readable name of the endpoint (e.g.aws_s3_us_east_1
) -
delivery_class
: the name of the class that does the delivery (e.gS3WestDeliveryJob
) -
endpoint_node
: the network location of the endpoint relative to the preservation catalog instance (e.g. localhost for a locally mounted NFS volume, s3.us-east-2.amazonaws.com for a S3 bucket, etc). -
storage_location
: the bucket name (e.g.sul-sdr-aws-us-east-1-test
)
-
-
ZippedMoabVersion
corresponds to a Moab-Version archived as a zip file on aZipEndpoint
.-
version
: the version from the Moab for which this record tracks cloud archiving. -
preserved_object_id
: references the parent preserved object on disk. -
zip_endpoint_id
: the endpoint to which the Moab-Version has been replicated
-
-
ZipParts
: We chunk archives of Moab versions into segments. This represents metadata for one such part (we currently segment every 10 GB, so most archive zips are only a single part).-
size
represents the size of the actualzip_part
-
zipped_moab_version_id
references the parent Moab-Version on aZipEndpoint
. 99% of the time, we will have 1ZippedMoabVersion
to 1ZipPart
. -
md5
represents the checksum used for checksum validation -
create_info
is a hash containing the zip command invoked to create the parts and the version of the command. -
parts_count
displays how many total zip parts were created during replication (for this version on this endpoint). -
suffix
if there is 1ZipPart
suffix will always be.zip
, if there are more than 1ZipPart
the suffix will be.z01
through.z(n-1)
(e.g. 3 parts will be ['.z01', '.z02', '.zip']) -
status
: displays whether theZipPart
has been replicated or not, whether there's an error with the cloud copy of the replicated part, etc. -
last_existence_check
: the last time we confirmed whether the part was still on cloud storage. -
last_checksum_validation
: the last time our stored checksum for the part was compared against the stored checksum on the cloud provider. Eventually we'd like this (or a separate but similar field) to denote an actual fixity check, where the copy is retrieved from the cloud and the checksum on the retrieved copy is compared to our stored checksum.
-
- An "online" copy is an exploded Moab folder structure (on a POSIX file system), on which we can run structural verification or checksum verification for constituent files, and from which we can retrieve individual assets of the Moab.
- An "archive" copy corresponds to a Moab-Version on a
ZipEndpoint
. The format is a multipart zip upload. - "TTL" is an acronym for "time-to-live", or an expiry age. In the case of our
archive_ttl
andfixity_ttl
values, it's the age beyond which we consider the last archive or fixity check result to be stale (in which case those checks should be re-run at the next scheduled opportunity).
- Replication errors
- Validate moab step fails during preservationIngestWF
- ZipmakerJob failures
- Moab Audit Failures
- Ceph Errors
- Job queues
- Deposit bag was missing
- ActiveRecord and Replication intro
- 2018 Work Cycle Documentation
- Fixing a stuck Moab
- Adding a new cloud provider
- Audits (how to run as needed)
- Extracting segmented zipfiles
- AWS credentials, S3 configuration
- Zip Creation
- Storage Migration Additional Information
- Useful ActiveRecord queries
- IO against Ceph backed preservation storage is hanging indefinitely (steps to address IO problems, and follow on cleanup)