-
Notifications
You must be signed in to change notification settings - Fork 2
DB Migration Checklist
Naomi Dushay edited this page Dec 16, 2022
·
2 revisions
Suggestion: "edit" this page and copy the raw text to an issue comment on YOUR migration issue or PR to create a checklist specific to your migration.
- have a clean rollback path with the migration (e.g. the database migration can be rolled back, plus any additional steps needed)
- dry run on QA
- pause google books
- watch Honeybadger for errors
- stop pres robots, pres cat workers, pres cat web services
- pres cat workers (from preservation_catalog project):
bundle exec cap qa sidekiq_systemd:stop
- pres robots workers (from preservation_robots project):
bundle exec cap qa resque:pool:stop
- pres cat web servers (
ksu
ed to root from each VM):service apache2 stop
<-- this is "just in case" as only nagios okcomputer should be querying prescat at this point
- pres cat workers (from preservation_catalog project):
- take backup snapshot
- from DB server,
ksu
ed to root:su - postgres -c "pg_dump -Fc pres > ~/useful_title_premigration_snapshot_$(date --iso-8601=seconds).sql"
- will result in file at
~postgres/'useful_title_premigration_snapshot_2022-11-30T17:51:57-08:00.sql'
- from DB server,
- start pres cat web servers (
ksu
ed to root from each VM):service apache2 start
- deploy migration with cap (
bundle exec cap qa deploy
from preservation_catalog, your-migration-branch) - start pres robots (deployment should've restarted pres cat workers automatically)
- start pres robots workers (from preservation_robots project):
bundle exec cap qa resque:pool:hot_swap
- start pres robots workers (from preservation_robots project):
- test the migration
- run
preassembly_image_accessioning_spec.rb
andpreassembly_hfs_accessioning_spec.rb
integration testsSDR_ENV=qa bundle exec rspec spec/features/preassembly_image_accessioning_spec.rb spec/features/preassembly_hfs_accessioning_spec.rb
- note the druids
- on the test objects created by that, run CV and replication audit, and manually examine the Moabs in the storage roots (just a quick listing to make sure they look right)
- run
- now revert and test the rollback, too!
- stop pres robots, pres cat workers, pres cat web services
- pres cat workers (from preservation_catalog project):
bundle exec cap qa sidekiq_systemd:stop
- pres robots workers (from preservation_robots project):
bundle exec cap qa resque:pool:stop
- pres cat web servers (
ksu
ed to root from each VM):service apache2 stop
- pres cat workers (from preservation_catalog project):
- restore the DB from snapshot
- from a pres qa application server, drop the database
- from DB server,
ksu
ed to root: re-apply the puppetproduction
branch so that thepres
DB is properly recreated (see here) - from DB server,
ksu
ed to root:su - postgres -c "pg_restore -d pres ~postgres/'useful_title_premigration_snapshot_2022-11-30T17:51:57-08:00.sql'"
- re-deploy
main
. capistrano may complain about passenger not running, but the deploy should otherwise be fine, and the symlink to that latest deployment should remain in place. - if the deployment didn't do so, re-start pres cat web servers (
ksu
ed to root from each VM):service apache2 start
- if the deployment didn't do so, start pres cat workers:
bundle exec cap qa cap sidekiq_systemd:restart
from preservation_catalog project. - start pres robots workers (
bundle exec cap qa resque:pool:hot_swap
from preservation_robots project) - start google books again
- run
CatalogUtils.check_existence_for_druid
on each of the two test druids from above. both should be added to the catalog (yes, afer the db:rollback). - run the same two integration tests against QA and make sure they pass
- make sure diff from release tag to
main
, e.g. (https://github.com/sul-dlss/preservation_catalog/compare/rel-2022-11-28...main) looks deployable to prod
- Deploy to Prod
- rebase branch as needed
- coordinate with ops (at a minimum to let them know; you may want to request their help for the db backup)
- merge PR to
main
- pause google books
- watch Honeybadger for errors
- stop pres robots, pres cat workers, pres cat web services
- pres cat workers (from preservation_catalog project):
bundle exec cap prod sidekiq_systemd:stop
- pres robots workers (from preservation_robots project):
bundle exec cap prod resque:pool:stop
- pres cat web servers (
ksu
ed to root from each VM):service apache2 stop
- pres cat workers (from preservation_catalog project):
- take backup snapshot
- from DB server,
ksu
ed to root:su - postgres -c "pg_dump -Fc pres > ~/useful_title_premigration_snapshot_$(date --iso-8601=seconds).sql"
- the file should be at
~pres/'useful_title_premigration_snapshot_2022-12-02T08:09:28-08:00.sql'
- from DB server,
- start pres cat web servers (
ksu
ed to root from each VM):service apache2 start
- deploy migration (
bundle exec cap prod deploy
from preservation_catalog,your-migration
branch (or main))- if no error, proceed, if error, restore DB from snapshot and abort
- verify prescat workers restarted with deploy https://preservation-catalog-web-prod-01.stanford.edu/queues/busy
- start pres robots (from preservation_robots project):
bundle exec cap prod resque:pool:hot_swap
- start google books again
- create a new version on a test object in prod (search on "test object" in argo).
- run checksum validation for this object from rails console
- run replication audit on this object from rails console. If you want to be super thorough about investigating replication success, you can use the code snippets in this wiki entry to check the cloud manually/synchronously and to get relevant query results.
- manually examine the Moabs in the storage roots (just a quick listing to make sure they look right).
- if no problems detected, all good, proceed.
- if problems
- consider aborting: shutting down gbooks/presbots/prescat, restoring DB from snapshot, re-deploying weekly release tag, and running
CatalogUtils.check_existence_for_druid
on the recently updated druid. if you do that, turn this into a real sub-checklist first to keep track of things.
- consider aborting: shutting down gbooks/presbots/prescat, restoring DB from snapshot, re-deploying weekly release tag, and running
If prod deploy succeeded, deploy main
to QA and Stage
- QA
- stage
- Replication errors
- Validate moab step fails during preservationIngestWF
- ZipmakerJob failures
- Moab Audit Failures
- Ceph Errors
- Job queues
- Deposit bag was missing
- ActiveRecord and Replication intro
- 2018 Work Cycle Documentation
- Fixing a stuck Moab
- Adding a new cloud provider
- Audits (how to run as needed)
- Extracting segmented zipfiles
- AWS credentials, S3 configuration
- Zip Creation
- Storage Migration Additional Information
- Useful ActiveRecord queries
- IO against Ceph backed preservation storage is hanging indefinitely (steps to address IO problems, and follow on cleanup)