-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
♻️ REFACTOR: New archive format #5145
Conversation
Thanks a lot, this looks great! (Memory, time, disk usage, and also the possibility to query inside!) I didn't check the implementation, but the logic you describe above seems sound to me.
|
Thanks @giovannipizzi
Yes indeed, the export schema will be "identical" to the AiiDA schema (with a few differences in exactly how fields/columns are stored, i.e. JSONB/JSON and UUID/CHAR(36). I don't think this "degrades" the abstraction anymore that it already is for the archive/QueryBuilder (see e.g. the discussion in #5088 (comment)), i.e. we assume:
Note, the parity PR does not change anything regarding the above two assumptions; it just changes some indexes on the database, converts a STRING to a TEXT type, and adds more non-nullable constraints |
Hmm, couldn't one just use one of the Docker images, if direct installation is not possible? It wouldn't be the end of the world to create this extra python package, just extra work and I'm lazy lol |
Some quick question arising from writing the importer:
|
Maybe this is not what you meant, but just in case I wasn't clear, I'm not worried of the type change (e.g. UUID -> string). More that we might want to store in the DB things in a format that might not mirror 100% what's in the DB. But I think @sphuber raised a similar concern for the QueryBuilder in a different issue, I think - as long as we have the same abstraction as the QB, I think we're fine.
Unfortunately we only store images on DockerHub, that now deletes images not used for 6 months. So unfortunately this means one has to rebuild images from Dockerfiles in the future, most probably, and this has the same issues of dependencies (e.g. not pinned in a dependency of a dependency) that I was mentioning :-(
Yes - the idea is that a 'configured' computer means a computer with an authinfo entry (for the current user). So they just show up as unconfigured. This is the intended behaviour I had in mind when this was originally implemented; you want to have a reference to the computer, but in most cases you are giving it to someone else (or reimporting on a different machine, possibly with a different type of configuration, e.g. with a proxy, or once you install AiiDA on the same computer so it's a localhost/local transport, on others is via SSH). Or maybe it's the same computer but another user is importing it, so they want to use a different user name to login. And, indeed, there might be security issues - even if this is the case currently, as we shouldn't store passwords or other credentials, but in the future (or some transport plugins) might decide (incorrectly) to do so. So - I think by default we should continue not exporting/importing it. |
I would point you towards #5154, for this more abstract (pun intended) discussion
Well if that’s the case, then I would suggest we should start thinking about storing the images that relate to specific versions, in somewhere controlled by us, e.g. the CSCS openstack. Perhaps, for future proofing, they should even be stored in the OCI image format (although I think Docker is anyway compliant with this) |
In e8f337d, I've added now a good chunk of the import code. |
Oh I'm already thinking towards this 😉 |
Ok, there are numerous |
Codecov Report
@@ Coverage Diff @@
## develop #5145 +/- ##
===========================================
+ Coverage 81.27% 81.30% +0.04%
===========================================
Files 534 529 -5
Lines 37423 37031 -392
===========================================
- Hits 30410 30104 -306
+ Misses 7013 6927 -86
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
This is essentially an addition to aiidateam#5156 and required aiidateam#5145 Without the "optimised" use of `Container.get_objects_stream_and_meta` for the `DiskObjectStoreRepositoryBackend`, the profiled archive creation in aiidateam#5145 goes from 4 minutes to 9 minutes!
Wahoo, the archive now works (just about) as a full backend: https://aiida-archive-demo.readthedocs.io |
Also, if you look in the code in |
cd191af
to
84406d7
Compare
84406d7
to
b1274e8
Compare
Link queries return `LinkQuadruple`
This will allow the code to be re-used with the new `verdi archive` commands
f8f4be2
to
8b3715d
Compare
Implement the new archive format, as discussed in `aiidateam/AEP/005_exportformat`. To address shortcomings in cpu/memory performance for export/import, the archive format has been re-designed. In particular, 1. The `data.json` has been replaced with an sqlite database, using the saeme schema as the sqlabackend, meaning it is no longer required to be fully read into memory. 2. The archive utilises the repository redesign, with binary files stored by hashkeys (removing de-duplication) 3. The archive is only saved as zip (not tar), meaning internal files can be decompressed+streamed independantly, without the need to uncompress the entire archive file. 4. The archive is implemented as a full (read-only) backend, meaning it can be queried without the need to import to a profile. Additionally, the entire export/import code has been re-written to utilise these changes. These changes have reduced the export times by ~250%, export peak RAM by ~400%, import times by ~400%, and import peak RAM by ~500%. The changes also allow for future push/pull mechanisms.
Add option to not create import group of imported nodes
`ProcessNode` checkpoint attributes (for running processes) contain serialized code and so can cause security issues, therefore they are now stripped by default. `create_archive` anyhow only allows export of sealed `ProcessNode`, i.e. ones for finalised processes which no longer require the checkpoints.
The `sphinx-sqlalchemy` extension is used to dynamically generate documentation for the database schema.
`BackendEntityAttributesMixin` leaked implementation specific variables and functions. This has now been moved to the `DjangoNode`/`SqlaNode` classes.
`BackendEntityExtrasMixin` leaked implementation specific variables and functions. This has now been moved to the `DjangoNode`/`SqlaNode` classes.
067e338
to
b1a7c46
Compare
PR to implement the new archive format, as discussed in https://github.com/aiidateam/AEP/tree/master/005_exportformat
Note, all the new code is added to
aiida/tools/archive
, as opposed to the currentaiida/tools/importexport
,since I think this is logically the better structure.
Abstraction
The archive format is now fully abstracted into the
ArchiveFormatAbstract
inaiida/tools/archive/abstract.py
, with the sqlite implementation inaiida/tools/archive/implementations/sqlite
.tests/tools/importexport/test_abstract.py
provides a good overview of the abstraction capabilities, but essentially you can open the archive in any of 'r', 'x', 'w', 'a' modes:The abstraction is designed to closely resemble that of the "main"
Backend
, with methods such asbulk_insert
,put_object
,list_objects
,open_object
anddelete_object
, and aQueryBuilder
implementation.Eventually, it could be envisaged that the archive can then become a "proper" backend, and export/import functionality will be merged into simply a
transfer_data(in_backend, out_backend, ...)
type method.The 'a' (append) mode allows for modifying the archive and so, it is intended, will eventually allow for "pushes" to the archive identical to archive imports (but in reverse).
Structure for sqlite-zip format
JSONB
->JSON
, andUUID
->String
)QueryBuilder
implementation! (with some restrictions)(see also https://en.wikipedia.org/wiki/ZIP_(file_format)#/media/File:ZIP-64_Internal_Layout.svg)
Archive creation (export)
The archive creation now progresses in a much more "logical" manner:
--test-run
CLI option), break here and print the entity counts--batch-size
zipfile
. This can be controlled in the CLI with the--compression
optionmetadata.json
file.metadata.json
is not compressed, for quicker accessYou can now also import all entities, with
verdi archive create --all
For profiling, I have created https://github.com/chrisjsewell/process-plot, and tested against the 2d-database on Materials Cloud (TODO add URL)
You can see below that, compared to the existing archive, it is twice as fast (mainly because of disk-objectecstore) and ~5 times less memory usage (which will also scale less with more nodes)!
version 1.6.5:
(see also #4534)
current develop (seb's temporary format)
new format
Key factors for memory usage:
(group_id, node_id)
LinkQuadruple
itemsZipInfo
(which is written last to the zip file central directory on context exit)I've played around with using Python's
shelve
module to do this, but run into https://stackoverflow.com/questions/37548943/shelve-raises-out-of-overflow-pagesComparison of compression levels:
data.json
: 510 Mbtimes to run QB context (see below)
Archive Inspect (CLI)
Default (fast):
With database statistics (
-d/--database
):Archive querying (API)
Use a
QueryBuilder
implementation, without importing the archive!The contextmanager extracts the sqlite database to a temporary directory, then starts an SQLAlchemy session, and loads this in to a
BackendQueryBuilder
implementation.Note, at least currently:
NotImplementedError
, i.e. should be clear to user why it failedcontains
,has_key
,of_length
,longer
,shorter
,of_type
Archive migration
For migrations from "legacy" archives, it progresses as such:
repository_metadata
)Naturally the memory usage is the size of the
data.json
BUT importantly, if the old archive is in ZIP format (not tar), the repository files are never extracted to disk (they are all streamed between the old archive and new archive) so the disk usage will be much lower, and also the migration is a lot quicker.
(For TAR, it is a lot slower to randomly access files, so we need to first extract the entire archive)
IMPORTANT migrations of legacy versions '0.1', '0.2', and '0.3' are no longer supported.
This is because '0.3' -> '0.4' requires manipulation of the repository files, and so would require extracting the entire old archive to disk.
I think this is an acceptable compromise, and obviously you could still use aiida-core v1 to first migrate up from these very old versions.
For future migrations, we will use alembic to migrate the sqlite DB, with migrations that almost identically mirror those of the main database (and only the database will be extracted to disk).
Archive import
Importantly, the import is now "backend acnostic", i.e. there is no longer separate code for django and sqlalchemy (there is now
bulk_insert
/bulk_update
methods on the backend).The repository files are never extracted to a temp folder, they are directly streamed to the backend container.
The code has also been improved, so that it has a more of a "logical flow", and you can also use the
--test-run
flag, to bail out before repo files are imported and the transaction is committed.The time is down from ~19 minutes to ~5 minutes, and the memory usage down from >1.5 Gb to 300 Mb!
v1.65
New