-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
suggestion: consider using record markers in the packs #124
Comments
Pinning @giovannipizzi @chrisjsewell |
Summary of the dicussion: Problem with the existing PR #125
Suggestions for the way foward:
Some notes on ZIP format: It would be ideal if the sealed pack is actually a ZIP files - this would make it very robust and accessible to wide avalaible softwares that can read ZIP. The ZIP file have the following format: Note that there present a both a LOCAL HEADER and a central directory in the end. When writing the pack file we may include the LOCAL HEADER, and the data descriptor since size and checksums are not know at the time the header is written. The header size may be something of concern? The LOCAL and the centre directory header for each record can approach 80 bytes - we should have some idea of what the average file sizes stored in the AiiDA usually is. We can write the packs in a ZIP compatible format, without the central directory, which is only added at the end when the pack is "sealed". The latter opertions should be cheap, and there is no needed to update the offsets in the SQLite database. |
Just want to say that I am working on this.... |
Cool, thanks! :-) |
It would be also useful to check if one can avoid adding the local header for each file that will take a lot of space for many small objects. E.g. looking into alternative files? It's not clear to me e.g. if 7z requires them, see https://py7zr.readthedocs.io/en/latest/archive_format.html - if it does not, it might be a good idea to use that format instead of zip? -> we can always add the very first header when creating a new pack, that is easy, but the local headers are a bit more annoying |
I had a look of the specification and I think it is a bit complex than ZIP. 7z does not require the local header because its main advantage is to be able to have a single "packed stream". e.g. solid compression to achieve better compression ratios, instead of ZIP compressing each file individually. Anyhow I think having the local header is a good idea as it does add more robustness in case that the end of the file central directory is damaged. There are tools that can try to "repair" damaged zip files. At the moment, my rough estimate of the local header is aound 82 bytes (30 bytes + 20 bytes for ZIP64 + 32 bytes for filename (part of the hash, to be added when sealling) ) . The same information is repeated in the end of the file central directory. So if we switch to a format that does not have local headers, then the maximum space save is just about 50%. |
One thing with ZIP is that the CRC check sum needs to be in the header. Since CRC is only avalaible after the data is written to the pack, it has to be stored somewhere and reused to seal the pack "sealed" - making a fully functioning ZIP file. One option is to keep it in the data descritpor file, an optional 20 bytes tailing marker. But I think we can get rid it and save the space, since the actual marker needs to be updated anyway. I am wondering how difficult it would be to add a column to the SQLite table Now the header size is 30 bytes (mandatoray) + 20 bytes (ZIP64 support for files larger than 4GB) + 16 bytes (filename from part of the hash) = 66 bytes. Would this be someting acceptable in exchange to have pack files compatible as ZIP archives? @giovannipizzi @chrisjsewell |
I don't think we should compute CRC on the fly or change the SQLite table. |
Yes that would be a good way to do it. One thing to note is that the CRC is calculated for the uncompressed data, so if compression is used, the data needs to be uncompressed first. Assumming a thoughpout of 100MB per second, a 6GB file would take at least one minute to seal. |
Alternatively we can just store the CRC computed on-the-fly tailing the record, e.g. taking 4 bytes space. During sealing we write it back in the local header. The resulting file should still be ZIP compatible. |
I would be tempted to say that, while sealing, it wouldn't be a bad idea to validate the content recomputing the hashes (sha256), what do you think? (we already have a function for that). If we agree on this, then also computing the CRC is a little additional cost? (and we could modify the validate function to also compute the CRC on the fly?) |
good point, I agree. Let's have the sealing operation also validate the hash and compute CRC - both can be done in the same time. The sealling process does not bring any down time - since the actual contents of the file remain unchanged. Hence, other process may access it for reading. |
I have a suggestion for improving the long-temr resilience for the pack files - we should consider including record markers in the pack file (e.g. similar to the Fortran sequential file).
The main advantage is that this would allow extraction of the data without the SQLite database (the index) at all. For example, in the case of disk storage failure rendering the latter corrupted.
My understanding is that the current implementation simply concatenate the pack file with the stream, and record the offset and lengths in the SQLite database. Instead, a slight modification can be done when writing the stream, something like this:
In Fortran, N is actually compiler dependent, with N=4 a maximum record length of about 2GB but there is support for "sub-records", but this is mainly for backward compatibility with F77. One can also use a 8 byte marker so the maximum size can be greatly incrased, but this means some storage overheads would be involved. The sign of the integer may be used as a flag for compression or not, or one can include dedicicate bytes etc.
Now, I realised that a BIG difference here (compard to Fortran WRITE) is that the size of the object may be unkown before writing it, as it can just be stream when directly writting data to the pack. But we can just omit step one above, or make it a fixed value (for integrity check). and to reconstruct the index, we read the pack file backwards and find each object using the markers.
This would allow reconstruction of the SQLite index purely from the pack file. One first read the marker, then use the length to read the content (object), followed by the ending marker as a verification (optional). Then the hash can be computed for this object (uncompress it first if needed), and the offest, size, length, can be recorded into the new SQLite database.
I think the only part that would need to be updated in the code is to include the record mark(s) in the (
_write_data_to_packfile_
):disk-objectstore/disk_objectstore/container.py
Lines 1245 to 1251 in 7a09ea2
and also the
repack
methods.This approach should maintain full backward compatibility as the precise size, length and offsets are stored in the SQLite database, it is just that the old pack format would not be supported for reconstruction without the SQLite database.
Reference:
restic
backup software that allows index reconstruction (we don't need to be as complicated as this, as they also need the data to be encrypted): https://restic.readthedocs.io/en/stable/100_references.html#pack-formatThe text was updated successfully, but these errors were encountered: