Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suggestion: consider using record markers in the packs #124

Open
zhubonan opened this issue Nov 7, 2021 · 12 comments
Open

suggestion: consider using record markers in the packs #124

zhubonan opened this issue Nov 7, 2021 · 12 comments

Comments

@zhubonan
Copy link
Collaborator

zhubonan commented Nov 7, 2021

I have a suggestion for improving the long-temr resilience for the pack files - we should consider including record markers in the pack file (e.g. similar to the Fortran sequential file).

The main advantage is that this would allow extraction of the data without the SQLite database (the index) at all. For example, in the case of disk storage failure rendering the latter corrupted.

My understanding is that the current implementation simply concatenate the pack file with the stream, and record the offset and lengths in the SQLite database. Instead, a slight modification can be done when writing the stream, something like this:

  1. Write a N bytes integer marker whose value is the record length
  2. Write the actual data
  3. Write a N bytes integer marker repeated - indicating the end of the record
  4. Record the offset and the length of the object taking account of the record sizes.

In Fortran, N is actually compiler dependent, with N=4 a maximum record length of about 2GB but there is support for "sub-records", but this is mainly for backward compatibility with F77. One can also use a 8 byte marker so the maximum size can be greatly incrased, but this means some storage overheads would be involved. The sign of the integer may be used as a flag for compression or not, or one can include dedicicate bytes etc.

Now, I realised that a BIG difference here (compard to Fortran WRITE) is that the size of the object may be unkown before writing it, as it can just be stream when directly writting data to the pack. But we can just omit step one above, or make it a fixed value (for integrity check). and to reconstruct the index, we read the pack file backwards and find each object using the markers.

This would allow reconstruction of the SQLite index purely from the pack file. One first read the marker, then use the length to read the content (object), followed by the ending marker as a verification (optional). Then the hash can be computed for this object (uncompress it first if needed), and the offest, size, length, can be recorded into the new SQLite database.

I think the only part that would need to be updated in the code is to include the record mark(s) in the (_write_data_to_packfile_):

def _write_data_to_packfile(
self,
pack_handle: StreamWriteBytesType,
read_handle: StreamReadBytesType,
compress: bool,
hash_type: Optional[str] = None,
) -> Union[Tuple[int, None], Tuple[int, str]]:

and also the repack methods.

This approach should maintain full backward compatibility as the precise size, length and offsets are stored in the SQLite database, it is just that the old pack format would not be supported for reconstruction without the SQLite database.

Reference:

@zhubonan
Copy link
Collaborator Author

Pinning @giovannipizzi @chrisjsewell

@zhubonan
Copy link
Collaborator Author

zhubonan commented Dec 7, 2021

Summary of the dicussion:

Problem with the existing PR #125

  • The current implementation in Include record marker in the pack file #125 is not robust enough. In particular, if the file is truncated the or the process dies while the pack file is been written, using the tailing markers will not be able to recover all file records.
  • Using preceding markers of each record may help but: 1. we do not usually know the file size in the first place; 2. updating the markers can be dangerous, and is possible with the appending mode ; 3. it is still prone to truncation, or "hole" in the pack file.

Suggestions for the way foward:

  • Another way is to add a collection of markers at the end of the pack file, when the file is "sealed", e.g. no longer accepting new inputs. The markers can just contain the offsets stored in the SQLite database.
  • Such "sealed" pack file would have a data structure looking very similar to ZIP file.
  • We should investigate whether it makes sense to just make the "sealed" packes actually the ZIP files.
  • This means that "unsealed" packs' indices cannot be reconstructed on their own, but there should be only a few of them in the data repository.

Some notes on ZIP format:

It would be ideal if the sealed pack is actually a ZIP files - this would make it very robust and accessible to wide avalaible softwares that can read ZIP.

The ZIP file have the following format:

image

Note that there present a both a LOCAL HEADER and a central directory in the end. When writing the pack file we may include the LOCAL HEADER, and the data descriptor since size and checksums are not know at the time the header is written. The header size may be something of concern? The LOCAL and the centre directory header for each record can approach 80 bytes - we should have some idea of what the average file sizes stored in the AiiDA usually is.

We can write the packs in a ZIP compatible format, without the central directory, which is only added at the end when the pack is "sealed". The latter opertions should be cheap, and there is no needed to update the offsets in the SQLite database.

@zhubonan
Copy link
Collaborator Author

zhubonan commented Dec 9, 2021

Just want to say that I am working on this....

@giovannipizzi
Copy link
Member

Cool, thanks! :-)
Probably it would be good to already introduce a table of packs, which contains the state of the pack (if missing from the table = current state, editable), and contains for this PR the state Sealed, and can also change to archived when working on #123). It might be good, after sealing, to also add a column with the full pack hash (same algorithm as the container itself?) that can be used for additional validation.

@giovannipizzi
Copy link
Member

It would be also useful to check if one can avoid adding the local header for each file that will take a lot of space for many small objects. E.g. looking into alternative files? It's not clear to me e.g. if 7z requires them, see https://py7zr.readthedocs.io/en/latest/archive_format.html - if it does not, it might be a good idea to use that format instead of zip? -> we can always add the very first header when creating a new pack, that is easy, but the local headers are a bit more annoying

@zhubonan
Copy link
Collaborator Author

zhubonan commented Dec 10, 2021

I had a look of the specification and I think it is a bit complex than ZIP. 7z does not require the local header because its main advantage is to be able to have a single "packed stream". e.g. solid compression to achieve better compression ratios, instead of ZIP compressing each file individually.

Anyhow I think having the local header is a good idea as it does add more robustness in case that the end of the file central directory is damaged. There are tools that can try to "repair" damaged zip files. At the moment, my rough estimate of the local header is aound 82 bytes (30 bytes + 20 bytes for ZIP64 + 32 bytes for filename (part of the hash, to be added when sealling) ) . The same information is repeated in the end of the file central directory. So if we switch to a format that does not have local headers, then the maximum space save is just about 50%.

@zhubonan
Copy link
Collaborator Author

One thing with ZIP is that the CRC check sum needs to be in the header. Since CRC is only avalaible after the data is written to the pack, it has to be stored somewhere and reused to seal the pack "sealed" - making a fully functioning ZIP file. One option is to keep it in the data descritpor file, an optional 20 bytes tailing marker. But I think we can get rid it and save the space, since the actual marker needs to be updated anyway.

I am wondering how difficult it would be to add a column to the SQLite table Obj. Ideally the CRC can be stored there. But I remember migrating SQLA might not be straightforward. If CRC is not avalaible, we can still get it by reading the data while sealing the pack file, but it can make the process much slower.

Now the header size is 30 bytes (mandatoray) + 20 bytes (ZIP64 support for files larger than 4GB) + 16 bytes (filename from part of the hash) = 66 bytes.

Would this be someting acceptable in exchange to have pack files compatible as ZIP archives? @giovannipizzi @chrisjsewell

@giovannipizzi
Copy link
Member

I don't think we should compute CRC on the fly or change the SQLite table.
I suggest we instead leave the correct number of bytes empty (e.g. set to 0x00 0x00 0x00...) before each section (or some other temporary marker) that is essentially ignored during normal usage; and during the sealing procedure we compute the CRC on the fly and other things that are needed, write it in the right place (so no need to rewrite the file), and seal the file (I tested on my laptop and CRC32 on ~6GB takes less than 3 seconds; I didn't check the cost of seeking and writing a lot of short byte sequences in an existing file though, but I imagine it's not impossibly slow, but I would guess slower than computing the CRC).
Then one just appends the ZIP final central directory and one is done?

@zhubonan
Copy link
Collaborator Author

Yes that would be a good way to do it. One thing to note is that the CRC is calculated for the uncompressed data, so if compression is used, the data needs to be uncompressed first. Assumming a thoughpout of 100MB per second, a 6GB file would take at least one minute to seal.

@zhubonan
Copy link
Collaborator Author

Alternatively we can just store the CRC computed on-the-fly tailing the record, e.g. taking 4 bytes space. During sealing we write it back in the local header. The resulting file should still be ZIP compatible.

@giovannipizzi
Copy link
Member

I would be tempted to say that, while sealing, it wouldn't be a bad idea to validate the content recomputing the hashes (sha256), what do you think? (we already have a function for that). If we agree on this, then also computing the CRC is a little additional cost? (and we could modify the validate function to also compute the CRC on the fly?)

@zhubonan
Copy link
Collaborator Author

good point, I agree. Let's have the sealing operation also validate the hash and compute CRC - both can be done in the same time. The sealling process does not bring any down time - since the actual contents of the file remain unchanged. Hence, other process may access it for reading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants