-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Compression #20
Comments
LZ4 is actually a dependency of OHC, a modified version of which we use for managing the in-memory index. We don't actually need LZ4 and we probably should remove it. Are you proposing compressing individual keys and values before they are written out to to disk? If you are referring to compressing bigger blocks of data within HaloDB, say at the block level, then this is probably a big change. |
What about compressing the .index and tombstone files? If I understand correctly, these are not actively read by queries, sequentially during compaction and initialization. Index data is a large space overhead versus other stores I'm testing -- since the key data is in both the index and the data file. Compressing the index with say, ZStandard compression at a low-ish level (which can compress in the 300MB/sec range and decompress in the 100MB/sec+ range would likely be a performance win and space saving win. If I'm wrong, and index / tombstone files are accessed randomly then it would probably be too big a change to make it worth it. As for compressing values implicitly, the problem with the client being responsible for this is that it is much harder to use something like zstd dictionary compression. HaloDB can sample values during compaction, and generate a dictionary on the fly that is optimal for the data being compacted. The user has no way to know when compaction is happening. |
Comparing to data files (1 GB per file based on default config), index and tombstone files are not major factor of disk space consuming. To maximize performance of DB opening, we would recommend not doing index and tombstone file compression. |
Ok, well I've got one data set right here with a 870MB Index files and a 904MB of data files. I have several data sets where the index file is close to 1/3 the size of the data, e.g. mapping UUID keys to UUID values.
I would be surprised if it would change the time by more than 10%, and not surprised if it was faster if the disk is busy. Roughly speaking, size will be between 25% and 65% the original index size with zstandard, with proportionately less I/O. CPU use when decompressing is low, a single CPU should decompress at close to 1GB/sec (output speed). I just tested compression with zstandard on two of my index files from different data sets. One compressed to 39% the original size, the other to 29% the original size. Decompressing was close to 1000MB/sec on an old processor: In some of my larger data sets, compressing indexes to a 3:1 ratio would save me between 10% and 33% disk space.
|
@scottcarey Thank you for the valuable feedbacks. Forgive me if my previous statement is not accurate. What I meant is absolute disk space consuming of index and tombstone files. In index file each entry length is header length (22 bytes) + key length (vary from 1 to 127). In our production, our key length is 8 bytes. For each DB instance, we store around 400 million records. The index files size is (22+8) * 400M = 12GB. I think it is not a big deal for the servers in nowadays. Specifically for your case, what I can say is each DB engine has its tradeoffs. We made this kind of tradeoffs based on our use case. In other words, it may not suitable for your use case. Thanks again for your inputs |
Would you accept a new feature that helps other use cases if it doesn't help yours that much? FWIW, HaloDB is killing every other DB engine except in:
The first two seem fairly solvable without radical changes. |
You are welcome to make contributions. Could you make the change to support compression of index and tombstone files? We can help you do code review. For this change, please use an option to control whether do compression or not, default no compression. About key length, we limit it to 127 bytes because we want to minimize the memory usage of indexing. Currently, 1 byte is reserved for key size in IndexFileEntry.java. It is client's responsibility to control the key length within this limitation. |
@wangtao724 After more in depth review of the code I have several optimizations I'll discuss in other issues. |
LZ4 is included as a dependancy, but it doesn't look to be used. Is there a reason for this?
Would you be open to a PR that optionally enables LZ4 compression of keys and/or values?
The text was updated successfully, but these errors were encountered: