Support bzip3 compression #110
Replies: 10 comments
-
Thanks, this looks really interesting! I'll take a closer look as soon as I have some time. |
Beta Was this translation helpful? Give feedback.
-
Testing bzip3 on a set of binary data:
LZMA
sizes of the files:
lzma has specific tricks that will always make it smaller in binary data and faster to decode than bzip3. the only time bzip3 might win is text data but even that could be beat by using lc8:lp0:pb0 . bzip3 is not a good compression algorithm for a fs since its not assymetrical i.e it will require the same time to decompress as to compress whereas lzma is 10x faster to decode than to encode. xz which is a derivative of lzma2 is a far simpler version from the lzma2 in igor's sdk lzma the one i used here (LZMA1) much ideal since it allows more of the settings to be used.The default llevel 9 of the xz is not optimal for all data.
|
Beta Was this translation helpful? Give feedback.
-
I beg to differ. I used your flags on a Wolfram Mathematica 12.0 install which is a fairly diverse corpus (a few Java runtimes, a bunch of executables/dlls, some Wolfram language code), and this is what I got:
LZMA with your flags took one and a half of an hour to compress 9GiB of data, so by an educated guess, it would probably take more than 3 hours to compress the dataset above. Secondly, bzip3 supports parallel decompression, just like bzip2, because all blocks are independent. Compared to the bzip3 timing of 7 minutes, LZMA is almost 12 times slower. I don't think it was worth to spend my entire morning on watching paint dry, but there's more - if we wanted a more fair comparison, we could apply some deduping to the initial data (which was presumably handled before by LZMA, as it's a LZ derivative):
Which proves my point that bzip3 performs pretty well with the aid of long range deduping - bzip3 isn't exactly a LZ derivative, so to have a fair benchmark on these kinds of data, we'd need to apply some sort of dictionary coding. Another thing (that feels painfully obvious to me) I would like to point out that no compression algorithm is the philosopher's stone of computing that "turns lead into gold" under all circumstances. Hence bzip3, as a very different conceptually compressor from everything dwarfs supports could be quite beneficial to include. And it's not like we have to compress all blocks using the same compressor in the end. Finally, regarding |
Beta Was this translation helpful? Give feedback.
-
@kspalaiologos, I would like to pipe in and note, however, that you aren't exactly using the most performant LZMA options with 7z. If I am reading your args correctly, you aren't using multithreading and could be using fast LZMA2 (although dwarfs currently doesn't use it so perhaps not ideal for comparison). Your xz example is better on this front. My preferred option is the following, which in my testing was very comparable to bzip3, albiet still slower but for a bit higher ratio (and I'm sure could be tweaked to be more similar with some effort):
I am curious to see how bzip3 would perform in a dwarfs scenario, regardless. |
Beta Was this translation helpful? Give feedback.
-
At this point I'm pretty sure that there are as many opinions about compression as people in this thread. Anyway, regarding flzma2 - it's not present in my 7zip build, nor the official build, neither in my distro's repositories, so I can't verify your results.
7zip seems to have consumed two of my cores. |
Beta Was this translation helpful? Give feedback.
-
Ah, I forgot that my distro of choice utilizes this fork: https://github.com/jinfeihan57/p7zip.
Well, even still, the |
Beta Was this translation helpful? Give feedback.
-
Results on the Mathematica installation:
They seem to be indeed much better than before, but it still seems just a little worse regarding the compression ratio than lrz+bzip3. And 10 times slower (10042s CPU time vs 1466s). |
Beta Was this translation helpful? Give feedback.
-
7z doesn't implement any of its own long-range functionality. You'd need to either use lrzip with both or neither for a fully fair comparison. I need to redo some of my own testing but with the data I tried (a large uncompressed pdf and the data from a game) bzip3 was typically faster and comparable in size. I can't remember whether 7z or bz3 won out. |
Beta Was this translation helpful? Give feedback.
-
so does lzma2 if you pass the mt flag(albeit with a bit less CR) i didnt use it since i wanted the max CR and decoding it would be much faster anyway with a single core. (a crucial requirement for a read only filesystem.)
there is no deduping step in 7z. how does it being a LZ derivative makes it a deduping algorithm?
your're already using a LZ derivative LZP in your codec?
BWT algorithms like in your case bzip3 excels in text and only in text, anything else you throw at it lzma will far outperform. But i do like the idea of compressing different blocks with different algorithms but then we are already pushing what is a read only FS to a archiver territory.
It will also use the same amount of decoding memory it required to encode which is ~5N iirc for libsais where as lzma require 32KB+dict size to decode and 11.5*dict size to encode. small test on silesia corpus:
bzip3 (compiled with AOCC for the best case scenario)
file sizes:
decode:
bzip3 here wins by a very tiny margin in CR with also 10x faster encode than lzma but its 8x slower to decode and to use 16 threads will require like 10x the memory at around 5GB and lzma would require around 205MB to decode. |
Beta Was this translation helpful? Give feedback.
-
I'm a bit reluctant to spend the effort of including I did a quick-and-dirty performance test, but I reckon it's somewhat representative. I took a DwarFS image with >1000 Perl installations (so mixed text and binary data) in it. This image uses lzma compression and is about 300 MiB in size (down from 50 GiB of input data). I created an "uncompressed" image from it:
This uncompressed image is 4.2 GiB. I used a script to extract the individual blocks:
This resulted in 69 blocks (data, metadata, index, ...), most of which were 64 MiB in size. A few were as small as 552 bytes. Compressing each of these blocks individually gives an exact estimate of how well a compression algorithm performs in the context of DwarFS. So I did the following:
I picked
With This is certainly not an exhaustive test, but I currently don't see I'm moving this issue to discussions for now; if there are significant improvements to |
Beta Was this translation helpful? Give feedback.
-
Over the last two weeks I've been working on bzip3, which is ultimately intended to replace bzip2. I believe that it'd be a good addition to
dwarfs
judging by the compression ratios it offers and its fairly good performance.To mimic your benchmark in a way, first, I have downloaded every version of Perl5 ever released and decompressed them.
Then, I put all the resulting
.tar
files in a single.tar
file and tried to compress it using various compressors:The results follow:
Then, I used
lrzip
to perform long-range deduplication on the original.tar
file:Finally, I compressed the resulting
none.tar.lrz
file using bzip3:The results follow:
I believe that the long-range deduping that dwarfs offers coupled together with block-level deduping that is implanted into bzip3 and a powerful BWT entropy post-coder would excel in particular on source code, text, various documents, etc...
I have also tested
dwarfs
on the directory with tarballs yielding the following result:In retrospect, I should have unpacked all the tarballs, but
lrzip
seems to have done a great job at deduping without this. Maybe this functionality could be implemented into dwarfs at some point in the future too.Beta Was this translation helpful? Give feedback.
All reactions