Performance improvement ideas #12

andrews05 · 2022-10-24T03:00:18Z

This SIMD accelerated crc library may be able to improve performance: https://github.com/srijs/rust-crc32fast
Not sure if this is related to the performance difference between this and zopfli-rs?

AlexTMjugador · 2022-10-24T08:36:53Z

zopfli-rs provides bindings for the original Zopfli C code, and that code does not use a particularly fast CRC32 implementation, at least for the Gzip format.

Besides, as admitted in the zopfli-rs benchmarks, the performance of both crates is non-linear with respect to the input data size, which makes it harder to reason about. Experimentally, I've also found substantial differences in runtime when trying to compress random data vs. data with a predictable distribution, even if they have the same size. It may happen that this Rust reimplementation is slower for that particular benchmark, but faster in others. And that benchmark tests input sizes of less than 1 KiB, which are so tiny that they don't even fill the DEFLATE window.

Anyway, I think that the performance of this crate can be improved! Using crc32fast is certainly an option, and so is multi-threaded compression.

andrews05 · 2022-10-24T23:50:42Z

Yeah, I'm not too worried about performance really, zopfli is always going to be horrendously slow 🤣. I just came across crc32fast and thought it might be useful here.

Although, ECT has demonstrated that fast zopfli is possible. I guess you could try and figure out how it was done. zopfli#119 and zopfli#110 (including the discussions) may be good reference.

AlexTMjugador · 2022-10-25T09:20:34Z

When dealing with near-optimal algorithms like Zopfli, one must keep the no free lunch theorem in mind. A patch may perform better on a given benchmark, in detriment of other possible benchmarks 😉

However, thank you for sharing the links! ECT certainly did some changes that are worth looking into, and multi-threading sounds like a relatively low-hanging fruit that could bring some significant performance improvements to the table.

kornelski · 2022-10-25T12:01:46Z

The current implementation looks very inefficient, handling io::Error for every byte:

zopfli/src/gzip.rs

Lines 31 to 50 in 41d8712

    
           let in_data = IterRead::new( 
        
               in_data 
        
                   .bytes() 
        
                   .filter_map(|byte_result| { 
        
                       read_error_kind = byte_result.as_ref().map_or_else( 
        
                           |error| Some(error.kind()), 
        
                           |byte| match insize.checked_add(1) { 
        
                               Some(new_insize) => { 
        
                                   insize = new_insize; 
        
                                   crc_digest.update(&[*byte]); 
        
                                   None 
        
                               } 
        
                               None => Some(io::ErrorKind::Other), 
        
                           }, 
        
                       ); 
        
                       byte_result.ok() 
        
                   }) 
        
                   .fuse(), 
        
           );

so I think there's plenty of room to improve.

AlexTMjugador · 2022-10-25T12:14:14Z

I was focusing my attention on the deflate function implementation, but that indeed looks slow! I use Zopfli to generate raw DEFLATE streams, so obviously any IterRead overhead here doesn't show in profiling graphs.

Edit: I've realized I had a much more efficient-looking hashing read wrapper type lying around from another Rust project of mine, so let's transplant it to Zopfli and see how it performs!

@kornelski

As pointed out by @kornelski, using IterRead for the task of updating a rolling hasher as bytes are pulled from a reader is likely to be very inefficient. Let's use a more specific wrapper type that does such a thing per block of data read, not per byte. While at it, let's replace the `crc32` dependency by `crc32fast`, and `adler32` by `simd-adler32`, which are faster equivalents. `iter-read` is no longer necessary, so it has been removed. The performance impact of these changes has not been properly analyzed yet, but some preliminary runs of `make zopfli && make test` show no noticeable performance regressions at least, which is good! Addresses #12.

AlexTMjugador · 2022-10-25T13:16:38Z

I've switched to crc32fast and got rid of IterRead in the latest commit. I'm not sure how much it helps performance yet, but at least the code is more elegant now 😄

kornelski · 2022-10-25T13:29:43Z

That looks great!

andrews05 · 2022-10-25T19:20:57Z

Nice! I measured this around 3-4% faster on one file I tested.

When dealing with near-optimal algorithms like Zopfli, one must keep the no free lunch theorem in mind. A patch may perform better on a given benchmark, in detriment of other possible benchmarks 😉

Ha, true! Though in my experience ECT can compress better than zopfli's default settings while still being about 100x faster (without multithreading).

On the topic of near-optimal algorithms, libdeflate has its own implementation which is also very fast, though falls just short of the ratios that ECT/zopfli can reach.

AlexTMjugador · 2022-10-26T09:26:49Z

Indeed, by the looks of it, ECT and other projects you've linked to did quite a few smart optimizations to the hot paths that work better in the practical scenarios I can imagine. Admittedly, this crate is more focused on correctness and being a faithful Rust port of the original Zopfli code than performance. Anyway, I'm glad that cleanup worked well for you, and I'd like this crate to become faster too, by implementing some of those ideas!

Truth be told, however, I didn't dive that deep into the matter, and getting these optimizations done right requires a lot of effort. I feel confident about implementing multi-threading at some point, but gaining further substantial speedups by accumulating known micro-optimizations is a task that flies over my head right now 😇

For the time being, I'd like to keep this issue open to track performance improvement ideas.

andrews05 · 2022-10-26T19:01:22Z

Yeah, I totally understand. Wish I knew more about compression myself 🙂

The modified code was part of the get_cost_stat function, which is pretty hot for bigger files (takes ~17% of the total CPU cycles for a sample file). Let's do some obvious improvements to use better numeric types here. Related issue: #12

AlexTMjugador · 2023-04-21T21:13:54Z

I got some time and will to run zopfli under perf. I noticed that squeeze::get_cost_stat was taking up a significant chunk of CPU cycles, so I decided to start by avoiding some of the unsigned-signed type mish-mash that goes on there. This simple change brought down the compression time for a test file from ~14.91 s to ~13.82 s (7.9% improvement) on an Intel Core i7-12700 CPU.

This is by no means a scientific benchmark or the end of the road when it comes to performance optimization, but I'm amazed that this relatively simple change had such outsized impact 😄

mqudsi · 2023-04-22T00:26:57Z

Nice find!

AlexTMjugador changed the title ~~Switch to crc32fast~~ Performance improvement ideas Oct 26, 2022

AlexTMjugador added the enhancement New feature or request label May 22, 2023

AlexTMjugador added this to the v0.8.0 milestone May 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvement ideas #12

Performance improvement ideas #12

andrews05 commented Oct 24, 2022

AlexTMjugador commented Oct 24, 2022 •

edited

Loading

andrews05 commented Oct 24, 2022 •

edited

Loading

AlexTMjugador commented Oct 25, 2022

kornelski commented Oct 25, 2022 •

edited

Loading

AlexTMjugador commented Oct 25, 2022 •

edited

Loading

AlexTMjugador commented Oct 25, 2022 •

edited

Loading

kornelski commented Oct 25, 2022

andrews05 commented Oct 25, 2022 •

edited

Loading

AlexTMjugador commented Oct 26, 2022 •

edited

Loading

andrews05 commented Oct 26, 2022

AlexTMjugador commented Apr 21, 2023

mqudsi commented Apr 22, 2023

Performance improvement ideas #12

Performance improvement ideas #12

Comments

andrews05 commented Oct 24, 2022

AlexTMjugador commented Oct 24, 2022 • edited Loading

andrews05 commented Oct 24, 2022 • edited Loading

AlexTMjugador commented Oct 25, 2022

kornelski commented Oct 25, 2022 • edited Loading

AlexTMjugador commented Oct 25, 2022 • edited Loading

AlexTMjugador commented Oct 25, 2022 • edited Loading

kornelski commented Oct 25, 2022

andrews05 commented Oct 25, 2022 • edited Loading

AlexTMjugador commented Oct 26, 2022 • edited Loading

andrews05 commented Oct 26, 2022

AlexTMjugador commented Apr 21, 2023

mqudsi commented Apr 22, 2023

AlexTMjugador commented Oct 24, 2022 •

edited

Loading

andrews05 commented Oct 24, 2022 •

edited

Loading

kornelski commented Oct 25, 2022 •

edited

Loading

AlexTMjugador commented Oct 25, 2022 •

edited

Loading

AlexTMjugador commented Oct 25, 2022 •

edited

Loading

andrews05 commented Oct 25, 2022 •

edited

Loading

AlexTMjugador commented Oct 26, 2022 •

edited

Loading