-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stb resize 2.08 #1649
stb resize 2.08 #1649
Conversation
I test this change on my Raspberry Pi 4B running Raspberry OS in 32-bit mode: The color bug I noticed in stbir__simple_flip_3ch() is fixed, in both scalar and simd paths. On my platform, stbir__simdf_swiz2 is not defined and it selects the second SIMD code block, using stbir__simdf_swiz(). The change in speed from enabling SIMD is slight (but consistent). I have verified which code paths are executing using printfs. With gcc, SIMD gave a 15% speedup. GCC build options: Clang build options: The fastest version, Clang with -DSTBIR_NO_SIMD (lol), performs as follow: I'm not sure why it's so slow since it's only 150 MB of pixels. Maybe the long scanlines are thrashing the cache in a maximally-bad way? The speed is fine for my application, and matches the 2.07 non-SIMD speed, so I dunno if there's a problem. But if you expected a bigger difference on this platform, I can poke at it some more. e: All times mentioned above are for the call to stbir_resize_extended(), which does much more work than just flip_3ch(). But even the core resizer math doesn't speed up with SIMD, really? Maybe I am doing something wrong here. e2: just rechecked 2.08 times against 2.07, both scalar and SIMD. They're the same. So this does not seem like a regression, just something I noticed now, since I am comparing simd and not-simd back-to-back to see that the color was fixed in both. |
That's a reasonably big downsample (depending on your filter) - 1 second doesn't seem nuts for a 32-bit platform that is reading 150 MB of input with a sample window of 27x20 (each output pixel has to read 27x20 of the input). 32-bit vs 64-bit is a huge hit here, btw. There are a couple things you can do:
For option 5, you can also wait for 2.09 which will internally do the cache striping for you. But yeah, 32-bit arm is just pretty darn pokey in general. |
Yep, I'm not complaining about the performnace, I just wanted to be check the numbers seemed sensible. The application I'm testing is decode and display of 45MP iPhone 15 heic files on a Rasp Pi Zero 2 W 512MB. (It works!) |
There's probably some more wins if you want to get fancy. Instead of decoding the HEIC into RGB and then resizing that, decode into YUV (where the U and V planes are smaller), resize those planes, and THEN convert to RGB in the smaller space. |
This was already merged in later updates. |
fix for RGB->BGR three channel flips and add SIMD (thanks to Ryan Salsbury)
fix for sub-rect resizes
use pragmas to control unrolling when they are available.