-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using direct I/O on Linux #808
Comments
So with the changes from #809, direct I/O still ends up being slightly slower for me:
except for removals which is still takes almost twice as long. Any ideas where these remaining effects are coming from? |
Hmm, I haven't played with direct i/o a whole lot, but here are a few hypotheses:
|
Not necessarily, or rather only if But I agree the opportunities for coalescing might be reduced to what is queued in the I/O layer. I guess direct I/O would only really make sense when combined with asynchronous I/O so that the program can issue all its I/O "immediately" so the block layer can do almost as much coalescing as the page cache could have done before the
So this sounds like something that could be optimized? For a free'd page, its contents should not matter, should it? So it could just be declared clean and but into the read cache? Maybe after zeroing out the in-memory representation? We'd just need to be sure that when the supposedly clean page is evicted and recreated from disk, that leads to the same result. I think this also the main benefit of running with direct I/O, it shines a light on hidden inefficiencies which the kernel's page cache currently smooths over. But on a fully laden production system without cache memory to spare, not depending on the page cache's help is presumably rather helpful. |
It's less trivial than it seems. Because pages can be allocated at different orders, the size of the cached page might change. If you want to hack it into the benchmark to see if it has an effect though, it's probably safe to just assume the page size doesn't change since I think the benchmark only writes small values. |
I looked into this, and actually I don't think this would contribute that much. The code already avoids reading the free'd page from disk. |
So if this re-creation of free'd pages is inconsequential, where is the I/O overhead in the "removals" benchmark coming from? Updating the tree metadata on disk? |
I'm not sure. Have you tried profiling with and without direct_io? The |
I tried your diff, but it panics with alignment errors in |
Sorry for not getting back earlier, I did not have time yet to rerun the benchmarks. Just two remarks:
|
For example, I am getting
in home directory which is why I chose |
Not sure if this is already helpful, but I used Hotspot as it has checkbox-level support for off-CPU profiling and limited the benchmarks to Looking at the flamegraph for off-CPU time (after filtering out the Ctrl+C handler thread which is blocked reading on its pipe and totally skews off-CPU time), I get the following flame graph for buffered I/O and using direct I/O So instead of waiting on This could end up coming back to direct I/O only really making sense using asynchronous I/O as well: With direct I/O, we write one page after the other if a transaction touched multiple pages whereas using buffered I/O we just push those pages into the page cache which can issue the writes in parallel/batched when |
Unfortunately, I hit a panic in read() because the allocated vector had the wrong alignment.
I see, ya it seems like async io might be required, if all the individual writes with direct io are more expensive than the OS performing them during the fsync. The only thing I can think of is that you could try to coalesce consecutive writes into calls to |
Ah ok, this would suggest you are using an older kernel version where direct I/O still requires page-aligned buffers? With recent kernels, this requirements has been dropped/reduced and the general alignment guaranteed by the heap allocator suffices. Otherwise, this get's tricky as |
I'm on the 6.8 kernel, but it was this assert in your diff: |
I think the assert is correct and necessary, i.e. direct I/O usually has alignment requirements, but they were significantly reduced in "later" kernels down from page alignment to lower values like e.g. the reported four on my system (ext4 on an NVMe drive on x86-64). I would be interesting to see which value |
It required alignment of 512
Sent from my phone
…On Mon, Jul 22, 2024, 10:26 PM Adam Reichold ***@***.***> wrote:
I think the assert is correct and necessary, i.e. direct I/O usually has
alignment requirements, but they were significantly reduced in "later"
kernels down from page alignment to lower values like e.g. the reported
four on my system (ext4 on an NVMe drive on x86-64).
I would be interesting to see which value stx_dio_mem_align reports on
your system? (I think almost anything in the I/O stack can influence this,
e.g. filesystem, block device driver, volume block size, etc.)
—
Reply to this email directly, view it on GitHub
<#808 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGNXQDDJF2K6IL3KJIMTB3ZNXSR5AVCNFSM6AAAAABHS2CJC2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBUGI4DAMZTGU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
This sounds like the device block size and is certainly higher than the natural alignment (sufficient alignment for all primitive types provided by the dynamic allocator by default). So I think for such a setup, we would need to implement our own memory management here which I think is out of scope for redb (and requires significant amounts of unsafe code). I think compared to that, using memory maps is the better trade-off of unsafe code versus performance. I do not want to re-open the memory maps debate, I think the position that redb takes is eminently reasonable. I just think it shows that direct I/O does not really fit into that position. I also feel like we have learned everything there is to learn here for now. So I maybe we just close this having found one optimization opportunity? |
Yep, thanks for contributing that optimization! |
Using |
Understanding that redb manages it owns user space page cache, I used the small diff
changes to make the lmdb_benchmark use direct I/O
to run the
lmdb_benchmark
using direct I/O which appears to works but significantly regresses performance, especially for the "batch writes" scenario which became three times slower (7s versus 24s) on my system.While this is obviously not a supported use case and hence this not meant as a bug report but rather a question, this result did surprise me. Maybe someone else has an idea why bypassing the kernel's page cache actually slows redb down?
The text was updated successfully, but these errors were encountered: