Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve freelist serialization #799

Open
tjungblu opened this issue Jul 25, 2024 · 0 comments
Open

Improve freelist serialization #799

tjungblu opened this issue Jul 25, 2024 · 0 comments

Comments

@tjungblu
Copy link
Contributor

Background

Related to #789, many of the corruption issues stem from reading freelists with bogus values. Most of the time we're catching the corruption, e.g. if the page id is suddenly out of the bounds of the memory mapped file, or it's allocated and free at the same time. We currently cannot detect silent bit flips where we suddenly turn pageid 9 into pageid 8, which then later find their way into a corrupted freelist.

The current serialization mechanism is a very simple run-length encoding of a list of sorted uint64 that denote the page identifier being free/pending. The data is directly written to the memory address of the freelist page(s) as a slice, which is backed by the memory mapped file.

While we assume that we have a bug somewhere that corrupts certain pages in some (virtualized?) environments and crash cases, we should be proactive here and start adding additional safeguards to detect and potentially even correct errors.

Efficiency

The amount of free pages and the size of the individual page ids rarely require the full 64 bit width of the page id type.

Which is great, as this opens up the possibility for varint/vint encoding, which is also done by protobuf [1].
This could save significant amount of space and deserialization time, as the current encoding can only store 512 ids per page, with vint compression we could easily store twice that.

Alternatively, we can encode the free page id as a bit in a sparse bitset like [2], which allows us even further compress this up to about 2048 ids per page.

[1] https://protobuf.dev/programming-guides/encoding/#varints
[2] https://github.com/RoaringBitmap/roaring

Error Detection

To start off very simple, we should add an error detection code (e.g. a CRC checksum) to the serialization process, if not even to the Page struct itself to also safeguard the data stored in the btree.

A separate topic is how we're reacting to a page/freelist being invalid based on their checksum, I'm happy to hear your suggestions on this.

Error Correction

This is probably a longer project, yet we should evaluate error correcting codes like Reed-Solomon [3] here as well. This would allow us to reconstruct faulty pages with only a little more space allocated.

[3] https://github.com/klauspost/reedsolomon

Compatibility

Given that we already have a serialization protocol in place, we need to be mindful about backward compatibility.

I propose to reserve the first integer with value math.MaxUint64 to denote the new "schema", which we should start to version properly for any future changes we may want to make. It's unlikely to be used as the real length of a freelist, given current physical memory and disk constraints.

One example encoding I could think of would be:

(PageCount uint16 = 0xFFFF)
(ids[0] aka. Length = math.MaxUint64)

version uint8
< encoded list of ids, varint/bitsets  >
checksum uint32/uint64

I would prefer to have a more battle tested and upgradable schema like protobuf, even at the expense of parsing overhead, but I wouldn't want to introduce a dependency that might be conflicting with etcd.

Thanks for reading that far, happy to hear your thoughts.

@github-actions github-actions bot added the stale label Dec 10, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 1, 2025
@ahrtr ahrtr reopened this Jan 1, 2025
@github-actions github-actions bot removed the stale label Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants