Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WAL snapshotting should leave behind some WAL files #25788

Closed
pauldix opened this issue Jan 11, 2025 · 0 comments · Fixed by #25801
Closed

WAL snapshotting should leave behind some WAL files #25788

pauldix opened this issue Jan 11, 2025 · 0 comments · Fixed by #25801
Assignees
Labels

Comments

@pauldix
Copy link
Member

pauldix commented Jan 11, 2025

Currently, when the WAL snapshots it deletes all files that were snapshotted. For downstream Enterprise replicas, we'll want to keep some WAL files around so they can pick them up even if lagging behind a bit.

Add a configuration option for the number of snapshotted WAL files to leave behind. The default should be 300 (at least 5m of data with default settings).

With #25787 merged in, the restart/replay process will now ignore previously snapshotted files.

We need to update replay and the snapshot process to not automatically delete all snapshotted files. Instead, it should keep a WAL file number of the oldest WAL file. When a snapshot comes through you can compare that number - the oldest to determine how many snapshotted files exist.

So we have:

  • keep-snapshotted-wal-count
  • oldest_wal_number
  • latest_wal_number (the most recent file written)
  • last_snapshot_number

Where oldest < last < latest always. The number of snapshotted WAL files we have kept on object store is last - oldest. We want to delete from oldest to N so that we have the keep number.

During replay you can ignore deletions completely, as long as the WAL is initialized with the 4 numbers we need.

One important bit about the restart is that we don't want to actually load all the WAL files between oldest and last_snapshot since we don't need that data. That means the startup process should first look for the latest persisted snapshot to get last_snapshot number.

Then we'll need to do as many object store LIST operations on the WAL directory to get the full range of files there. We only need to know oldest and latest. Now we have our 4 numbers and we can load up all the WAL files from last_snapshot to latest to load into the QueryableBuffer.

@pauldix pauldix added the v3 label Jan 11, 2025
praveen-influx added a commit that referenced this issue Jan 11, 2025
This commit allows a configurable number of wal files to be left behind
in OS. This is necessary as enterprise replicas rely on these files.

closes: #25788
praveen-influx added a commit that referenced this issue Jan 11, 2025
This commit allows a configurable number of wal files to be left behind
in OS. This is necessary as enterprise replicas rely on these files.

closes: #25788
praveen-influx added a commit that referenced this issue Jan 12, 2025
* feat: introduce num wal files to keep

This commit allows a configurable number of wal files to be left behind
in OS. This is necessary as enterprise replicas rely on these files.

closes: #25788

* refactor: address PR feedback

* refactor: address PR comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants