-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Still return continuous WAL entries when running into ErrSliceOutOfRange #19095
Conversation
Signed-off-by: Benjamin Wang <[email protected]>
Confirmed that this PR can fix the error in #19038 (comment). @siyuanfoundation please let me know if you can still reproduce it in your environment. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files
... and 24 files with indirect coverage changes @@ Coverage Diff @@
## main #19095 +/- ##
==========================================
- Coverage 68.77% 68.71% -0.06%
==========================================
Files 420 420
Lines 35642 35642
==========================================
- Hits 24513 24492 -21
- Misses 9703 9719 +16
- Partials 1426 1431 +5 Continue to review full report in Codecov by Sentry.
|
Thanks for the confirmation. Can we get this merged firstly? PTAL cc @serathius |
After syncing my repo, I just found the robustness test still fails even with this fix. Because
|
@siyuanfoundation how often did you see this error? Or in other words, is it easy to reproduce this error? If I understood it correctly, the robustness test error means that the last client write which already got successful response, but it wasn't persisted in WAL file. Please let me know if I misunderstood it. Each time when we see an issue, the first thing is to figure out whether it's a real issue from end user perspective. can you manually double check whether the last successful client write was persisted in the WAL files of majorities members, and also the bbolt db? Also I see that robustness test might not process the WAL records correctly, the longest one might not be he correct one. As long as the WAL records were not committed yet, they may be overwritten by following WAL records. etcd/tests/robustness/report/wal.go Lines 78 to 79 in fce823a
I regard it as a test issue for now, please raise a separate issue to track it. Thanks. |
You are right that normally longest WAL is not necessarily include the longest commit sequence, however in robustness test we explicitly make a single additional transaction after the test is finished, this should ensure that there are no any other uncommitted transactions. We require the transaction to succeed and later use it to assert that WAL is complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good, however I haven't validated how it works with repair.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahrtr, serathius The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is the same error as below before this PR
I can reproduce the error for
|
@siyuanfoundation I am a little confused, probably I did not say it clearly. There are two errors. One is #19038 (comment), and it's already confirmed that this PR can fix it. Please let me know if can still see the error with the patch included in this PR. The second error is #19095 (comment). A successful client write must have been persisted in the WAL files at least majority of the members, and probably also in bbolt DB. This is exactly I was requesting to double confirm manually as mentioned in #19095 (comment). Also
@serathius thx for the clarification. But theoretically it's still possible that the longest one isn't the correct one. The single additional successful transaction you mentioned only guarantees that majority members have the correct WAL data. Also note since the current robustness test always reads WAL data starting from a snapshot {0, 0} as mentioned in #19038 (comment), so if there is gap, as mentioned in #19038 (comment) and #19038 (comment), in the WAL file of each member, then you definitely will see the error FYI. the last successful client write: etcd/tests/robustness/traffic/traffic.go Lines 119 to 121 in 70a1726
|
Let me merge this PR firstly, since it's already confirmed that it can resolve the first error. Regarding the second error, it should be a test issue. |
The statement |
They are not the same error. Even without this PR, the robustness test still has the second error. The reason why you did not see it before is that it's hidden by the first error. If you really understood my previous comment, the robustness test's way of reading WAL is wrong. |
Just raised #19147 |
Please read #19038 (comment) and #19038 (comment)
cc @serathius @siyuanfoundation
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.