Still return continuous WAL entries when running into ErrSliceOutOfRange #19095

ahrtr · 2024-12-21T15:24:19Z

Please read #19038 (comment) and #19038 (comment)

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

Signed-off-by: Benjamin Wang <[email protected]>

ahrtr · 2024-12-21T15:40:11Z

Confirmed that this PR can fix the error in #19038 (comment). @siyuanfoundation please let me know if you can still reproduce it in your environment.

codecov · 2024-12-21T15:42:08Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.71%. Comparing base (40b856e) to head (152de1f).
Report is 24 commits behind head on main.

Additional details and impacted files

Files with missing lines	Coverage Δ
server/storage/wal/wal.go	`57.88% <100.00%> (ø)`

... and 24 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #19095      +/-   ##
==========================================
- Coverage   68.77%   68.71%   -0.06%     
==========================================
  Files         420      420              
  Lines       35642    35642              
==========================================
- Hits        24513    24492      -21     
- Misses       9703     9719      +16     
- Partials     1426     1431       +5

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 40b856e...152de1f. Read the comment docs.

siyuanfoundation · 2024-12-23T21:07:01Z

I can confirm this fixes the failure in #19038. Thank you @ahrtr !

ahrtr · 2024-12-24T09:02:44Z

I can confirm this fixes the failure in #19038. Thank you @ahrtr !

Thanks for the confirmation.

Can we get this merged firstly? PTAL cc @serathius

ahrtr · 2024-12-24T13:04:39Z

cc @fuweid @ivanvc @jmhbnz

siyuanfoundation · 2025-01-03T22:56:07Z

After syncing my repo, I just found the robustness test still fails even with this fix. Because validatePersistedRequestMatchClientRequests requires the lastOp to be persisted, partial WAL entries would not work for this check.
I got the error of:

last succesful client write {"Type":"txn","LeaseGrant":null,"LeaseRevoke":null,"Range":null,"Txn":{"Conditions":null,"OperationsOnSuccess":[{"Type":"put-operation","Range":{"Start":"","End":"","Limit":0},"Put":{"Key":"tombstone","Value":{"Value":"true","Hash":0},"LeaseID":0},"Delete":{"Key":""}}],"OperationsOnFailure":null},"Defragment":null,"Compact":null} was not persisted, required to validate

ahrtr · 2025-01-04T12:24:30Z

After syncing my repo, I just found the robustness test still fails even with this fix. Because validatePersistedRequestMatchClientRequests requires the lastOp to be persisted, partial WAL entries would not work for this check. I got the error of:
last succesful client write {"Type":"txn","LeaseGrant":null,"LeaseRevoke":null,"Range":null,"Txn":{"Conditions":null,"OperationsOnSuccess":[{"Type":"put-operation","Range":{"Start":"","End":"","Limit":0},"Put":{"Key":"tombstone","Value":{"Value":"true","Hash":0},"LeaseID":0},"Delete":{"Key":""}}],"OperationsOnFailure":null},"Defragment":null,"Compact":null} was not persisted, required to validate

@siyuanfoundation how often did you see this error? Or in other words, is it easy to reproduce this error?

If I understood it correctly, the robustness test error means that the last client write which already got successful response, but it wasn't persisted in WAL file. Please let me know if I misunderstood it.

Each time when we see an issue, the first thing is to figure out whether it's a real issue from end user perspective. can you manually double check whether the last successful client write was persisted in the WAL files of majorities members, and also the bbolt db?

Also I see that robustness test might not process the WAL records correctly, the longest one might not be he correct one. As long as the WAL records were not committed yet, they may be overwritten by following WAL records.

etcd/tests/robustness/report/wal.go

Lines 78 to 79 in fce823a

    
           if len(memberRequests) > len(persistedRequests) { 
        
           	persistedRequests = memberRequests

I regard it as a test issue for now, please raise a separate issue to track it. Thanks.

serathius · 2025-01-07T08:38:04Z

Also I see that robustness test might not process the WAL records correctly, the longest one might not be he correct one. As long as the WAL records were not committed yet, they may be overwritten by following WAL records.

You are right that normally longest WAL is not necessarily include the longest commit sequence, however in robustness test we explicitly make a single additional transaction after the test is finished, this should ensure that there are no any other uncommitted transactions. We require the transaction to succeed and later use it to assert that WAL is complete.

serathius

Change looks good, however I haven't validated how it works with repair.

k8s-ci-robot · 2025-01-07T08:39:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, serathius

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahrtr,serathius]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

siyuanfoundation · 2025-01-07T21:51:47Z

After syncing my repo, I just found the robustness test still fails even with this fix. Because validatePersistedRequestMatchClientRequests requires the lastOp to be persisted, partial WAL entries would not work for this check. I got the error of:
last succesful client write {"Type":"txn","LeaseGrant":null,"LeaseRevoke":null,"Range":null,"Txn":{"Conditions":null,"OperationsOnSuccess":[{"Type":"put-operation","Range":{"Start":"","End":"","Limit":0},"Put":{"Key":"tombstone","Value":{"Value":"true","Hash":0},"LeaseID":0},"Delete":{"Key":""}}],"OperationsOnFailure":null},"Defragment":null,"Compact":null} was not persisted, required to validate
@siyuanfoundation how often did you see this error? Or in other words, is it easy to reproduce this error?

If I understood it correctly, the robustness test error means that the last client write which already got successful response, but it wasn't persisted in WAL file. Please let me know if I misunderstood it.

Each time when we see an issue, the first thing is to figure out whether it's a real issue from end user perspective. can you manually double check whether the last successful client write was persisted in the WAL files of majorities members, and also the bbolt db?

Also I see that robustness test might not process the WAL records correctly, the longest one might not be he correct one. As long as the WAL records were not committed yet, they may be overwritten by following WAL records.

etcd/tests/robustness/report/wal.go

Lines 78 to 79 in fce823a

if len(memberRequests) > len(persistedRequests) {

persistedRequests = memberRequests

I regard it as a test issue for now, please raise a separate issue to track it. Thanks.

This is the same error as below before this PR

failed to read WAL, cannot be repaired, err: wal: slice bounds out of range, snapshot[Index: 0, Term: 0], current entry[Index: 7931, Term: 4], len(ents): 7189

I can reproduce the error for MemberDowngrade failpoint at least 10% of the time.
There does not seem to be any problem with the member data. Although in my local tests, I cannot find all the WAL files even with --max-wals=0 --max-snapshots=0.
The top of the log dump is like

Snapshot:
term=4 index=14302 nodes=[ac4ec652f10e5b49 bf19ae4419db00dc eabdbb777cf498cb] confstate={"voters":[12416079282240904009,13770228943176794332,16914881897345358027],"auto_leave":false}
Start dumping log entries from snapshot.
WAL metadata:
nodeID=eabdbb777cf498cb clusterID=b3bc0c1919fe5d7e term=4 commitIndex=14526 vote=eabdbb777cf498cb
WAL entries: 225
lastIndex=14527
term         index      type    data
   4         14303      norm    header:<

ahrtr · 2025-01-08T10:56:24Z

@siyuanfoundation I am a little confused, probably I did not say it clearly.

There are two errors. One is #19038 (comment), and it's already confirmed that this PR can fix it. Please let me know if can still see the error with the patch included in this PR.

The second error is #19095 (comment). A successful client write must have been persisted in the WAL files at least majority of the members, and probably also in bbolt DB. This is exactly I was requesting to double confirm manually as mentioned in #19095 (comment).

Also

You are right that normally longest WAL is not necessarily include the longest commit sequence, however in robustness test we explicitly make a single additional transaction after the test is finished, this should ensure that there are no any other uncommitted transactions. We require the transaction to succeed and later use it to assert that WAL is complete.

@serathius thx for the clarification. But theoretically it's still possible that the longest one isn't the correct one. The single additional successful transaction you mentioned only guarantees that majority members have the correct WAL data.

Also note since the current robustness test always reads WAL data starting from a snapshot {0, 0} as mentioned in #19038 (comment), so if there is gap, as mentioned in #19038 (comment) and #19038 (comment), in the WAL file of each member, then you definitely will see the error last succesful client write .... was not persisted, required to validate. It's a test issue which needs to be resolved.

FYI. the last successful client write:

etcd/tests/robustness/traffic/traffic.go

Lines 119 to 121 in 70a1726

    
           // Ensure that last operation succeeds 
        
           _, err = cc.Put(ctx, "tombstone", "true") 
        
           require.NoErrorf(t, err, "Last operation failed, validation requires last operation to succeed")

{
  "Type": "txn",
  "LeaseGrant": null,
  "LeaseRevoke": null,
  "Range": null,
  "Txn": {
    "Conditions": null,
    "OperationsOnSuccess": [
      {
        "Type": "put-operation",
        "Range": {
          "Start": "",
          "End": "",
          "Limit": 0
        },
        "Put": {
          "Key": "tombstone",
          "Value": {
            "Value": "true",
            "Hash": 0
          },
          "LeaseID": 0
        },
        "Delete": {
          "Key": ""
        }
      }
    ],
    "OperationsOnFailure": null
  },
  "Defragment": null,
  "Compact": null
}

ahrtr · 2025-01-08T14:13:33Z

Let me merge this PR firstly, since it's already confirmed that it can resolve the first error. Regarding the second error, it should be a test issue.

siyuanfoundation · 2025-01-08T18:18:43Z

@siyuanfoundation I am a little confused, probably I did not say it clearly.

There are two errors. One is #19038 (comment), and it's already confirmed that this PR can fix it. Please let me know if can still see the error with the patch included in this PR.

The statement it's already confirmed that this PR can fix it is no longer true. Before applying this PR, I am seeing the WAL error, with this PR, the same downgrade robustness test no longer passes like I tested before, but changed the error msg to not finding the last commit. The two errors are the same under the hood because of missing WAL entries in the persisted file.

ahrtr · 2025-01-08T18:34:52Z

but changed the error msg to not finding the last commit. The two errors are the same under the hood because of missing WAL entries in the persisted file.

They are not the same error. Even without this PR, the robustness test still has the second error. The reason why you did not see it before is that it's hidden by the first error. If you really understood my previous comment, the robustness test's way of reading WAL is wrong.

ahrtr · 2025-01-08T20:22:08Z

Just raised #19147

k8s-ci-robot added area/robustness-testing area/testing approved size/XS labels Dec 21, 2024

Still return continuous WAL entries when running into ErrSliceOutOfRange

152de1f

Signed-off-by: Benjamin Wang <[email protected]>

ahrtr force-pushed the wal_20241221 branch from f752b96 to 152de1f Compare December 21, 2024 15:33

ahrtr mentioned this pull request Dec 24, 2024

Add MemberDowngrade failpoint #19038

Merged

serathius approved these changes Jan 7, 2025

View reviewed changes

ahrtr merged commit 00e5b65 into etcd-io:main Jan 8, 2025
34 checks passed

ahrtr deleted the wal_20241221 branch January 8, 2025 14:15

ahrtr mentioned this pull request Jan 8, 2025

Workaround the flaky downgrade robustness test due to WAL records missing in all members #19147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Still return continuous WAL entries when running into ErrSliceOutOfRange #19095

Still return continuous WAL entries when running into ErrSliceOutOfRange #19095

ahrtr commented Dec 21, 2024

ahrtr commented Dec 21, 2024

codecov bot commented Dec 21, 2024 •

edited

Loading

siyuanfoundation commented Dec 23, 2024

ahrtr commented Dec 24, 2024

ahrtr commented Dec 24, 2024

siyuanfoundation commented Jan 3, 2025 •

edited

Loading

ahrtr commented Jan 4, 2025 •

edited

Loading

serathius commented Jan 7, 2025

serathius left a comment

k8s-ci-robot commented Jan 7, 2025

siyuanfoundation commented Jan 7, 2025

ahrtr commented Jan 8, 2025 •

edited

Loading

ahrtr commented Jan 8, 2025

siyuanfoundation commented Jan 8, 2025

ahrtr commented Jan 8, 2025 •

edited

Loading

ahrtr commented Jan 8, 2025

Still return continuous WAL entries when running into ErrSliceOutOfRange #19095

Still return continuous WAL entries when running into ErrSliceOutOfRange #19095

Conversation

ahrtr commented Dec 21, 2024

ahrtr commented Dec 21, 2024

codecov bot commented Dec 21, 2024 • edited Loading

Codecov Report

siyuanfoundation commented Dec 23, 2024

ahrtr commented Dec 24, 2024

ahrtr commented Dec 24, 2024

siyuanfoundation commented Jan 3, 2025 • edited Loading

ahrtr commented Jan 4, 2025 • edited Loading

serathius commented Jan 7, 2025

serathius left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 7, 2025

siyuanfoundation commented Jan 7, 2025

ahrtr commented Jan 8, 2025 • edited Loading

ahrtr commented Jan 8, 2025

siyuanfoundation commented Jan 8, 2025

ahrtr commented Jan 8, 2025 • edited Loading

ahrtr commented Jan 8, 2025

codecov bot commented Dec 21, 2024 •

edited

Loading

siyuanfoundation commented Jan 3, 2025 •

edited

Loading

ahrtr commented Jan 4, 2025 •

edited

Loading

ahrtr commented Jan 8, 2025 •

edited

Loading

ahrtr commented Jan 8, 2025 •

edited

Loading