Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Consensus] test to confirm correct threshold clock advancement after GC #20492

Merged
merged 1 commit into from
Jan 24, 2025

Conversation

akichidis
Copy link
Contributor

@akichidis akichidis commented Dec 3, 2024

Description

This is a fix for a behaviour that has been observed during the private-testnet test where some nodes started crashing when setting gc_depth = 5. The goal was to test GC with very low gc values to reveal possible issues, such as this one.

This PR is fixing the block acceptance path after commits happening. As we try to unsuspend blocks after commits have happened, which naturally move the gc_round , we should make sure that those blocks are been processed via Core as well and move the threshold clock. Otherwise it's possible - and this mostly becomes visible during bulk processing of blocks - for the following scenario to happen:

This PR is now including the test only that confirm the issue as described below when we bulk processing blocks under GC conditions. The #20906 has moved the threshold clock in DagState eliminating the issue:

  • a bunch of blocks are received for processing
  • threshold clocks advances to round R according to the so far accepted blocks via the block manager. Some blocks though are not accepted yet as they do have some early dependencies higher than the current gc_round, so they get suspended.
  • we attempt to trigger a commit. A commit happens where multiple leaders get committed. This moves the gc_round. Blocks are unsuspended because of the new gc_round. More commits are now happening as we can process the suspended blocks.
  • Once the commits finished, gc_round has advanced so much that our threshold clock is well behind. Now our node tries to create a new block but the threshold clock is at round R although gc_round has advanced to R + X. This will make our own block not getting accepted as we produce a block for a quite old round. We see panics from
    assert_eq!(accepted_blocks.len(), 1);

Test plan

CI/PT


Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

  • Protocol:
  • Nodes (Validators and Full nodes):
  • Indexer:
  • JSON-RPC:
  • GraphQL:
  • CLI:
  • Rust SDK:
  • REST API:

Copy link

vercel bot commented Dec 3, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sui-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 23, 2025 0:12am
3 Skipped Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Visit Preview Jan 23, 2025 0:12am
sui-kiosk ⬜️ Ignored (Inspect) Visit Preview Jan 23, 2025 0:12am
sui-typescript-docs ⬜️ Ignored (Inspect) Visit Preview Jan 23, 2025 0:12am

/// commits can trigger an advancement in gc round. Suspended blocks that have dependency in their causal history to any gc'ed blocks, will get unsuspended
/// and accepted.
#[instrument(level = "debug", skip_all)]
fn try_commit(&mut self) -> ConsensusResult<(Vec<CommittedSubDag>, Vec<VerifiedBlock>)> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative I thought was try_commit calling directly the add_accepted_blocks it self for stronger guarantees if we refactor the code etc. Chose to return the accepted blocks and handle independently as I didn't want the try_commit do too much on its own

@mwtian
Copy link
Member

mwtian commented Dec 4, 2024

I think this issue seems to be a consequence of the threshold clock implementation, which is essentially a separate data structure that needs to be synced with DagState. Keeping track of the flow of accepted blocks seem a bit messy. What about revisiting a previous discussion / PR, where during proposing, the quorum round is read from DagState?

@akichidis akichidis force-pushed the akichidis/consensus-commit-rule-modify branch from c3c0502 to a9b2f85 Compare December 19, 2024 15:58
@akichidis akichidis requested review from a team and mystenmark as code owners December 19, 2024 15:58
@akichidis akichidis force-pushed the akichidis/consensus-commit-rule-modify branch from a9b2f85 to 1424e0c Compare January 2, 2025 16:33
Base automatically changed from akichidis/consensus-commit-rule-modify to main January 9, 2025 16:28
@akichidis akichidis force-pushed the akichidis/fix-advance-threshold-clock branch from d4024c5 to 545b664 Compare January 9, 2025 17:24
@akichidis akichidis temporarily deployed to sui-typescript-aws-kms-test-env January 9, 2025 17:24 — with GitHub Actions Inactive
@akichidis
Copy link
Contributor Author

I think this issue seems to be a consequence of the threshold clock implementation, which is essentially a separate data structure that needs to be synced with DagState. Keeping track of the flow of accepted blocks seem a bit messy. What about revisiting a previous discussion / PR, where during proposing, the quorum round is read from DagState?

@mwtian @arun-koshy could you please re-review? I will try to refactor this on separate PR and find a better fit/solution probably using DagState, but I would like to merge this so I can at least enable GC for simtests/devnet and start getting feedback earlier and then refine further. Let me know if that's ok.

@akichidis akichidis force-pushed the akichidis/fix-advance-threshold-clock branch from 545b664 to 64b2bf6 Compare January 23, 2025 12:10
@akichidis akichidis temporarily deployed to sui-typescript-aws-kms-test-env January 23, 2025 12:10 — with GitHub Actions Inactive
@akichidis akichidis changed the title [Consensus] process accepted blocks after commits [Consensus] test to confirm correct threshold clock advancement after GC Jan 23, 2025
@akichidis akichidis merged commit 4be879f into main Jan 24, 2025
52 of 55 checks passed
@akichidis akichidis deleted the akichidis/fix-advance-threshold-clock branch January 24, 2025 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants