Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart flat db healing pipeline in case of error #8081

Open
macfarla opened this issue Jan 6, 2025 · 3 comments
Open

Restart flat db healing pipeline in case of error #8081

macfarla opened this issue Jan 6, 2025 · 3 comments
Labels
bonsai bug Something isn't working P2 High (ex: Degrading performance issues, unexpected behavior of core features (DevP2P, syncing, etc)) syncing ux

Comments

@macfarla
Copy link
Contributor

macfarla commented Jan 6, 2025

During testing of 24.12.0-RC2, one node failed the sync during the flat db healing step with a MerkleTrieException. There are no other exceptions in the log.

{"@timestamp":"2024-12-09T09:43:38,974","level":"INFO","thread":"EthScheduler-Services-34 (batchHealAndPersistFlatStorageData)","class":"Pipeline","message":"Unexpected exception in pipeline. Aborting.
Throwable summary: org.hyperledger.besu.ethereum.trie.MerkleTrieException: Unable to load trie node value for hash 0x7c645ffe7048d604cc92a50f9e033b98e05ff48fe07e6d0223abf58fe274d8ce location 0x040005
	at: org.hyperledger.besu.ethereum.trie.StoredNode.lambda$load$0(StoredNode.java:135)","throwable":""}

Leaving #8015 as the investigation of the root cause of this exception.

This issue is to implement the mitigation which is to restart the pipeline if such an error occurs. Currently a restart is required to recover (likely because you get a new pivot block)

@macfarla macfarla added bonsai bug Something isn't working P2 High (ex: Degrading performance issues, unexpected behavior of core features (DevP2P, syncing, etc)) syncing ux labels Jan 6, 2025
@matkt
Copy link
Contributor

matkt commented Jan 7, 2025

do you have all of the logs since the beginning of the node ? I want to be sure we are not missing an exception that will explain this problem

also a restart will not fix the problem, you need to restart the sync from scratch when you have this issue

@jframe
Copy link
Contributor

jframe commented Jan 9, 2025

I have the logs up until it failed, no errors in the logs though. dev-elc-bu-nb-sepolia-burn-snap-24.12.0-RC2_besu.log

Restarting the node fixed the issue. It started the flat db healing again and then continued with the full sync as normal. I couldn't find those logs unfortunately though.

@iryoung
Copy link
Contributor

iryoung commented Jan 22, 2025

I investigate this issue a bit and identified a few potential options:

  • Restarting org.hyperledger.besu.ethereum.eth.sync.snapsyncSnapWorldStateDownloadProcess.storageFlatDatabaseHealingPipeline - batchHealAndPersistFlatStorageData stage
    • This doesn't seem suitable as it doesn't fetch new pivot block info.
  • Restarting org.hyperledger.besu.ethereum.eth.sync.snapsyncSnapWorldStateDownloadProcess.storageFlatDatabaseHealingPipeline
    • This also doesn't seem suitable for the same reason.
  • restarting whole pipeline in SnapWorldStateDownloadProcess
    • This might works, but it could be tricky to implement.
  • Restarting the whole node
    • This seems simple and clean, but I'm not sure if it's a viable option.

Additionally, could you provide guidance on how to reproduce this issue?

@jframe I wan't able to access the log you uploaded( permission denied) Should I register for specific some service, or could you provide another way to download it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bonsai bug Something isn't working P2 High (ex: Degrading performance issues, unexpected behavior of core features (DevP2P, syncing, etc)) syncing ux
Projects
None yet
Development

No branches or pull requests

4 participants