Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heimdall 1.1.0-beta Amoy in combination with Erigon v2.60.10 not able to begin without remote bor #1218

Open
MrFreezeDZ opened this issue Dec 18, 2024 · 12 comments

Comments

@MrFreezeDZ
Copy link

MrFreezeDZ commented Dec 18, 2024

Heimdall version
We run the image from here: https://hub.docker.com/layers/0xpolygon/heimdall/1.1.0-beta/images/sha256-1f8acf1364c29c388869b7701dc9870f58267ba7989221e289596b3a3473d7b4

Environment:

  • OS We run the Image inside of a Kubernetes Cluster
  • Install tools Helm with custom Helm-Chart
  • Others:

What happened:
After upgrading to version 1.1.0-beta we had an error. We opened a support ticket: https://support.polygon.technology/support/tickets/140239
We should resync with snapshots from here: https://publicnode.com/snapshots#polygon
And we should use for heimdall the remote bor_rpc_url: https://rpc-amoy.polygon.technology
We also had to unwind our Erigon to get it to run again.
Also we needed to activate the bor api on the Erigon side, which we did not have to do before Heimdall 1.1.0-beta.
After Heimdall and Erigon were in sync again, we were able to set the Heimdall bor_rpc_url again to our local Erigon. Heimdall only complained every now and then with messages like this:

ERROR[2024-12-17|14:35:32.885] Unable to fetch block by number from child chain block=null err="Post "http://localhost:8545\": dial tcp 127.0.0.1:8545: connect: connection refused"
ERROR[2024-12-17|14:35:32.885] Error validating milestone module=Milestone startBlock=15718120 endBlock=15718143 hash=0x16a2937e593095f88a4c0fd83d964dbef249e53cd57a32b355b255997f4fbfc8 milestoneId="2fdf1f1e-5afc-449d-823b-8a5727d697be - 0x3d964dbef249e53cd57a32b355b255997f4fbfc8" error="End block number with confirmation is not available in the Bor chainEndBlock15718143Confirmation16"
ERROR[2024-12-17|14:35:32.885] Hash is not valid module=Milestone startBlock=15718120 endBlock=15718143 hash=0x16a2937e593095f88a4c0fd83d964dbef249e53cd57a32b355b255997f4fbfc8 milestoneId="2fdf1f1e-5afc-449d-823b-8a5727d697be - 0x3d964dbef249e53cd57a32b355b255997f4fbfc8"

Also when I have Erigon running and stop Heimdall and delete it's data folder, download the Heimdall snapshot and extract it again, it will log an error message if the bor_rpc_url is set to Erigon (localhost). The only way getting Heimdall to run with the downloaded snapshots for me is to use the bor_rpc_url: https://rpc-amoy.polygon.technology

What you expected to happen:
I expect to be able to download the snapshots for Heimdall and Erigon from https://publicnode.com/snapshots#polygon and run our node without any remote bor rpc url set. The constraint of having to set another bor_rpc_url for our Heimdall than our own feels wrong.

Have you tried the latest version: yes/no
yes

How to reproduce it (as minimally and precisely as possible):

Logs (paste a small part showing an error (< 10 lines) or link a pastebin, gist, etc. containing more of the log file):

ERROR: failed to create new node: error during handshake: error on replay: Wrong Block.Header.AppHash. Expected 47FE8530728F28365D236037AC55243299866FBABF2D937C871D3F4221FC8861, got FAE7FDF9DA93BCEB584E8DC34F7498FD5AD7CD0206D346560D5B77B53068CBCA

Config (you can paste only the changes you've made):
Nothing changed since Heimdall version 1.0.7.

node command runtime flags:
/usr/bin/heimdalld start --home=/heimdall-home

/dump_consensus_state output for consensus bugs

Anything else we need to know:

@avalkov
Copy link

avalkov commented Dec 18, 2024

ERROR[2024-12-17|14:35:32.885] Unable to fetch block by number from child chain block=null err="Post "[http://localhost:8545\](http://localhost:8545%5C/)": dial tcp 127.0.0.1:8545: connect: connection refused"

It looks like the local erigon node is not running ?

@MrFreezeDZ
Copy link
Author

@avalkov The local Erigon instance runs. This message comes not continuous. The problem for this message in my opinion is, that Erigon is not fast enough for Heimdall. The blocknumber also rises with these messages. But to be clear, this message is the only thing I find suspicious and I am not sure if it is related to the main problem. The main problem here is:

I expect to be able to download the snapshots for Heimdall and Erigon from https://publicnode.com/snapshots#polygon and run our node without any remote bor rpc url set.

@avalkov
Copy link

avalkov commented Dec 19, 2024

@MrFreezeDZ From what I understand you run a heimdall container and you run erigon in container or host machine ? Because from the error it looks like heimdall is trying to connect to erigon on localhost which cannot be correct since it has to be the erigon container IP or host machine IP.

Please verify manually that you can connect to the erigon RPC (which is port 8545), from inside of the heimdall container.

Thanks.

@MrFreezeDZ
Copy link
Author

@avalkov oh I think you are right. That is definitely a problem. Thank you very much for pointing me in the right direction. Right now I am trying to test it. I will write here if it works or not.

@MrFreezeDZ
Copy link
Author

@avalkov Hi, I tested it. When I spin up Heimdall and Erigon from scratch with snapshots from https://publicnode.com/snapshots#polygon and Heimdall pointing to bor_rpc_url = "https://rpc-amoy.polygon.technology" it works. When Heimdall and Erigon are both in sync I am able to set Heimdalls's bor_rpc_url = "http://erigon-svc:8545", restart Heimdall and it keeps working. "erigon-svc" is the Kubernetes service that our local Erigon is listening on.
But when I spin up from scratch again with snapshots and initially let Heimdall point to bor_rpc_url = "http://erigon-svc:8545 it won't work and Heimdall will find no peers. This is my seeds configuration for this Heimdall amoy instance:

I get following errors, when I then switch to bor_rpc_url = "https://rpc-amoy.polygon.technology" after intially using bor_rpc_url = "http://erigon-svc:8545":

ERROR[2024-12-20|13:52:35.907] Error dialing seed                           module=p2p err="auth failure: secret conn failed: EOF" [email protected]:26656
ERROR[2024-12-20|13:52:36.225] Error dialing seed                           module=p2p err="auth failure: conn.ID (3b4c9788a51e212fa1d0bba8fd1f8bd2083cf2a4) dialed ID (080dcdffcc453367684b61d8f3ce032f357b0f73) mismatch" [email protected]:26656
ERROR[2024-12-20|13:52:36.418] Peer Info                                    module=pex numPeers=0
ERROR[2024-12-20|13:52:36.418] Peer Info                                    module=pex numPeers=0
ERROR[2024-12-20|13:52:36.419] Peer Info                                    module=pex numPeers=0
ERROR[2024-12-20|13:52:36.418] Peer Info                                    module=pex numPeers=0
ERROR[2024-12-20|13:52:36.418] Peer Info                                    module=pex numPeers=0
ERROR[2024-12-20|13:52:36.418] Peer Info                                    module=pex numPeers=0
ERROR[2024-12-20|13:52:36.418] Peer Info                                    module=pex numPeers=0
ERROR[2024-12-20|13:52:36.418] Peer Info                                    module=pex numPeers=0
ERROR[2024-12-20|13:52:36.418] Peer Info                                    module=pex numPeers=0
ERROR[2024-12-20|13:52:36.418] Peer Info                                    module=pex numPeers=0
ERROR[2024-12-20|13:52:36.418] Peer Info                                    module=pex numPeers=0
ERROR[2024-12-20|13:52:36.418] Peer Info                                    module=pex numPeers=0
ERROR[2024-12-20|13:52:36.418] Peer Info                                    module=pex numPeers=0
panic: +2/3 committed an invalid block: Wrong Block.Header.AppHash.  Expected 54CCCBD23882C578A77DABCF5D0D20FA59554F756C78167795CE661A6D6A92F9, got 28E2F727486853BA333AEEDD31DF24354CCC2E73BC0ADF73CD04E04561326A1D

goroutine 124 [running]:
github.com/tendermint/tendermint/consensus.(*ConsensusState).finalizeCommit(0xc0011c8a88, 0x5a8146)
        /root/go/pkg/mod/github.com/maticnetwork/[email protected]/consensus/state.go:1304 +0xc67
github.com/tendermint/tendermint/consensus.(*ConsensusState).tryFinalizeCommit(0xc0011c8a88, 0x5a8146)
        /root/go/pkg/mod/github.com/maticnetwork/[email protected]/consensus/state.go:1281 +0x2da
github.com/tendermint/tendermint/consensus.(*ConsensusState).addProposalBlockPart(0xc0011c8a88, 0x0?, {0xc0178c7020?, 0x1b36820?})
        /root/go/pkg/mod/github.com/maticnetwork/[email protected]/consensus/state.go:1523 +0x711
github.com/tendermint/tendermint/consensus.(*ConsensusState).handleMsg(0xc0011c8a88, {{0x27d6180, 0xc0179e47b0}, {0xc0178c7020, 0x28}})
        /root/go/pkg/mod/github.com/maticnetwork/[email protected]/consensus/state.go:679 +0x191
github.com/tendermint/tendermint/consensus.(*ConsensusState).readReplayMessage(0xc0011c8a88, 0x1ce3820?, {0x0?, 0x0?})
        /root/go/pkg/mod/github.com/maticnetwork/[email protected]/consensus/replay.go:88 +0xf05
github.com/tendermint/tendermint/consensus.(*ConsensusState).catchupReplay(0xc0011c8a88, 0x5a8146)
        /root/go/pkg/mod/github.com/maticnetwork/[email protected]/consensus/replay.go:160 +0x5ed
github.com/tendermint/tendermint/consensus.(*ConsensusState).OnStart(0xc0011c8a88)
        /root/go/pkg/mod/github.com/maticnetwork/[email protected]/consensus/state.go:309 +0x1ea
github.com/tendermint/tendermint/libs/common.(*BaseService).Start(0xc0011c8a88)
        /root/go/pkg/mod/github.com/maticnetwork/[email protected]/libs/common/service.go:139 +0x1f2
github.com/tendermint/tendermint/consensus.(*ConsensusReactor).SwitchToConsensus(_, {{{0xa, 0x0}, {0xc01146ddc0, 0x6}}, {0xc01146dde0, 0xe}, 0x5a8145, 0x1b6807, {{0xc0112c7160, ...}, ...}, ...}, ...)
        /root/go/pkg/mod/github.com/maticnetwork/[email protected]/consensus/reactor.go:117 +0x166
github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).poolRoutine(0xc000683340)
        /root/go/pkg/mod/github.com/maticnetwork/[email protected]/blockchain/v0/reactor.go:271 +0xffd
created by github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).OnStart in goroutine 1
        /root/go/pkg/mod/github.com/maticnetwork/[email protected]/blockchain/v0/reactor.go:118 +0x6e

Another panic is:

ERROR[2024-12-20|14:12:30.469] Error dialing seed                           module=p2p err="auth failure: secret conn failed: read tcp 100.64.145.18:54246->54.217.171.196:26656: read: connection reset by peer" [email protected]:26656
ERROR[2024-12-20|14:12:30.792] Error dialing seed                           module=p2p err="auth failure: conn.ID (3b4c9788a51e212fa1d0bba8fd1f8bd2083cf2a4) dialed ID (080dcdffcc453367684b61d8f3ce032f357b0f73) mismatch" [email protected]:26656
panic: Failed to process committed block (5931334:D25A2D0C2B96D0999D9AE60491A993FDE3FB44BE3324222A177FA5266FA60F80): Wrong Block.Header.AppHash.  Expected 54CCCBD23882C578A77DABCF5D0D20FA59554F756C78167795CE661A6D6A92F9, got 28E2F727486853BA333AEEDD31DF24354CCC2E73BC0ADF73CD04E04561326A1D

goroutine 84 [running]:
github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).poolRoutine(0xc00051b880)
        /root/go/pkg/mod/github.com/maticnetwork/[email protected]/blockchain/v0/reactor.go:344 +0xfa5
created by github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).OnStart in goroutine 1
        /root/go/pkg/mod/github.com/maticnetwork/[email protected]/blockchain/v0/reactor.go:118 +0x6e

@avalkov
Copy link

avalkov commented Dec 23, 2024

Regarding seeds not found, are you sure that all required erigon ports are forwarded properly ? https://github.com/erigontech/erigon?tab=readme-ov-file#default-ports-and-firewalls

Also when you say you spin up from scratch with snapshots, you mean that both erigon and heimdall start from same snapshot?
This really may not work - #1209 (comment)

@Raneet10
Copy link
Member

But when I spin up from scratch again with snapshots and initially let Heimdall point to bor_rpc_url = "http://erigon-svc:8545

@MrFreezeDZ Can you ensure that the bor_ rpc endpoint is enabled on your erigon container ? For instance are you able to curl the bor_getAuthor request on http://erigon-svc:8545 ?

@MrFreezeDZ
Copy link
Author

@avalkov I can say that before Heimdall 1.1.0-beta everything worked and we did not even had the bor api enabled on Erigon. As you pointed me to, we never had a correct bor endpoint configured for our Heimdall instance, until now.
Do you mean that with Heimdall 1.1.0-beta more than the bor api on the Erigon side needs to be enabled?

The comment from icculp here describes exactly what I am wondering about too. So as long as this is a known issue and it is not planned to have this circular dependency in the future I will look forward to the next Heimdall release.

@Raneet10
Copy link
Member

Do you mean that with Heimdall 1.1.0-beta more than the bor api on the Erigon side needs to be enabled?

Yes there are some instances where heimdall is made to do a bor rpc call, specifically bor_getAuthor. This change was introduced while implementing PIP-52.

The #1209 (comment) describes exactly what I am wondering about too. So as long as this is a known issue and it is not planned to have this circular dependency in the future I will look forward to the next Heimdall release.

Well heimdall won't have a dependency on bor until the Jorvik HF hits. Overall, we realised that PIP-52 ended up introducing slight non-deterministic behaviour for which we have a fix ready. It should get rolled out on amoy in early Jan and Feb'25 for mainnet.
Thanks!

@MrFreezeDZ
Copy link
Author

@Raneet10 to answer your question to ensure that the bor api is enabled I started a pod in the same namespace to be able to use curl to access the ergion-svc. I can use the bor_ methods, for example this works:

curl http://erigon-svc:8545 -X POST -H "Content-Type: application/json" --data '{"jsonrpc":"2.0","method":"bor_getAuthor","params":["latest"], "id":1}'
{"jsonrpc":"2.0","id":1,"result":"0x6dc2dd54f24979ec26212794c71afefed722280c"}

Copy link

This issue is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jan 11, 2025
@MrFreezeDZ
Copy link
Author

In my opinion this is an issue that should not get closed automatically, as long as there is no new Heimdall version after 1.1.0-beta with which we will be able to spin up a node from snapshots again even after jorvik hardfork.

@avalkov avalkov removed the Stale label Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants