Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - missing resource isolation to prefer critical tasks over query processing #5984

Open
andrejpodzimek opened this issue Sep 13, 2024 · 4 comments
Labels
needs triage Issue / PR needs to be triaged. Stale

Comments

@andrejpodzimek
Copy link

andrejpodzimek commented Sep 13, 2024

Internal/External
External

Area
Other

Summary
Leader log queries impede critical validator processing and cause extreme numbers of missed slot leader checks.

Steps to reproduce

  1. Watch the frequency of missed slot leader checks over time.
  2. Run a demanding cardano-cli query in a loop against the validator (example below).
  3. Watch the disaster unfold: In my case, there were 7% of missed slot leader checks due to a repeated query.
  4. Repeat the test with a regular relay node. Tip differences will run sky high (>100) when queries are processed.

Expected behavior
Proper resource isolation.

  • Ongoing query processing never delays inward tip propagation (“height”).
  • Ongoing query processing on a validator never ever causes missed slot leader checks!
    The fact that one should not run queries against a validator is orthogonal; a validator should either process such queries gracefully, without impediment to critical operations, or outright reject them.
  • Timing-critical tasks must take precedence. (They should not be timing-critical, but sadly are.)

System info (please complete the following information):

  • OS Name: ArchLinux
  • OS Version: Only bad distros have this.
  • Node version (output of cardano-node --version):
    cardano-node 9.1.1 - linux-x86_64 - ghc-8.10
    git rev 66dc08944479792b2823c9e1356914820c9ea059
    
  • CLI version (output of cardano-cli --version):
    cardano-cli 9.2.1.0 - linux-x86_64 - ghc-8.10
    git rev 66dc08944479792b2823c9e1356914820c9ea059
    

Screenshots and attachments
An example query to expose resource isolation problems:

cardano-cli query leadership-schedule \
  --socket-path /run/cardano-validator/socket \
  --genesis config/mainnet-shelley-genesis.json \
  --mainnet \
  --vrf-signing-key-file keys/mainnet/vrf.skey \
  --stake-pool-id ... \
  --next

RTS options:

... +RTS -N -A64m -H -Iw59 --nonmoving-gc -RTS ...

Additional context
This case could be dismissed with “use a workaround”, i.e. “have a separate relay node for slot leader queries only”, i.e. not for routing to a validator. However, such an idea is suboptimal, increasing the amount of resources a pool operator must set aside by up to 50%, compared to the simplest relay + validator setup.

The lack of proper resource isolation may have been a contributing factor to my problem of never successfully validating a block, described in this post and above.

@andrejpodzimek andrejpodzimek added the needs triage Issue / PR needs to be triaged. label Sep 13, 2024
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

@github-actions github-actions bot added the Stale label Oct 14, 2024
@erikd erikd removed the Stale label Oct 14, 2024
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

@github-actions github-actions bot added the Stale label Nov 14, 2024
@karknu
Copy link
Contributor

karknu commented Nov 14, 2024

@andrejpodzimek I've been working on something that may alleviate your problem. It was done for relays serving hundreds of clients but perhaps it could work here too.

https://github.com/IntersectMBO/cardano-node/tree/karknu/thread_isolation , based on 10.1.2 so will require a chain replay if you're still on 9.2.1. Experimental so best to test it on your backup BP or on a testnet.

@github-actions github-actions bot removed the Stale label Nov 15, 2024
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

@github-actions github-actions bot added the Stale label Dec 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Issue / PR needs to be triaged. Stale
Projects
None yet
Development

No branches or pull requests

3 participants