Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] - Log reason when node switches tip to a fork #6014

Open
TerminadaPool opened this issue Oct 15, 2024 · 17 comments
Open

[FR] - Log reason when node switches tip to a fork #6014

TerminadaPool opened this issue Oct 15, 2024 · 17 comments
Labels
needs triage Issue / PR needs to be triaged. type: enhancement An improvement on the existing functionality

Comments

@TerminadaPool
Copy link

Internal/External
External: Stake pool operator

Area
Other: Logging

Describe the feature you'd like
Please log the reason why the node preferred another fork

Describe alternatives you've considered
The independent tool cncli provides logging of block data including the block VRF value into a sqlite database. This database can be queried for insight into whether the likely reason the node preferred another fork was because the other fork terminal block had a lower block VRF and the tie break rule was applied.

Additional context / screenshots
The node knows the reason for switching tip to a competing fork but this reason isn't logged. Rather the logs only confirm that the tip was switched. But, was the reason due to the competing fork being longer? Or was it due to the other fork's terminal block having a lower VRF causing the tie-break rule to be applied? Or did the node prefer another fork some other reason?

This information would be really helpful when trying to identify what caused your own block to not be adopted.

@TerminadaPool TerminadaPool added needs triage Issue / PR needs to be triaged. type: enhancement An improvement on the existing functionality labels Oct 15, 2024
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

@github-actions github-actions bot added the Stale label Nov 15, 2024
@TerminadaPool
Copy link
Author

Please stay as an open feature request.

@rphair
Copy link

rphair commented Nov 15, 2024

Yes, please: if this is considered impractical to add I would like to hear why. Our SPO operation believes this is important for quality assurance purposes.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

@github-actions github-actions bot added the Stale label Dec 16, 2024
@rphair
Copy link

rphair commented Dec 16, 2024

@carbolymer what kind of engagement do the developers expect for this issue not to be Stale? There won't be anything new to report until someone in the relevant team finally responds.

This log info has been long & vitally requested for diagnostic purposes and the need for it is never going to go away. Those who need it most would be small stake pools with infrequent blocks... and therefore needed more in the near future if & when the K PoS parameter increases (as is constantly suggested in governance conversations).

If this issue keeps being dismissed, the implication is that QA isn't very important for stake pools and/or chain density isn't that important for Cardano itself. I don't think either of these things are true and I can state from SPO experience dating back to the beginning of the Shelley period that implementing this feature (which we arguably should have had from day 1) would be helpful for SPOs and node infrastructure operators at all levels of endeavour.

@github-actions github-actions bot removed the Stale label Dec 17, 2024
@carbolymer
Copy link
Contributor

There was a slack discussion about this issue some time ago.
@jasagredo @amesgen would you mind sharing main takeaways here?

@amesgen
Copy link
Member

amesgen commented Jan 6, 2025

The trace messages already contain enough detail to deduce the reason why a node changes its chain. Concretely, whenever we change our selection, we trace the so called "SelectView" of the tip of both the old and the new chain. This SelectView contains exactly the information that is needed in order to determine whether a chain is preferable to another chain. See PraosChainSelectView for more details and assorted documentation.

In Conway, the chain order works like this, via a lexicographical combinationi of the following criteria, when deciding whether to switch from the current chain to a candidate chain:

  1. compare chain length, preferring larger values
  2. when (same slot and same issuer) compare operational certificate issue number, preferring larger values
  3. when (slots differ by at most 5) compare VRF, preferring smaller values

By comparing the old and new SelectView, one can see which of these three rules was responsible for the chain switch:


Let's look at an example:

{
  "at": "2024-12-13T13:50:58.295305594Z",
  "ns": "ChainDB.AddBlockEvent.AddedToCurrentChain",
  "data": {
    "kind": "AddedToCurrentChain",
    "newTipSelectView": {
      "chainLength": 11214658,
      "issueNo": 16,
      "issuerHash": "8e0bb9b126acd8e65a023602377d9a7a6f15af394d1f7bdd852182b8",
      "kind": "PraosChainSelectView",
      "slotNo": 142531567,
      "tieBreakVRF": "f8f31aabb555124392485cbd7ba02bded4e4eac8980fc36fde0c9a21d7cf2d51c6fd4ddda8113153639db862781dafacbdd77da4e072452a330c32509506762e"
    },
    "newtip": "17d5b0f21e86503b41851640e9558610ebe9b7a62c96a08873eb4d0c7961a460@142531567",
    "oldTipSelectView": {
      "chainLength": 11214657,
      "issueNo": 20,
      "issuerHash": "c63dab6d780a74cbae2a27696c9723f55b3092b2bd001256df03827f",
      "kind": "PraosChainSelectView",
      "slotNo": 142531554,
      "tieBreakVRF": "905ff37db665e86d60d93dabc92ec86124b1a22165e4968facca226e790d9b7dd3f60502fa84dcd30d95ac28a9278e5e72b8b2eb451759ab59c92fc65909a2b4"
    }
  },
  "sev": "Notice",
  "thread": "76",
  "host": "ramify"
}

Here, we can see that the chainLength of newTipSelectView is larger than the one of the chainLength of the oldTipSelectView, so the chain switch happened due to rule 1.

If these would be equal, one would have to look at rule 2 next (and then rule 3) to see whether it is responsible.


For convenience/simplicity, we could further enrich the trace message to directly include eg a reason field that says something like "LongerChain" or "BetterTieBreaker".

cc @mgmeier as you might have already started with this.

@TerminadaPool
Copy link
Author

That sort of log information would absolutely suit my needs. I wasn't aware we could get the "tieBreakVRF" values logged ordinarily.

What do I need to configure in my node config.json file so that such information is logged in my systemd journal?

@amesgen
Copy link
Member

amesgen commented Jan 7, 2025

In both the old and new tracing system, you should see this when you enable JSON output and use a severity of at least Notice for ChainDB.AddBlockEvent.AddedToCurrentChain/ChainDB.AddBlockEvent.SwitchedToAFork.

@mgmeier
Copy link
Contributor

mgmeier commented Jan 7, 2025

Thanks @amesgen for summarizing the discussion we had (BTW: needed to edit your post to change my GH handle to @mgmeier - the correct one). I'll add some more

  1. As mentioned the above trace messages / tracers need to be enabled, and severity filtering should be configured to include messages of serverity Notice.
  2. In the new system, you'd additionally need to configure the trace messages to be more verbose, by setting their detail level to at least DDetailed.
  3. We have it on our list to add a convenience field to the trace message, stating a reason for switching to a fork (should that happen). Would this make sense to provide at normal detail level, without knowing the underlying detailed criteria (like chain length, opcert no and VRF)?
  4. Point 3 will likely happen for the new tracing system only, as the old system will be maintenance-only starting with Node release 10.2 - and not receive additional features.

@amesgen
Copy link
Member

amesgen commented Jan 7, 2025

BTW: needed to edit your post to change my GH handle to @mgmeier - the correct one

Thanks for catching that!

  1. In the new system, you'd additionally need to configure the trace messages to be more verbose, by setting their detail level to at least DDetailed.

In the code, it looks like {new,old}TipSelectView is traced even with the default level of detail:

forMachine dtal (ChainDB.AddedToCurrentChain events selChangedInfo _base extended) =
mconcat $
[ "kind" .= String "AddedToCurrentChain"
, "newtip" .= renderPointForDetails dtal (AF.headPoint extended)
, "newTipSelectView" .= forMachine dtal (ChainDB.newTipSelectView selChangedInfo)
]
++ [ "oldTipSelectView" .= forMachine dtal oldTipSelectView
| Just oldTipSelectView <- [ChainDB.oldTipSelectView selChangedInfo]

@mgmeier
Copy link
Contributor

mgmeier commented Jan 7, 2025

In the code, it looks like {new,old}TipSelectView is traced even with the default level of detail

Aplogies, yes, the info will be traced at default detail level.
I was looking at a local branch where I was toying around with the settings :-|

@rphair
Copy link

rphair commented Jan 7, 2025

In both the old and new tracing system, you should see this when you enable JSON output

@amesgen @mgmeier our pool has 4+ years of homebrewed automation driven from text-based output ("ViewMode": "SimpleView").

Does it follow from #6014 (comment) that AddedToCurrentChain and SwitchedToAFork are not visible from SimpleView, and that therefore we would have to change to JSON logging to see these events?

@mgmeier
Copy link
Contributor

mgmeier commented Jan 7, 2025

Which tracing system are you using (i.e., what is your config value for UseTraceDispatcher)?

@rphair
Copy link

rphair commented Jan 7, 2025

Whatever the default would be for the node: we don't have this setting in any of our config files.

@mgmeier
Copy link
Contributor

mgmeier commented Jan 7, 2025

Then it is the old system, as currently it defaults to false.
As far as I can see, the info for {new,old}TipSelectView is indeed only available for JSON logging.

However, here's an idea about how to scrape that info without breaking existing automation. In the node config, you could set up an additional scribe to a file, having "scFormat": "ScJson". You would then need to create a script, possibly based on grep and jq, that monitors the trace messages (=JSON objects) going to that file. When you encounter a message indicating switching to a fork, the script could inspect its data and send an event to the systemd journal, to merge with the rest of the logging data in text-based format. If you configure log rotation or somesuch for that file scribe, the script would need to take that into account somehow as well.

Would that work for you somehow?

@rphair
Copy link

rphair commented Jan 7, 2025

yes, thanks @mgmeier - to log these events we'll fork off another log as you suggest. 🙏 If whatever changes are contemplated here produce effects in SimpleView I trust we will be able to find out about them here or in a linked thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Issue / PR needs to be triaged. type: enhancement An improvement on the existing functionality
Projects
None yet
Development

No branches or pull requests

5 participants