[FR] - Log reason when node switches tip to a fork #6014

TerminadaPool · 2024-10-15T07:30:43Z

Internal/External
External: Stake pool operator

Area
Other: Logging

Describe the feature you'd like
Please log the reason why the node preferred another fork

Describe alternatives you've considered
The independent tool cncli provides logging of block data including the block VRF value into a sqlite database. This database can be queried for insight into whether the likely reason the node preferred another fork was because the other fork terminal block had a lower block VRF and the tie break rule was applied.

Additional context / screenshots
The node knows the reason for switching tip to a competing fork but this reason isn't logged. Rather the logs only confirm that the tip was switched. But, was the reason due to the competing fork being longer? Or was it due to the other fork's terminal block having a lower VRF causing the tie-break rule to be applied? Or did the node prefer another fork some other reason?

This information would be really helpful when trying to identify what caused your own block to not be adopted.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-11-15T02:02:16Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

TerminadaPool · 2024-11-15T03:35:54Z

Please stay as an open feature request.

rphair · 2024-11-15T11:39:43Z

Yes, please: if this is considered impractical to add I would like to hear why. Our SPO operation believes this is important for quality assurance purposes.

github-actions · 2024-12-16T02:06:54Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

rphair · 2024-12-16T11:22:17Z

@carbolymer what kind of engagement do the developers expect for this issue not to be Stale? There won't be anything new to report until someone in the relevant team finally responds.

This log info has been long & vitally requested for diagnostic purposes and the need for it is never going to go away. Those who need it most would be small stake pools with infrequent blocks... and therefore needed more in the near future if & when the K PoS parameter increases (as is constantly suggested in governance conversations).

If this issue keeps being dismissed, the implication is that QA isn't very important for stake pools and/or chain density isn't that important for Cardano itself. I don't think either of these things are true and I can state from SPO experience dating back to the beginning of the Shelley period that implementing this feature (which we arguably should have had from day 1) would be helpful for SPOs and node infrastructure operators at all levels of endeavour.

carbolymer · 2024-12-27T11:53:06Z

There was a slack discussion about this issue some time ago.
@jasagredo @amesgen would you mind sharing main takeaways here?

amesgen · 2025-01-06T10:25:11Z

The trace messages already contain enough detail to deduce the reason why a node changes its chain. Concretely, whenever we change our selection, we trace the so called "SelectView" of the tip of both the old and the new chain. This SelectView contains exactly the information that is needed in order to determine whether a chain is preferable to another chain. See PraosChainSelectView for more details and assorted documentation.

In Conway, the chain order works like this, via a lexicographical combinationi of the following criteria, when deciding whether to switch from the current chain to a candidate chain:

compare chain length, preferring larger values
when (same slot and same issuer) compare operational certificate issue number, preferring larger values
when (slots differ by at most 5) compare VRF, preferring smaller values

By comparing the old and new SelectView, one can see which of these three rules was responsible for the chain switch:

Let's look at an example:

{
  "at": "2024-12-13T13:50:58.295305594Z",
  "ns": "ChainDB.AddBlockEvent.AddedToCurrentChain",
  "data": {
    "kind": "AddedToCurrentChain",
    "newTipSelectView": {
      "chainLength": 11214658,
      "issueNo": 16,
      "issuerHash": "8e0bb9b126acd8e65a023602377d9a7a6f15af394d1f7bdd852182b8",
      "kind": "PraosChainSelectView",
      "slotNo": 142531567,
      "tieBreakVRF": "f8f31aabb555124392485cbd7ba02bded4e4eac8980fc36fde0c9a21d7cf2d51c6fd4ddda8113153639db862781dafacbdd77da4e072452a330c32509506762e"
    },
    "newtip": "17d5b0f21e86503b41851640e9558610ebe9b7a62c96a08873eb4d0c7961a460@142531567",
    "oldTipSelectView": {
      "chainLength": 11214657,
      "issueNo": 20,
      "issuerHash": "c63dab6d780a74cbae2a27696c9723f55b3092b2bd001256df03827f",
      "kind": "PraosChainSelectView",
      "slotNo": 142531554,
      "tieBreakVRF": "905ff37db665e86d60d93dabc92ec86124b1a22165e4968facca226e790d9b7dd3f60502fa84dcd30d95ac28a9278e5e72b8b2eb451759ab59c92fc65909a2b4"
    }
  },
  "sev": "Notice",
  "thread": "76",
  "host": "ramify"
}

Here, we can see that the chainLength of newTipSelectView is larger than the one of the chainLength of the oldTipSelectView, so the chain switch happened due to rule 1.

If these would be equal, one would have to look at rule 2 next (and then rule 3) to see whether it is responsible.

For convenience/simplicity, we could further enrich the trace message to directly include eg a reason field that says something like "LongerChain" or "BetterTieBreaker".

cc @mgmeier as you might have already started with this.

TerminadaPool · 2025-01-06T23:11:42Z

That sort of log information would absolutely suit my needs. I wasn't aware we could get the "tieBreakVRF" values logged ordinarily.

What do I need to configure in my node config.json file so that such information is logged in my systemd journal?

amesgen · 2025-01-07T10:17:24Z

In both the old and new tracing system, you should see this when you enable JSON output and use a severity of at least Notice for ChainDB.AddBlockEvent.AddedToCurrentChain/ChainDB.AddBlockEvent.SwitchedToAFork.

mgmeier · 2025-01-07T12:33:09Z

Thanks @amesgen for summarizing the discussion we had (BTW: needed to edit your post to change my GH handle to @mgmeier - the correct one). I'll add some more

As mentioned the above trace messages / tracers need to be enabled, and severity filtering should be configured to include messages of serverity Notice.
~~In the new system, you'd additionally need to configure the trace messages to be more verbose, by setting their detail level to at least DDetailed.~~
We have it on our list to add a convenience field to the trace message, stating a reason for switching to a fork (should that happen). Would this make sense to provide at normal detail level, without knowing the underlying detailed criteria (like chain length, opcert no and VRF)?
Point 3 will likely happen for the new tracing system only, as the old system will be maintenance-only starting with Node release 10.2 - and not receive additional features.

amesgen · 2025-01-07T12:38:50Z

BTW: needed to edit your post to change my GH handle to @mgmeier - the correct one

Thanks for catching that!

In the new system, you'd additionally need to configure the trace messages to be more verbose, by setting their detail level to at least DDetailed.

In the code, it looks like {new,old}TipSelectView is traced even with the default level of detail:

cardano-node/cardano-node/src/Cardano/Node/Tracing/Tracers/ChainDB.hs

Lines 511 to 518 in 1774c93

    
           forMachine dtal (ChainDB.AddedToCurrentChain events selChangedInfo _base extended) = 
        
               mconcat $ 
        
                        [ "kind" .=  String "AddedToCurrentChain" 
        
                        , "newtip" .= renderPointForDetails dtal (AF.headPoint extended) 
        
                        , "newTipSelectView" .= forMachine dtal (ChainDB.newTipSelectView selChangedInfo) 
        
                        ] 
        
                     ++ [ "oldTipSelectView" .= forMachine dtal oldTipSelectView 
        
                        | Just oldTipSelectView <- [ChainDB.oldTipSelectView selChangedInfo]

mgmeier · 2025-01-07T12:42:34Z

In the code, it looks like {new,old}TipSelectView is traced even with the default level of detail

Aplogies, yes, the info will be traced at default detail level.
I was looking at a local branch where I was toying around with the settings :-|

rphair · 2025-01-07T12:42:46Z

In both the old and new tracing system, you should see this when you enable JSON output

@amesgen @mgmeier our pool has 4+ years of homebrewed automation driven from text-based output ("ViewMode": "SimpleView").

Does it follow from #6014 (comment) that AddedToCurrentChain and SwitchedToAFork are not visible from SimpleView, and that therefore we would have to change to JSON logging to see these events?

mgmeier · 2025-01-07T12:49:25Z

Which tracing system are you using (i.e., what is your config value for UseTraceDispatcher)?

rphair · 2025-01-07T12:54:35Z

Whatever the default would be for the node: we don't have this setting in any of our config files.

mgmeier · 2025-01-07T13:39:19Z

Then it is the old system, as currently it defaults to false.
As far as I can see, the info for {new,old}TipSelectView is indeed only available for JSON logging.

However, here's an idea about how to scrape that info without breaking existing automation. In the node config, you could set up an additional scribe to a file, having "scFormat": "ScJson". You would then need to create a script, possibly based on grep and jq, that monitors the trace messages (=JSON objects) going to that file. When you encounter a message indicating switching to a fork, the script could inspect its data and send an event to the systemd journal, to merge with the rest of the logging data in text-based format. If you configure log rotation or somesuch for that file scribe, the script would need to take that into account somehow as well.

Would that work for you somehow?

rphair · 2025-01-07T13:53:27Z

yes, thanks @mgmeier - to log these events we'll fork off another log as you suggest. 🙏 If whatever changes are contemplated here produce effects in SimpleView I trust we will be able to find out about them here or in a linked thread.

TerminadaPool added needs triage Issue / PR needs to be triaged. type: enhancement An improvement on the existing functionality labels Oct 15, 2024

github-actions bot added the Stale label Nov 15, 2024

carbolymer removed the Stale label Nov 15, 2024

TerminadaDrep mentioned this issue Dec 12, 2024

Enhancements and fixes related to logging, metrics, and testing #6052

Merged

4 tasks

github-actions bot added the Stale label Dec 16, 2024

github-actions bot removed the Stale label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR] - Log reason when node switches tip to a fork #6014

[FR] - Log reason when node switches tip to a fork #6014

TerminadaPool commented Oct 15, 2024

github-actions bot commented Nov 15, 2024

TerminadaPool commented Nov 15, 2024

rphair commented Nov 15, 2024

github-actions bot commented Dec 16, 2024

rphair commented Dec 16, 2024 •

edited

Loading

carbolymer commented Dec 27, 2024

amesgen commented Jan 6, 2025 •

edited by mgmeier

Loading

TerminadaPool commented Jan 6, 2025

amesgen commented Jan 7, 2025

mgmeier commented Jan 7, 2025 •

edited

Loading

amesgen commented Jan 7, 2025

mgmeier commented Jan 7, 2025 •

edited

Loading

rphair commented Jan 7, 2025 •

edited

Loading

mgmeier commented Jan 7, 2025

rphair commented Jan 7, 2025

mgmeier commented Jan 7, 2025

rphair commented Jan 7, 2025

[FR] - Log reason when node switches tip to a fork #6014

[FR] - Log reason when node switches tip to a fork #6014

Comments

TerminadaPool commented Oct 15, 2024

github-actions bot commented Nov 15, 2024

TerminadaPool commented Nov 15, 2024

rphair commented Nov 15, 2024

github-actions bot commented Dec 16, 2024

rphair commented Dec 16, 2024 • edited Loading

carbolymer commented Dec 27, 2024

amesgen commented Jan 6, 2025 • edited by mgmeier Loading

TerminadaPool commented Jan 6, 2025

amesgen commented Jan 7, 2025

mgmeier commented Jan 7, 2025 • edited Loading

amesgen commented Jan 7, 2025

mgmeier commented Jan 7, 2025 • edited Loading

rphair commented Jan 7, 2025 • edited Loading

mgmeier commented Jan 7, 2025

rphair commented Jan 7, 2025

mgmeier commented Jan 7, 2025

rphair commented Jan 7, 2025

rphair commented Dec 16, 2024 •

edited

Loading

amesgen commented Jan 6, 2025 •

edited by mgmeier

Loading

mgmeier commented Jan 7, 2025 •

edited

Loading

mgmeier commented Jan 7, 2025 •

edited

Loading

rphair commented Jan 7, 2025 •

edited

Loading