Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include healthcheck logic for helper scripts running as sidecars #1842
base: alpha
Are you sure you want to change the base?
Include healthcheck logic for helper scripts running as sidecars #1842
Changes from 10 commits
0b27c4e
257647c
0644d5f
859b0e5
f752b1c
74aa459
1eda07b
a7d9a18
bd6d3dd
443eece
29aa4ce
505487a
338639c
cbdd386
36d6525
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the part I was mentioning earlier about "fail early". The reason being is we have compounded retries occurring which reduces observability. I want to apologize (profusely) in advance, this comment is going to be pretty verbose to break down what I mean by compounding retries and reduced observability.
First, I can definitely see where my earlier statements about using the query tip 60 seconds and relying on the 3 "retries" was confusing. What I was referring to is this portion of the current healthcheck node query tip:
guild-operators/files/docker/node/addons/healthcheck.sh
Lines 13 to 22 in 349dffd
It sleeps 60 between the first block and second, and has zero retries. It only relies on the
--healthcheck-retries
of the container configuration. This results in a minimum 15 minutes and maximum 24 minutes before the container actually shows an unhealthy status. It does not compound retries which makes the status change from healthy -> unhealthy take even longer.I'll clarify what I mean by compounding retries:
The containers own HEALTHCHECK defined in the dockerfile sets the container automatically to use
--health-start-period 5m --healthcheck-interval 5m
without being defined it also gets--healthcheck-retries 3
from the default number of retries. Here is the container inspect output with time in nanoseconds.The check for ptsendtip does a loop with
HEALTHCHECK_RETRIES
(default 20), withlog_entry_timeout
(set to 60), if it fails we waitHEALTHCHECK_RETRY_WAIT
(default 3), and start again.That is a minimum of 15 and maximum of almost 27 minutes (including runtime).
I see multiple options as a solution:
Drop the retry logic from
check_cncli_send_tip()
. Let it run once for 60 seconds and exit 1 if it doesn't see a block. This makes it similar to the query tip in thetimeout
and all other checks in regards to no retries (except the KOIOS version of check_node).Keep the retry logic in
check_cncli_send_tip()
and set the HEALTCHECK_RETRIES to default to 0.Add retry logic to every check function in the healthcheck script and set HEALTCHECK_RETRIES to default to 0.
Each option results in a 15-21 minute window before knowing the container is unhealthy when the container configuration for healthcheck intervals & retries and all script variables are left at current/default values.
I'm torn between 3 and 1 as my favorite option. Improving observability even further requires either the operator to adjust the startup, interval, timeout and retries when creating the container OR for us to come to a consensus about adjusting the Dockerfile HEALTHCHECK statement when building the container image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thankyou for the thorough explination.
My gut feeling is that this would be the prefered approach, minimising the code to add/maintain.
To test how often it times out, I'm in the process of running the function 100 times with varying timeout values...
(Results TBD)
After removing retries from the function, if it times out, it then must wait another 5mins before retring again
--healthcheck-interval 5m
... I propose to change the timeout within the function to something larger, maybe 90s-120s (depending on the results of my testing).E.g timeout=90 (rather than timeout=60) increases the container time to failure by 30s while decreasing the container time to sucess by relatively larger 4.5min (by avoiding unnecassary timouts and retries).
What would be the maximum time you would expect between tips?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anecdotally I have seen Tip (diff) reach 60 seconds quite often. Occasionally a bit higher at 100-120 seconds. The KOIOS addition was because the original query tip I had twice over a few months hit 60 seconds on repeated retries and marked it unhealthy causing some investigations. After adding KOIOS, or even disabling KOIOS and using the original query tip, I don't remember it happening in the last year seeing unhealthy on the node.
In rare, extreme events, it has exceeded 5 minutes. Node 1.35, or 1.34, had one event long ago that was exceeding 5 minutes. Not sure if this linked issue/comment showing 8 minute delay between blocks is the one, or something else as I didnt read the whole thread. The time frame seems like it could be the one I'm thinking of, and I don't recall at the moment another lasting longer than 5.
IntersectMBO/cardano-node#4421 (comment)
Again, these are rare "extreme" events, and the healthcheck in theory wouldn't even point this out with the current configuration anyway when it lasts less than the
--healthcheck-retries
multiplied by the interval, we should in theory only see this in grafana/prometheus and in theory the container runtimes would still be showing a healthy container even with timeout=60.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the results:
timeout=60 - Result: 9% timeout frequency (9-fail/91-success).
timeout=90 - Result: 4% timeout frequency (4-fail/96-success).
timeout=120 - Result: 2% timeout frequency (2-fail/98-success).
In this limited testing sample, it has already failed sucessively (twice in a row) - (with lower timout of 60s)... So I feel using a higher timout would mean less chance of failure three times in a row, causing unnceassary investigations.
...How would you feel about making timeout=120?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you happen to have details about consecutive failure/timeouts? For example based on the 300 second interval if it's never observed that we have 1 consecutive failures over 100 attempts, then it still results in 0 status changes to unhealty.
To bump the timeout even higher I'd probably want to see 2, possibly 3, consecutive timeoutes reached based on X interval being tested. If it reaches 50-66% of the required consecutive failures to cause the node to be marked unhealthy then I would definitely consider bumping it up.
However, if the results were similar to KOIOS where there was never more than 1 failure and 0 consecutive failures I would prefer leaving the internal timeout for the check at 60. Raising it to 120 would also require an increase of the Dockerfile HEALTHCHECK settings timeout (aka
--healthcheck-timeout
) value currently at 100.If you don't have the details on how many consecutive failures were observed this gist contains the POC I did to gather the details for KOIOS. I won't get time to setup a ptsendtip until sometime this weekend, which would push my ability to run this test out to Sunday or sometime next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests had no interval... I didn't feel adding an interval would effect the results in any meaningful way.
timeout=60 - Result: 9% timeout frequency (9-fail/91-success).
timeout=90 - Result: 4% timeout frequency (4-fail/96-success).
timeout=120 - Result: 2% timeout frequency (2-fail/98-success).
I'm in the process of re-running it with timeout=60, interval=300s... I'll keep you posted.
Oh yes, thanks for pointing that out. Let's see how the re-test goes.
This looks great. Nice use of sqlite!
Here's my test, it's not as elegant but it gets the job done:
https://gist.github.com/adamsthws/dce0263bea2302047660b9c8ea458cbd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For docker/podman, there arent current instructions - we need to possibly attach a sample reference compose file , wherein PGPASSFILE can be added (or instructions added to be part of priv folder)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the current Guild Ops DBSync documentation which mentions this considered insufficient in this case? It documents it at the top of the page, and lower down shows examples of using the priv folder as the path.
Also, this healtcheck won't impose any new requirements for using
dbsync.sh
in a container that don't already exist using it on bare metal.Line 120 comes from line 42 of
set_defaults()
function:guild-operators/scripts/cnode-helper-scripts/dbsync.sh
Lines 38 to 47 in 349dffd
Lines 121-122 come from lines 52-53 of
check_defaults()
function:guild-operators/scripts/cnode-helper-scripts/dbsync.sh
Lines 49 to 60 in 349dffd
A compose file example is definitely an option. However, I find it can also be a crutch by allowing users to skip reading documentation. If you consider it a "must have" then I think it should be part of another PR and have its own separate testing and review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I look at (atleast a big % of) docker users as spoon-fed users - not intended in any demeaning way, but the primary requirement of them using docker [besides virtual env isolation] would be simplicity - as we're not talking about slightly more experienced/advanced users already taking to next step and deploying in professional environments , their expectations will be not having to go through document pages beyond docker reference ones.
If someone running a manual setup is moving to docker, I dont expect them to have as many queries. It does kinda begs the question if docker users would also share the pre-reqs on home page - in my experience of reading github issues on IO/intersectmbo issues, a very large section does not.
Ye - I agree. That - if added - does not need to be part of this PR indeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talking about dbsync in particular, I would think this would likely be a seperate docker image altogether (which is why I mentioned earlier there is some work to be done):
In theory , same can be done for smaller components, but there is little value as their setups are mostly the same. For instance, for something like ogmios - we dont need a custom image, official image is perfectly usable.
Once above is available, a [docker|podman]-compose will essentially just give a reference on sharing volumes between the pods|containers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, after this, and some other documentation issues/PR are resolved I'll open one and start a compose example.
Nice. So far I've been running the upstream db-sync image, as well as ogmios, and guild ops for node, cncli, and mithril. I'm hereby offering to assist with a new container if any help is needed.
Yep. The socket, configs (files) and just about every other subdir of CNODE_HOME lend themselves to volume usage. Not scripts though. The excessive work trying to make scripts a shared volume wasn't worth the time I initially spent to get it working. I've found a container environment file with all vars + a completely stock scripts directory and env file inside the container to be the simplest way to run separate containers per process with shared vars.