-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overdue state doesn't honor set time periods #10082
Comments
ref/NC/773667 |
ref/IP/53721 |
I was wrong (#9984 (comment)). #10070 won't fix this one as re-scheduled checks due to time periods happen and get propagated to DB backends (tested). Even if active checks are disabled (according to code). So my next check -> next update -> overdue? is always in the future, already moved every 5m. |
So the checking node's Redis and those Icinga DB DB get the overdue update. That's it. Checkable::OnNextCheckUpdated doesn't propagate itself over the cluster. @julianbrost I suggest to change the latter. |
@yhabteab on the other hand prefers just an additional flag in the existing SetNextCheck event to call Checkable::OnNextCheckUpdated. Would require to touch all its callers. |
Why do you have to do something like that? Can't you just recreate the conditions in which cases |
You mean, to duplicate them across the code and require keeping them in sync? Anyway, what I meant: If you look for Checkable::OnNextCheckUpdated usages, most/all should also call SetNextCheck/UpdateNextCheck which already fires the SetNextCheck event. To include your flag you'd have to do something else, for each such caller. I, in contrast, suggest the more natural cluster event approach we already use: one signal – one cluster event. One triggers another, while respecting MessageOrigin. |
It's not across the code, you'd just have to do it in one place ( |
@julianbrost came up with: Why should a SetNextCheck not always set icinga:nextupdate:*? SetNextCheck is called about 3x per actual check. But one icinga:nextupdate:* setting is completely enough. So, after the final SetNextCheck we say via OnNextCheckUpdated(): OK, dear backend, do your thing. Precisely speaking:
But the latter Checkable::OnNextCheckUpdated call wasn't cluster-synced either. |
My preferred solutions
|
Any ETA of when this can be fixed? |
The colleagues' preferred solutionAttach Redis overdue update to the existing NextCheck cluster event. Thing to consider (aka why I vote against)We already did exactly that which resulted in this problem: Hence, we had to update Redis overdue (next update) only OnNextCheckUpdated: We SetNextCheck, gently speaking, more than once: grep -rnFwe SetNextCheck -e UpdateNextCheck lib (01d3a1d)Various
Don't matter IMAO, for different reasons. Non-code, method declarations, ... End user
IMAO to be broadcasted everywhere, so already not problematic. However, that's likely already done somewhere, e.g. via Checkable::OnNextCheckUpdated. Checkable::Start
IMAO not needed for the cluster as run on both HA nodes, nor in Icinga DB which already writes (enough stuff, including) next-updates reflecting the above SetNextCheck during startup. Icinga DB has a higher activation priority. Our bool suppress_events should be ideally an enum: enum class SuppressEvents : uint_fast8_t {
None = 0, // default, replacement for false
Cluster = 1,
Backend = 2 | Cluster, // If sent over the cluster w/o any flags, the other side won't SuppressEvents::Backend, so Backend implies Cluster
Local = 4 | Backend, // If even local handlers shall be suppressed... implies All
All = Cluster | Backend | Local // replacement for true
}; @julianbrost Ok? The Checkable::Start case would use SuppressEvents::Backend. Before check start, not to start it twice
Seems to can be suppressed completely, which is done here: Reschedule on new check result
Obviously wanted everywhere. However, that's likely already done somewhere, e.g. via Checkable::OnNextCheckUpdated. Active checks disabled
Just needed for the check scheduler itself, I think. Time period/ reachability says no
That's the issue here, this is what we want to be written in Icinga DB. But, again, watch out for existing Checkable::OnNextCheckUpdated. Parent/child re-scheduling
Don't have to be sent in cluster according to #10093 (comment), so SuppressEvents::Cluster, maybe even Icinga DB updates aren't needed here. (Already not performed.) |
That suggestion just raises more questions for me. As a special case for this attribute? In general? Would you duplicate the signals for the different scopes? Would it stay a single signal and each subscriber would check if it should ignore it? And it feels like something is very wrong with next check if that was necessary to handle it properly. It may sound like a stupid question, but what does the value of next check even mean at the moment? Like one has an intuition what it should roughly1 mean, but there are multiple instances in the code that set it, that look more like workarounds rather than places where the intention is actually to reschedule a check.
Now the question: can we simplify this so that one check just performs one update of next check? That should then mean that if that value updates, it's also a relevant change for all consumers. Or do we really need to add complexity to assign different scopes for different updates of the same value? Footnotes
|
First of all, I was wrong regarding Parent/child re-scheduling. Their SetNextCheck have to stay because the checkable in question could be in another zone. So, according to #10093 (comment), those SetNextCheck are performed twice (HA). But, I guess, this duplication can be solved like here: Now, back to our sheep...
Yes, definitely.
Yes, I think so. Ex. for SuppressEvents::All, of course.
Absolutely. One workaround for the checker index by next-check, another timeout+30s to eventually re-try agent checks, ...
Well, there's the check start and the CR arrival.
|
I guess it's fair to say by now that this turned out to be more complex than we though initially (the actual problem is also not what I initially expected). Reducing
|
event::SetNextCheck += flag (for Checkable::OnNextCheckUpdated)
A separate one isn't necessary. I'm also open to extending event::SetNextCheck with a flag (#10082 (comment)),
exactly because there is an overlap with that message. |
Describe the bug
We got services with checks configured to run only at certain times, e.g. each morning between 0600 and 0630. Outside of this timeframe, IcingaDB-Web starts showing those services as
Overdue
, even though it is expected behaviour for those checks not to return any results nor to be run at that moment.To Reproduce
Overdue
because IcingaDB (not Icinga2 itself, I think) expects it to be executed.Expected behavior
IcingaDB should honor time periods and not just account for check intervals for inferring
Overdue
state.Your Environment
Include as many relevant details about the environment you experienced the problem in.
icinga2 --version
): v2.14.0icingadb --version
): v1.1.1php --version
): v7.4.33The text was updated successfully, but these errors were encountered: