-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic Notification Policy Routes #1800
base: master
Are you sure you want to change the base?
Dynamic Notification Policy Routes #1800
Conversation
Adapted PROJECT to new go module version path and ran: ``` ./bin/operator-sdk create api --group grafana --version v1beta1 --kind GrafanaNotificationPolicyRoute --controller false ```
- during the reconcile loop in notificationpolicy_controller.go, we have to fetch all matching GrafanaNotificationPolicyRoutes for the currently reconciled GrafanaNotificationPolicy - this can be very easily achieved with a routeSelector, which will be a Kubernetes LabelSelector - if we would go with instanceSelector, we would have to fetch all available GrafanaNotificationPolicyRoutes and then do some filtering afterwards, to see if the instanceSelector matches, which would be both more inefficient and more complex
The GrafanaNotificationPolicy Controller now watches GrafanaNoticationPolicyRoutes instead of using ownerReferences, as ownerReferences do not support cross-namespace references. We now also emit a event on the GrafanaNotificationPolicyRoute to indicate that it has been merged into a specific policy.
This PR hasn't been updated for a while, marking as stale |
I have updated the draft PR with the latest discussions from #1789:
There are no checks for Another idea would be to solve this by implementing a ValidationWebhook. @theSuess do you have any thoughts on this? Results of updates SamplesAssembled ❮ kubectl get grafananotificationpolicy grafananotificationpolicy-sample -o jsonpath="{.status}" |jq {
"conditions": [
{
"lastTransitionTime": "2025-01-21T09:48:19Z",
"message": "Notification Policy was successfully applied to 1 instances",
"observedGeneration": 1,
"reason": "ApplySuccessful",
"status": "True",
"type": "NotificationPolicySynchronized"
}
],
"discoveredRoutes": [
"default/dynamic-e",
"grafana-crds/dynamic-c",
"default/dynamic-d"
]
} ❯ kubectl get events
LAST SEEN TYPE REASON OBJECT MESSAGE
20m Normal Merged grafananotificationpolicyroute/dynamic-d Route merged into NotificationPolicy default/grafananotificationpolicy-sample
20m Normal Merged grafananotificationpolicyroute/dynamic-d Route merged into NotificationPolicy default/grafananotificationpolicy-sample
20m Normal Merged grafananotificationpolicyroute/dynamic-e Route merged into NotificationPolicy default/grafananotificationpolicy-sample
20m Normal Merged grafananotificationpolicyroute/dynamic-e Route merged into NotificationPolicy default/grafananotificationpolicy-sample |
3da8a88
to
95a1fd8
Compare
I rebased on the latest master branch. Additionally I looked into the ValidationWebhook for ensuring that While the implementation of the logic is quite straight-forward, setting up webhooks is a bit more complex, especially bringing this to the Helm chart. I can certainly do this if you think this is the way forward, we can find another way to do the validation, or maybe skip this validation for now. Any thoughts? |
Sorry for the deleted comment, missed the part about api level validation rules in your previous note. We had some discussions around Validation Webhooks before and came to the conclusion that these are too finicky to get right. If the validation rules don't work, I'd just go for a status condition in the respective object when both values are set which tells the user that the dynamic matching takes priority and the |
Unfortunately, as far as I can see, the validation expressions will not work due to the Alternatively, I will go with the hint in the status condition as you suggested 👍 |
The reason why I added the schemaless validation is to support recursive objects. If there is a cleaner way, I'm all for it! |
- adds validation for ensuring routes and routeSelector are mutual exclusive - updates both GrafanaNotificationPolicy and GrafanaNotificationPolicyRoute status conditions accordingly
I updated the PR and implemented the validation by updating the status condition on both
|
} | ||
assembledNotificationPolicy, mergedRoutes, err = assembleNotificationPolicyRoutes(ctx, r.Client, namespace, assembledNotificationPolicy) | ||
r.Log.Info("assembled notification policy routes", "mergedRoutes", mergedRoutes) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should we handle errors here e.g. due to a loop in matched routes? Is it fine as is, by simply returning with an error? Alternatively we could also update the status condition similar to applyErrors := make(map[string]string)
below and stop reconciling.
Looking at the code below, it looks like the status condition is never updated when errors occurred, not sure if I am misreading the code or this is on purpose:
if len(applyErrors) > 0 {
return ctrl.Result{}, fmt.Errorf("failed to apply to all instances: %v", applyErrors)
}
// ...
condition := buildSynchronizedCondition("Notification Policy", conditionNotificationPolicySynchronized, notificationPolicy.Generation, applyErrors, len(instances))
meta.SetStatusCondition(¬ificationPolicy.Status.Conditions, condition)
Looks like this has been adapted here: #1815
Any thoughts @theSuess ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning an error doesn't make sense here, as it would cause the operator to retry the same resource. Assuming nothing changed, it'll just fail again. My preferred outcome is to note the error in the conditions.
Status conditions are applied through the defer
function here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something, but the errors are never set on the condition, because the controller always returns when there are errors and never reaches the code to set the error condition:
see:
if len(applyErrors) > 0 { |
if len(applyErrors) > 0 {
return ctrl.Result{}, fmt.Errorf("failed to apply to all instances: %v", applyErrors)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some modifications here, hopefully that resolves the issue: f924763
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msvechla Good catch it looks like I caused a regression preventing bad synchronization results from being added to the status..
I will fix that ASAP everywhere else with some regression tests.
If you revert these lines it should fix it and you can then undo the changes to the defer
and move applyErrors
and such back to the instances apply loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very good! I've tested it locally and it works as expected. I left some minor comments, but nothing that should require large scale refactoring.
Thanks for also taking care of the documentation right away 💯
if notificationPolicy.Spec.Route.IsRouteSelectorMutuallyExclusive() { | ||
meta.RemoveStatusCondition(¬ificationPolicy.Status.Conditions, conditionRoutesIgnoredDueToRouteSelector) | ||
} else { | ||
setRoutesIgnoredDueToRouteSelectorCondition(¬ificationPolicy.Status.Conditions, notificationPolicy.Generation) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything preventing this check from occurring before applying the notification policy? I'd like to avoid half-applied notification policies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We said that in case both are specified, we simply overwrite what is specified in Routes
, so I think it's not an issue if we update the condition at the end. There should be no half-applied notification policies as far as I can see.
Co-authored-by: Dominik Süß <[email protected]>
Co-authored-by: Dominik Süß <[email protected]>
related GrafanaNotificationPolicies can now no longer easily retrieved by comparing labels, as routes can be referenced transitively. Therefore we simply sync all related policies now.
733f358
to
942a0b5
Compare
Co-authored-by: Dominik Süß <[email protected]>
From a user perspective, I would find an error to be more easy to deal with than partially ignoring parts of the spec I set. From that point, would it make sense to instead set the This is not perfect as some routes might be missing, but I would see it as less confusing overall. It would mean we can return earlier before attempting the apply as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good and I learned quite a few things going through this. 😄
// IsCrossNamespaceImportAllowed returns true when cross namespace imports are allowed | ||
func (np *GrafanaNotificationPolicy) IsCrossNamespaceImportAllowed() bool { | ||
return np.Spec.AllowCrossNamespaceImport | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate of AllowCrossNamespace
~20 lines down
assembledNotificationPolicy, mergedRoutes, err = assembleNotificationPolicyRoutes(ctx, r.Client, namespace, assembledNotificationPolicy) | ||
if err != nil { | ||
applyErrors[globalApplyError] = err.Error() | ||
return ctrl.Result{Requeue: false}, fmt.Errorf("failed to assemble GrafanaNotificationPolicy using routeSelectors: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requeue
and requeueDelay
are ignored when you return an Error
The manager will retry the reconcile on it's own with an exponential backoff delay, we rely on this in most other reconcilers.
return ctrl.Result{Requeue: false}, fmt.Errorf("failed to assemble GrafanaNotificationPolicy using routeSelectors: %w", err) | |
return ctrl.Result{}, fmt.Errorf("failed to assemble GrafanaNotificationPolicy using routeSelectors: %w", err) |
func hasRouteSelector(route *grafanav1beta1.Route) bool { | ||
if route == nil { | ||
return false | ||
} | ||
|
||
if route.RouteSelector != nil { | ||
return true | ||
} | ||
|
||
for _, nestedRoute := range route.Routes { | ||
if hasRouteSelector(nestedRoute) { | ||
return true | ||
} | ||
} | ||
|
||
return false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a struct function on the Route
object in api/v1beta1/grafananotificationpolicy_types.go
as it's not relying on anything specific to the reconciler (ctx, log, client).
var namespace *string | ||
if !notificationPolicy.IsCrossNamespaceImportAllowed() { | ||
ns := notificationPolicy.GetObjectMeta().GetNamespace() | ||
namespace = &ns | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could just as well be moved into assembleNotificationPolicyRoutes
as it's not used elsewhere after this.
Then change the function arguments to take the notificationPolicy object itself.
@@ -115,7 +130,22 @@ func (r *GrafanaNotificationPolicyReconciler) Reconcile(ctx context.Context, req | |||
removeNoMatchingInstance(¬ificationPolicy.Status.Conditions) | |||
r.Log.Info("found matching Grafana instances for notificationPolicy", "count", len(instances)) | |||
|
|||
applyErrors := make(map[string]string) | |||
var mergedRoutes []*v1beta1.GrafanaNotificationPolicyRoute | |||
assembledNotificationPolicy := notificationPolicy.DeepCopy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we never intend to modify the supplied notificationPolicy
, there's no reason to DeepCopy
unless you plan to call notificationPolicy.Update()
later on but avoid the assembled routes being applied.
} | ||
|
||
for _, route := range routes { | ||
r.Recorder.Event(route, corev1.EventTypeNormal, "Merged", fmt.Sprintf("Route merged into NotificationPolicy %s/%s", notificationPolicy.GetNamespace(), notificationPolicy.GetName())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@theSuess Are events planned as a replacement instead of status updates or an addition to conditions?
Or is there a reason it has not been used yet?
// it ensures that there are no reference loops when discovering routes via labelSelectors | ||
|
||
func assembleNotificationPolicyRoutes(ctx context.Context, k8sClient client.Client, namespace *string, notificationPolicy *grafanav1beta1.GrafanaNotificationPolicy) (*grafanav1beta1.GrafanaNotificationPolicy, []*v1beta1.GrafanaNotificationPolicyRoute, error) { | ||
assembledPolicy := notificationPolicy.DeepCopy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, the DeepCopy
is currently unnecessary.
See #1789 for the related feature proposal.
This implementation is still work in progress for now and should support the feature proposal.
Updates on
GrafanaNotificationPolicyRoutes
are tracked via watches. I first tried to accomplish this via ownerReferences (see commits), but this is not supported cross-namespace.Example of how the sample policies in
hack/kind/resources/default/grafana-notification-policy.yaml
are rendered:Status updates on the
GrafanaNotificationPolicy
Events emitted for the merged routes: