Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Notification Policy Routes #1800

Open
wants to merge 33 commits into
base: master
Choose a base branch
from

Conversation

msvechla
Copy link

See #1789 for the related feature proposal.

This implementation is still work in progress for now and should support the feature proposal.

Updates on GrafanaNotificationPolicyRoutes are tracked via watches. I first tried to accomplish this via ownerReferences (see commits), but this is not supported cross-namespace.


Example of how the sample policies in hack/kind/resources/default/grafana-notification-policy.yaml are rendered:

CleanShot 2024-12-18 at 11 58 53

Status updates on the GrafanaNotificationPolicy

❯ kubectl get grafananotificationpolicy grafananotificationpolicy-sample -o jsonpath="{.status}" |jq
{
  "conditions": [
    {
      "lastTransitionTime": "2024-12-17T15:52:31Z",
      "message": "Notification Policy was successfully applied to 1 instances",
      "observedGeneration": 9,
      "reason": "ApplySuccessful",
      "status": "True",
      "type": "NotificationPolicySynchronized"
    }
  ],
  "discoveredRoutes": [
    "grafana-crds/dynamic-c (priority: 1)",
    "default/dynamic-d (priority: 2)",
    "default/dynamic-e (priority: nil)"
  ]
}

Events emitted for the merged routes:

❯ kubectl get events

LAST SEEN   TYPE     REASON   OBJECT                                     MESSAGE
20m         Normal   Merged   grafananotificationpolicyroute/dynamic-c   Route merged into NotificationPolicy default/grafananotificationpolicy-sample
3s          Normal   Merged   grafananotificationpolicyroute/dynamic-d   Route merged into NotificationPolicy default/grafananotificationpolicy-sample
3s          Normal   Merged   grafananotificationpolicyroute/dynamic-e   Route merged into NotificationPolicy default/grafananotificationpolicy-sample

Adapted PROJECT to new go module version path and ran:

```
./bin/operator-sdk create api --group grafana --version v1beta1 --kind GrafanaNotificationPolicyRoute --controller false
```
- during the reconcile loop in notificationpolicy_controller.go, we have
  to fetch all matching GrafanaNotificationPolicyRoutes for the currently reconciled GrafanaNotificationPolicy
- this can be very easily achieved with a routeSelector, which will be a Kubernetes LabelSelector
- if we would go with instanceSelector, we would have to fetch all available
  GrafanaNotificationPolicyRoutes and then do some filtering afterwards,
  to see if the instanceSelector matches, which would be both more inefficient and more complex
The GrafanaNotificationPolicy Controller now watches
GrafanaNoticationPolicyRoutes instead of using ownerReferences, as
ownerReferences do not support cross-namespace references.

We now also emit a event on the GrafanaNotificationPolicyRoute to
indicate that it has been merged into a specific policy.
@msvechla msvechla changed the title Notification policy routes impl Dynamic Notification Policy Routes Dec 18, 2024
Copy link

This PR hasn't been updated for a while, marking as stale

@github-actions github-actions bot added the stale label Jan 20, 2025
@github-actions github-actions bot added the documentation Issues relating to documentation, missing, non-clear etc. label Jan 21, 2025
@msvechla
Copy link
Author

I have updated the draft PR with the latest discussions from #1789:

  • RouteSelector has been moved to the Route object, allowing to dynamically inject routes on all levels now
  • Logic to detect reference loops has been implemented
  • Tests have been added accordingly
  • Priority logic has been removed

There are no checks for RouteSelector and Routes being mutually exclusive so far.
Unfortunately I think we can not implement this validation on the OpenAPI level, as the Routes attribute uses // +kubebuilder:validation:Schemaless, which disables all validation.

Another idea would be to solve this by implementing a ValidationWebhook. @theSuess do you have any thoughts on this?

Results of updates Samples

Assembled hack/kind/resources/default/grafana-notification-policy.yaml:

CleanShot 2025-01-21 at 11 06 27

❮ kubectl get grafananotificationpolicy grafananotificationpolicy-sample -o jsonpath="{.status}" |jq
{
  "conditions": [
    {
      "lastTransitionTime": "2025-01-21T09:48:19Z",
      "message": "Notification Policy was successfully applied to 1 instances",
      "observedGeneration": 1,
      "reason": "ApplySuccessful",
      "status": "True",
      "type": "NotificationPolicySynchronized"
    }
  ],
  "discoveredRoutes": [
    "default/dynamic-e",
    "grafana-crds/dynamic-c",
    "default/dynamic-d"
  ]
}
❯ kubectl get events

LAST SEEN   TYPE     REASON   OBJECT                                     MESSAGE
20m         Normal   Merged   grafananotificationpolicyroute/dynamic-d   Route merged into NotificationPolicy default/grafananotificationpolicy-sample
20m         Normal   Merged   grafananotificationpolicyroute/dynamic-d   Route merged into NotificationPolicy default/grafananotificationpolicy-sample
20m         Normal   Merged   grafananotificationpolicyroute/dynamic-e   Route merged into NotificationPolicy default/grafananotificationpolicy-sample
20m         Normal   Merged   grafananotificationpolicyroute/dynamic-e   Route merged into NotificationPolicy default/grafananotificationpolicy-sample

@msvechla msvechla force-pushed the notification_policy_routes_impl branch from 3da8a88 to 95a1fd8 Compare January 21, 2025 10:11
@msvechla
Copy link
Author

I rebased on the latest master branch.

Additionally I looked into the ValidationWebhook for ensuring that routeSelector and routes are mutually exlusive.

While the implementation of the logic is quite straight-forward, setting up webhooks is a bit more complex, especially bringing this to the Helm chart.

I can certainly do this if you think this is the way forward, we can find another way to do the validation, or maybe skip this validation for now.

Any thoughts?

@github-actions github-actions bot removed the stale label Jan 22, 2025
@theSuess
Copy link
Member

Sorry for the deleted comment, missed the part about api level validation rules in your previous note.

We had some discussions around Validation Webhooks before and came to the conclusion that these are too finicky to get right.

If the validation rules don't work, I'd just go for a status condition in the respective object when both values are set which tells the user that the dynamic matching takes priority and the routes field is ignored

@msvechla
Copy link
Author

Unfortunately, as far as I can see, the validation expressions will not work due to the +kubebuilder:validation:Schemaless on the Routes object. I can see if I can find a workaround.

Alternatively, I will go with the hint in the status condition as you suggested 👍

@theSuess
Copy link
Member

The reason why I added the schemaless validation is to support recursive objects. If there is a cleaner way, I'm all for it!

- adds validation for ensuring routes and routeSelector are mutual
  exclusive
- updates both GrafanaNotificationPolicy and
  GrafanaNotificationPolicyRoute status conditions accordingly
@msvechla
Copy link
Author

I updated the PR and implemented the validation by updating the status condition on both GrafanaNotificationPolicy and GrafanaNotificationPolicyRoute.

Status:
  Conditions:
    Last Transition Time:  2025-01-22T10:08:13Z
    Message:               Dynamically matched definitions from routeSelector will take precedence and the routes field is ignored
    Observed Generation:   8
    Reason:                BothRoutesAndRouteSelectorSpecified
    Status:                True
    Type:                  RoutesIgnoredDueToRouteSelector

}
assembledNotificationPolicy, mergedRoutes, err = assembleNotificationPolicyRoutes(ctx, r.Client, namespace, assembledNotificationPolicy)
r.Log.Info("assembled notification policy routes", "mergedRoutes", mergedRoutes)
if err != nil {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should we handle errors here e.g. due to a loop in matched routes? Is it fine as is, by simply returning with an error? Alternatively we could also update the status condition similar to applyErrors := make(map[string]string) below and stop reconciling.

Looking at the code below, it looks like the status condition is never updated when errors occurred, not sure if I am misreading the code or this is on purpose:

        if len(applyErrors) > 0 {
		return ctrl.Result{}, fmt.Errorf("failed to apply to all instances: %v", applyErrors)
	}

// ...
	condition := buildSynchronizedCondition("Notification Policy", conditionNotificationPolicySynchronized, notificationPolicy.Generation, applyErrors, len(instances))
	meta.SetStatusCondition(&notificationPolicy.Status.Conditions, condition)

Looks like this has been adapted here: #1815

Any thoughts @theSuess ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning an error doesn't make sense here, as it would cause the operator to retry the same resource. Assuming nothing changed, it'll just fail again. My preferred outcome is to note the error in the conditions.

Status conditions are applied through the defer function here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something, but the errors are never set on the condition, because the controller always returns when there are errors and never reaches the code to set the error condition:

see:

if len(applyErrors) > 0 {
    return ctrl.Result{}, fmt.Errorf("failed to apply to all instances: %v", applyErrors)
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some modifications here, hopefully that resolves the issue: f924763

Copy link
Contributor

@Baarsgaard Baarsgaard Jan 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msvechla Good catch it looks like I caused a regression preventing bad synchronization results from being added to the status..
I will fix that ASAP everywhere else with some regression tests.
If you revert these lines it should fix it and you can then undo the changes to the defer and move applyErrors and such back to the instances apply loop

Copy link
Member

@theSuess theSuess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good! I've tested it locally and it works as expected. I left some minor comments, but nothing that should require large scale refactoring.

Thanks for also taking care of the documentation right away 💯

api/v1beta1/grafananotificationpolicy_types.go Outdated Show resolved Hide resolved
api/v1beta1/grafananotificationpolicy_types.go Outdated Show resolved Hide resolved
api/v1beta1/grafananotificationpolicyroute_types.go Outdated Show resolved Hide resolved
controllers/grafananotificationpolicyroute_controller.go Outdated Show resolved Hide resolved
Comment on lines +169 to +173
if notificationPolicy.Spec.Route.IsRouteSelectorMutuallyExclusive() {
meta.RemoveStatusCondition(&notificationPolicy.Status.Conditions, conditionRoutesIgnoredDueToRouteSelector)
} else {
setRoutesIgnoredDueToRouteSelectorCondition(&notificationPolicy.Status.Conditions, notificationPolicy.Generation)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything preventing this check from occurring before applying the notification policy? I'd like to avoid half-applied notification policies

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We said that in case both are specified, we simply overwrite what is specified in Routes, so I think it's not an issue if we update the condition at the end. There should be no half-applied notification policies as far as I can see.

controllers/notificationpolicy_controller.go Outdated Show resolved Hide resolved
docs/docs/alerting/notification-policies.md Outdated Show resolved Hide resolved
docs/docs/alerting/notification-policies.md Outdated Show resolved Hide resolved
docs/docs/alerting/notification-policies.md Outdated Show resolved Hide resolved
docs/docs/alerting/notification-policies.md Outdated Show resolved Hide resolved
@msvechla msvechla force-pushed the notification_policy_routes_impl branch from 733f358 to 942a0b5 Compare January 24, 2025 14:47
@msvechla msvechla marked this pull request as ready for review January 24, 2025 14:51
@msvechla msvechla requested a review from theSuess January 24, 2025 14:51
@Baarsgaard
Copy link
Contributor

Baarsgaard commented Jan 25, 2025

From a user perspective, I would find an error to be more easy to deal with than partially ignoring parts of the spec I set.
If I was not careful, I could end up seeing half the Contact Points, but not realize the other half is missing.

From that point, would it make sense to instead set the InvalidSpec condition on NotificationPolicies or PolicyRoutes where both fields are set?
This would result in NotificationPolicies that refuse to be updated and/or PolicyRoutes that are omitted from assembly depending on where both are configured.

This is not perfect as some routes might be missing, but I would see it as less confusing overall.

It would mean we can return earlier before attempting the apply as well.

Copy link
Contributor

@Baarsgaard Baarsgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good and I learned quite a few things going through this. 😄

Comment on lines +181 to +185
// IsCrossNamespaceImportAllowed returns true when cross namespace imports are allowed
func (np *GrafanaNotificationPolicy) IsCrossNamespaceImportAllowed() bool {
return np.Spec.AllowCrossNamespaceImport
}

Copy link
Contributor

@Baarsgaard Baarsgaard Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate of AllowCrossNamespace ~20 lines down

assembledNotificationPolicy, mergedRoutes, err = assembleNotificationPolicyRoutes(ctx, r.Client, namespace, assembledNotificationPolicy)
if err != nil {
applyErrors[globalApplyError] = err.Error()
return ctrl.Result{Requeue: false}, fmt.Errorf("failed to assemble GrafanaNotificationPolicy using routeSelectors: %w", err)
Copy link
Contributor

@Baarsgaard Baarsgaard Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requeue and requeueDelay are ignored when you return an Error
The manager will retry the reconcile on it's own with an exponential backoff delay, we rely on this in most other reconcilers.

Suggested change
return ctrl.Result{Requeue: false}, fmt.Errorf("failed to assemble GrafanaNotificationPolicy using routeSelectors: %w", err)
return ctrl.Result{}, fmt.Errorf("failed to assemble GrafanaNotificationPolicy using routeSelectors: %w", err)

Comment on lines +416 to +432
func hasRouteSelector(route *grafanav1beta1.Route) bool {
if route == nil {
return false
}

if route.RouteSelector != nil {
return true
}

for _, nestedRoute := range route.Routes {
if hasRouteSelector(nestedRoute) {
return true
}
}

return false
}
Copy link
Contributor

@Baarsgaard Baarsgaard Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a struct function on the Route object in api/v1beta1/grafananotificationpolicy_types.go as it's not relying on anything specific to the reconciler (ctx, log, client).

Comment on lines +137 to +141
var namespace *string
if !notificationPolicy.IsCrossNamespaceImportAllowed() {
ns := notificationPolicy.GetObjectMeta().GetNamespace()
namespace = &ns
}
Copy link
Contributor

@Baarsgaard Baarsgaard Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just as well be moved into assembleNotificationPolicyRoutes as it's not used elsewhere after this.
Then change the function arguments to take the notificationPolicy object itself.

@@ -115,7 +130,22 @@ func (r *GrafanaNotificationPolicyReconciler) Reconcile(ctx context.Context, req
removeNoMatchingInstance(&notificationPolicy.Status.Conditions)
r.Log.Info("found matching Grafana instances for notificationPolicy", "count", len(instances))

applyErrors := make(map[string]string)
var mergedRoutes []*v1beta1.GrafanaNotificationPolicyRoute
assembledNotificationPolicy := notificationPolicy.DeepCopy()
Copy link
Contributor

@Baarsgaard Baarsgaard Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we never intend to modify the supplied notificationPolicy, there's no reason to DeepCopy unless you plan to call notificationPolicy.Update() later on but avoid the assembled routes being applied.

}

for _, route := range routes {
r.Recorder.Event(route, corev1.EventTypeNormal, "Merged", fmt.Sprintf("Route merged into NotificationPolicy %s/%s", notificationPolicy.GetNamespace(), notificationPolicy.GetName()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theSuess Are events planned as a replacement instead of status updates or an addition to conditions?
Or is there a reason it has not been used yet?

// it ensures that there are no reference loops when discovering routes via labelSelectors

func assembleNotificationPolicyRoutes(ctx context.Context, k8sClient client.Client, namespace *string, notificationPolicy *grafanav1beta1.GrafanaNotificationPolicy) (*grafanav1beta1.GrafanaNotificationPolicy, []*v1beta1.GrafanaNotificationPolicyRoute, error) {
assembledPolicy := notificationPolicy.DeepCopy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, the DeepCopy is currently unnecessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Issues relating to documentation, missing, non-clear etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants