Metric counters report inconsistent values for policy metrics #115

phenixblue · 2021-08-19T21:45:57Z

What happened:

I noticed that the policy failure counter metrics are providing an inconsistent value.

If you inspect the value for a particular counter metric incrementally it will change the value reflected in a way that is not accurate to policy evaluations.

Snippet from running curl (every 2 seconds) filtered for a single metric:

magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0

It's almost like there are multiple counters running in the background and the metrics route handler sometimes displays values from one, and sometimes displays values from the other.

What you expected to happen:

Metrics counter values should be consistent

How to reproduce it (as minimally and precisely as possible):

Deploy MagTape
Run make test-functional and/or manually apply some resources to force policy failures to increment the counters
Port forward to a specific MagTape pod on port 5000
Run a curl against the metrics endpoint in a loop and record the values for a specific metric and you should see the value change

$ for i in {1..100}; do curl -ks https://localhost:5000/metrics | grep "magtape_policy_total" | grep "test1" | grep "fail" | grep "privileged" >> /tmp/magtape-pod1-metrics.out; done

Anything else we need to know?:

MagTape was running with 3 replicas

Environment:

Kubernetes version (use kubectl version): v1.17
Cloud provider or hardware configuration:
Others:
- MagTape v2.3.2

The text was updated successfully, but these errors were encountered:

phenixblue · 2021-08-20T05:39:08Z

Ok, this seems to be related to the multi-process configuration with Gunicorn. There are multiple processes with full copies of MagTape running per pod, therefore multiple metrics counters per pod.

This is a known issue with using the Prometheus client with Gunicorn. Solution and caveats are noted here:

https://github.com/prometheus/client_python#multiprocess-mode-eg-gunicorn

phenixblue added bug Something isn't working help wanted Extra attention is needed python needs-triage observability labels Aug 19, 2021

phenixblue removed the needs-triage label Aug 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metric counters report inconsistent values for policy metrics #115

Metric counters report inconsistent values for policy metrics #115

phenixblue commented Aug 19, 2021 •

edited

Loading

phenixblue commented Aug 20, 2021 •

edited

Loading

Metric counters report inconsistent values for policy metrics #115

Metric counters report inconsistent values for policy metrics #115

Comments

phenixblue commented Aug 19, 2021 • edited Loading

phenixblue commented Aug 20, 2021 • edited Loading

phenixblue commented Aug 19, 2021 •

edited

Loading

phenixblue commented Aug 20, 2021 •

edited

Loading