Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric counters report inconsistent values for policy metrics #115

Open
phenixblue opened this issue Aug 19, 2021 · 1 comment
Open

Metric counters report inconsistent values for policy metrics #115

phenixblue opened this issue Aug 19, 2021 · 1 comment
Labels
bug Something isn't working help wanted Extra attention is needed observability python

Comments

@phenixblue
Copy link
Contributor

phenixblue commented Aug 19, 2021

What happened:

I noticed that the policy failure counter metrics are providing an inconsistent value.

If you inspect the value for a particular counter metric incrementally it will change the value reflected in a way that is not accurate to policy evaluations.

Snippet from running curl (every 2 seconds) filtered for a single metric:

magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 35.0
magtape_policy_total{count_type="fail",ns="test1",policy="policy-privileged-pod"} 1.0

It's almost like there are multiple counters running in the background and the metrics route handler sometimes displays values from one, and sometimes displays values from the other.

What you expected to happen:

Metrics counter values should be consistent

How to reproduce it (as minimally and precisely as possible):

  • Deploy MagTape
  • Run make test-functional and/or manually apply some resources to force policy failures to increment the counters
  • Port forward to a specific MagTape pod on port 5000
  • Run a curl against the metrics endpoint in a loop and record the values for a specific metric and you should see the value change
$ for i in {1..100}; do curl -ks https://localhost:5000/metrics | grep "magtape_policy_total" | grep "test1" | grep "fail" | grep "privileged" >> /tmp/magtape-pod1-metrics.out; done

Anything else we need to know?:

MagTape was running with 3 replicas

Environment:

  • Kubernetes version (use kubectl version): v1.17
  • Cloud provider or hardware configuration:
  • Others:
    • MagTape v2.3.2
@phenixblue phenixblue added bug Something isn't working help wanted Extra attention is needed python needs-triage observability labels Aug 19, 2021
@phenixblue
Copy link
Contributor Author

phenixblue commented Aug 20, 2021

Ok, this seems to be related to the multi-process configuration with Gunicorn. There are multiple processes with full copies of MagTape running per pod, therefore multiple metrics counters per pod.

This is a known issue with using the Prometheus client with Gunicorn. Solution and caveats are noted here:

https://github.com/prometheus/client_python#multiprocess-mode-eg-gunicorn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed observability python
Projects
None yet
Development

No branches or pull requests

1 participant