From 8244e13ec8bc13f149db61dba9775d6710a2fc89 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Charles-Edouard=20Br=C3=A9t=C3=A9ch=C3=A9?= Date: Mon, 12 Sep 2022 14:17:05 +0200 Subject: [PATCH] Reports aggregation in a separate process MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Charles-Edouard Brétéché --- proposals/reports.md | 106 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 106 insertions(+) create mode 100644 proposals/reports.md diff --git a/proposals/reports.md b/proposals/reports.md new file mode 100644 index 0000000..2f3588e --- /dev/null +++ b/proposals/reports.md @@ -0,0 +1,106 @@ +# Meta + +- Name: Clean-up +- Start Date: 2022-09-12 +- Author(s): eddycharly +- Supersedes: N/A + +# Table of Contents + +- [Meta](#meta) +- [Table of Contents](#table-of-contents) +- [Overview](#overview) +- [Definitions](#definitions) +- [Motivation](#motivation) +- [Proposal](#proposal) +- [Implementation](#implementation) +- [Migration (OPTIONAL)](#migration-optional) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Prior Art](#prior-art) +- [Unresolved Questions](#unresolved-questions) +- [CRD Changes (OPTIONAL)](#crd-changes-optional) + +# Overview + +Support per resource report, automatic reports cleanup, and separate reports aggregation controller. + +# Definitions + +- TTL: Time-to-live. The amount of time a resource may exist until it is cleaned up. + +# Motivation + +The current implementation to generate and maintain policy reports is causing memory issues. +Moreover, reports lifecycle are all managed by hand. Matching reports and their corresponding resource/policy is cumbersome. + +Kubernes has built-in mecanisms to clean up resources when parent resources are deleted, we should leverage native capabilities when possible. + +Processing reports should not impact Kyverno admission requests processing and we should be able to scale differently for large clusters when necessary. + +# Proposal + +In this proposal, we study the possibility to change the way reports are generated by: +- creating one report per resource +- bind the report lifecycle to the resource lifecycle +- allow reconciling reports in an external process + +There are three ways of generating reports in Kyverno: +1. At admission time, all policies running in audit mode are run against the admission request and produce report results. +1. When a policy is created/updated/deleted, if the policy can run in background mode, reports are updated according to the policy changes. +1. Periodically, policies running in background mode are re eveluated against resources present in the cluster and reports are updated accordingly. + +By creating one report per resource, generating higher level reports (ie. per namepsace) boils down to aggregating reports living in the namespace. + +The controller responsible for aggregating reports can be isolated in its own process, separated from the Kyverno main controller. + +Finally, managing the one to one relationship between a resource and its corresponding report is way easier than what we have today. + +# Implementation + +The implementation should be straightforward, we need to generate single resource reports at admission time: +- the report name can be derived from the resource `uid` +- if the resource is updated, we update the report (the `uid` remains the same) +- if the resource is deleted, Kubernetes will garbage collect the orphan reports + +Background scans should follow the same logic as above. + +A new controller implemented in a separate process will be in charge of watching reports and create/update/delete higher level reports. + +As a bonus, for very large clusters, we can add options to run multiple controllers responsible for aggregating reports by only watching a subset of per resource reports, hence letting the end user shard reports aggregation (we could have one controller per namespace for example). + +## Custom Resources + +1. We have all necessary resources in place to implement this new design. + +## Existing Kubernetes Constructs + +N/A + +## Link to the Implementation PR + +N/A + +# Migration (OPTIONAL) + +N/A + +# Drawbacks + +N/A + +# Alternatives + +* Various alternatives have been tested in the past but without much success. Throttling is hard to implement in distributed systems and all alternatives were running in process and have shown high memory and/or cpu consumption. + +# Prior Art + +* [kube-janitor](https://codeberg.org/hjacobs/kube-janitor) + +# Unresolved Questions + +N/A + +# CRD Changes (OPTIONAL) + +N/A