Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace GRPC state of world updates with ADS Delta #6806

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

saley89
Copy link

@saley89 saley89 commented Dec 12, 2024

For: #6266

This PR replaces the default mechanism of GRPC for a combination of Aggregated Discovery Service and Delta GRPC updates.

Findings

From a starting point of a running on a Kubernetes cluster using over 25000+ HttpProxy resources we saw the following with standard GRPC:

  • 3 x Contour instances at 64GB and over 10cpu cores peak
  • 18 x Envoy instances (6 per zone, a/b/c) running 8.5GB and 2cpu peak
  • Endpoint updates taking multiple seconds or longer. Delayed updates in the logs of over 30 seconds.
  • Due to slowness in keeping up with the state of the world updates we would get frequent 503 responses as the endpoints were out of date for so long.

With the switch to ADS Delta_GRPC we now see the following in the same cluster:

  • Contour memory reduced to ~2GB sustained and flat lining. CPU peaks down to 2cpu during processing of new proxies only. Average use is now around 500m.
  • Envoy memory usage reduced to around ~6GB and CPU down at ~500m.
  • Endpoint updates are now almost instant and any infrequent delayed updates logs now state the time as low 100's of milliseconds.

In short, this update mechanism is incredibly fast and resource efficient when compared with standard GRPC, "state of the world" communication. Without moving to this setup, a large scale cluster running many HttpProxy is simply not viable and has been noted in other issues (see #6743 - we believe this to be a side effect when running many proxies and we don't see this behaviour in our ADS Delta_GRPC version).

Changes

The bootstrap code for setting the default dynamic_resources configuration has been setup to specify ADS as per the documentation as follows:

{
    "dynamic_resources": {
        "ads_config": {
            "api_type": "DELTA_GRPC",
            "transport_api_version": "V3",
            "grpc_services": [
                {
                    "envoy_grpc": {
                        "cluster_name": "contour",
                        "authority": "contour"
                    }
                }
            ]
        },
        "lds_config": {
            "ads": {}
        },
        "cds_config": {
            "ads": {}
        }
    }
}

And then each endpoint created for EDS is configured as follows:

...
"eds_cluster_config": {
       "eds_config": {"ads": {}
       },
...       

@saley89 saley89 requested a review from a team as a code owner December 12, 2024 13:51
@saley89 saley89 requested review from tsaarni and sunjayBhatia and removed request for a team December 12, 2024 13:51
@sunjayBhatia sunjayBhatia requested review from a team, davinci26 and clayton-gonsalves and removed request for a team December 12, 2024 13:51
@saley89 saley89 force-pushed the ads-configuration-option branch 2 times, most recently from 5a12a94 to fab3db3 Compare December 12, 2024 13:58
Copy link

codecov bot commented Dec 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.05%. Comparing base (8a60b0d) to head (e6d2d26).
Report is 15 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6806      +/-   ##
==========================================
- Coverage   81.05%   81.05%   -0.01%     
==========================================
  Files         133      133              
  Lines       20026    20024       -2     
==========================================
- Hits        16232    16230       -2     
  Misses       3500     3500              
  Partials      294      294              
Files with missing lines Coverage Δ
internal/envoy/v3/auth.go 100.00% <100.00%> (ø)
internal/envoy/v3/bootstrap.go 92.43% <100.00%> (+0.15%) ⬆️
internal/envoy/v3/cluster.go 96.36% <100.00%> (-0.12%) ⬇️
internal/envoy/v3/listener.go 98.49% <100.00%> (ø)
internal/envoy/v3/stats.go 100.00% <100.00%> (ø)
internal/featuretests/v3/envoy.go 99.13% <100.00%> (ø)

Signed-off-by: David Sale <[email protected]>
@saley89 saley89 force-pushed the ads-configuration-option branch from 16b8020 to e6d2d26 Compare December 16, 2024 16:24
Copy link
Member

@sunjayBhatia sunjayBhatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the investigation and work on this @saley89

I agree moving to Delta ADS is a very valuable improvement to Contour, however unfortunately we cannot immediately switch over to a new gRPC variant for a few reasons:

  • Currently in SOTW xDS, Contour treats all Envoy instances as the same "node" (see here:
    defaultCache = envoy_cache_v3.NewSnapshotCache(false, &contour_xds_v3.Hash, log.WithField("context", "defaultCache"))
    edsCache = envoy_cache_v3.NewSnapshotCache(false, &contour_xds_v3.Hash, log.WithField("context", "edsCache"))
    ), this would need to change and we would need to generate snapshot caches per-Envoy node, this is especially important in the case of Envoy nodes disconnecting and reconnecting (with cluster resource changes that may have happened in the interim)
  • To protect our OSS users with running clusters that upgrade to a release with Delta ADS support, in the case of bugs etc. this needs to be a more gradual rollout, i.e. add as a feature flagged or opt-in feature, add docs on how to enable, etc. it cannot become the default immediately
    • we'll need to have some e2e upgrade testing to make sure upgrading to Delta ADS works as expected

I started some of the background refactor work to enable us to feature flag the various config sources here but other priorities took over at the time: #5523

@saley89
Copy link
Author

saley89 commented Dec 17, 2024

Hi @sunjayBhatia that's great to hear. Originally I went down the line of what your PR achieves. I wanted to flag it such that GRPC remained the default but you could toggle ADS with Delta. I was really struggling to see how to pass any type of configuration flags down to the ConfigSource function as it wasn't a part of any structure.

Your PR looks ideal, and sets the foundations I need to achieve what I wanted at the start. I'm thinking if we could get your PR into the main branch with just the default GRPC behavior, I could then rebase what I have done above and setup some options in there to toggle on the ADS Delta behavior.

Do you think something like would be possible in the very near future? As described above our setup means without ADS Delta it simply can't support the scale of our cluster.

Copy link

github-actions bot commented Jan 9, 2025

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 14d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

  • Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
  • Mark this PR as fresh by commenting or pushing a commit
  • Close this PR
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 9, 2025
davinci26 added a commit to davinci26/contour that referenced this pull request Jan 20, 2025
Take two of projectcontour#5523

That introduces a struct `NewEnvoysGen` that allows us to modify the
behaviour of Envoy config generation. This will help with

projectcontour#6806

On the debate on `singleton` vs `struct` approach. I thought this again
and I realized that `struct` is probably worth the big diff here. For the
following reason:

* It allows us to test the behaviour much easier. With a singleton it would be
impossible for higher level tests like tests in `internal/featuretests` to modify
the behaviour of the Envoy config generation. With a struct, we can just pass the
config options to the struct and test the behaviour.

Signed-off-by: Sotiris Nanopoulos <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants