Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support paginating by key-value size #14809

Open
linxiulei opened this issue Nov 19, 2022 · 18 comments · May be fixed by #16300
Open

Support paginating by key-value size #14809

linxiulei opened this issue Nov 19, 2022 · 18 comments · May be fixed by #16300

Comments

@linxiulei
Copy link

What would you like to be added?

Allow RangeRequest to specify a size limit to paginate the result if the size of result exceeds the limit.

Why is this needed?

This would allow clients to constrain the response size to avoid overwhelming etcd server with expensive requests.

@linxiulei
Copy link
Author

cc @serathius

@lavacat
Copy link

lavacat commented Nov 19, 2022

Is this for Kubernetes use case or general case?

FYI, Kubernetes recently added maxLimit = 10000

@fuweid
Copy link
Member

fuweid commented Nov 20, 2022

Is this for Kubernetes use case or general case?

FYI, Kubernetes recently added maxLimit = 10000

Thanks for the link! Be default the object size is limited in 1.5MiB. The pageSize can be used to reduce memory spike.
In kubernetes, it seems it can resolve this issue.

@linxiulei
Copy link
Author

Is this for Kubernetes use case or general case?

I am thinking of Kubernetes use case, but it definitely benefits general cases as well

FYI, Kubernetes recently added maxLimit = 10000

It seems somehow to mitigate the similar problem. AIUI, the intention of it is actually increase the response size to reduce the calls to etcd.

@stale
Copy link

stale bot commented Mar 18, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 18, 2023
@stale stale bot closed this as completed May 22, 2023
@linxiulei linxiulei linked a pull request Jul 25, 2023 that will close this issue
@linxiulei
Copy link
Author

@ahrtr can you please also try re-opening this?

@ahrtr ahrtr reopened this Jul 25, 2023
@stale stale bot removed the stale label Jul 25, 2023
@stale stale bot removed the stale label Jul 25, 2023
@serathius
Copy link
Member

serathius commented Jul 26, 2023

Note, not sure if large objects are as much of a concern when compared to large amounts of small objects. Response size can be controlled on user side as long as they limit maximal size of the object and number of objects per response.

Context kubernetes/enhancements#2340 (comment)

Small objects are much more disruptful for clients. See the difference in CPU usage between "Disabled Large Objects" and "Disabled small objects" for apiserver. Difference is between 3 CPUs and 72 CPUs.

Overall I still think that long term better solution is to introduce watch cache to etcd client to get the same awesome benefits (2-10 times CPU reduction, 20-50 times latency reduction) instead of creating workaround.

@linxiulei
Copy link
Author

As far as the problem is concerned for this issue to address, the total size is most relevant as it results in proportional memory usage for etcd. Fixing this is a measure to avoid query-of-death (ie. OOM) more than a performance improvement.

Besides that, I don't see major conflict in having both long term and short term solution at the same time.

@serathius
Copy link
Member

Besides that, I don't see major conflict in having both long term and short term solution at the same time.

I expect Kubernetes will have consistent reads from cache before etcd releases v3.6. Getting watch cache out will not be fare out. Short term will be immediately deprecated.

@linxiulei
Copy link
Author

IMO, this pagination by size is still helpful after we have consistent reads in k8s because k8s still needs to request etcd with non-streaming requests which might be too large and explode etcd memory usage, right?

@serathius
Copy link
Member

No, goal is to serve non-streaming requests from watch cache.

@linxiulei
Copy link
Author

linxiulei commented Sep 14, 2023

you mean apiserver's watch cache, right? how about when apiserver starts and gets all objects from etcd into its own cache?

@serathius
Copy link
Member

Single get all objects request is not a big problem, the problem are frequent fetch request done by badly written Kubernetes controllers. Such requests require a lot of allocations and waste time for proto marshalling/unmarshalling causing etcd memory to grow in uncontrolled way.

Note that all Kubernetes requests are sharded by resource, and Kubernetes limits size of response to 2GB, meaning that a initial request for all resources allocates maybe couple to tens of GB, it's big, but not horrible. Compare it to multiple controllers that fetch single 1GB resource every second.

@wenjiaswe
Copy link
Contributor

wenjiaswe commented Sep 14, 2023

In 16300, the newly implemented maxbytes is defaulted to 0, which would not break any existing use case, and for cases where they do want to control the max size, this could be useful. Note that kube-apiserver maxlimit limits the object counts, but object with large size and more revisions would still cause problem.

I don't see a strong reason not to merge. @serathius Do you have any specific concerns?

@serathius
Copy link
Member

I don't have any concerns about compatibility, more about feature bloat. Based on my understanding of SIG api-machinery roadmap this feature will never be used by Kubernetes.

I don't think etcd should implement a feature for Kubernetes without having a prior design and approval from Kubernetes stakeholders. Before we add work for ourselves let's get LGTMs from SIG api-machinery.

cc @jpbetz @deads @logicalhan @wojtek-t

@ahrtr
Copy link
Member

ahrtr commented Sep 18, 2023

@linxiulei I share the same understanding as #14809 (comment). Please raise a KEP on Kubernetes and get K8s's approval firstly. I don't think etcd community will reject a feature which is useful and approved by K8s.

@tjungblu
Copy link
Contributor

Turns out we have a case where we ran out of the size aspect of the objects. We get the logs from apiserver:

I0821 23:03:24.662977      17 trace.go:205] Trace[419549052]: "List(recursive=true) etcd3" key:/secrets,resourceVersion:,resourceVersionMatch:, limit:10000,continue: (21-Aug-2023 23:03:23.767) (total time: 895 ms): Trace[419549052]: [895.681318ms] [895.681318ms] END
W0821 23:03:24.662996      17 reflector.go:324] storage/cacher.go:/secrets: failed to list *core.Secret: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2169698338 vs.  2147483647)
E0821 23:03:24.663007      17 cacher.go:425] cacher (*core.Secret): unexpected ListAndWatch error: failed to list *core.Secret: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2169698338 vs. 2147483647); reinitializing...

{"level":"warn","ts":"2023-08-17T23:03:24.662Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00357ae00/10.1.1.3:2379","attempt":0,"error":"rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2169698338 vs. 2147483647)"}

Imagine a case where there's just a load of big secrets. I'm still figuring out whether this is caused by badly written operators :)

The GRPC limit we can't really change, because it's already int32.Max:
https://github.com/grpc/grpc-go/blob/8cb98464e5999aa2fd57bbf5b23bd5a634f4b2f5/server.go#L59

We should enable some kind of paging for range responses, but I agree that this needs to go the KEP route first.

tjungblu added a commit to tjungblu/etcd that referenced this issue Nov 7, 2023
This adds a metric to allow us to alert on large range responses
described in etcd-io#14809.

Signed-off-by: Thomas Jungblut <[email protected]>
@tjungblu
Copy link
Contributor

tjungblu commented Nov 7, 2023

As a start, I've quickly added a metric that allows us to alert on common large range requests (secrets / configmaps / image streams) with #16881.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
7 participants