-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A second health GRPC endpoint to differentiate Kubernetes liveness/readiness checks (e.g.) #53
Comments
You don't need branch access to create PRs. Create a GitHub clone, put your branch there, and then make the PR.
Okay, just by reading the definitions: gRPC health is k8s readiness. gRPC doesn't have a liveness service, although honestly I tend to expect those to be call outs from the service to a watchdog. But I can still understand the calling being inverted.
Liveness needs to check that the service is actually functional. How is it expected to work? Each service in a server registers a callback to the readiness service under a particular name. When the readiness service is called, the service calls the callback for the name requested. Once the callback completes the service completes the RPC; if the callback fails then it fails the RPC? Since a failure of any service should cause a restart, it seems the "name" part of the health service isn't useful. Instead the liveness service should maybe loop through all of its services and issue their callbacks. That does sound bad though, as it could take a while to run. Maybe have services register themselves and have the liveness service be a watchdog? Each service registered would be responsible for "pinging" the liveness service at their registered frequency. If they fail to do so, then the liveness service starts responding "dead". The k8s documentation keeps using /healthz for liveness. In that case it sort of seems like using the same Health service we have now, but maybe services just register a |
Yes, the HTTP calls are /from/ k8s to the service on, say, /health and /ready.
Yes, if the liveness check doesn't return a 200 in a timely fashion, the instance is considered dead and will be culled by k8s. If the readiness check doesn't return a 200 in a timely fashion, the instance is considered "stalled" and will be removed from the load balancer. The only criteria k8s judges these calls is by a 2xx status code; the body isn't utilized.
I'm not sure I'm grokking what you mean by callbacks. The l/r checks are simple inbound HTTP requests, which the GRPC service will require transcoding to GRPC, and each of these RPCs will return 2xx or not appropriately, based on the business logic of liveness and (primarily) readiness (e.g., MySQL is down so I can't function correctly right now).
This is an interesting approach, but it's not what k8s supports AFAIK.
Again, I think this could work, but I'm not sure this workable in the context of pure transcoding, ala grpc-gateway and Envoy et al. Given a GRPC service using reflection in the context of Go, how would one register the same GRPC service as two different "endpoints"? That would probably be a viable option, but if it's possible we haven't seen exemplars. |
I also wanted to point out that the alternative that "everyone" will need to adopt is to duplicate the health grpc as a separate live/readiness service in each project or in a shared repo for every engineering org. Example below. That much duplication of code infers to me that perhaps it belongs in one place.
|
@ejona86 Any additional comments or feedback? Thanks. |
@ejona86 It would be great to make progress here. Thanks. |
NOTE: I have a change I can submit for review, given branch access.
TL;DR: This PR proposes a second GRPC health endpoint for use with GRPC transcoders running within Kubernetes. Kubernetes supports both liveness and readiness endpoints, but for HTTP->GRPC transcoding use cases, two separate GRPCs need to be registered to differentiate their functionality. (Kubernetes probe definitions)
For grpc-gateway and Envoy for example, the transcoding is 1:1 from HTTP endpoint to GRPC, and transcoding can't multiplex two HTTP endpoints to a single GRPC endpoint.
One alternative is to create a stub HTTP service within an otherwise pure GRPC service, however the business logic of liveness vs readiness would have to be duplicated in the HTTP logic layer, rather than within the parallel GRPC layer.
The proposed change essentially duplicates the Health GRPC and wraps it around the new "Readiness" name. This was chosen because of the simplicity and small change footprint.
An alternative approach could be to add two additional endpoints, one for Liveness and one for Readiness, each perhaps with their own (duplicate) Request and Response structures. There's a certain naming cleanliness here, but it seems overly complex.
Feedback welcome!
Quick summary of liveness vs readiness use cases:
is up and serving GRPC.
IMPORTANT: Kubernetes will restart containers that fail a liveness check.
has completed loading any prerequisite data, and that the service's
upstream dependencies (e.g., MySQL, Kafka) are available and working.
IMPORTANT: Kubernetes will remove the container from the service load
balancers, but will not restart the container.
In other words, liveness represents permanent failure, while readiness
represents a transient failure, and the application-side implementation of
these checks should differ accordingly.
The text was updated successfully, but these errors were encountered: