-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better concurrent request handling for model host address #38
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! This is a tricky part of the codebase. I think we need a solution for waiting until an endpoint is available, but also respecting the context. I think there might be a leak in here as implemented - love to get your thoughts.
Thank you very much for the feedback! I applied some change but I think this needs some better testing before it can be merged. It is a much more complex beast than I thought initially |
Agree on the additional testing. Looks like the larger scale (300 concurrent requests) system test is catching some kind of issue: https://github.com/substratusai/lingo/actions/runs/7286882573/job/19856443938?pr=38#step:5:1098 I can help with additions to the system tests if you have specific scenarios that should be tested. The original system tests were a result of concurrent request handling being broken and needing to ensure scale up and scale down works as expected and request and responses are being returned for a realistic backend. Edit: I've triggered a re-run of the system tests to ensure it wasnt just a flaky test. |
manager.getEndpoints(myService). | ||
setIPs(map[string]struct{}{myService: {}}, map[string]int32{myPort: 1}) | ||
|
||
testCases := map[string]struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these would be better written as individual tests instead of a table of tests cases. They each test different things. For example: for the timeout example it would be good to assert that the returned error is due to context cancellation and this code would only be used for that test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the error check a bit vague but IMHO it makes sense to have a spec for the methods that defines all cases. I find it more readable.
But to be fair, I use table tests as my default structure for unit tests and may be biased. If this is very important for you, I can refactor. The error type is checked now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a strong opinion, good with this.
@@ -56,7 +56,13 @@ func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) { | |||
defer complete() | |||
|
|||
log.Println("Waiting for IPs", id) | |||
host := h.Endpoints.GetHost(r.Context(), deploy, "http") | |||
host, err := h.Endpoints.AwaitHostAddress(r.Context(), deploy, "http") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be more robust to check this error instead of assuming is it a timeout error. In the future the error logic in the invoked function might be updated to return different error types but this call-site might not be reconsidered. Also, it is not always a timeout today: if the caller cancels the request the context will cancel (not technically a timeout).
log.Printf("error while finding the host address %v", err) | ||
switch { | ||
case errors.Is(err, context.Canceled): | ||
w.WriteHeader(http.StatusInternalServerError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some research on the inet what http code makes sense but looks like this case is not handled explicit very often. Alternatively 499
was suggested, which is not part of Go stdlib though
Like in #36 the reconcile may be affected by external requests. This refactoring helps by reducing lock conflicts
I have also added a benchmark that shows that the new rwlock is ~30% faster than before on my box. But this is all within ns and does not really matter:
The key benefit of this PR is handling request timeout