-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dynamic host volumes: serialize ops per volume #24852
Conversation
so DHV plugins can assume that Nomad will not run concurrent operations for the same volume, and for us to avoid interleaving client RPCs with raft writes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a little worried this was going to be tough to do without either a coarse-grained mutex or a hot loop, but your implementation elegantly avoids both problems. Nice work on this so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
feelin pretty good about this one
54367db
to
2a771c1
Compare
nomad/host_volume_endpoint_test.go
Outdated
// block until something runs unblockCurrent() | ||
if bc := v.getBlockChan(); bc != nil { | ||
bc <- "create" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored it (again), and while it's still convoluted, it does actually test the serialization.
instead of these mock client methods blocking on a receive, they block on a send. then unblockCurrent()
unblocks them by receiving, and it returns what it got, i.e. whichever client RPC was being called at the time.
then the test code makes sure that matches which server RPC completed, and we can check the state as appropriate for the operation type.
aside from me thinking it works, I can tell it does because if I don't use the same volume ID as initial create, then stuff happens in random orders and there are chaotic test errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
We can probably reuse this logic for tightening up the serialization of CSI operations too, so it's great that we're spending the effort to make this solid.
} | ||
case <-time.After(time.Second): | ||
t.Error("timeout waiting for an RPC to finish") | ||
break LOOP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will deadlock the test if we break while there are any pending funcs
(i.e. if the first or second times out), because we'll never unblock the channel for the remaining functions but call funcs.Wait()
after the loop breaks. A deadlocked test manifests as the whole test run blowing up in a panic after the 20min timeout, which is pretty awful to debug on GHA because it's not obvious which test is panicking without a lot of rummaging in the logs.
We could fix this by having a shutdown context on the mockVolumeClient
. Just before we break the loop we can close that context. And then the send looks something like this:
// block until something runs unblockCurrent()
if bc := v.getBlockChan(); bc != nil {
select {
case bc <- "delete":
case <-v.shutdownCtx.Done():
v.setBlockChan(nil)
}
}
That would allow the funcs.Wait()
to run to completion and close out the test. We'd still end up failing the test because of the t.Error
here.
All that being said, I'll 👍 this PR as-is, as it'd be a bit of effort to fix and only matters in the case we break the test. Up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I had caught all the deadlocks! I caused enough of them while writing this 😋 but you're totally right. I pushed another commit to cover it.
so it's great that we're spending the effort to make this solid.
thanks for saying so; I was worried I was deadlocked myself, spending too much time on it 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Let only one of any create/register/delete run at a time per volume ID.