-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Live Migration. SSH session to migrating VM can be lost. #18
Comments
@Lukasa Can you have a look at this to double check whether this is something we should be worrying about or prioritising? |
EPIPE is usually raised because the connection has been closed on the far end, but I wouldn't be at all surprised to find it gets returned if an ICMP Destination Host Unreachable message gets received. This can happen when moving VMs in a small window when there is no route available for that VM, and the window is larger if the VM is moving away from the host that has the SSH session open. I think this is relatively non-urgent: I'd only consider it problematic if we can reproduce this with traffic moving between VMs or from outside the data center. |
I don't think we do anything special for migrations right now, we treat them as a removal and then and add of a port. I feel like, to handle migration robustly without dropping connections, we need to:
It's probably good enough to have a 60s overlap rather than trying to detect convergence. Without that, there's a window where there's no route to the guest on either node so downstream routers will reply with no-route-to-host. |
Boiling this down to the customer perspective, it sounds like there is a window condition where live migration will fail (i.e. if the definition of "live" is that my SSH session didn't drop). @fasaxc @DaveLangridge is this a reasonable summary? |
Yes. The migration works, but the SSH sometimes drops and you need to reconnect. |
We could use BGP to make this go very smoothly, assuming we can spot when something is being migrated:
There's always a route to the address of the workload and the best route is the new one as soon as it comes up. |
Note: this is also being tracked as a (possible) networking-calico issue at https://bugs.launchpad.net/networking-calico/+bug/1628960. |
Although this issue seems to predate Newton, it's possible the change #38 may help in this situation. |
For the record we have just made another networking-calico fix - #59 - that is relevant to connectivity across a live migration. The scenario for that fix is a double migration, i.e. migrating from host A to host B and then back again to host A, so again it's not directly pertinent to the original report here. However I think that given our recent testing, investigation and fixes around #38 and #59, it feels likely that the original issue reported here is no longer reproducible with up to date OpenStack - where our recent work has all been with Ussuri on Ubuntu - and Calico. In particular we now know that the timing of routes is controlled by when TAP interfaces and WorkloadEndpoints exist as follows.
[1] indicates the bug that was fixed by #59 . Now that cell would read "No", but previously the old WEP was hanging around until the next resync. Anyway it didn't affect the existence of a route to the old host, because Nova had removed the TAP interface. Anyway, the table shows us that there is now no significant time when the VM route is being advertised from the wrong node, or when there is no route at all. We may still be slightly vulnerable to timing deltas for control plane (OpenStack + BIRD) processing on the two relevant nodes, but recent testing has not revealed such cases, and I wonder if those timing windows are now small enough to be covered by network retries. |
@neiljerram - as of the most recent fixes, I can confirm that there are no remaining issues. I'd recommend this issue be closed as well. |
I had an ssh session to a VM from host 2. Live migrating to host 1 and back. Migrating from host 1 to 2, ssh session stayed up with gap in pings to another VM ~2s. Migrating from host 2 to host 1, I get 'broken pipe' for the ssh connection. The VM has migrated OK, pings to the migrating VM show a gap of ~2s.
The text was updated successfully, but these errors were encountered: