Live Migration. SSH session to migrating VM can be lost. #18

DaveLangridge · 2015-02-12T11:21:18Z

I had an ssh session to a VM from host 2. Live migrating to host 1 and back. Migrating from host 1 to 2, ssh session stayed up with gap in pings to another VM ~2s. Migrating from host 2 to host 1, I get 'broken pipe' for the ssh connection. The VM has migrated OK, pings to the migrating VM show a gap of ~2s.

eepyaich · 2015-02-13T09:45:48Z

@Lukasa Can you have a look at this to double check whether this is something we should be worrying about or prioritising?

Lukasa · 2015-02-13T11:18:32Z

EPIPE is usually raised because the connection has been closed on the far end, but I wouldn't be at all surprised to find it gets returned if an ICMP Destination Host Unreachable message gets received. This can happen when moving VMs in a small window when there is no route available for that VM, and the window is larger if the VM is moving away from the host that has the SSH session open.

I think this is relatively non-urgent: I'd only consider it problematic if we can reproduce this with traffic moving between VMs or from outside the data center.

fasaxc · 2015-05-20T18:48:02Z

I don't think we do anything special for migrations right now, we treat them as a removal and then and add of a port. I feel like, to handle migration robustly without dropping connections, we need to:

keep the route on the old host until the new host publishes its route to the guest
blackhole traffic for the guest at the old host until the traffic all moves to the new
remove the route on the old host
stop blackholing traffic at old guest.

It's probably good enough to have a 60s overlap rather than trying to detect convergence.

Without that, there's a window where there's no route to the guest on either node so downstream routers will reply with no-route-to-host.

eepyaich · 2015-08-03T16:25:18Z

Boiling this down to the customer perspective, it sounds like there is a window condition where live migration will fail (i.e. if the definition of "live" is that my SSH session didn't drop). @fasaxc @DaveLangridge is this a reasonable summary?

DaveLangridge · 2015-08-04T06:46:50Z

Yes. The migration works, but the SSH sometimes drops and you need to reconnect.

fasaxc · 2015-11-02T10:50:39Z

We could use BGP to make this go very smoothly, assuming we can spot when something is being migrated:

host that initially contains workload starts advertising route with a community that indicates that the VM is being migrated (we time out this route after, say 60s)
new host starts advertising route to the VM without community (i.e. just normal advertisement)
network is configured to prefer normal routes to routes with the migration community attached.

There's always a route to the address of the workload and the best route is the new one as soon as it comes up.

nelljerram · 2016-09-30T12:58:57Z

Note: this is also being tracked as a (possible) networking-calico issue at https://bugs.launchpad.net/networking-calico/+bug/1628960.

comay · 2021-01-12T19:17:53Z

Although this issue seems to predate Newton, it's possible the change #38 may help in this situation.

nelljerram · 2021-08-23T17:43:08Z

For the record we have just made another networking-calico fix - #59 - that is relevant to connectivity across a live migration. The scenario for that fix is a double migration, i.e. migrating from host A to host B and then back again to host A, so again it's not directly pertinent to the original report here.

However I think that given our recent testing, investigation and fixes around #38 and #59, it feels likely that the original issue reported here is no longer reproducible with up to date OpenStack - where our recent work has all been with Ussuri on Ubuntu - and Calico. In particular we now know that the timing of routes is controlled by when TAP interfaces and WorkloadEndpoints exist as follows.

|----------------+------------+------------+------------+------------+--------------+--------------|
| Step           | TAP on old | WEP on old | TAP on new | WEP on new | Route on old | Route on new |
|----------------+------------+------------+------------+------------+--------------+--------------|
| Start          | Yes        | Yes        | No         | No         | Yes          | No           |
|----------------+------------+------------+------------+------------+--------------+--------------|
| Pre-migration  | Yes        | Yes        | Yes        | No         | Yes          | No           |
|----------------+------------+------------+------------+------------+--------------+--------------|
| Live migration | Yes        | Yes        | Yes        | No         | Yes          | No           |
| in progress    |            |            |            |            |              |              |
|----------------+------------+------------+------------+------------+--------------+--------------|
| Migration done | No         | Yes [1]    | Yes        | Yes        | No           | Yes          |
|----------------+------------+------------+------------+------------+--------------+--------------|

[1] indicates the bug that was fixed by #59 . Now that cell would read "No", but previously the old WEP was hanging around until the next resync. Anyway it didn't affect the existence of a route to the old host, because Nova had removed the TAP interface.

Anyway, the table shows us that there is now no significant time when the VM route is being advertised from the wrong node, or when there is no route at all. We may still be slightly vulnerable to timing deltas for control plane (OpenStack + BIRD) processing on the two relevant nodes, but recent testing has not revealed such cases, and I wonder if those timing windows are now small enough to be covered by network retries.

tj90241 · 2021-10-07T13:51:25Z

@neiljerram - as of the most recent fixes, I can confirm that there are no remaining issues. I'd recommend this issue be closed as well.

eepyaich assigned Lukasa Feb 13, 2015

eepyaich unassigned Lukasa Aug 4, 2015

fasaxc assigned nelljerram Sep 8, 2016

fasaxc transferred this issue from projectcalico/felix May 1, 2020

caseydavenport added the kind/enhancement New feature or request label May 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Live Migration. SSH session to migrating VM can be lost. #18

Live Migration. SSH session to migrating VM can be lost. #18

DaveLangridge commented Feb 12, 2015

eepyaich commented Feb 13, 2015

Lukasa commented Feb 13, 2015

fasaxc commented May 20, 2015

eepyaich commented Aug 3, 2015

DaveLangridge commented Aug 4, 2015

fasaxc commented Nov 2, 2015

nelljerram commented Sep 30, 2016

comay commented Jan 12, 2021

nelljerram commented Aug 23, 2021

tj90241 commented Oct 7, 2021

Live Migration. SSH session to migrating VM can be lost. #18

Live Migration. SSH session to migrating VM can be lost. #18

Comments

DaveLangridge commented Feb 12, 2015

eepyaich commented Feb 13, 2015

Lukasa commented Feb 13, 2015

fasaxc commented May 20, 2015

eepyaich commented Aug 3, 2015

DaveLangridge commented Aug 4, 2015

fasaxc commented Nov 2, 2015

nelljerram commented Sep 30, 2016

comay commented Jan 12, 2021

nelljerram commented Aug 23, 2021

tj90241 commented Oct 7, 2021