Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Live Migration. SSH session to migrating VM can be lost. #18

Open
DaveLangridge opened this issue Feb 12, 2015 · 10 comments
Open

Live Migration. SSH session to migrating VM can be lost. #18

DaveLangridge opened this issue Feb 12, 2015 · 10 comments
Assignees
Labels
kind/enhancement New feature or request

Comments

@DaveLangridge
Copy link

I had an ssh session to a VM from host 2. Live migrating to host 1 and back. Migrating from host 1 to 2, ssh session stayed up with gap in pings to another VM ~2s. Migrating from host 2 to host 1, I get 'broken pipe' for the ssh connection. The VM has migrated OK, pings to the migrating VM show a gap of ~2s.

@eepyaich
Copy link

@Lukasa Can you have a look at this to double check whether this is something we should be worrying about or prioritising?

@Lukasa
Copy link

Lukasa commented Feb 13, 2015

EPIPE is usually raised because the connection has been closed on the far end, but I wouldn't be at all surprised to find it gets returned if an ICMP Destination Host Unreachable message gets received. This can happen when moving VMs in a small window when there is no route available for that VM, and the window is larger if the VM is moving away from the host that has the SSH session open.

I think this is relatively non-urgent: I'd only consider it problematic if we can reproduce this with traffic moving between VMs or from outside the data center.

@fasaxc
Copy link
Member

fasaxc commented May 20, 2015

I don't think we do anything special for migrations right now, we treat them as a removal and then and add of a port. I feel like, to handle migration robustly without dropping connections, we need to:

  • keep the route on the old host until the new host publishes its route to the guest
  • blackhole traffic for the guest at the old host until the traffic all moves to the new
  • remove the route on the old host
  • stop blackholing traffic at old guest.

It's probably good enough to have a 60s overlap rather than trying to detect convergence.

Without that, there's a window where there's no route to the guest on either node so downstream routers will reply with no-route-to-host.

@eepyaich
Copy link

eepyaich commented Aug 3, 2015

Boiling this down to the customer perspective, it sounds like there is a window condition where live migration will fail (i.e. if the definition of "live" is that my SSH session didn't drop). @fasaxc @DaveLangridge is this a reasonable summary?

@DaveLangridge
Copy link
Author

Yes. The migration works, but the SSH sometimes drops and you need to reconnect.

@fasaxc
Copy link
Member

fasaxc commented Nov 2, 2015

We could use BGP to make this go very smoothly, assuming we can spot when something is being migrated:

  • host that initially contains workload starts advertising route with a community that indicates that the VM is being migrated (we time out this route after, say 60s)
  • new host starts advertising route to the VM without community (i.e. just normal advertisement)
  • network is configured to prefer normal routes to routes with the migration community attached.

There's always a route to the address of the workload and the best route is the new one as soon as it comes up.

@nelljerram
Copy link
Member

Note: this is also being tracked as a (possible) networking-calico issue at https://bugs.launchpad.net/networking-calico/+bug/1628960.

@fasaxc fasaxc transferred this issue from projectcalico/felix May 1, 2020
@caseydavenport caseydavenport added the kind/enhancement New feature or request label May 13, 2020
@comay
Copy link

comay commented Jan 12, 2021

Although this issue seems to predate Newton, it's possible the change #38 may help in this situation.

@nelljerram
Copy link
Member

For the record we have just made another networking-calico fix - #59 - that is relevant to connectivity across a live migration. The scenario for that fix is a double migration, i.e. migrating from host A to host B and then back again to host A, so again it's not directly pertinent to the original report here.

However I think that given our recent testing, investigation and fixes around #38 and #59, it feels likely that the original issue reported here is no longer reproducible with up to date OpenStack - where our recent work has all been with Ussuri on Ubuntu - and Calico. In particular we now know that the timing of routes is controlled by when TAP interfaces and WorkloadEndpoints exist as follows.

|----------------+------------+------------+------------+------------+--------------+--------------|
| Step           | TAP on old | WEP on old | TAP on new | WEP on new | Route on old | Route on new |
|----------------+------------+------------+------------+------------+--------------+--------------|
| Start          | Yes        | Yes        | No         | No         | Yes          | No           |
|----------------+------------+------------+------------+------------+--------------+--------------|
| Pre-migration  | Yes        | Yes        | Yes        | No         | Yes          | No           |
|----------------+------------+------------+------------+------------+--------------+--------------|
| Live migration | Yes        | Yes        | Yes        | No         | Yes          | No           |
| in progress    |            |            |            |            |              |              |
|----------------+------------+------------+------------+------------+--------------+--------------|
| Migration done | No         | Yes [1]    | Yes        | Yes        | No           | Yes          |
|----------------+------------+------------+------------+------------+--------------+--------------|

[1] indicates the bug that was fixed by #59 . Now that cell would read "No", but previously the old WEP was hanging around until the next resync. Anyway it didn't affect the existence of a route to the old host, because Nova had removed the TAP interface.

Anyway, the table shows us that there is now no significant time when the VM route is being advertised from the wrong node, or when there is no route at all. We may still be slightly vulnerable to timing deltas for control plane (OpenStack + BIRD) processing on the two relevant nodes, but recent testing has not revealed such cases, and I wonder if those timing windows are now small enough to be covered by network retries.

@tj90241
Copy link
Contributor

tj90241 commented Oct 7, 2021

@neiljerram - as of the most recent fixes, I can confirm that there are no remaining issues. I'd recommend this issue be closed as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants