Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liqo does not work with Cilium with eBPF Host Routing or conntrack disabled #2166

Open
yoctozepto opened this issue Nov 24, 2023 · 7 comments
Labels
fix Fixes a bug in the codebase.

Comments

@yoctozepto
Copy link

What happened:

Peering Liqo clusters where either one has Cilium with either eBPF Host Routing [1] (requires and is enabled by default after enabling kube-proxy replacement and eBPF masquerading) or bypassing iptables (netfilter) Connection Tracking (conntrack) [2] results in the Liqo Wireguard VPN tunnel dropping the packets along the way. For example, trying the in-band peering will fail on authentication because the two control planes do not really see each other (despite the "successful" tunnel establishment).

[1] https://docs.cilium.io/en/stable/operations/performance/tuning/#ebpf-host-routing
[2] https://docs.cilium.io/en/stable/operations/performance/tuning/#bypass-iptables-connection-tracking

What you expected to happen:

I expect Liqo to work in this situation.

How to reproduce it (as minimally and precisely as possible):

Deploy Cilium on a modern kernel (see the referenced docs) with the following minimal values.yaml file contents:

kubeProxyReplacement: true
bpf:
  masquerade: true
# the following need adjustment, these are because of the kube-proxy replacement
k8sServiceHost: some.ip.address
k8sServicePort: 6443

Anything else we need to know?:

Environment:

  • Liqo version: v0.10.1
  • Liqoctl version: v0.10.1
  • Kubernetes version (use kubectl version):
Client Version: v1.28.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.2
  • Cloud provider or hardware configuration: Talos Linux 1.4.8
  • Network plugin and version: Cilium 1.14.2
@stelucz
Copy link

stelucz commented Dec 20, 2023

There's another issue too. Even without enabled eBPF routing. liqo-auth is spammed by EOF errors;

auth 2023/12/20 08:04:00 http: TLS handshake error from 10.0.0.199:3205: EOF
auth 2023/12/20 08:04:02 http: TLS handshake error from 10.0.1.56:56183: EOF
auth 2023/12/20 08:04:02 http: TLS handshake error from 10.0.1.251:56569: EOF
auth 2023/12/20 08:04:04 http: TLS handshake error from 10.0.0.199:43529: EOF
auth 2023/12/20 08:04:05 http: TLS handshake error from 10.0.1.56:25211: EOF
auth 2023/12/20 08:04:05 http: TLS handshake error from 10.0.1.251:45286: EOF
auth 2023/12/20 08:04:07 http: TLS handshake error from 10.0.0.199:54163: EOF
auth 2023/12/20 08:04:09 http: TLS handshake error from 10.0.1.56:10733: EOF
auth 2023/12/20 08:04:09 http: TLS handshake error from 10.0.1.251:19602: EOF

source addresses above are Cilium "routers" at nodes.

@yoctozepto
Copy link
Author

auth 2023/12/20 08:04:00 http: TLS handshake error from 10.0.0.199:3205: EOF
<snip>

source addresses above are Cilium "routers" at nodes.

That's because they open and close TCP connections to the service.

@cheina97
Copy link
Member

Hi, sorry for the late reply. We are starting to investigate your issues. @yoctozepto and @stelucz, have you encountered these problems only with in-band peering or even with out-of-band?

@stelucz
Copy link

stelucz commented Dec 22, 2023

Hi @cheina97 my "problem" with errors in logs is just after Liqo deployment, no peering established so far.

@cheina97
Copy link
Member

Hi @cheina97 my "problem" with errors in logs is just after Liqo deployment, no peering established so far.

Thanks

@EladDolev
Copy link
Contributor

We're trying to peer two GKE clusters, where the destination cluster got Dataplane V2 (Cilium based) and we also encounter those TLS handshake errors in liqo-auth

Peering in-band fails with a timeout and we see the following errors in the controller manager logs

failed to send identity request: Post "[https://10.131.0.3:443/identity/certificate](https://10.131.0.3/identity/certificate)": context deadline exceeded (
Client.Timeout exceeded while awaiting headers)

If peering out-of-band and unpeering without deleting the created namespaces, peering in-band is then possible

@danvaida
Copy link

danvaida commented May 15, 2024

Hi @cheina97 my "problem" with errors in logs is just after Liqo deployment, no peering established so far.

Thanks

Hey folks. FWIW, I'm also experiencing this with Cilium (chart version 1.15.4). Cilium chart vars are as follows:

---
eni:
  enabled: true
  awsEnablePrefixDelegation: true
  awsReleaseExcessIPs: true
ipam:
  mode: eni
egressMasqueradeInterfaces: eth+
tunnel: disabled
hubble:
  relay:
    enabled: true
  ui:
    enabled: false

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: liqo.io/type
              operator: DoesNotExist

Liqo "consumer" cluster is EKS with 1.29.4 and "producer" cluster is GKE with 1.29.3.
Cilium is running only on the EKS cluster.
The TLS handshake error shows up in the logs of Liqo right after installation on the EKS cluster. Upon a peering attempt, auth step fails with ERRO Authentication to the remote cluster "eks" failed: timed out waiting for the condition.
I tried it with a vanilla cluster w/o Cilium on it and I was able to establish a bi-directional out-of-band peering and tested it successfully with some namespace offloading.
liqoctl is v0.10.3.

Is it reasonable to expect that this will work any time soon?

Update (07.06.24):
Turns out that, as being on EKS is sometimes common to use the AWS Load Balancer Ingress Controller, you need to be aware that beginning with its version v2.5.0, by default, it is creating an internal Network Load Balancer:

[...]
This controller creates an internal NLB by default. You need to specify the annotation service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing on your service if you want to create an internet-facing NLB for your service.
[...]

As such, keep that in mind when installing Liqo directly with Helm or with liqoctl (which also uses the Helm chart in the background).

Using service.beta.kubernetes.io/aws-load-balancer-internal: "false", does the trick, too, but might give you some headaches due to the boolean value as liqoctl install only supports --set and doesn't support the handy --set-string that helm supports. It's fine if you use a YAML file containing the values, though.

One-liner example:

$ liqoctl --context=some-cluster install eks \
  --eks-cluster-region=${EKS_CLUSTER_REGION} \
  --eks-cluster-name=${EKS_CLUSTER_NAME} \
  --user-name liqo-cluster-user \
  --set auth.service.annotations."service\.beta\.kubernetes\.io/aws-load-balancer-scheme"=internet-facing \
  --set gateway.service.annotations."service\.beta\.kubernetes\.io/aws-load-balancer-scheme"=internet-facing

@cheina97 cheina97 removed the kind/bug label Dec 20, 2024
@aleoli aleoli added the fix Fixes a bug in the codebase. label Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Fixes a bug in the codebase.
Projects
None yet
Development

No branches or pull requests

6 participants