Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble with Discovery service - Talos 1.9.2 - Kubernetes - 1.32.1 #10222

Open
savagemindz opened this issue Jan 24, 2025 · 1 comment
Open

Comments

@savagemindz
Copy link

Bug Report

2025-01-24T18:04:21.623Z �[31mERROR�[0m hello failed {"component": "controller-runtime", "controller": "cluster.DiscoveryServiceController", "error": "rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: context deadline exceeded"", "endpoint": "discovery.talos.dev:443"}

Description

Hi all,

I am sure I am doing something stupid but figured I would open this in case it is a bug (or you can educate me).

I have been playing with Talos in lab using terraform to provision it on a small proxmox cluster. All the VMs get created fine and I can bootstrap the first control plane node but all nodes then fail to connect to discovery.talos.dev. The logs fill with...

2025-01-24T18:04:21.623Z �[31mERROR�[0m hello failed {"component": "controller-runtime", "controller": "cluster.DiscoveryServiceController", "error": "rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: context deadline exceeded"", "endpoint": "discovery.talos.dev:443"}

At this point the cluster sort of builds but my call to talosctl health results in nodes only being able to see themselves.

talosctl health -n k8s-cp-1
discovered nodes: ["192.168.2.81"]
waiting for etcd to be healthy: ...
waiting for etcd to be healthy: OK
waiting for etcd members to be consistent across nodes: ...
waiting for etcd members to be consistent across nodes: OK
waiting for etcd members to be control plane nodes: ...
waiting for etcd members to be control plane nodes: etcd member ips ["192.168.2.82" "192.168.2.83" "192.168.2.81"] are not subset of control plane node ips ["192.168.2.81" "2001:4d48:ad5e:e02:b8d0:ff:fe01:1"]

Interestingly kubernetes does seem to make the cluster though

kg nodes
NAME STATUS ROLES AGE VERSION
k8s-cp-1 Ready control-plane 19m v1.32.1
k8s-cp-2 Ready control-plane 17m v1.32.1
k8s-cp-3 Ready control-plane 18m v1.32.1
k8s-wk-1 Ready 18m v1.32.1
k8s-wk-2 Ready 19m v1.32.1

There are no firewall rules preventing outbound connectivity to discovery.talos.dev:443 and a curl from a machine on the same network works fine. I have checked all the obvious things I can think of such as NTP, network reachability, DNS etc.

If I downgrade to kubernetes 1.31 and use the kubernetes registry I can get the cluster to build.

In case this matters I am trying to build with cilium as an inline manifest.

Anyway let me know if you need any more info or what me to run something but I am at a bit of a loss.

Thanks
iain

Logs

support.zip

Environment

  • Talos version: [talosctl version --nodes <problematic nodes>]

talosctl version --nodes 192.168.2.81  ✔  py3.12 Py
Client:
Tag: v1.8.3
SHA: 6494ace
Built:
Go version: go1.22.9
OS/Arch: linux/amd64
Server:
NODE: 192.168.2.81
Tag: v1.9.2
SHA: 09758b3
Built:
Go version: go1.23.4
OS/Arch: linux/amd64
Enabled: RBAC

  • Kubernetes version: [kubectl version --short]
    kubectl version  1 ✘  py3.12 Py  admin@k8s ○
    Client Version: v1.32.0
    Kustomize Version: v5.5.0
    Server Version: v1.32.1

  • Platform:

@savagemindz
Copy link
Author

Just to add to this I booted the cluster using the kubernetes registry and kubernetes version 1.31.5 and then edited the machine config for each node and rebooted one of them (so far). The node that rebooted still prints the "transport: authentication handshake failed: context deadline exceeded" error. "talosctl get affiliates --namespace=cluster-raw" also only has entries from the k8s service. Nothing appears from the discovery service. my machine config look like this anyway.

discovery: enabled: true registries: kubernetes: disabled: false service: disabled: false

One other thing that I thought might be relevant. The cluster is using an image generated by https://factory.talos.dev/. It has these additional options added to it.

  • "iscsi-tools",
  • "qemu-guest-agent",
  • "util-linux-tools"

Thanks
iain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant