Trouble with Discovery service - Talos 1.9.2 - Kubernetes - 1.32.1 #10222

savagemindz · 2025-01-24T23:28:46Z

Bug Report

2025-01-24T18:04:21.623Z �[31mERROR�[0m hello failed {"component": "controller-runtime", "controller": "cluster.DiscoveryServiceController", "error": "rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: context deadline exceeded"", "endpoint": "discovery.talos.dev:443"}

Description

Hi all,

I am sure I am doing something stupid but figured I would open this in case it is a bug (or you can educate me).

I have been playing with Talos in lab using terraform to provision it on a small proxmox cluster. All the VMs get created fine and I can bootstrap the first control plane node but all nodes then fail to connect to discovery.talos.dev. The logs fill with...

2025-01-24T18:04:21.623Z �[31mERROR�[0m hello failed {"component": "controller-runtime", "controller": "cluster.DiscoveryServiceController", "error": "rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: context deadline exceeded"", "endpoint": "discovery.talos.dev:443"}

At this point the cluster sort of builds but my call to talosctl health results in nodes only being able to see themselves.

talosctl health -n k8s-cp-1
discovered nodes: ["192.168.2.81"]
waiting for etcd to be healthy: ...
waiting for etcd to be healthy: OK
waiting for etcd members to be consistent across nodes: ...
waiting for etcd members to be consistent across nodes: OK
waiting for etcd members to be control plane nodes: ...
waiting for etcd members to be control plane nodes: etcd member ips ["192.168.2.82" "192.168.2.83" "192.168.2.81"] are not subset of control plane node ips ["192.168.2.81" "2001:4d48:ad5e:e02:b8d0:ff:fe01:1"]

Interestingly kubernetes does seem to make the cluster though

kg nodes
NAME STATUS ROLES AGE VERSION
k8s-cp-1 Ready control-plane 19m v1.32.1
k8s-cp-2 Ready control-plane 17m v1.32.1
k8s-cp-3 Ready control-plane 18m v1.32.1
k8s-wk-1 Ready 18m v1.32.1
k8s-wk-2 Ready 19m v1.32.1

There are no firewall rules preventing outbound connectivity to discovery.talos.dev:443 and a curl from a machine on the same network works fine. I have checked all the obvious things I can think of such as NTP, network reachability, DNS etc.

If I downgrade to kubernetes 1.31 and use the kubernetes registry I can get the cluster to build.

In case this matters I am trying to build with cilium as an inline manifest.

Anyway let me know if you need any more info or what me to run something but I am at a bit of a loss.

Thanks
iain

Logs

support.zip

Environment

Talos version: [talosctl version --nodes <problematic nodes>]

talosctl version --nodes 192.168.2.81  ✔  py3.12 Py
Client:
Tag: v1.8.3
SHA: 6494ace
Built:
Go version: go1.22.9
OS/Arch: linux/amd64
Server:
NODE: 192.168.2.81
Tag: v1.9.2
SHA: 09758b3
Built:
Go version: go1.23.4
OS/Arch: linux/amd64
Enabled: RBAC

Kubernetes version: [kubectl version --short]
kubectl version  1 ✘  py3.12 Py  admin@k8s ○
Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.32.1
Platform:

The text was updated successfully, but these errors were encountered:

savagemindz · 2025-01-26T11:36:09Z

Just to add to this I booted the cluster using the kubernetes registry and kubernetes version 1.31.5 and then edited the machine config for each node and rebooted one of them (so far). The node that rebooted still prints the "transport: authentication handshake failed: context deadline exceeded" error. "talosctl get affiliates --namespace=cluster-raw" also only has entries from the k8s service. Nothing appears from the discovery service. my machine config look like this anyway.

discovery: enabled: true registries: kubernetes: disabled: false service: disabled: false

One other thing that I thought might be relevant. The cluster is using an image generated by https://factory.talos.dev/. It has these additional options added to it.

"iscsi-tools",
"qemu-guest-agent",
"util-linux-tools"

Thanks
iain

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trouble with Discovery service - Talos 1.9.2 - Kubernetes - 1.32.1 #10222

Trouble with Discovery service - Talos 1.9.2 - Kubernetes - 1.32.1 #10222

savagemindz commented Jan 24, 2025

savagemindz commented Jan 26, 2025

Trouble with Discovery service - Talos 1.9.2 - Kubernetes - 1.32.1 #10222

Trouble with Discovery service - Talos 1.9.2 - Kubernetes - 1.32.1 #10222

Comments

savagemindz commented Jan 24, 2025

Bug Report

Description

Logs

Environment

savagemindz commented Jan 26, 2025