Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker joins external discovery service but not control plane (pre K8S bootstrap) #10184

Open
JarleB opened this issue Jan 21, 2025 · 7 comments

Comments

@JarleB
Copy link

JarleB commented Jan 21, 2025

Bug Report

Worker unable to register via Talos API on control plane node after install from iso.

Description

Using metal-arm64.iso from https://github.com/siderolabs/talos/releases (1.9.2) to boot UTM [1] VMs on the Mac. UTM is configured to use shared network so that all VMs attached to bridge100 (create upon first VM booting) has full access between them within the CIDR 192.168.64/0

  • generating a new config with talosctl gen config foobar https://192.168.64.3:6443
  • replacing disk sea with via in both control plan and worker conga files
  • booting control plane VM with ISO, having one disk (/dev/vda) attached (3GB).
  • applying control plane config to control plane node using taloscontrol from macOS/homebrew (v1.9.2)
    • node pulls image from registry and reboots
    • node comes up as expected booting from vda with Talos API available from macOS
  • booting worker node with same iso and disk setup
  • applying worker config to worker nodes having the same disk config as the control node (vda 3GB)
    • node pulls image from registry and reboots
    • node comes up booting from vda and registering with the discovery service but is unable to join the control plane node
    • node is no longer reachable directly with Talos API, neither via the control plane node as expected from install guide
  • Added additional ubuntu vm on the same network to verify that port 50000 is available also within VMs (both worker and control plane when booted from ISO) on the same subnet and not only from the macOS host

Logs

~ talosctl gen config foobar https://192.168.64.3:6443
generating PKI and tokens
Created /Users/t992596/git/lab/foobar/controlplane.yaml
Created /Users/t992596/git/lab/foobar/worker.yaml
Created /Users/t992596/git/lab/foobar/talosconfig

~ vi controlplane.yaml

~ vi worker.yaml

~ grep vda *
controlplane.yaml:        disk: /dev/vda # The disk used for installations.
worker.yaml:        disk: /dev/vda # The disk used for installations.

~ # booting cotrol plane

~ talosctl apply-config --insecure -n 192.168.64.3 --file controlplane.yaml

~ talosctl --talosconfig=./talosconfig  -e 192.168.64.3 --nodes 192.168.64.3 version
Client:
	Tag:         v1.9.2
	SHA:         undefined
	Built:       2025-01-16T15:13:01Z
	Go version:  go1.23.4
	OS/Arch:     darwin/arm64
Server:
	NODE:        192.168.64.3
	Tag:         v1.9.2
	SHA:         09758b3f
	Built:
	Go version:  go1.23.4
	OS/Arch:     linux/arm64
	Enabled:     RBAC

~ talosctl --talosconfig=./talosconfig  -e 192.168.64.3 --nodes 192.168.64.3 get members
NODE           NAMESPACE   TYPE     ID              VERSION   HOSTNAME        MACHINE TYPE   OS               ADDRESSES
192.168.64.3   cluster     Member   talos-whn-zul   1         talos-whn-zul   controlplane   Talos (v1.9.2)   ["192.168.64.3"]

~  # booting worker

~  talosctl apply-config --insecure -n 192.168.64.5 --file worker.yaml

~ talosctl --talosconfig=./talosconfig  -e 192.168.64.3 --nodes 192.168.64.3 get members
NODE           NAMESPACE   TYPE     ID              VERSION   HOSTNAME        MACHINE TYPE   OS               ADDRESSES
192.168.64.3   cluster     Member   talos-2xf-is7   2         talos-2xf-is7   worker         Talos (v1.9.2)   ["192.168.64.5"]
192.168.64.3   cluster     Member   talos-whn-zul   1         talos-whn-zul   controlplane   Talos (v1.9.2)   ["192.168.64.3"]

~  talosctl --talosconfig=./talosconfig  -e 192.168.64.3 --nodes 192.168.64.5 version
Client:
	Tag:         v1.9.2
	SHA:         undefined
	Built:       2025-01-16T15:13:01Z
	Go version:  go1.23.4
	OS/Arch:     darwin/arm64
Server:
error getting version: 1 error occurred:
	* 192.168.64.5: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 192.168.64.5:50000: connect: connection refused"

~ talosctl --talosconfig=./talosconfig  -e 192.168.64.5 --nodes 192.168.64.5 version
Client:
	Tag:         v1.9.2
	SHA:         undefined
	Built:       2025-01-16T15:13:01Z
	Go version:  go1.23.4
	OS/Arch:     darwin/arm64
Server:
error getting version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 192.168.64.5:50000: connect: connection refused"

~ talosctl bootstrap --nodes 192.168.64.3 --endpoints 192.168.64.3 \
  --talosconfig=./talosconfig
error executing bootstrap: 1 error occurred:
	* 192.168.64.3: rpc error: code = FailedPrecondition desc = bootstrap is not available yet

Screenshot of worker console (unable to cut/paste text):
Image

Environment

See above output and description

[1] https://mac.getutm.app

@smira
Copy link
Member

smira commented Jan 21, 2025

The worker can't get to the controlplane to establish certificates, so apid can't start.

You can pull controlplane logs with talosctl -n <CP> dmesg, also you should be able to get the support bundle from the controlplane with talosctl -n <CP> support.

@JarleB
Copy link
Author

JarleB commented Jan 21, 2025

thx @smira :

You can pull controlplane logs with talosctl -n <CP> dmesg, also you should be able to get the support bundle from the controlplane with talosctl -n <CP> support.

True:
~ talosctl --talosconfig=./talosconfig -e 192.168.64.3 -n 192.168.64.3 dmesg > dmesg.log
dmesg.log

However:

talosctl --talosconfig=./talosconfig  -e 192.168.64.3 -n 192.168.64.3 support
Failed to create kubernetes client Get "https://192.168.64.3:6443/api/v1/namespaces/kube-system": dial tcp 192.168.64.3:6443: connect: connection refused
Support bundle is written to support.zip
rpc error: code = Unavailable desc = ListPodSandbox with filter &PodSandboxFilter{Id:,State:&PodSandboxStateValue{State:SANDBOX_READY,},LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: no such file or directory"

support.zip is 0B

@smira
Copy link
Member

smira commented Jan 21, 2025

Ok, your issue is that the disk is 3 GiB which is too small to run Kubernetes, but there's an issue with Talos mis-reporting this.

@JarleB
Copy link
Author

JarleB commented Jan 21, 2025

Ok, your issue is that the disk is 3 GiB which is too small to run Kubernetes, but there's an issue with Talos mis-reporting this.

Ah. Thx again! What is the minimum recommended size, and perhaps docs should state it ? (Or perhaps it Is there and I missed it?)

@smira
Copy link
Member

smira commented Jan 21, 2025

@JarleB
Copy link
Author

JarleB commented Jan 21, 2025

https://www.talos.dev/v1.9/introduction/troubleshooting/#system-requirements

https://www.talos.dev/v1.9/introduction/system-requirements/

Works like a charm after attaching disks od 10GB size

Sorry for the noise, and thanks for the prompt help. Very much appreciated.

@smira smira reopened this Jan 21, 2025
@smira
Copy link
Member

smira commented Jan 21, 2025

There is still a small issue with reporting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants