K3S node spamming logs with "request cluster ID mismatch" #16629

kenlasko · 2023-09-21T19:17:51Z

kenlasko
Sep 21, 2023

I've got a K3S cluster with 3 masters and 3 workers that's been running just fine for the past several months. However, my first etcd master node (node1) is spamming my journalctl log multiple times a second with this pair of similar entries:

Sep 21 18:57:05 node1 k3s[999]: {"level":"warn","ts":"2023-09-21T18:57:05.464935Z","caller":"rafthttp/http.go:500","msg":"request cluster ID mismatch","local-member-id":"33e3c62f80baaab0","local-member-cluster-id":"398dad8ab81b9249","local-member-server-version":"3.5.9","local-member-server-minimum-cluster-version":"3.0.0","remote-peer-server-name":"7445bf6b546bba98","remote-peer-server-version":"3.5.7","remote-peer-server-minimum-cluster-version":"3.0.0","remote-peer-cluster-id":"5349836852c2ccd3"}
Sep 21 18:57:05 node1 k3s[999]: {"level":"warn","ts":"2023-09-21T18:57:05.473258Z","caller":"rafthttp/http.go:500","msg":"request cluster ID mismatch","local-member-id":"33e3c62f80baaab0","local-member-cluster-id":"398dad8ab81b9249","local-member-server-version":"3.5.9","local-member-server-minimum-cluster-version":"3.0.0","remote-peer-server-name":"ebcd726a6bf925f1","remote-peer-server-version":"3.5.7","remote-peer-server-minimum-cluster-version":"3.0.0","remote-peer-cluster-id":"5349836852c2ccd3"}

The same error does not occur on the other two masters. Other than the spamming, everything appears to be fine. I can use etcdctl to change leader to/from any of the 3 nodes. Bringing down any one node has no ill effect on the overall cluster.

I've tried deleting and rejoining the node from the cluster multiple times. I've re-imaged the node (but using the same computer name). I've tried compacting and defragging the database multiple times. I've removed two of the three masters and re-joined. I've tried backing up and restoring etcd database while initiating a cluster-reset. But no matter what I do, as soon as I bring node1 back online, the errors start spamming on that node. The only thing that changes in the error message is the local-member-id

The interesting thing is that the IDs for remote-peer-cluster-id, remote-peer-server-name, and local-member-cluster-id never changes, even though those nodes have been removed/rejoined several times and have new IDs. It seems as though there is some stale info in the database that I have no idea how to get rid of. Again, everything else seems fine, except for the log spam on node1.

How can I clean up the etcd database to get rid of these old entries (assuming that's the issue here)?

Node endpoint status:

+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| node1:2379     | 33e3c62f80baaab0 |   3.5.9 |   20 MB |      true |      false |         7 |     590511 |             590511 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| node2:2379     | 354c354595d42dd  |   3.5.9 |   20 MB |     false |      false |         7 |     590519 |             590519 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| node3:2379     | 189e834ffafa993  |   3.5.9 |   20 MB |     false |      false |         7 |     590524 |             590524 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Environmental Info:

k3s version v1.27.5+k3s1 (k3s-io/k3s@8d074ec)
go version go1.20.7

Node(s) CPU architecture, OS, and Version:

Linux node1 5.15.0-84-generic #93-Ubuntu SMP Tue Sep 5 17:16:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

3 servers
3 agents
Bare-metal x86

Answered by ahrtr

Sep 22, 2023

The new warning log makes lots of sense. It means an unknown etcd instance was trying to connect to one member (89a79bc17f234c01) in the cluster (398dad8ab81b9249). You need to find out the unknown etcd instance and shut it down or correct its configuration.

View full answer

dims · 2023-09-21T20:34:59Z

dims
Sep 21, 2023

please open in k3s repo!

1 reply

kenlasko Sep 21, 2023
Author

I did and was told it was an etcd issue and to bring the issue here: k3s-io/k3s#8400

ahrtr · 2023-09-22T10:02:40Z

ahrtr
Sep 22, 2023
Maintainer

Could you generate a report using etcd-diagnosis?

Example command:

$ ./etcd-diagnosis --endpoints=https://10.0.1.10:2379,https://10.0.1.11:2379,https://10.0.1.12:2379 --cacert ./ca.crt --key ./etcd-diagnosis.key --cert ./etcd-diagnosis.crt

6 replies

kenlasko Sep 22, 2023
Author

Here are the results. Took a bit to make it work:

2023/09/22 12:41:41 etcd diagnosis starting...
2023/09/22 12:41:41 ---------------------------------------------------------
2023/09/22 12:41:41 Running "membershipChecker" (1/4)...
2023/09/22 12:41:41 Endpoints: [192.168.1.4:2379 192.168.1.5:2379 192.168.1.6:2379]
2023/09/22 12:41:41 {
        "name": "membershipChecker",
        "summary": "Successful",
        "memberList": {
                "header": {
                        "cluster_id": 4147161643184001609,
                        "member_id": 9919067959670819841,
                        "raft_term": 3
                },
                "members": [
                        {
                                "ID": 5641186559882066610,
                                "name": "node3-b0031bd9",
                                "peerURLs": [
                                        "https://192.168.1.6:2380"
                                ],
                                "clientURLs": [
                                        "https://192.168.1.6:2379"
                                ]
                        },
                        {
                                "ID": 9430413041254939993,
                                "name": "node2-6656d96a",
                                "peerURLs": [
                                        "https://192.168.1.5:2380"
                                ],
                                "clientURLs": [
                                        "https://192.168.1.5:2379"
                                ]
                        },
                        {
                                "ID": 9919067959670819841,
                                "name": "node1-645357a5",
                                "peerURLs": [
                                        "https://192.168.1.4:2380"
                                ],
                                "clientURLs": [
                                        "https://192.168.1.4:2379"
                                ]
                        }
                ]
        }
}
2023/09/22 12:41:41 ---------------------------------------------------------
2023/09/22 12:41:41 Running "epStatusChecker" (2/4)...
2023/09/22 12:41:41 Endpoints: [192.168.1.4:2379 192.168.1.5:2379 192.168.1.6:2379]
2023/09/22 12:41:42 {
        "name": "epStatusChecker",
        "summary": [
                "Successful"
        ],
        "epStatusList": [
                {
                        "endpoint": "192.168.1.4:2379",
                        "epStatus": {
                                "header": {
                                        "cluster_id": 4147161643184001609,
                                        "member_id": 9919067959670819841,
                                        "revision": 47205993,
                                        "raft_term": 3
                                },
                                "version": "3.5.9",
                                "dbSize": 24166400,
                                "leader": 9919067959670819841,
                                "raftIndex": 387656,
                                "raftTerm": 3,
                                "raftAppliedIndex": 387656,
                                "dbSizeInUse": 16547840
                        }
                },
                {
                        "endpoint": "192.168.1.5:2379",
                        "epStatus": {
                                "header": {
                                        "cluster_id": 4147161643184001609,
                                        "member_id": 9430413041254939993,
                                        "revision": 47205993,
                                        "raft_term": 3
                                },
                                "version": "3.5.9",
                                "dbSize": 24117248,
                                "leader": 9919067959670819841,
                                "raftIndex": 387656,
                                "raftTerm": 3,
                                "raftAppliedIndex": 387656,
                                "dbSizeInUse": 16539648
                        }
                },
                {
                        "endpoint": "192.168.1.6:2379",
                        "epStatus": {
                                "header": {
                                        "cluster_id": 4147161643184001609,
                                        "member_id": 5641186559882066610,
                                        "revision": 47205993,
                                        "raft_term": 3
                                },
                                "version": "3.5.9",
                                "dbSize": 24096768,
                                "leader": 9919067959670819841,
                                "raftIndex": 387656,
                                "raftTerm": 3,
                                "raftAppliedIndex": 387656,
                                "dbSizeInUse": 16519168
                        }
                }
        ]
}
2023/09/22 12:41:42 ---------------------------------------------------------
2023/09/22 12:41:42 Running "readChecker" (3/4)...
2023/09/22 12:41:42 Endpoints: [192.168.1.4:2379 192.168.1.5:2379 192.168.1.6:2379]
2023/09/22 12:41:42 {
        "name": "readChecker",
        "summary": "Successful",
        "readResponses": [
                {
                        "endpoint": "192.168.1.4:2379",
                        "took": "16.086735ms"
                },
                {
                        "endpoint": "192.168.1.5:2379",
                        "took": "10.358984ms"
                },
                {
                        "endpoint": "192.168.1.6:2379",
                        "took": "30.459868ms"
                }
        ]
}
2023/09/22 12:41:42 ---------------------------------------------------------
2023/09/22 12:41:42 Running "metricsChecker" (4/4)...
2023/09/22 12:41:42 Endpoints: [192.168.1.4:2379 192.168.1.5:2379 192.168.1.6:2379]
2023/09/22 12:41:42 Failed to get endpoint metrics from "192.168.1.4:2379": http get failed: Get "http://192.168.1.4:2379/metrics": EOF
2023/09/22 12:41:42 Failed to get endpoint metrics from "192.168.1.5:2379": http get failed: Get "http://192.168.1.5:2379/metrics": EOF
2023/09/22 12:41:42 Failed to get endpoint metrics from "192.168.1.6:2379": http get failed: Get "http://192.168.1.6:2379/metrics": EOF
2023/09/22 12:41:42 {
        "name": "metricsChecker",
        "summary": [
                "Failed to get endpoint metrics from \"192.168.1.4:2379\": http get failed: Get \"http://192.168.1.4:2379/metrics\": EOF",
                "Failed to get endpoint metrics from \"192.168.1.5:2379\": http get failed: Get \"http://192.168.1.5:2379/metrics\": EOF",
                "Failed to get endpoint metrics from \"192.168.1.6:2379\": http get failed: Get \"http://192.168.1.6:2379/metrics\": EOF"
        ],
        "epMetricsList": [
                {
                        "endpoint": "192.168.1.4:2379",
                        "took": "1.090758ms"
                },
                {
                        "endpoint": "192.168.1.5:2379",
                        "took": "1.271692ms"
                },
                {
                        "endpoint": "192.168.1.6:2379",
                        "took": "3.314268ms"
                }
        ]
}
2023/09/22 12:41:42 etcd diagnosis done!

ahrtr Sep 22, 2023
Maintainer

thx for the feedback.

Based on warning log you provided (pasted below),

The member ( (in hex) 33e3c62f80baaab0 / (in dec) 3739050022973123248), which raised the warning logs , isn't a member of the cluster, because its member ID doesn't match any member ID in the cluster;
The remote peer is running etcd 3.5.7.

Sep 21 18:57:05 node1 k3s[999]: {"level":"warn","ts":"2023-09-21T18:57:05.464935Z","caller":"rafthttp/http.go:500","msg":"request cluster ID mismatch","local-member-id":"33e3c62f80baaab0","local-member-cluster-id":"398dad8ab81b9249","local-member-server-version":"3.5.9","local-member-server-minimum-cluster-version":"3.0.0","remote-peer-server-name":"7445bf6b546bba98","remote-peer-server-version":"3.5.7","remote-peer-server-minimum-cluster-version":"3.0.0","remote-peer-cluster-id":"5349836852c2ccd3"}
Sep 21 18:57:05 node1 k3s[999]: {"level":"warn","ts":"2023-09-21T18:57:05.473258Z","caller":"rafthttp/http.go:500","msg":"request cluster ID mismatch","local-member-id":"33e3c62f80baaab0","local-member-cluster-id":"398dad8ab81b9249","local-member-server-version":"3.5.9","local-member-server-minimum-cluster-version":"3.0.0","remote-peer-server-name":"ebcd726a6bf925f1","remote-peer-server-version":"3.5.7","remote-peer-server-minimum-cluster-version":"3.0.0","remote-peer-cluster-id":"5349836852c2ccd3"}

kenlasko Sep 22, 2023
Author

I've done more troubleshooting in the time since I posted the original ticket. One of the things I did was to backup/restore the cluster. That changed the member IDs. Here's the updated error. The local-member-id matches now:

Sep 22 13:25:40 node1 k3s[351204]: {"level":"warn","ts":"2023-09-22T13:25:40.709847Z","caller":"rafthttp/http.go:500","msg":"request cluster ID mismatch","local-member-id":"89a79bc17f234c01","local-member-cluster-id":"398dad8ab81b9249","local-member-server-version":"3.5.9","local-member-server-minimum-cluster-version":"3.0.0","remote-peer-server-name":"7445bf6b546bba98","remote-peer-server-version":"3.5.7","remote-peer-server-minimum-cluster-version":"3.0.0","remote-peer-cluster-id":"5349836852c2ccd3"}
Sep 22 13:25:40 node1 k3s[351204]: {"level":"warn","ts":"2023-09-22T13:25:40.736869Z","caller":"rafthttp/http.go:500","msg":"request cluster ID mismatch","local-member-id":"89a79bc17f234c01","local-member-cluster-id":"398dad8ab81b9249","local-member-server-version":"3.5.9","local-member-server-minimum-cluster-version":"3.0.0","remote-peer-server-name":"ebcd726a6bf925f1","remote-peer-server-version":"3.5.7","remote-peer-server-minimum-cluster-version":"3.0.0","remote-peer-cluster-id":"5349836852c2ccd3"}

ahrtr Sep 22, 2023
Maintainer

The new warning log makes lots of sense. It means an unknown etcd instance was trying to connect to one member (89a79bc17f234c01) in the cluster (398dad8ab81b9249). You need to find out the unknown etcd instance and shut it down or correct its configuration.

Answer selected by jmhbnz

kenlasko Sep 22, 2023
Author

Can a node have more than one etcd instance running? There are only 6 nodes in the environment. 3 masters and 3 workers. There are no other servers in the network. Also, is it possible for a single node to be a member of two etcd clusters? And would that node have the same member ID in both? The error logs indicate the member ID is the same that appears in the "working" environment.

kenlasko Sep 22, 2023
Author

OK, I forgot about my two RPis.

I did a tcpdump and noticed lots of traffic from my RPis to node1 on port 2380. They were cluster members at one point, but were removed because performance was garbage. It would also appear that at some point, they were also master nodes which I don't recall doing at all. Must have been while I was first learning Kubernetes. They were still acting as master nodes. One could definitely argue that I still don't know what I'm doing. :)

I shut down both RPis and the errors on node1 stopped immediately. When I started them up, so did the errors. A simple k3s-uninstall.sh did the trick. The errors are now gone from node1.

On the upside, I now have a much better understanding of etcd.

Thank you @ahrtr for making me see the light, no matter how dim I might appear.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K3S node spamming logs with "request cluster ID mismatch" #16629

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

K3S node spamming logs with "request cluster ID mismatch" #16629

kenlasko Sep 21, 2023

Environmental Info:

Node(s) CPU architecture, OS, and Version:

Cluster Configuration:

Replies: 2 comments · 7 replies

dims Sep 21, 2023

kenlasko Sep 21, 2023 Author

ahrtr Sep 22, 2023 Maintainer

kenlasko Sep 22, 2023 Author

ahrtr Sep 22, 2023 Maintainer

kenlasko Sep 22, 2023 Author

ahrtr Sep 22, 2023 Maintainer

kenlasko Sep 22, 2023 Author

kenlasko Sep 22, 2023 Author

kenlasko
Sep 21, 2023

Replies: 2 comments 7 replies

dims
Sep 21, 2023

kenlasko Sep 21, 2023
Author

ahrtr
Sep 22, 2023
Maintainer

kenlasko Sep 22, 2023
Author

ahrtr Sep 22, 2023
Maintainer

kenlasko Sep 22, 2023
Author

ahrtr Sep 22, 2023
Maintainer

kenlasko Sep 22, 2023
Author

kenlasko Sep 22, 2023
Author