coodirnator: optimize the detectiong timeout for ip conflict and gateway detection #4504

cyclinder · 2025-01-02T03:33:39Z

Thanks for contributing!

Notice:

unite test or E2E test
do not forget essential code comment and log
document for the PR
release note label
"release/none"
"release/bug"
"release/feature"
read about Contribution notice: https://spidernet-io.github.io/spiderpool/latest/develop/contributing/

What issue(s) does this PR fix:

Fixes #

Special notes for your reviewer:

 Event occurred message is Warning/Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "b5997950cb8d4b2111f2f6bca9aa1c2628d0bdc6a457be264ab04b5ae5ebabf6": plugin type="multus" name="multus-cni-network" failed (add): [ns-974abd94d9/dep-name-32a9f8b13b-66d55cddff-s94q8/4ca679f5-082d-4d70-b106-f22f735c2c61:test-multus-265db25b27]: error adding container to network "test-multus-265db25b27": plugin type="coordinator" failed (add): failed to checking ip 172.200.0.68 if it's conflicting: context canceled

fix timeout doesn't works for ip and gateway detection
Increase the timeout to 100 ms.

detection options:

retries: the number of retries, i.e., the number of times arp/ndp was sent
interval: interval between each transmission
timeout: timeout for performing detection, default 100ms, e.g. detecting IP conflicts for ipv4 is 100ms.

修复 coordinator 网关检测/ IP 冲突的超时时间 timeout 不生效
调整 timeout 从 100 ms 到 200 ms

检测配置：

retries: 表示重试次数，即发送 arp/ndp 报文的个数，默认为 3
interval: 每次发送间隔, 遗弃
timeout: 用于每个报文的发送及响应的超时时间，默认 100ms

codecov · 2025-01-02T03:36:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.22%. Comparing base (aea3599) to head (34cead5).
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4504      +/-   ##
==========================================
+ Coverage   79.08%   79.22%   +0.14%     
==========================================
  Files          55       55              
  Lines        6389     6389              
==========================================
+ Hits         5053     5062       +9     
+ Misses       1132     1122      -10     
- Partials      204      205       +1

Flag	Coverage Δ
unittests	`79.22% <ø> (+0.14%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 1 file with indirect coverage changes

weizhoublue · 2025-01-02T07:57:42Z

pkg/networking/gwconnection/connection.go

+		case <-ctx.Done():
+			return ctx.Err()
+		case <-ticker.C:
+			err = client.SetReadDeadline(time.Now().Add(dg.interval))


这里逻辑我觉得是可以优化的
（1）为什么有了 SetReadDeadline ，又来 time.NewTicker ，两重超时控制。
（2）retries 之间可以优化为并行？而不是串行。 retries 是解决避免丢包问题，并行，在满足该需求下，还能让每个响应报文的等待时间扩大。
因此，只需要总的 arp 探测时间和并发多少个探测即可，时间是固定的。没必要是 retries * interval 来增加时间成本
（3）如果 interval 10ms 控制了每次 arp 时间， 3 次 arp，为什么还要额外有一个 context 500ms 超时控制？似乎这 500ms 要么时间太长起不到作用，要么真正的 arp 10ms 检测得不到保障， golang 自己调度太慢导致 500ms 需要介入。因此，我认为 500ms 是一个多余的超时控制

所以，这里有了 3 个超时控制
是否最终应该是优化为 3 个 arp 并行 + 每个 arp 100ms，就解决问题了

这里可能并不适合并发发送报文探测，某一次丢包说明这个时刻网络有些不稳定，这时候并发 N 个包可能都会丢，而且并发增加代码复杂度（特别是协程内切换 netns），单为了省时并发并没有意义

这里可能确实需要优化下，detection options 三个参数：

retry: 重试次数，假设某个时刻发送的包丢了，我可以再等待若干时间重试

interval: 重试间隔

timeout: 超时时间，控制整个探测过程可接受的最大时间，但这里应该比理论上再宽裕点，因为 golang 代码协程调度等也是需要一定时间的

所以优化后的代码应该是：

不同重试依然是串行，单次探测不设置超时时间，通过总的超时时间控制

或者：

不同重试依然是串行，单次探测设置超时时间，不设置总的超时时间控制，也不用关心 golang runtime 本身的耗时，总的超时时间为： retry * interval。但这样可能有个问题：某次的探测会一直 block 住，所以还是需要 timeout，表示某次探测的 timeout 或整个探测过程的 timeout

如果网络不稳定，串行间隔多少ms 的意义貌似不大，这里主要靠发送的数量来解决。我相信这个代码你能解决，更多的是从方案优化角度讨论

除非 interval 超时控制有 bug ，否则 retry + interval 已经决定了 arp 超时，本质 timeout 超时是多此一举，或者是用户不关心的，用户没法量化或者也不关心 golang 的调度，用户甚至不关心一次 add ip 的调度耗时，用户也不关心 interval 以外的耗时
理论上，不应有 timeout ，我们都不关心整个二进制的总耗时，为什么在 arp 这里需要特别细化它的总超时

因此， 2 次 arp + 单次 50ms-100ms 超时，是用户合理的感知设置
至于串行和并行，是另一个优先级低的问题，如果是串行，我建议 2 次arp 即可

间隔串行间隔多少ms 也要比并行好些，但如果主要靠发送的数量来解决，也不需要什么间隔了，直接 send 若干数量的报文 + 和在指定的 timeout 内能接收到 reply 就足够了

间隔串行间隔多少ms 也要比并行好些，但如果主要靠发送的数量来解决，也不需要什么间隔了，直接 send 若干数量的报文 + 指定的 timeout 就足够了

就是这个意思，多个报文数 + 超时即可，并不需要多轮

因此，用户输入参数可以优化下

Interval: "10ms", // 废弃 TimeOut: "100ms", // 一次超时,SetReadDeadline reties: 3, // 可有可无，一次的请求数

reties 怎么是可有可无？感觉还是需要的，默认还是 3？

weizhoublue · 2025-01-02T07:59:32Z

pkg/networking/ipchecking/ipchecking.go

-			// Was the error caused by a read timeout, and should the loop continue?
-			if neterr, ok := err.(net.Error); ok && neterr.Timeout() {
-				ipc.logger.Error(err.Error())
+		select {


同理，探测是可以并行的？

weizhoublue · 2025-01-02T08:02:32Z

docs/concepts/coordinator-zh_CN.md

@@ -41,7 +41,7 @@ EOF
 | podRPFilter       | 设置 Pod 的 sysctl 参数 rp_filter                                                                                                                                                                                                                                              | 整数型      | optional   | 0                                 |
 | hostRPFilter       | (遗弃)设置节点 的 sysctl 参数 rp_filter                                                                                                                                                                                                                                              | 整数型      | optional   | 0                                 |
 | txQueueLen         | 设置 Pod 的网卡传输队列                                                                                                                                                                                                                                                          | 整数型      | optional   | 0                                 |
-| detectOptions      | 检测地址冲突和网关可达性的高级配置项: 包括重试次数(默认为 3 次), 探测间隔(默认为 10ms) 和 超时时间(默认为 100ms)                                                                                                                                                                                                        | 对象类型     | optional   | 空                                 |
+| detectOptions      | 检测地址冲突和网关可达性的高级配置项: 包括重试次数(默认为 3 次), 探测间隔(默认为 10ms) 和 超时时间(默认为 500ms)                                                                                                                                                                                                        | 对象类型     | optional   | 空                                 |


（1）这里描述有歧义， 3 次探测，每次探测间隔 10ms （是一次探测的超时 10ms），那么应该共开销 3*10ms 。那么，超时时间(默认为 500ms) 又是如何出来的，这个 500ms 是否还有意义么

weizhoublue · 2025-01-02T11:58:20Z

pkg/networking/gwconnection/connection.go

@@ -74,27 +74,21 @@ func (dg *DetectGateway) ArpingOverIface() error {
 	gwNetIP := netip.MustParseAddr(dg.V4Gw.String())
 	var gwHwAddr net.HardwareAddr
 	for i := 0; i < dg.retries; i++ {


参考源码，似乎是可以这么调用，完成多次发送

client.SetDeadline for i := 0; i < dg.retries; i++ { err := c.Request(ip) if err != nil { return nil, err } } // Loop and wait for replies for { arp, _, err := c.Read() if err != nil { return nil, err } if arp.Operation != OperationReply || arp.SenderIP != ip { continue } return arp.SenderHardwareAddr, nil }

good job
貌似很多地方调用这一段，可以封装个函数，会好管理

emm，ip 冲突那边还不太一样，那边需要构建 packet

这个统一的 api ，完成探测，它可以有入参数，源 MAC, 是用 4 个 0 ，还是用网卡自身的

weizhoublue · 2025-01-02T12:01:37Z

cmd/coordinator/cmd/cni_types.go

@@ -274,8 +274,8 @@ func validateRPFilterConfig(rpfilter *int32, coordinatorConfig int64) (*int32, e
 func ValidateDelectOptions(config *DetectOptions) (*DetectOptions, error) {
 	if config == nil {
 		return &DetectOptions{
-			Interval: "1s",
-			TimeOut:  "3s",
+			Interval: "10ms",


次版本中， interval 的相关代码可以去除了，doc 中可以标记 deprecated

本 pr docs 标记 deprecated，后续版本移除代码

Signed-off-by: Cyclinder Kuo <[email protected]>

coodirnator: optimize the detectiong timeout for ip conflict and gateway detection Signed-off-by: robot <[email protected]>

cyclinder added release/bug cherrypick-release-v0.8 Cherry-pick the PR to branch release-v0.8. cherrypick-release-v0.9 cherrypick-release-v1.0 Cherry-pick the PR to branch release-v1.0. labels Jan 2, 2025

cyclinder requested review from weizhoublue and lou-lan as code owners January 2, 2025 03:33

cyclinder force-pushed the coordinator/detecte branch from 4d7487a to 0da4740 Compare January 2, 2025 06:37

cyclinder requested a review from windsonsea as a code owner January 2, 2025 06:37

cyclinder force-pushed the coordinator/detecte branch from 0da4740 to cad0a69 Compare January 2, 2025 07:17

weizhoublue reviewed Jan 2, 2025

View reviewed changes

cyclinder force-pushed the coordinator/detecte branch 3 times, most recently from ec4a945 to 6531af5 Compare January 2, 2025 11:00

weizhoublue reviewed Jan 2, 2025

View reviewed changes

cyclinder force-pushed the coordinator/detecte branch from 6531af5 to a5ee454 Compare January 2, 2025 11:59

weizhoublue reviewed Jan 2, 2025

View reviewed changes

cyclinder force-pushed the coordinator/detecte branch from a5ee454 to 99eb660 Compare January 2, 2025 12:07

coodirnator: set timeout for ip and gateway detection

34cead5

Signed-off-by: Cyclinder Kuo <[email protected]>

cyclinder force-pushed the coordinator/detecte branch from 99eb660 to 34cead5 Compare January 2, 2025 12:24

weizhoublue approved these changes Jan 3, 2025

View reviewed changes

weizhoublue changed the title ~~coodirnator: fix timeout doesn't works for ip and gateway detection~~ coodirnator: optimize the detectiong timeout for ip conflict and gateway detection Jan 3, 2025

weizhoublue merged commit 5d25b7d into spidernet-io:main Jan 3, 2025
58 checks passed

This was referenced Jan 3, 2025

failed to cherry pick PR 4504 from cyclinder, to branch release-v0.8 #4508

Closed

failed to cherry pick PR 4504 from cyclinder, to branch release-v0.9 #4509

Closed

github-actions bot pushed a commit that referenced this pull request Jan 3, 2025

Merge pull request #4504 from cyclinder/coordinator/detecte

76cfddc

coodirnator: optimize the detectiong timeout for ip conflict and gateway detection Signed-off-by: robot <[email protected]>

weizhoublue mentioned this pull request Jan 3, 2025

coodirnator: optimize the detectiong timeout for ip conflict and gateway detection #4510

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coodirnator: optimize the detectiong timeout for ip conflict and gateway detection #4504

coodirnator: optimize the detectiong timeout for ip conflict and gateway detection #4504

cyclinder commented Jan 2, 2025 •

edited

Loading

codecov bot commented Jan 2, 2025 •

edited

Loading

weizhoublue Jan 2, 2025 •

edited

Loading

cyclinder Jan 2, 2025 •

edited

Loading

weizhoublue Jan 2, 2025 •

edited

Loading

weizhoublue Jan 2, 2025 •

edited

Loading

weizhoublue Jan 2, 2025 •

edited

Loading

cyclinder Jan 2, 2025 •

edited

Loading

weizhoublue Jan 2, 2025

weizhoublue Jan 2, 2025 •

edited

Loading

cyclinder Jan 2, 2025

weizhoublue Jan 2, 2025 •

edited

Loading

weizhoublue Jan 2, 2025 •

edited

Loading

weizhoublue Jan 2, 2025 •

edited

Loading

cyclinder Jan 2, 2025 •

edited

Loading

weizhoublue Jan 2, 2025 •

edited

Loading

cyclinder Jan 2, 2025

weizhoublue Jan 3, 2025

weizhoublue Jan 2, 2025

cyclinder Jan 2, 2025

coodirnator: optimize the detectiong timeout for ip conflict and gateway detection #4504

coodirnator: optimize the detectiong timeout for ip conflict and gateway detection #4504

Conversation

cyclinder commented Jan 2, 2025 • edited Loading

Thanks for contributing!

codecov bot commented Jan 2, 2025 • edited Loading

Codecov Report

weizhoublue Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

cyclinder Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

weizhoublue Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

weizhoublue Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

weizhoublue Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

cyclinder Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

weizhoublue Jan 2, 2025

Choose a reason for hiding this comment

weizhoublue Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

cyclinder Jan 2, 2025

Choose a reason for hiding this comment

weizhoublue Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

weizhoublue Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

weizhoublue Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

cyclinder Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

weizhoublue Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

cyclinder Jan 2, 2025

Choose a reason for hiding this comment

weizhoublue Jan 3, 2025

Choose a reason for hiding this comment

weizhoublue Jan 2, 2025

Choose a reason for hiding this comment

cyclinder Jan 2, 2025

Choose a reason for hiding this comment

cyclinder commented Jan 2, 2025 •

edited

Loading

codecov bot commented Jan 2, 2025 •

edited

Loading

weizhoublue Jan 2, 2025 •

edited

Loading

cyclinder Jan 2, 2025 •

edited

Loading

weizhoublue Jan 2, 2025 •

edited

Loading

weizhoublue Jan 2, 2025 •

edited

Loading

weizhoublue Jan 2, 2025 •

edited

Loading

cyclinder Jan 2, 2025 •

edited

Loading

weizhoublue Jan 2, 2025 •

edited

Loading

weizhoublue Jan 2, 2025 •

edited

Loading

weizhoublue Jan 2, 2025 •

edited

Loading

weizhoublue Jan 2, 2025 •

edited

Loading

cyclinder Jan 2, 2025 •

edited

Loading

weizhoublue Jan 2, 2025 •

edited

Loading