Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coodirnator: optimize the detectiong timeout for ip conflict and gateway detection #4504

Merged
merged 1 commit into from
Jan 3, 2025

Conversation

cyclinder
Copy link
Collaborator

@cyclinder cyclinder commented Jan 2, 2025

Thanks for contributing!

Notice:

What issue(s) does this PR fix:

Fixes #

Special notes for your reviewer:

 Event occurred message is Warning/Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "b5997950cb8d4b2111f2f6bca9aa1c2628d0bdc6a457be264ab04b5ae5ebabf6": plugin type="multus" name="multus-cni-network" failed (add): [ns-974abd94d9/dep-name-32a9f8b13b-66d55cddff-s94q8/4ca679f5-082d-4d70-b106-f22f735c2c61:test-multus-265db25b27]: error adding container to network "test-multus-265db25b27": plugin type="coordinator" failed (add): failed to checking ip 172.200.0.68 if it's conflicting: context canceled 
  • fix timeout doesn't works for ip and gateway detection
  • Increase the timeout to 100 ms.

detection options:

  1. retries: the number of retries, i.e., the number of times arp/ndp was sent
  2. interval: interval between each transmission
  3. timeout: timeout for performing detection, default 100ms, e.g. detecting IP conflicts for ipv4 is 100ms.
  • 修复 coordinator 网关检测/ IP 冲突的超时时间 timeout 不生效
  • 调整 timeout 从 100 ms 到 200 ms

检测配置:

  1. retries: 表示重试次数,即发送 arp/ndp 报文的个数,默认为 3
  2. interval: 每次发送间隔, 遗弃
  3. timeout: 用于每个报文的发送及响应的超时时间,默认 100ms

@cyclinder cyclinder added release/bug cherrypick-release-v0.8 Cherry-pick the PR to branch release-v0.8. cherrypick-release-v0.9 cherrypick-release-v1.0 Cherry-pick the PR to branch release-v1.0. labels Jan 2, 2025
Copy link

codecov bot commented Jan 2, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.22%. Comparing base (aea3599) to head (34cead5).
Report is 2 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4504      +/-   ##
==========================================
+ Coverage   79.08%   79.22%   +0.14%     
==========================================
  Files          55       55              
  Lines        6389     6389              
==========================================
+ Hits         5053     5062       +9     
+ Misses       1132     1122      -10     
- Partials      204      205       +1     
Flag Coverage Δ
unittests 79.22% <ø> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 1 file with indirect coverage changes

@cyclinder cyclinder force-pushed the coordinator/detecte branch from 4d7487a to 0da4740 Compare January 2, 2025 06:37
@cyclinder cyclinder requested a review from windsonsea as a code owner January 2, 2025 06:37
@cyclinder cyclinder force-pushed the coordinator/detecte branch from 0da4740 to cad0a69 Compare January 2, 2025 07:17
case <-ctx.Done():
return ctx.Err()
case <-ticker.C:
err = client.SetReadDeadline(time.Now().Add(dg.interval))
Copy link
Collaborator

@weizhoublue weizhoublue Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里逻辑 我觉得是可以优化的
(1)为什么有了 SetReadDeadline , 又来 time.NewTicker , 两重超时控制。
(2)retries 之间可以优化为 并行?而不是 串行。 retries 是解决 避免 丢包问题,并行 , 在满足该需求下,还能让每个 响应报文的等待时间 扩大。
因此,只需要总的 arp 探测时间 和 并发多少个 探测 即可,时间是固定的。没必要 是 retries * interval 来增加 时间成本
(3)如果 interval 10ms 控制了每次 arp 时间, 3 次 arp,为什么还要额外有一个 context 500ms 超时 控制 ?似乎这 500ms 要么时间太长 起不到作用,要么 真正的 arp 10ms 检测 得不到保障, golang 自己 调度太慢 导致 500ms 需要介入 。因此,我认为 500ms 是一个 多余的 超时控制

所以,这里有了 3 个 超时 控制
是否最终应该是 优化为 3 个 arp 并行 + 每个 arp 100ms, 就解决问题了

Copy link
Collaborator Author

@cyclinder cyclinder Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 这里可能 并不适合 并发发送报文探测,某一次丢包说明这个时刻网络有些不稳定,这时候并发 N 个 包可能都会丢,而且并发增加代码复杂度(特别是协程内切换 netns),单为了省时并发并没有意义
  2. 这里可能确实需要优化下,detection options 三个参数:
  • retry: 重试次数,假设某个时刻发送的包丢了,我可以再等待若干时间重试
  • interval: 重试间隔
  • timeout: 超时时间,控制整个探测过程可接受的最大时间,但这里应该比理论上再宽裕点,因为 golang 代码 协程调度等也是需要一定时间的

所以优化后的代码应该是:

  1. 不同重试依然是串行,单次探测不设置超时时间,通过总的超时时间控制

或者:

  1. 不同重试依然是串行,单次探测设置超时时间,不设置总的超时时间控制,也不用关心 golang runtime 本身的耗时,总的超时时间为: retry * interval。但这样可能有个问题:某次的 探测会一直 block 住,所以还是需要 timeout, 表示某次探测的 timeout 或 整个探测过程的 timeout

Copy link
Collaborator

@weizhoublue weizhoublue Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果网络不稳定,串行间隔 多少ms 的意义貌似不大,这里 主要靠 发送的数量来解决。我相信 这个 代码 你能 解决,更多的是 从 方案优化 角度 讨论

Copy link
Collaborator

@weizhoublue weizhoublue Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

除非 interval 超时控制有 bug , 否则 retry + interval 已经决定了 arp 超时, 本质 timeout 超时是多此一举 ,或者 是 用户 不关心的,用户 没法 量化 或者 也不关心 golang 的调度 ,用户甚至 不关心 一次 add ip 的 调度耗时,用户也不关心 interval 以外的 耗时
理论上,不应有 timeout ,我们都不关心 整个 二进制的 总耗时,为什么在 arp 这里 需要特别 细化 它的 总超时

Copy link
Collaborator

@weizhoublue weizhoublue Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因此, 2 次 arp + 单次 50ms-100ms 超时,是 用户合理的感知 设置
至于 串行 和 并行,是另一个 优先级低的问题,如果是串行 ,我建议 2 次arp 即可

Copy link
Collaborator Author

@cyclinder cyclinder Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

间隔串行间隔 多少ms 也要比并行好些,但如果 主要靠 发送的数量来解决,也不需要什么 间隔了,直接 send 若干数量的 报文 + 和在指定的 timeout 内能接收到 reply 就足够了

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

间隔串行间隔 多少ms 也要比并行好些,但如果 主要靠 发送的数量来解决,也不需要什么 间隔了,直接 send 若干数量的 报文 + 指定的 timeout 就足够了

就是这个意思, 多个报文数 + 超时 即可 ,并不需要多轮

Copy link
Collaborator

@weizhoublue weizhoublue Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因此,用户输入参数 可以优化下

			Interval: "10ms", // 废弃
 			TimeOut:  "100ms",  // 一次超时,SetReadDeadline
 			reties:    3,      // 可有可无,一次的请求数

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reties 怎么是可有可无? 感觉还是需要的,默认还是 3?

// Was the error caused by a read timeout, and should the loop continue?
if neterr, ok := err.(net.Error); ok && neterr.Timeout() {
ipc.logger.Error(err.Error())
select {
Copy link
Collaborator

@weizhoublue weizhoublue Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同理, 探测 是可以并行的 ?

@@ -41,7 +41,7 @@ EOF
| podRPFilter | 设置 Pod 的 sysctl 参数 rp_filter | 整数型 | optional | 0 |
| hostRPFilter | (遗弃)设置节点 的 sysctl 参数 rp_filter | 整数型 | optional | 0 |
| txQueueLen | 设置 Pod 的网卡传输队列 | 整数型 | optional | 0 |
| detectOptions | 检测地址冲突和网关可达性的高级配置项: 包括重试次数(默认为 3 次), 探测间隔(默认为 10ms) 和 超时时间(默认为 100ms) | 对象类型 | optional | 空 |
| detectOptions | 检测地址冲突和网关可达性的高级配置项: 包括重试次数(默认为 3 次), 探测间隔(默认为 10ms) 和 超时时间(默认为 500ms) | 对象类型 | optional | 空 |
Copy link
Collaborator

@weizhoublue weizhoublue Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1)这里描述有歧义, 3 次 探测, 每次 探测间隔 10ms (是一次探测的超时 10ms),那么 应该共开销 3*10ms 。那么,超时时间(默认为 500ms) 又是如何出来的,这个 500ms 是否还有 意义么

@cyclinder cyclinder force-pushed the coordinator/detecte branch 3 times, most recently from ec4a945 to 6531af5 Compare January 2, 2025 11:00
@@ -74,27 +74,21 @@ func (dg *DetectGateway) ArpingOverIface() error {
gwNetIP := netip.MustParseAddr(dg.V4Gw.String())
var gwHwAddr net.HardwareAddr
for i := 0; i < dg.retries; i++ {
Copy link
Collaborator

@weizhoublue weizhoublue Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参考源码,似乎是可以这么调用,完成 多次发送

client.SetDeadline

for  i := 0; i < dg.retries; i++  {
	err := c.Request(ip)
	if err != nil {
		return nil, err
	}
}

// Loop and wait for replies
for {
		arp, _, err := c.Read()
		if err != nil {
			return nil, err
		}

		if arp.Operation != OperationReply || arp.SenderIP != ip {
			continue
		}

		return arp.SenderHardwareAddr, nil
}

Copy link
Collaborator Author

@cyclinder cyclinder Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

@weizhoublue weizhoublue Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good job
貌似很多地方调用 这一段,可以 封装个 函数,会 好管理

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm,ip 冲突那边还不太一样,那边需要构建 packet

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个统一的 api ,完成探测, 它可以有入参数,源 MAC, 是 用 4 个 0 ,还是 用 网卡自身的

@cyclinder cyclinder force-pushed the coordinator/detecte branch from 6531af5 to a5ee454 Compare January 2, 2025 11:59
@@ -274,8 +274,8 @@ func validateRPFilterConfig(rpfilter *int32, coordinatorConfig int64) (*int32, e
func ValidateDelectOptions(config *DetectOptions) (*DetectOptions, error) {
if config == nil {
return &DetectOptions{
Interval: "1s",
TimeOut: "3s",
Interval: "10ms",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

次版本中, interval 的 相关代码 可以 去除了,doc 中 可以标记 deprecated

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

本 pr docs 标记 deprecated,后续版本移除 代码

@cyclinder cyclinder force-pushed the coordinator/detecte branch from a5ee454 to 99eb660 Compare January 2, 2025 12:07
@cyclinder cyclinder force-pushed the coordinator/detecte branch from 99eb660 to 34cead5 Compare January 2, 2025 12:24
@weizhoublue weizhoublue changed the title coodirnator: fix timeout doesn't works for ip and gateway detection coodirnator: optimize the detectiong timeout for ip conflict and gateway detection Jan 3, 2025
@weizhoublue weizhoublue merged commit 5d25b7d into spidernet-io:main Jan 3, 2025
58 checks passed
github-actions bot pushed a commit that referenced this pull request Jan 3, 2025
coodirnator: optimize the detectiong timeout  for ip conflict and gateway detection
Signed-off-by: robot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherrypick-release-v0.8 Cherry-pick the PR to branch release-v0.8. cherrypick-release-v0.9 cherrypick-release-v1.0 Cherry-pick the PR to branch release-v1.0. release/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants