-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coodirnator: optimize the detectiong timeout for ip conflict and gateway detection #4504
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #4504 +/- ##
==========================================
+ Coverage 79.08% 79.22% +0.14%
==========================================
Files 55 55
Lines 6389 6389
==========================================
+ Hits 5053 5062 +9
+ Misses 1132 1122 -10
- Partials 204 205 +1
Flags with carried forward coverage won't be shown. Click here to find out more. |
4d7487a
to
0da4740
Compare
0da4740
to
cad0a69
Compare
case <-ctx.Done(): | ||
return ctx.Err() | ||
case <-ticker.C: | ||
err = client.SetReadDeadline(time.Now().Add(dg.interval)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里逻辑 我觉得是可以优化的
(1)为什么有了 SetReadDeadline , 又来 time.NewTicker , 两重超时控制。
(2)retries 之间可以优化为 并行?而不是 串行。 retries 是解决 避免 丢包问题,并行 , 在满足该需求下,还能让每个 响应报文的等待时间 扩大。
因此,只需要总的 arp 探测时间 和 并发多少个 探测 即可,时间是固定的。没必要 是 retries * interval 来增加 时间成本
(3)如果 interval 10ms 控制了每次 arp 时间, 3 次 arp,为什么还要额外有一个 context 500ms 超时 控制 ?似乎这 500ms 要么时间太长 起不到作用,要么 真正的 arp 10ms 检测 得不到保障, golang 自己 调度太慢 导致 500ms 需要介入 。因此,我认为 500ms 是一个 多余的 超时控制
所以,这里有了 3 个 超时 控制
是否最终应该是 优化为 3 个 arp 并行 + 每个 arp 100ms, 就解决问题了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 这里可能 并不适合 并发发送报文探测,某一次丢包说明这个时刻网络有些不稳定,这时候并发 N 个 包可能都会丢,而且并发增加代码复杂度(特别是协程内切换 netns),单为了省时并发并没有意义
- 这里可能确实需要优化下,detection options 三个参数:
- retry: 重试次数,假设某个时刻发送的包丢了,我可以再等待若干时间重试
- interval: 重试间隔
- timeout: 超时时间,控制整个探测过程可接受的最大时间,但这里应该比理论上再宽裕点,因为 golang 代码 协程调度等也是需要一定时间的
所以优化后的代码应该是:
- 不同重试依然是串行,单次探测不设置超时时间,通过总的超时时间控制
或者:
- 不同重试依然是串行,单次探测设置超时时间,不设置总的超时时间控制,也不用关心 golang runtime 本身的耗时,总的超时时间为: retry * interval。但这样可能有个问题:某次的 探测会一直 block 住,所以还是需要 timeout, 表示某次探测的 timeout 或 整个探测过程的 timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果网络不稳定,串行间隔 多少ms 的意义貌似不大,这里 主要靠 发送的数量来解决。我相信 这个 代码 你能 解决,更多的是 从 方案优化 角度 讨论
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
除非 interval 超时控制有 bug , 否则 retry + interval 已经决定了 arp 超时, 本质 timeout 超时是多此一举 ,或者 是 用户 不关心的,用户 没法 量化 或者 也不关心 golang 的调度 ,用户甚至 不关心 一次 add ip 的 调度耗时,用户也不关心 interval 以外的 耗时
理论上,不应有 timeout ,我们都不关心 整个 二进制的 总耗时,为什么在 arp 这里 需要特别 细化 它的 总超时
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
因此, 2 次 arp + 单次 50ms-100ms 超时,是 用户合理的感知 设置
至于 串行 和 并行,是另一个 优先级低的问题,如果是串行 ,我建议 2 次arp 即可
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
间隔串行间隔 多少ms 也要比并行好些,但如果 主要靠 发送的数量来解决,也不需要什么 间隔了,直接 send 若干数量的 报文 + 和在指定的 timeout 内能接收到 reply 就足够了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
间隔串行间隔 多少ms 也要比并行好些,但如果 主要靠 发送的数量来解决,也不需要什么 间隔了,直接 send 若干数量的 报文 + 指定的 timeout 就足够了
就是这个意思, 多个报文数 + 超时 即可 ,并不需要多轮
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
因此,用户输入参数 可以优化下
Interval: "10ms", // 废弃
TimeOut: "100ms", // 一次超时,SetReadDeadline
reties: 3, // 可有可无,一次的请求数
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reties 怎么是可有可无? 感觉还是需要的,默认还是 3?
// Was the error caused by a read timeout, and should the loop continue? | ||
if neterr, ok := err.(net.Error); ok && neterr.Timeout() { | ||
ipc.logger.Error(err.Error()) | ||
select { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同理, 探测 是可以并行的 ?
docs/concepts/coordinator-zh_CN.md
Outdated
@@ -41,7 +41,7 @@ EOF | |||
| podRPFilter | 设置 Pod 的 sysctl 参数 rp_filter | 整数型 | optional | 0 | | |||
| hostRPFilter | (遗弃)设置节点 的 sysctl 参数 rp_filter | 整数型 | optional | 0 | | |||
| txQueueLen | 设置 Pod 的网卡传输队列 | 整数型 | optional | 0 | | |||
| detectOptions | 检测地址冲突和网关可达性的高级配置项: 包括重试次数(默认为 3 次), 探测间隔(默认为 10ms) 和 超时时间(默认为 100ms) | 对象类型 | optional | 空 | | |||
| detectOptions | 检测地址冲突和网关可达性的高级配置项: 包括重试次数(默认为 3 次), 探测间隔(默认为 10ms) 和 超时时间(默认为 500ms) | 对象类型 | optional | 空 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1)这里描述有歧义, 3 次 探测, 每次 探测间隔 10ms (是一次探测的超时 10ms),那么 应该共开销 3*10ms 。那么,超时时间(默认为 500ms) 又是如何出来的,这个 500ms 是否还有 意义么
ec4a945
to
6531af5
Compare
@@ -74,27 +74,21 @@ func (dg *DetectGateway) ArpingOverIface() error { | |||
gwNetIP := netip.MustParseAddr(dg.V4Gw.String()) | |||
var gwHwAddr net.HardwareAddr | |||
for i := 0; i < dg.retries; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
参考源码,似乎是可以这么调用,完成 多次发送
client.SetDeadline
for i := 0; i < dg.retries; i++ {
err := c.Request(ip)
if err != nil {
return nil, err
}
}
// Loop and wait for replies
for {
arp, _, err := c.Read()
if err != nil {
return nil, err
}
if arp.Operation != OperationReply || arp.SenderIP != ip {
continue
}
return arp.SenderHardwareAddr, nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good job
貌似很多地方调用 这一段,可以 封装个 函数,会 好管理
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emm,ip 冲突那边还不太一样,那边需要构建 packet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个统一的 api ,完成探测, 它可以有入参数,源 MAC, 是 用 4 个 0 ,还是 用 网卡自身的
6531af5
to
a5ee454
Compare
@@ -274,8 +274,8 @@ func validateRPFilterConfig(rpfilter *int32, coordinatorConfig int64) (*int32, e | |||
func ValidateDelectOptions(config *DetectOptions) (*DetectOptions, error) { | |||
if config == nil { | |||
return &DetectOptions{ | |||
Interval: "1s", | |||
TimeOut: "3s", | |||
Interval: "10ms", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
次版本中, interval 的 相关代码 可以 去除了,doc 中 可以标记 deprecated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
本 pr docs 标记 deprecated,后续版本移除 代码
a5ee454
to
99eb660
Compare
Signed-off-by: Cyclinder Kuo <[email protected]>
99eb660
to
34cead5
Compare
coodirnator: optimize the detectiong timeout for ip conflict and gateway detection Signed-off-by: robot <[email protected]>
Thanks for contributing!
Notice:
"release/none"
"release/bug"
"release/feature"
What issue(s) does this PR fix:
Fixes #
Special notes for your reviewer:
detection options:
检测配置: