-
-
Notifications
You must be signed in to change notification settings - Fork 741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vrrp heartbeat blocked when harddisk hangout #2494
Comments
Maybe we can change the /tmp/keepalived.stats file path into tempfs directory? I don't think it needs to be persisted to the hard disk |
I can certainly add a configurable path for the files to be written to; of course on most systems nowadays /tmp is a tmpfs. A better way, and the way keepalived was designed to work, is to obtain stats via snmp. A further question is what other files are opened and read from/written to in the main vrrp or checker processes (there are none in the bfd process). Reading the configuration for a reload is handled by the parent process, and it doesn't matter if that is blocked for a while; the configuration is passed to the child process via a memfd so that shouldn't block. What comes to mind are This has triggered some further, and completely unrelated, thinking. The Another thought is that when reading and writing files in the critical timing path, it might be that we should use io_uring (liburing). That leaves the problem of open(), and the stat() family if we use it (I think close() should not block). https://nullprogram.com/blog/2020/09/04/ is very interesting in this respect, and I will explore it further. |
Hi all.
Describe the bug
We have a set of environments that are built in kvm virtual machines with ceph blocks as the back-end storage, and when ceph failover occurs the VM IO will pause, and at this time the keepalived vrrp heartbeat will pause for the same amount of time, and the slave nodes will not be able to receive the master node's heartbeat, thus leading to incorrect failover
To Reproduce
hanout disk io, but keep network running, all the slave node will change them to master
Expected behavior
keep vrrp heartbeat when disk io block, because network is running
Keepalived version
1.3.5
Root cause
We use Prometheus-keepalived-exporter to monitor the keepalived status, this program tells the keepalived to generate the /tmp/keepalived.stats file every minute, and the exporter collects the keepalived information by reading this file
This seems to be fine.
But when hard disk io block, strace shows that the keepalived is stuck at open/read/write /tmp/keepalived.stats, and the vrrp heartbeat has stopped.
The text was updated successfully, but these errors were encountered: