-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IC_queue is getting huge #1542
Comments
@esplinr -- can your team take a look at this? Thanks! |
@esplinr I added the info on the software version we are using. If more logs or information are needed please ask |
It'll take me a while to find someone with the right expertise to triage this and provide some direction. |
Small update: |
After small investigation of logs and validator-info, I have the next results:
The main recommendation from my point of view here is:
|
Thanks @anikitinDSR for looking into this.
I struggle to understand your second point. |
My interpretation of @anikitinDSR 's comment is that the logs suggest the network is partitioned, which can happen if more than f nodes are rebooted at the same time. That could be why some nodes are seeing an older view change than the others. My suggestion is that you rebuild just the nodes that are seeing the old view change, and let them catch up to have a consistent view with the rest of the network. If that isn't the problem, you'll have to trace the system to figure out what exactly is going on. |
The final results and suggestions after investigations and discussion.
|
We experience an issue with a huge IC_queue. Our validator_info response has a size of around 7MB. The issue started already in March, when lots of view changes were triggered before the view change could be completed. View_no to be voted for increased from 8044 to 10931. Then a node voted for a view change to 8045 again which was successful. However every node still carries the huge IC_queue. Restarting does not help since the IC_queue is persisted.
IC - Instance Change
I've attached (parts of) the log from the view_change_trigger_service.py.
extract.txt
Looking at the code it seems that this is the only point where instance change messages get removed.
Is there a way to flush the IC queue without deleting the indy node's data directory? About 10 stewards would need to do this in this case.
Can we do something to prevent such a build up of (unsuccessful) view changes?
I've reported this issue/asked the questions also on Rocketchat.
Thanks for taking the time to look into this!
Network Details
13 validator nodes in March. Today 15.
Software
There is some variation on the exact os_version among the validator nodes.
The text was updated successfully, but these errors were encountered: