-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubo stopped listing pin after some times #10596
Comments
I can confirm we are seeing the same behaviour while testing Logs from ipfs-cluster:
No errors or other logs observed in Kubo. Also, calling the API endpoint or running |
Hello, per the details you provide, I just think leveldb exploded in some way. How many files are there in the Do you have monitoring for disk-reads? Is it trying to read/write a lot from disk even when nothing is being added? I would recommend to switch leveldb to pebble (so, flatfs + pebble). You will loose the pinset (not the blocks, just the list of things you have pinned) but cluster will add the pins again, in time. |
Yes, when switching you will need to edit Do you use MFS for anything? Also, regarding the ipfs-pins graph, it goes to 0 because of ipfs-cluster/ipfs-cluster#2122. From now on it will stay at the last reported amount when Even if it won't need to download data, it will need to add 16M items to the datastore, and pinning will make it traverse everything it has. |
Thank you, I'm going to try that today on one of our node.
No we don't use MFS, we only add new pin via the cluster API, and when needed we access our data via Kubo's gateway using the CIDs. As far as I understand this doesn't involve the MFS subsystems.
Good to know, thank you 👍 |
I have switch one of our node to using the pebble datastore, right now it is slowly adding back the whole pinset to pebble. |
hi |
@ehsan6sha still too soon to tell. Our node is still adding the data back into the pebble store. It has only catched up 50% of the previous pins right now. |
Oops, seems like we needed more information for this issue, please comment with more details or this issue will be closed in 7 days. |
@Mayeu how does it look like now? (assuming it finished re-pinning) |
Oops, seems like we needed more information for this issue, please comment with more details or this issue will be closed in 7 days. |
Sadly, I can't find anything because the amount log produce meant that the log before the spike up was purged 🤦🏻♀️ I'm updating our log retention config for that machine to keep much (much) more logs, and "hope" to see that drop again. On the bright side, this node is now fully caught up with the cluster state, so we'll see if this issue shows up again. But since the 4th of December, our first node (which is still using LevelDB) didn't experience that issue. |
As mentioned, cannot trust the graph much due to the bug I pointed above... you are better tracking the "pending" items (queued, pinning, error) and comparing that to total items in cluster pinset, rather than using this metric right now. |
(and happy new year!) |
@Mayeu We will wait for another week, and assume the issue is resolved if we do not hear from you. |
@hsanjuan right, I forgot about that. We do gather those as well. Here they are for the past 60 days: Pin queued: Error: Pinning: Just a reminder of the timeline:
For comparison, here are the graphs for our first node (still using LevelDBs), which didn't experience as many issues (it still encountered that listing issue, but for some reason it stabilized pretty quickly after we realized there was an issue): Error and Pinning: Queued: |
So my understanding is that the new node still hangs on |
@hsanjuan I can't go back in the logs before the 5th of January so I'm not sure why the new node was stuck between the 22nd of December and the 5th of January. Between the 5th and today there are 4184 |
and the errors are |
Also, assume you call it manually and it is not streaming anything at all for a few minutes... can you |
Yes, the errors are the
Previously when the issue arose with LevelDB, it "never" finished ("never" as in I stopped curl after 24h). But I don't yet have been able to catch it myself with Pebble. I don't think it will "never" finish anymore since it seems to happen regularly and then resolve itself as this number of error per day shows:
I'll script something to triggers a diagnostic if a curl doesn't finish after a few minutes, because I'm not sure I can react myself to those events. FYI (and mine in 6 months), I'm getting those numbers with angle-grinder:
|
Oops, seems like we needed more information for this issue, please comment with more details or this issue will be closed in 7 days. |
Hello, @Mayeu . There logs should print info messages of the sort: "Full pinset listing finished" . How long does it take to list the pinset when it works? The ipfs-cluster config has a If it is not that, and you can reproduce this by calling |
Sorry for the delay here, I finally got to check the data that was gathered last week.
We have set this timeout to 20 minutes in our case. Here is one profiling that was triggered while a At the longest some requests took 14 minutes (after the log entry) to respond. But at that point partition was full of profiles so I don't have a profile matching those request. I'll try again if you want.
There isn't any log at the start of the listing process, right? The earliest I can find is when the process is around 500k pins. On our node using pebble it seems to take around 1m:
On our node still using leveldb, this takes around 45s:
|
Checklist
Installation method
built from source
Version
Config
Description
Hello,
We started to experience an issue with Kubo in our 2-node cluster where Kubo don't list pin anymore.
We have 2 nodes that both pin all the pinset we keep track of, which is around 16.39 million pins right now.
Last weeks (while we were still using 0.29), Kubo stopped responding to the
/pin/ls
queries sent by the cluster, those requests were hanging "indefinitely" (as in, when using curl I stopped the command after ~16h without response). Ouripfs-cluster
process is returning the following in the log when this happens:This started out of the blue, there was no change on the server. The issue remained after upgrading to 0.32.1.
At that time, we had the bloom filter activated, deactivating it did improve the situation for a while (maybe 24h), and then the issue started to show up again. In retrospect, I think it may not be related to the bloom filter at all).
This is the typical metrics reported by
ipfs-cluster
which show when Kubo stop responding to/pin/ls
:The graph on top is the number of pins the cluster is keeping track of, and on the one on the bottom is the number of pins reported by Kubo. When restarting Kubo it generally jumps to the expected amount, and after a while it drops to 0. At that point any attempt to list pin from Kubo fails.
We only have the metrics reported by ipfs-cluster because of this Kubo bug.
The server CPU, RAM, and disk utilization is fairly low when this issue show up, so it doesn't look like it a performance issue. The only metric that started to go out of bound is the number of open file descriptors which grow and reached the 128k limit set. I bumped it to 1.28 million, but it still reaches it (with or without the bloom filter):
The FDs limit is set both at the systemd unit level, and via
IPFS_FD_MAX
.Restarting Kubo make it work again most of the time, but sometimes it doesn't change anything and it instantly starts to fail.
Here is some profiling data from one of our nodes:
More info about the system:
logs
andcache
for ZFSKubo also emit a lot of:
But
ipfs swarm resources
doesn't return anything above 5-15%, so I think this error is actually on the remote node side and not related to our issue, right?Anything else we could gather to help solve this issue?
Right now I'm out of ideas to get our cluster back into a working state (beside restarting Kubo every 2h but that's not a solution since it will prevent us from reproving the pins to the rest of the network)
Edit with additional info:
--enable-gc
flag, as prescribed by ipfs-cluster doc.The text was updated successfully, but these errors were encountered: