On failure, and what to do about it #208

pvh · 2023-10-19T07:07:57Z

In a distributed system, it's possible that you are trying to load data that isn't available. The server you're requesting it from might be offlne, your network might be interrupted, the data could be corrupted, or the hosts you can access might decide not to give it to you.

Our current solution to handling these problems comes in two forms, the UNAVAILABLE and the FAILED states.

When we know a document to be definitively unavailable: all the peers you have have responded to say they won't give it to you, then we go to the UNAVAILABLE state. From there, introducing a new peer will result in a new request and possibly a more favorable response.

What's less clear is how to handle documents that have timed out. Right now documents that don't load before #timeoutDelay go into the FAILED state. FAILED is marked as final, so from there our state machine refuses all progress. This is obviously a relatively minor issue, but fixing this bug seems to be more difficult than just transitioning to UNAVAILABLE. (Believe me, I tried.)

The problem I'm studying is as follows:

I try to load a large collection of documents not found in local storage
Requesting these documents enqueues a flood of follow-on document requests via the service-worker.
The service worker doc handles time out after 60s and enter a "failed" state.
From here there is no recovery short of killing the service worker and restarting it.

The question I have -- probably mostly for @alexjg and @acurrieclark is what the behaviour should be? I suspect we may want to do away with the timeout entirely, for starters. But would that actually fix anything? How do we handle slow-to-load documents? Can we identify the difference between "there's lots to load over a slow channel" versus "this just isn't ever going to happen."

HerbCaudill · 2023-10-19T13:16:11Z

Agreed that a timeout and a final error state is probably the wrong solution.

Seems like if you've asked a peer for a document and they're still online and they haven't responded with the document or with UNAVAILABLE, you want to keep trying until you hear back from them one way or another, or they go offline.

pvh · 2023-10-19T15:14:53Z

I think that's about right, probably with some kind of dynamic delay.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On failure, and what to do about it #208

On failure, and what to do about it #208

pvh commented Oct 19, 2023

HerbCaudill commented Oct 19, 2023

pvh commented Oct 19, 2023

On failure, and what to do about it #208

On failure, and what to do about it #208

Comments

pvh commented Oct 19, 2023

HerbCaudill commented Oct 19, 2023

pvh commented Oct 19, 2023