Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On failure, and what to do about it #208

Open
pvh opened this issue Oct 19, 2023 · 2 comments
Open

On failure, and what to do about it #208

pvh opened this issue Oct 19, 2023 · 2 comments

Comments

@pvh
Copy link
Member

pvh commented Oct 19, 2023

In a distributed system, it's possible that you are trying to load data that isn't available. The server you're requesting it from might be offlne, your network might be interrupted, the data could be corrupted, or the hosts you can access might decide not to give it to you.

Our current solution to handling these problems comes in two forms, the UNAVAILABLE and the FAILED states.

When we know a document to be definitively unavailable: all the peers you have have responded to say they won't give it to you, then we go to the UNAVAILABLE state. From there, introducing a new peer will result in a new request and possibly a more favorable response.

What's less clear is how to handle documents that have timed out. Right now documents that don't load before #timeoutDelay go into the FAILED state. FAILED is marked as final, so from there our state machine refuses all progress. This is obviously a relatively minor issue, but fixing this bug seems to be more difficult than just transitioning to UNAVAILABLE. (Believe me, I tried.)

The problem I'm studying is as follows:

  • I try to load a large collection of documents not found in local storage
  • Requesting these documents enqueues a flood of follow-on document requests via the service-worker.
  • The service worker doc handles time out after 60s and enter a "failed" state.
  • From here there is no recovery short of killing the service worker and restarting it.

The question I have -- probably mostly for @alexjg and @acurrieclark is what the behaviour should be? I suspect we may want to do away with the timeout entirely, for starters. But would that actually fix anything? How do we handle slow-to-load documents? Can we identify the difference between "there's lots to load over a slow channel" versus "this just isn't ever going to happen."

@HerbCaudill
Copy link
Collaborator

Agreed that a timeout and a final error state is probably the wrong solution.

Seems like if you've asked a peer for a document and they're still online and they haven't responded with the document or with UNAVAILABLE, you want to keep trying until you hear back from them one way or another, or they go offline.

@pvh
Copy link
Member Author

pvh commented Oct 19, 2023

I think that's about right, probably with some kind of dynamic delay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants