Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data durability question #264

Open
sroze opened this issue Dec 31, 2023 · 8 comments
Open

Data durability question #264

sroze opened this issue Dec 31, 2023 · 8 comments

Comments

@sroze
Copy link

sroze commented Dec 31, 2023

Looking at the code, I'm trying to figure out how much I can trust automerge-repo from a durability question and I'd love to have a few pointers, for pieces I don't understand.

  1. When peers send messages to each other, there doesn't seem to be any sort of acknowledgment. As such, is it fair to say that if a server fails to process a given message (i.e. storage fails or power outage at this moment for instance), other peers won't know about it? If so, wouldn't they continue to send further messages and fail to do so given the de-sync of heads?
  2. Repo's use of the storage subsystem is out-of-band (in other to debounce).. As such, does it mean that once the message has been sent and ack'ed by the Repo, there is a risk of data loss, because it might not have been persisted by the server?

Thank you so much 🙏

@alexjg
Copy link
Contributor

alexjg commented Jan 16, 2024

Good questions.

  1. There isn't an acknowledgement as such, although there is logic to make sure we don't send messages unless either we have new changes or the other end has responded to the last message we sent, so we don't send messages indefinitely. More importantly, the transport is assumed to be reliable and NetworkAdapters should emit "peer-disconnected" when a connection is lost, which will notify the Repo to stop sending messages on that connection. Obviousy disconnection often requires waiting for a timeout in the underlying transport.
  2. True, currently it is possible for a peer to acknowledge changes but then to fall over before the change makes it to storage. In this case when the peer comes back up again then the sender would reconnect and resend their changes (this is handled by the sync protocol).

To the second point, I have played with various implementations of "flush before ack" type things, but I'm still uncertain as to whether it's worth the performance hit and implementation complexity. One of the great things about local-first architectures is that you can be less dependent on the reliability of a single machine. Say the receiving peer is a sync server, if that peer falls over before saving to storage then as long as some peer has the data then they will resend it to the sync server when it comes back up. This is not to say that we shouldn't implement some kind of enhanced durability mode, but I would like to have a motivating usecase for it because so far it's not been needed so I don't know what the API should look like. In that vein would you be able to describe your architecture and what kind of data loss you are concerned about?

@sroze
Copy link
Author

sroze commented Jan 16, 2024

Very clear, thank you @alexjg. Just so I understand your point though, what would happen in the following scenario:

  1. "client peer" sends an update, "server peer" receives it and stores it (successfully) in its storage.
  2. "client peer" sends another update based on 1, "server peer" receives it but the out of band storage fails (currently, the error is simply ignored).
  3. "client peer" sends another update based on 2, "server peer" receives it and stores it.

To prevent data loss, during 3, changes from 2 & 3 needs to be persisted. How would the client know that during "3", the "client peer" needs to sends both changes made in 2 & 3? If it doesn't, how are we preventing data loss? Sorry if they seem basic questions, I'm not fully grasping the exchanges between peers.

@alexjg
Copy link
Contributor

alexjg commented Jan 16, 2024

I'm not quite sure your scenario is correct. If there is an error in storage the whole sync server should fall over and then whatever service orchestration you're using would restart it. The error may still be asynchronous though because we throttle flushes, that means the scenario might be more like this:

  1. client sends update, server receives and stores it
  2. client sends another update, server receives it
  3. sometime later server flushes, crashes, and restarts
  4. the client loses the original connection due to the restart and reconnects, which resets the sync state on the client
  5. the sync protocol negotiates what information each side has, causing the client to retransmit the second change, which was never saved

@alexjg
Copy link
Contributor

alexjg commented Jan 16, 2024

This will work even if saves end up failing out of order, e.g.

  1. client sends change 1, server receives and stores
  2. client sends change 2
  3. server begins flush of change 2
  4. client sends change 3
  5. server begins flush of change 3
  6. change 3 flush succeeds
  7. flush of change 2 fails

Now we're in a situation where the server has change 1 and change 3 on disk.

  1. server restarts
  2. client restarts sync negotiation, determines that the server is missing change 3 and retransmits it

@sroze
Copy link
Author

sroze commented Jan 25, 2024

If there is an error in storage the whole sync server should fall over and then whatever service orchestration you're using would restart it

Okay, so this means that we rely on NodeJS' unhandled exceptions to crash the server, which is not NodeJS' default behaviour. I think it's very risky for the library to expect this to happen. In any reasonably high throughput scenarios, when running NodeJS APIs as Docker containers, I wouldn't expect this to be the case either: a server crash would be causing a large number of requests to fail, I would expect errors to be handled on a per-request basis.

@alexjg
Copy link
Contributor

alexjg commented Jan 25, 2024

Interesting, my understanding was that the default behavior for uncaught exceptions in Node is to crash the process, which is implied by this documentation, have I misunderstood or maybe there's a standard practice of disabling this default behavior in some environments (e.g. docker containers)?

Regardless, I agree, crashing the whole server is undesirable if you expect transient storage failures. Loading the document into memory can be a CPU bound task which would mean that starting the server up again with all the documents that clients were connected to might take a bit of time (we are working hard on making loading much much faster and further down the line on modifying the sync and storage protocols to allow sync servers to not have the document in memory at all).

I don't think we can handle storage failure purely on a per-request or connection basis. If we catch a storage exception then all we know is that some data may have failed to save for a particular document, which means that every connection to that document may think that the server has saved data which it has not saved and so I think we would at least need to restart sync for all the clients connected to the given document. Fortunately, we do have provisions in the sync protocol for restarting sync on an existing connection by resetting the sync state on the restarting peer, so I think what we would need to do is this:

  1. Catch the storage exception and identify the documents which have been affected
  2. Stop processing incoming messages for the documents in question (actually, discard them)
  3. Reload the documents from storage
  4. reset (encode and decode) the sync states for every peer connected to each document
  5. Generate new outgoing messages for every (peer, document) pare and resume processing sync messages for the documents

This doesn't help if the storage failure is not transient though. In that case we will just stop processing and then sit there waiting for the documents to load whilst storage continues to fail. In this case it seems to me that it would be preferable to crash the server so that clients and orchestration tools can at least know that something is wrong. This makes me think that if you expect transient storage errors a better place to handle that might be in the storage adapter where you can do some kind of backoff on failures?

@kid-icarus
Copy link
Contributor

Uncaught exceptions crash the process, here's a little example:

setInterval(() => {
  if (Math.random() > 0.5) {
    throw new Error('Random error')
  }
  console.log('Hello')
}, 1000)

maybe there's a standard practice of disabling this default behavior in some environments (e.g. docker containers)?

In production apps, you'll often find an something like:

process.on('uncaughtException', () => {
  // log error
  // pipe error to error tracking, i.e. Sentry
})

@mykola-vrmchk
Copy link
Contributor

mykola-vrmchk commented Mar 26, 2024

I don't think we can handle storage failure purely on a per-request or connection basis. If we catch a storage exception then all we know is that some data may have failed to save for a particular document, which means that every connection to that document may think that the server has saved data which it has not saved and so I think we would at least need to restart sync for all the clients connected to the given document.

@alexjg could synchronization of an update to other peers begin only after that update is preserved on disk? DocHandle could accept some storage callback which it should wait to finish before emitting events that trigger network sync.

In such a scenario, we could not emit sync events on storage failure. Thus each server connection would not know about non-saved data, and we could gracefully recover from storage failure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants