Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Fix FlowWindowSize managment when sender drops packets #2815

Merged
merged 1 commit into from
Nov 3, 2023

Conversation

yomnes0
Copy link
Collaborator

@yomnes0 yomnes0 commented Oct 26, 2023

No description provided.

@maxsharabayko maxsharabayko added this to the v1.6.0 milestone Oct 26, 2023
@maxsharabayko maxsharabayko added the [core] Area: Changes in SRT library core label Oct 26, 2023
@yomnes0
Copy link
Collaborator Author

yomnes0 commented Oct 27, 2023

When the sender drops packets, it generate a fake ack. In some cases, it could end up in a situation in which its ackseqno was always higher than the one from the receiver. Because of that, ack were ignored and the FlowWindowSize was never update.

Without this fix, when doing a 50mbps transmission at 1200s latency over a 21% loss network with a 200ms RTTT, we would end up with 50% unrecovered packets. With this fix, this goes down to below 1%, which is in line with other cases in which is close to other cases in which the problem didn't appear

@yomnes0
Copy link
Collaborator Author

yomnes0 commented Oct 30, 2023

Here are some tests results :

With the fix, default buffer size, 100ms RTT network and 25% loss packets

Latency Lost Packets Droped Packets Late packets Unrecovered packets percentage
1000ms 70702 5 86 0.13
1200ms 71290 2 66 0.09
1300ms 71661 64 58 0.17
1400ms 81443 688 85 0.94

With the fix, 1000000000 recv buffer size, 100ms RTT network and 25% loss packets

Latency Lost Packets Droped Packets Late packets Unrecovered packets percentage
1000ms 71199 2 77 0.11
1200ms 70753 0 69 0.10
1300ms 71165 0 55 0.05
1400ms 71235 0 13 0.02

With the fix, 200ms RTT network and 25% lost packets. Increasing the receiving buffer improves the results the same way it did with the 100ms RTT network

Latency Lost Packets Droped Packets Late packets Unrecovered packets percentage
1000ms 71065 1075 233 1.84
1200ms 70968 366 46 0.58
1300ms 72389 341 55 0.55
1400ms 75285 890 182 1,42

For reference, here are the results I had without the fix, in a network with 200ms RTT network and 21% losses (I didn't go up to 25% but results are still speaking for themselves). WIth latency greater than 1200ms, results were the same than at 1200 (>=50% unrecovered packets)

Latency Lost Packets Droped Packets Late packets Unrecovered packets percentage
1000ms 59238 349 134 0.82
1200ms 60574 30980 41 51,21

@maxsharabayko
Copy link
Collaborator

Results with the fix are collected in a 100 ms RTT network with 25% packet loss and a 200 ms network with 25% packet loss.
Results without the fix are collected in a 200 ms RTT network with 21% packet loss.

Comparing results without the fix (200 ms RTT, 21% packet loss) and without the fix (200 ms RTT, 25% packet loss) there is a lot more dropped packets with the fix at 1s latency (1075 vs 349), but a lot less at 1.2 s latency (366 vs 30980).

Although the results do not provide a solid base for comparison, it looks like the fix uses the available size of the receiver buffer more efficiently.

@ethouris
Copy link
Collaborator

ethouris commented Nov 1, 2023

What I can see what m_iFlowWindowSize field designates, it looks like it's the size of the receiver buffer of the peer in packets.

How can then dropping from sender buffer virtually increase the free space in the peer's receiver buffer?

I understand that it's just a "statement" that these same packets that are dropped on the sender side will be also dropped (independently) on the receiver side when the time to play a packet following the loss comes. The problem is, we don't exactly know when this is going to happen. The perfect solution would be to define the exact time, that would be "peer's play time minus RTT" of the packet that closes the earliest gap, as the time up to which the packets on the sender side should be dismissed. This could at least replicate the phenomena happening on the receiver side at this moment, or very close to it.

Don't forget that people have been complaining here that they have much more drops if the sender dropping is enabled. I added this option to configure the sender dropping time mainly for development needs, to see by adjusting this settings you can have less drops on the same link.

@yomnes0
Copy link
Collaborator Author

yomnes0 commented Nov 1, 2023

Results with the fix are collected in a 100 ms RTT network with 25% packet loss and a 200 ms network with 25% packet loss. Results without the fix are collected in a 200 ms RTT network with 21% packet loss.

Comparing results without the fix (200 ms RTT, 21% packet loss) and without the fix (200 ms RTT, 25% packet loss) there is a lot more dropped packets with the fix at 1s latency (1075 vs 349), but a lot less at 1.2 s latency (366 vs 30980).

Although the results do not provide a solid base for comparison, it looks like the fix uses the available size of the receiver buffer more efficiently.

I'll add some tests results with the fix at 21% loss

@maxsharabayko
Copy link
Collaborator

@ethouris

I understand that it's just a "statement" that these same packets that are dropped on the sender side will also be dropped (independently) on the receiver side when the time to play a packet following the loss comes. The problem is, we don't exactly know when this is going to happen.

The assumption is that the TL packet drop on the sender side happens later, than the corresponding TL packet drop on the receiver, given some later packet has been received. Even assuming no later packets have been received, still sending the following packets would eventually trigger TL packet drop on the receiver anyway, freeing up space for further packets.
If the receiver fails to receive any packet for a descent time resulting in the SND seqno being much further from the RCV seqno, the sequence discrepancy would be eventually detected. The connection is to be broken by the sequence discrepancy, or by the idle (no response) timeout.

@maxsharabayko
Copy link
Collaborator

maxsharabayko commented Nov 1, 2023

Ran tests to determine the RCV buffer size leading to a 0% packet drop.

50 Mbps, 100 ms RTT, 25% packet loss, SRT latency 1.3 seconds.

The target RCV buffer size recommended by the Configuration Guidelines is 9436992 bytes.
For 130 ms RTT it would be 9541504 bytes.

The value to recommend based on the results below is 11936992 (+26%).

SRT master.

RCVBUF:  9436992 pktRcvUnique 72750, pktRcvDrop: 20131 (27.67%)
RCVBUF:  9936992 pktRcvUnique 70255, pktRcvDrop: 20075 (28.57%)
RCVBUF: 10436992 pktRcvUnique 74912, pktRcvDrop: 15810 (21.1%)
RCVBUF: 10936992 pktRcvUnique 82331, pktRcvDrop: 10879 (13.21%)
RCVBUF: 11436992 pktRcvUnique 72003, pktRcvDrop: 19024 (26.42%)
RCVBUF: 11936992 pktRcvUnique 94710, pktRcvDrop:     4 (0%)
RCVBUF: 12436992 pktRcvUnique 94602, pktRcvDrop:     1 (0%)
RCVBUF: 12936992 pktRcvUnique 94651, pktRcvDrop:     0 (0%)
RCVBUF: 13436992 pktRcvUnique 94560, pktRcvDrop:     0 (0%)

SRT with the FIX (PR 2815)

RCVBUF:  9436992 pktRcvUnique 89856, pktRcvDrop: 3393 (3.78%)
RCVBUF:  9936992 pktRcvUnique 93241, pktRcvDrop: 1268 (1.36%)
RCVBUF: 10436992 pktRcvUnique 91598, pktRcvDrop: 1556 (1.7%)
RCVBUF: 10936992 pktRcvUnique 91747, pktRcvDrop:  727 (0.79%)
RCVBUF: 11436992 pktRcvUnique 94330, pktRcvDrop:   17 (0.02%)
RCVBUF: 11936992 pktRcvUnique 94687, pktRcvDrop:    1 (0%)
RCVBUF: 12436992 pktRcvUnique 94616, pktRcvDrop:    4 (0%)
RCVBUF: 12936992 pktRcvUnique 94605, pktRcvDrop:    2 (0%)
RCVBUF: 13436992 pktRcvUnique 94391, pktRcvDrop:    0 (0%)

50 Mbps, 100 ms RTT, 25% packet loss, SRT latency 1.0 seconds.

The target RCV buffer size recommended by the Configuration Guidelines is 7339392 bytes.

The value to recommend based on the results below is 10339392 (+40%).

SRT master

RCVBUF:  7339392 pktRcvUnique 70568, pktRcvDrop: 20988 (29.74%)
RCVBUF:  7839392 pktRcvUnique 72596, pktRcvDrop: 20726 (28.55%)
RCVBUF:  8339392 pktRcvUnique 71079, pktRcvDrop: 20844 (29.33%)
RCVBUF:  8839392 pktRcvUnique 94360, pktRcvDrop:    92 (0.1%)
RCVBUF:  9339392 pktRcvUnique 94674, pktRcvDrop:    33 (0.03%)
RCVBUF:  9839392 pktRcvUnique 94463, pktRcvDrop:     3 (0%)
RCVBUF: 10339392 pktRcvUnique 94710, pktRcvDrop:     2 (0%)
RCVBUF: 10839392 pktRcvUnique 94752, pktRcvDrop:     2 (0%)
RCVBUF: 11339392 pktRcvUnique 94522, pktRcvDrop:     1 (0%)
RCVBUF: 11839392 pktRcvUnique 94652, pktRcvDrop:     3 (0%)

SRT with the fix

RCVBUF:  7339392 pktRcvUnique 91678, pktRcvDrop: 2682 (2.93%)
RCVBUF:  7839392 pktRcvUnique 92193, pktRcvDrop: 1925 (2.09%)
RCVBUF:  8339392 pktRcvUnique 94336, pktRcvDrop:  312 (0.33%)
RCVBUF:  8839392 pktRcvUnique 94530, pktRcvDrop:   69 (0.07%)
RCVBUF:  9339392 pktRcvUnique 94437, pktRcvDrop:    8 (0.01%)
RCVBUF:  9839392 pktRcvUnique 94499, pktRcvDrop:   13 (0.01%)
RCVBUF: 10339392 pktRcvUnique 94560, pktRcvDrop:    1 (0%)
RCVBUF: 10839392 pktRcvUnique 94552, pktRcvDrop:    1 (0%)
RCVBUF: 11339392 pktRcvUnique 94737, pktRcvDrop:    0 (0%)
RCVBUF: 11839392 pktRcvUnique 94500, pktRcvDrop:    3 (0%)

SRT with the fix and SRTO_SNDDROPDELAY=-1

RCVBUF:  7339392 pktRcvUnique 72603, pktRcvDrop: 20784 (28.63%)
RCVBUF:  7839392 pktRcvUnique 86328, pktRcvDrop:  7383 (8.55%)
RCVBUF:  8339392 pktRcvUnique 93663, pktRcvDrop:   128 (0.14%)
RCVBUF:  8839392 pktRcvUnique 94510, pktRcvDrop:   174 (0.18%)
RCVBUF:  9339392 pktRcvUnique 94515, pktRcvDrop:    29 (0.03%)
RCVBUF:  9839392 pktRcvUnique 94607, pktRcvDrop:     6 (0.01%)
RCVBUF: 10339392 pktRcvUnique 94403, pktRcvDrop:     0 (0%)
RCVBUF: 10839392 pktRcvUnique 94771, pktRcvDrop:     2 (0%)
RCVBUF: 11339392 pktRcvUnique 94710, pktRcvDrop:     2 (0%)
RCVBUF: 11839392 pktRcvUnique 94790, pktRcvDrop:     0 (0%)

@maxsharabayko maxsharabayko added the Type: Bug Indicates an unexpected problem or unintended behavior label Nov 1, 2023
@ethouris
Copy link
Collaborator

ethouris commented Nov 1, 2023

The assumption is that the TL packet drop on the sender side happens later, than the corresponding TL packet drop on the receiver, given some later packet has been received. Even assuming no later packets have been received, still sending the following packets would eventually trigger TL packet drop on the receiver anyway, freeing up space for further packets.

My point is that on the receiver side the dropping is triggered at the very specific packet - the packet that follows the loss. Dropped are only the lost packets irrecovered at the moment of drop check. So on the sender side at the moment of receiving the loss report you know only one thing for sure - that these cells in the receiver buffer will have been freed exactly at the moment when the "loss coverer" packet is ready to play (actually when retrieved by the application, but we believe that it will be extracted immediately). Whether any packets were recovered up to this moment is unimportant because we believe that every packet preceding the play-ready packet will be removed from the buffer at the play time anyway.

The problem is: packets removed as "too late" from the sender buffer still have their "corresponding cells" active in the peer's receiver buffer after dropping. When these same cells will be removed from the peer's receiver buffer is another story.

This simply means that a sender's prediction that by removing some range of packets from the sender buffer it gets more free space in the peer's receiver buffer at the moment when this dropping happens only increases the risk that the newest packets will be dropped on reception in case when the buffer is almost full and the sender has "tricked itself" that the peer will have the space to accommodate the packet it is about to send.

The problem isn't whether the cells in the peer's receiver buffer will be freed "eventually", but whether they will be really freed at the very moment when the new packet is being sent and the free tokens in the flight window are checked for green light.

If the receiver fails to receive any packet for a descent time resulting in the SND seqno being much further from the RCV seqno, the sequence discrepancy would be eventually detected. The connection is to be broken by the sequence discrepancy, or by the idle (no response) timeout.

This is happening from a completely different reason, although actually tricking the flight window size is only increasing the probability for this to happen. If there's a drop-on-reception once, the only way how it can be avoided is to quickly make space in the receiver buffer.

BTW. if there was at least one problem that has been solved by increasing the flight window tricky way and temporarily (only up to the next incoming ACK), wouldn't increasing the receiver buffer size do exactly the same thing, just without dangerous tricking?

@maxsharabayko
Copy link
Collaborator

@ethouris

The problem is: packets removed as "too late" from the sender buffer still have their "corresponding cells" active in the peer's receiver buffer after dropping. When these same cells will be removed from the peer's receiver buffer is another story.

The reality is the sender does not know if those cells are free or not, that's true.
At the same time, the sender should drop later than the receiver (this is still to be improved within #1910). Thus it can be assumed if it decides to drop those packets, they are to be dropped/read from the receiver by the time of this event. The only thing I see here is that if the receiving application not reading further, then those packets would still be in the buffer. But we should try to optimize the conventional case when the receiving app reads every time there is something available to be read.

@maxsharabayko maxsharabayko merged commit cb362ee into Haivision:master Nov 3, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[core] Area: Changes in SRT library core Type: Bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants