-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make all polling updaters wait for graph update finish #6262
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev-2.x #6262 +/- ##
=============================================
+ Coverage 69.81% 69.82% +0.01%
- Complexity 17939 17940 +1
=============================================
Files 2042 2042
Lines 76645 76632 -13
Branches 7830 7831 +1
=============================================
+ Hits 53506 53507 +1
+ Misses 20398 20384 -14
Partials 2741 2741 ☔ View full report in Codecov by Sentry. |
5c98899
to
12b4995
Compare
// Handle update in graph writer runnable | ||
saveResultOnGraph | ||
.execute(context -> updateHandler.update(feed, context.gtfsRealtimeFuzzyTripMatcher())) | ||
.get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a timeout here - what happens if a task take too much time or does not return ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The task is to write the update into the graph. I am not sure what will happen if the task to update the graph is forcibly terminated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't want to use the asynchronicity of the Future, why do we even need a Future at all? I think in such a case the method could just return void.
In any case, we need to discuss this in the dev meeting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After further discussion I finally realized what's going on here. In fact @leonardehrenfried, this is using the Future to delay completion of the polling graph updater until the subtask it enqueues on a different ExecutorService completes. I will comment further on the main discussion thread.
This was shifted to a draft pending profiling and code audit to determine why exactly applying tens of thousands of add updates is so slow (as reported by @miklcct). |
# Conflicts: # application/src/main/java/org/opentripplanner/updater/alert/GtfsRealtimeAlertsUpdater.java
Unfortunately I can do nothing on it: The same slowness should also appear when you have tens of thousands of trip updates because most of the running time is spent on updating the timetables of a trip pattern, as you need to sort the trip times when building the timetable, and the The other problem which caused the original bug was that, the server I used was a slow server which was heavily loaded. Not only it runs OTP, but also it runs the GTFS-RT generator to be consumed locally, and also a few CI agents to build various components as well such as the app. The server was a virtual machine which shared a physical host with other virtual hosts running the production OTP servers as well. With this fix, it is safe to run all polling updaters with a very small interval, e.g. |
You could try to see if this change helps ( /** A wrapper for a primitive int array. This is insane but necessary in Java. */
private static final class IntArray implements Serializable {
private final int[] array;
private final int hash;
public IntArray(int[] array) {
this.array = array;
this.hash = Arrays.hashCode(array);
}
@Override
public int hashCode() {
return hash;
}
@Override
public boolean equals(Object other) {
if (other instanceof IntArray that) {
return hash == that.hash && Arrays.equals(array, that.array);
}
return false;
}
} I am not to optimistic, but it is wort trying. |
You say you get many trip updates for the same pattern, can you provide some numbers?
The trips should be sorted, so another thing to investigate is:
static void sort(int[] a) {
for(int i=1; i<a.length; ++i) {
var temp = a[i];
int j = i-1;
if(temp < a[j]) {
do {
a[j+1] = a[j];
--j;
} while (j >= 0 && temp < a[j]);
a[j+1] = temp;
}
}
} Is it so that you ge a lot of updates to trips that does not run at the moment? |
Our upstream system will provide live times up to 24 hours in the future. It is common for trips to be pre-cancelled, or delays posted in advance if there are known disruptions on the network. The number of trips updated is about 30k. The number of stops is probably about 20k (a rough approximation only - about 4k stations in the whole network so 4 to 5 platforms each on average will be a good guess). The number of patterns is a bit hard to guess: on one extreme, a service using different platforms in a station is in a different pattern so most intercity services are in small patterns, while at the other extreme, services such as Elizabeth line are effectively metro services that use the exact same platforms every few minutes, that hundreds of trips are in the same pattern, and all of them have live times for every single stop. |
I think you can ignore my code suggestions above - looking at the flamegraph again I see that the time spent doing deduplication and sorting is minimal - the problem is that it is done for each update. To move forward on this I suggest the following:
|
Isn't this supposed to be |
The hash being equal doesn't guarantee the contents are equal, I think that the code is correct. If the hash isn't equal it returns false directly with no need to check the content. If they are equal the content must be checked. |
Yes, I am applying the same updates every time. It is the nature of polling updaters with |
Do you know the percentage of the update that are modified/replacement trips? (I guess these are to handle platform changes/variations from the planned timetable) |
The vast majority of them. |
We discussed this PR at length, during and after today's developer meeting. The problem this PR is solving and the mechanism by which it solves the problem were not obvious to everyone, and we agreed it needs some documentation in the PR comments, and then in the code as both short line comments and Javadoc at the head of GraphUpdaterManager. The context is as follows: GraphUpdaterManager has one ExecutorService with N threads on which the polling graph updaters run in parallel ( The central change here at Without this PR, the code block in question immediately returns after it enqueues the task on the single-threaded executor. The timing mechanism that provides the recurring schedule of the polling graph updater has the characteristic that it waits at least a specified number of seconds after the task completes before firing the next instance of the task. But this delay is measured from the time the fetch/parse/enqueue task completes, not from the time that the secondary task it enqueues in the single-threaded In the event that that secondary task takes longer to complete than the inter-polling duration, either because the secondary task is large/slow or because it's enqueued behind some other slow task, the polling executor will enqueue another task in the single-threaded executor while the first one has not yet completed. Over time, this leads to a pile-up of large update tasks, with the one currently executing being farther and farther behind the most recent data. With this PR, the code block cited above is modified to call get() on the Future returned by Without this PR we are just ignoring the fact that Potential down sides of this include blocking entire threads for many seconds at a time while waiting for the sub-tasks they enqueue. This does not appear to be a problem as the size of the multi-threaded ExecutorService thread pool is unlimited, with the parameters at its creation specifying only a minimum size, not a maximum. Even in cases with many polling updaters (Entur uses ~50) the number of idle threads should be within reason and have no serious performance impact. And in practice we only expect a few of them to block for perceptible lengths of time. Another potential concern is that this approach will apply backpressure on any one polling updater, but several updaters producing slow sub-tasks could still cause a pile-up in the single-threaded queue. But on further inspection, apparently this is not the case. By virtue of all polling updaters waiting on completion of tasks that are all in the same single-threaded queue, the poll intervals of all the polling updaters are effectively increased in response to crowding conditions in the single-threaded queue. One common reaction to these queue pile-ups is that it's a configuration issue, and one should simply increase the polling interval to leave enough time for these large secondary tasks to complete before the next polling operation. The problem with this idea is that the necessary time to process each task can be highly variable depending on operational conditions, of both transit and the server itself. A snowstorm could greatly increase the size of the realtime data file. Other realtime updaters could encounter larger batches of updates even when this one doesn't, delaying its tasks in the single-threaded queue. Heavier routing load on the server might slow down application of realtime updates, etc. We want the polling to slow down dynamically in response to varying levels of contention for the single-threaded graph update executor. In this PR, the same change (calling get() on the Future) is made to every kind of polling graph updater. It is debatable whether this is necessary for all updaters and/or whether consistency is a good thing, independent of whether a particular updater has a tendency to produce large/slow tasks. An important factor to consider is which polling updaters can produce very large batch updates (taking longer than one second, and sometimes upwards of 15 seconds to apply) and under what circumstances. Another thing to consider is whether there are any downsides to uniformly applying the blocking approach, even when polling updaters finish quickly. Our tentative conclusion in today's meetings was that uniformly applying the blocking approach in this PR has only benefits and will create nice uniform backpressure across the whole system. Other details are:
|
@t2gran has suggested that the On the other points I left unfinished above: @vpaturet pointed out today that the periodic scheduling of snapshot tasks on the single graph updater thread "cuts in line" and lands at the head of the queue rather than its tail. It must wait for any other task that's already running on that single thread, but will be the very next task to run, making the update visible to end users. The delay until it's next triggered is measured from when it finishes, so if a snapshot task is stuck waiting for a large update block to finish, multiple snapshot tasks will not pile up in the queue. They also experience backpressure from congestion on this single thread. Note that the snapshot tasks and the polling graph updaters are both using this kind of periodic scheduling, but they're running on different executors. The snapshots are periodically scheduled directly on the single-threaded executor; the polling updaters are periodically scheduled on the multi-threaded |
And finally, the underlying question still remains: even with 30k updates, why does it take 15 seconds? There's probably something that can be optimized away there, and switching to streaming updates would probably completely eliminate the problem. |
We will need to do that if we further expand our live time coverage to every single train, metro, bus, tram, ferry, etc., in the whole country but this can definitely help for a medium-sized deployment without setting up another piece of dedicated infrastructure to convert polling to streaming GTFS-RT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you add the feature toggle this should be ok. The WaitOnPullingUpdatersBeforeResheduling
should be on(true
) by default.
Feel free to find a better name and adding an appropriate documentation for the feature.
saveResultOnGraph | ||
.execute(context -> updateHandler.update(feed, context.gtfsRealtimeFuzzyTripMatcher())) | ||
.get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good change, but we need to be able to turn this off in case it have some side-effects we have missed. This is critical code, and it is possible that we have not understood all implications. So, it would be nice to introduce a OTPFeature
to be able to turn this the synchronisation off. This need to be done for all calls to Feature#get()
introduced in this PR.
saveResultOnGraph | |
.execute(context -> updateHandler.update(feed, context.gtfsRealtimeFuzzyTripMatcher())) | |
.get(); | |
var f = saveResultOnGraph.execute(context -> updateHandler.update(feed, context.gtfsRealtimeFuzzyTripMatcher())); | |
if(OTPFeature.WaitOnPullingUpdatersBeforeResheduling.isOn()) { | |
f.get(); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a method to the updater and a unit test to do that, such that the behaviour is testable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good change, but we need to be able to turn this off in case it have some side-effects we have missed.
In the alternative I suggested, this could be accomplished by an additional condition
if (OTPFeature.AwaitPollingUpdaterCompletion.isOff() || status == null || status.isDone()) {
status = runPolling();
}
Though I think safety measures (and maybe even the OTPOption) are less necessary in an implementation where there are no blocking calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(@t2gran sorry, I accidentally edited your comment instead of hitting "quote reply". I think I restored your comment to its original form.)
A slightly different approach would be to save the future and test it at the beginning of the next polling round.
|
This is done in multiple places, mabe we should encapsulae this so this logic is done in one place and with error handling. I am ok with doing that in a separate PR - what do you think @vpaturet ? |
This seems like a good idea. Although we deduced it would work, I was a little uncomfortable with blocking whole threads for long periods of time. The proposed solution would first function in a nonblocking manner, then switch to blocking after one cycle. Since this still includes the blocking behavior in addition to the nonblocking it could be difficult to reason about in certain circumstances. What about a purely nonblocking approach? For example:
Whatever solution is applied, the logic will change from previous discussions (especially the backpressure aspect) so I'd like to wait to finalize the documentation until after you've added your commits. |
With this kind of approach, I'm not sure you'd want to push the Future management down into the WriteToGraphCallback API, because the wait-and-retry behavior is on a per-polling-updater basis. Each of those polling updaters would need to receive some sort of handle it uses to decide when to proceed, and the needed functionality of that handle is basically It seems simpler to me if any new So: class PollingGraphUpdater {
private GraphWriteCompletionStatus status;
void run() {
if (status == null || status.isDone()) {
status = runPolling();
} else {
LOG.info("Previous polling operation not yet finished. Waiting one more polling cycle.);
}
}
}
class PollingTripUpdater extends PollingGraphUpdater {
GraphWriteCompletionStatus runPolling() {
// ...
return saveResultOnGraph.execute(runnable);
}
}
class GraphUpdaterManager {
public GraphWriteCompletionStatus execute(GraphWriterRunnable runnable) {
return new GraphWriteCompletionStatus(scheduler.submit(() -> { }));
}
} |
The alternative approach that @abyrd mentioned will have a different effect on timing. If the poll is scheduled every 30 seconds, and the graph update takes 31 seconds on one occasion, a whole update at the 30 seconds will be skipped, while if it takes 29 seconds, a new processing will be done 1 second afterwards. Meanwhile, my original approach always starts a new poll at a certain time after the processing is complete. |
Yes, it would have a very different effect on timing. And the other solution proposed by Vincent would have another different effect on timing. Most solutions that try to be non-blocking would end up with different timing behaviors, it's just a trade-off. My sense was that people are looking at this as a special failsafe to handle situations where the polling interval is configured too short, not something that should be triggered constantly in regular operation. So I wasn't concerned about occasionally skipping one update cycle when load was too heavy. But really I have no strong preference on this, I was just carrying ideas presented by @vpaturet further. It seemed like he had a preference for nonblocking waits, and I was pointing out that it could be entirely non-blocking, since I'm not totally sure mixing blocking and nonblocking would be an improvement. |
# Conflicts: # doc/user/Configuration.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed in today's development meeting: the logic won't yet be changed in ways presented in the last few comments. The get() logic will remain in place, but gated behind a feature flag. We will then follow up with improvements later which may make it universally applicable enough to not need the flag.
I will follow up with Javadoc / code comments clarifying how this works, to facilitate any future round of changes.
@t2gran although we've got two reviews now, GH is waiting on your approval since your review requested the feature flag. |
# Conflicts: # application/src/main/java/org/opentripplanner/updater/siri/updater/SiriSXUpdater.java
Summary
This makes all polling updaters wait for graph update finish before returning.
Issue
Fixes #6252
Unit tests
None. This is a performance issue.
Documentation
None.