RTO Task Overhaul (BugFix and Support to run multiple subtasks) #14623

noob-se7en · 2024-12-09T20:24:51Z

Solves

Issue: Support maxNumRowsPerTask in RealtimeToOfflineSegmentsTask #12857
Issue description: Currently RealtimeToOfflineSegmentsTask that is used to move real time segments to offline segments, does not have the ability to tune maxNumRowsPerTask. This is the parameter that determines the input to a task. Without this configuration, we end up creating one minion task, which takes in all the input (i.e all segments that meet the criteria to be converted to offline segments) which prevents us from using other minions.
Bugs: [Bug] Data Inconsistency in RealtimeToOffline Minion tasks #14659
2.1. Currently for RTO task, watermark is updated in executor after segments are uploaded. Hence, there can be possible scenarios where segments were uploaded to offline table but RTO metadata watermark was not updated AND RTO task generator does not validate if the segment has already been processed in the previous minion run.

Proposed Solution:

Divide a task into multiple subtasks based on max num of rows per subtask.
Since Segment Lineage is only kept for replaced segment (not valid of RTO tasks), This PR uses similar logic to add segment lineage like data structure called ExpectedRealtimeToOfflineTaskResultInfo in RTO task metadata.
Once subtasks are generated, during the execution of subtasks, each subtasks atomically updates ExpectedRealtimeToOfflineTaskResultInfo present in task metadata before uploading offline segments.
In next iteration of generating RTO subtasks, using task metadata and current state of offline table, evaluate failed subtasks and re-schedule them. Incase of no failure pick new eligible segments.
Update the watermark based on the startTime of the earliest segment selected only if new eligible segments are picked (i.e. no failure).

Alternate solution (dropped) - #14578

TODO - Need to add more tests for edge cases (pending).

…nerator

codecov-commenter · 2024-12-09T21:32:08Z

Codecov Report

Attention: Patch coverage is 76.36364% with 78 lines in your changes missing coverage. Please review.

Project coverage is 63.62%. Comparing base (59551e4) to head (279a339).
Report is 1625 commits behind head on master.

Files with missing lines	Patch %	Lines
...egments/RealtimeToOfflineSegmentsTaskExecutor.java	4.65%	41 Missing ⚠️
.../minion/RealtimeToOfflineSegmentsTaskMetadata.java	66.07%	17 Missing and 2 partials ⚠️
...gments/RealtimeToOfflineSegmentsTaskGenerator.java	92.38%	7 Missing and 9 partials ⚠️
.../minion/RealtimeToOfflineCheckpointCheckPoint.java	90.47%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #14623      +/-   ##
============================================
+ Coverage     61.75%   63.62%   +1.86%     
- Complexity      207     1411    +1204     
============================================
  Files          2436     2709     +273     
  Lines        133233   151767   +18534     
  Branches      20636    23431    +2795     
============================================
+ Hits          82274    96556   +14282     
- Misses        44911    47941    +3030     
- Partials       6048     7270    +1222

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (+99.99%)`	⬆️
integration	`100.00% <ø> (+99.99%)`	⬆️
integration1	`100.00% <ø> (+99.99%)`	⬆️
integration2	`?`
java-11	`34.03% <59.39%> (-27.68%)`	⬇️
java-21	`63.61% <76.36%> (+1.99%)`	⬆️
skip-bytebuffers-false	`63.61% <76.36%> (+1.86%)`	⬆️
skip-bytebuffers-true	`63.60% <76.36%> (+35.87%)`	⬆️
temurin	`63.62% <76.36%> (+1.86%)`	⬆️
unittests	`63.61% <76.36%> (+1.86%)`	⬆️
unittests1	`56.13% <72.72%> (+9.24%)`	⬆️
unittests2	`34.03% <59.39%> (+6.30%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

swaminathanmanish · 2024-12-10T06:18:02Z

...mmon/src/main/java/org/apache/pinot/common/minion/RealtimeToOfflineSegmentsTaskMetadata.java


  private final String _tableNameWithType;
-  private final long _watermarkMs;
+  private long _watermarkMs;
+  private final Map<String, List<String>> _realtimeSegmentVsCorrespondingOfflineSegmentMap;


This approach looks more promising.
I think we might have to track this input->output segment on a per task basis and then undo everything if there's a task failure (task's execution should be treated as all or nothing). You could maintain 2 maps where the key of the map is taskId and value is list of segments (inputSegments, outputSegments).
If there's a task failure, you need to undo all that the task has done i.e remove all outputSegments (if they exist in offline table) and redo all inputSegments that the task picked.
The reason for the above is a single input segment can map to multiple output segments and multiple input can map to single output segment. The cleanest approach is undo what the task has done (either in the minion itself & in generator as fallback) if there's failure and retry the input segments.

This code handles M -> N segment conversion. This edge case is handled currently as well, no?

...not/plugin/minion/tasks/realtimetoofflinesegments/RealtimeToOfflineSegmentsTaskExecutor.java

…ize_RTO-2

KKcorps · 2024-12-18T13:09:27Z

...mmon/src/main/java/org/apache/pinot/common/minion/RealtimeToOfflineSegmentsTaskMetadata.java


  private final String _tableNameWithType;
-  private final long _watermarkMs;
+  private long _windowStartMs;


would this change cause backward incompatibility issues during deployments?

Fixed. _windowStartMs will be read from znRecord.getLongField("watermarkMs", 0) like before.
So there shouldn't be any backward incompatibility issue.

KKcorps · 2024-12-18T13:10:07Z

...on/src/main/java/org/apache/pinot/common/minion/ExpectedRealtimeToOfflineTaskResultInfo.java

+ *    when a prev minion task is failed.
+ *
+ */
+public class ExpectedRealtimeToOfflineTaskResultInfo {


nit: needs a better and shorter name

Use RTOTaskResult instead?

Would like to include Expected . Maybe ExpectedRTOSubtaskResult

have refactored var names.

rajagopr · 2024-12-18T22:16:28Z

...on/src/main/java/org/apache/pinot/common/minion/ExpectedRealtimeToOfflineTaskResultInfo.java

+ *  The <code>_segmentsTo</code> denotes the expected offline segemnts.
+ *  The <code>_id</code> denotes the unique identifier of object.
+ *  The <code>_taskID</code> denotes the minion taskId.
+ *  The <code>_taskFailure</code> denotes the status of minion task handling the


_taskStatus would be more apt.

have refactored the code doc. We just only want to track whether task failed or not, Rest status are irrelevant here.

rajagopr · 2024-12-18T22:52:01Z

...mmon/src/main/java/org/apache/pinot/common/minion/RealtimeToOfflineSegmentsTaskMetadata.java

 */
 public class RealtimeToOfflineSegmentsTaskMetadata extends BaseTaskMetadata {

-  private static final String WATERMARK_KEY = "watermarkMs";
+  private static final String WINDOW_START_KEY = "watermarkMs";


Value should be windowStartMs?

For Backward compatibility the value should remain same. Like when we deploy, watermark should be picked what was previously set.

rajagopr · 2024-12-19T03:14:12Z

...on/src/main/java/org/apache/pinot/common/minion/ExpectedRealtimeToOfflineTaskResultInfo.java

+ *    when a prev minion task is failed.
+ *
+ */
+public class ExpectedRealtimeToOfflineTaskResultInfo {


Use RTOTaskResult instead?

rajagopr · 2024-12-19T03:32:58Z

...mmon/src/main/java/org/apache/pinot/common/minion/RealtimeToOfflineSegmentsTaskMetadata.java

+  private long _windowStartMs;
+  private long _windowEndMs;
+  private final Map<String, ExpectedRealtimeToOfflineTaskResultInfo> _idVsExpectedRealtimeToOfflineTaskResultInfo;
+  private final Map<String, String> _segmentNameVsExpectedRealtimeToOfflineTaskResultInfoId;


rename as _taskResultsMap instead? Consider simplifying the variable names throughtout.

Simplified var names.

rajagopr · 2024-12-19T03:34:51Z

...mmon/src/main/java/org/apache/pinot/common/minion/RealtimeToOfflineSegmentsTaskMetadata.java

    _tableNameWithType = tableNameWithType;
-    _watermarkMs = watermarkMs;
+    _idVsExpectedRealtimeToOfflineTaskResultInfo = new HashMap<>();


Variable name of the form taskToResults is easier to follow than taskVsResults

noob-se7en · 2024-12-30T07:11:18Z

...ot/plugin/minion/tasks/realtimetoofflinesegments/RealtimeToOfflineSegmentsTaskGenerator.java

@@ -93,7 +93,7 @@ public class RealtimeToOfflineSegmentsTaskGenerator extends BaseTaskGenerator {

  private static final String DEFAULT_BUCKET_PERIOD = "1d";
  private static final String DEFAULT_BUFFER_PERIOD = "2d";
-  private static final int DEFAULT_MAX_NUM_RECORDS_PER_TASK = 50_000_000;
+  private static final int DEFAULT_MAX_NUM_RECORDS_PER_TASK = Integer.MAX_VALUE;


This is important for case where this PR is deployed to controller first.

swaminathanmanish · 2025-01-02T12:15:09Z

pinot-common/src/main/java/org/apache/pinot/common/minion/ExpectedSubtaskResult.java

+    _taskFailure = taskFailure;
+  }
+
+  public String getTaskID() {


Can this itself be a uniqueId instead of new _id field?

we want to separate out logic of setting _taskId and this _id. This _taskId is just needed for logging of failed minion task.

swaminathanmanish · 2025-01-02T12:16:10Z

pinot-common/src/main/java/org/apache/pinot/common/minion/ExpectedSubtaskResult.java

+ *    when a prev minion task is failed.
+ *
+ */
+public class ExpectedSubtaskResult {


Is this applicable for other tasks. If not, I'd suggest moving this data structure to a RTO specific module.

Actually we have already kept TaskMetadata in pinot-common module. We will have to move both RTOTaskMetadata and this class to RTO specific module. Because pinot-common does not import RTO specific module i.e. pinot-minion-builtin-tasks.

swaminathanmanish · 2025-01-02T12:17:01Z

pinot-common/src/main/java/org/apache/pinot/common/minion/ExpectedSubtaskResult.java

+ *    when a prev minion task is failed.
+ *
+ */
+public class ExpectedSubtaskResult {


Can we name this something like RealtimeToOfflineCheckpoint or something ?

swaminathanmanish · 2025-01-02T12:18:22Z

...not/plugin/minion/tasks/realtimetoofflinesegments/RealtimeToOfflineSegmentsTaskExecutor.java

@@ -99,12 +107,10 @@ public void preProcess(PinotTaskConfig pinotTaskConfig) {
    RealtimeToOfflineSegmentsTaskMetadata realtimeToOfflineSegmentsTaskMetadata =
        RealtimeToOfflineSegmentsTaskMetadata.fromZNRecord(realtimeToOfflineSegmentsTaskZNRecord);
    long windowStartMs = Long.parseLong(configs.get(RealtimeToOfflineSegmentsTask.WINDOW_START_MS_KEY));
-    Preconditions.checkState(realtimeToOfflineSegmentsTaskMetadata.getWatermarkMs() <= windowStartMs,
+    Preconditions.checkState(realtimeToOfflineSegmentsTaskMetadata.getWindowStartMs() == windowStartMs,


Nice stricter check ! Please confirm this.

yes this is fine

swaminathanmanish · 2025-01-02T12:20:03Z

...not/plugin/minion/tasks/realtimetoofflinesegments/RealtimeToOfflineSegmentsTaskExecutor.java

@@ -156,6 +162,11 @@ protected List<SegmentConversionResult> convert(PinotTaskConfig pinotTaskConfig,
    // Segment config
    segmentProcessorConfigBuilder.setSegmentConfig(MergeTaskUtils.getSegmentConfig(configs));

+    // Since multiple subtasks run in parallel, there shouldn't be a name conflict.
+    // Append uuid
+    segmentProcessorConfigBuilder.setSegmentNameGenerator(


I feel this should be left to the user to setup the naming scheme and we pick a default/generic naming scheme, instead of using a specific one. Can we not use the default name generator in SegmentProcessorFramework. Isn't time based normalized name generator used?

hmmm, Then we will have to add notes in documentations that this edge case can occur and user should keep uuid based segmentNameGenerator. I think user should not bother about this and Executor should handle this edge-case.

swaminathanmanish · 2025-01-02T12:54:59Z

...ot/plugin/minion/tasks/realtimetoofflinesegments/RealtimeToOfflineSegmentsTaskGenerator.java

+      ExpectedSubtaskResult expectedSubtaskResult =
+          expectedSubtaskResultMap.get(id);
+      // if already marked as failure, no need to delete again.
+      if (expectedSubtaskResult.isTaskFailure()) {


Do we need this state ? Can we not use offline table as source of truth of whether a segment exists or not.

This is needed and referenced in Executor.
Preconditions.checkState(prevExpectedSubtaskResult.isTaskFailure(), "ExpectedSubtaskResult can only be replaced if it's of a failed task");

Let's say There were 2 consecutive failures in RTO for a realtime segment. Executor will try to update the expectedSubtaskResult for a segment in the map only if existing entry is marked as failed.

swaminathanmanish · 2025-01-02T13:06:25Z

...mmon/src/main/java/org/apache/pinot/common/minion/RealtimeToOfflineSegmentsTaskMetadata.java

+  private long _windowStartMs;
+  private long _windowEndMs;
+  private final Map<String, ExpectedSubtaskResult> _expectedSubtaskResultMap;
+  private final Map<String, String> _segmentNameToExpectedSubtaskResultID;


Why do we need this Map? Can we not use the previous map itself to get the list of segmentsFrom and segmentsTo for a specific taskId? We can the construct this map in memory, from that right?

We need this map. consider edge case where there are more than one successive failures of subtask.
For a given RealtimeSegment we don't know which is the latest ExpectedSubtaskResult.

Let me see if I can refactor to simplify the readibility of the code.

I have refactored the code, pls re-check.

swaminathanmanish · 2025-01-02T13:08:17Z

...ot/plugin/minion/tasks/realtimetoofflinesegments/RealtimeToOfflineSegmentsTaskGenerator.java

+        // we can clear the state of prev minion tasks as now it's useless.
+        if (!realtimeToOfflineSegmentsTaskMetadata.getSegmentNameToExpectedSubtaskResultID().
+            isEmpty()) {
+          realtimeToOfflineSegmentsTaskMetadata.getSegmentNameToExpectedSubtaskResultID().clear();


This is idempotent and should be fine upon re-execution? (i.e if there's failure prior to updating this state in Zk)

Yes.
Since we reach here only when no prev minon failures are found, In the next generator run we will evaluate same and will reach here again.

swaminathanmanish · 2025-01-02T13:11:41Z

...ot/plugin/minion/tasks/realtimetoofflinesegments/RealtimeToOfflineSegmentsTaskGenerator.java

+    Map<String, String> configs = MinionTaskUtils.getPushTaskConfig(realtimeTableName, taskConfigs,
+        _clusterInfoAccessor);
+    configs.putAll(getBaseTaskConfigs(tableConfig, segmentNameList));
+    configs.put(MinionConstants.DOWNLOAD_URL_KEY, StringUtils.join(downloadURLList, MinionConstants.URL_SEPARATOR));


Should the order of download url list match segmentNameList, so that minion correctly uses the download url ?

Yes order matches.

Unit tests are pending, which will validate all these things

swaminathanmanish · 2025-01-02T13:16:41Z

...ot/plugin/minion/tasks/realtimetoofflinesegments/RealtimeToOfflineSegmentsTaskGenerator.java

-          skipGenerate = true;
-          break;
+
+      // Get all offline table segments.


Please add sufficient log.info (without flooding:)), that can help someone debug whats going on (specifically differentiating between happy and failure paths and when there's failure, what action took place/which segments had to be reprocessed or deleted...

…ize_RTO-2

noob-se7en added 15 commits December 2, 2024 20:01

Adds Support of maxNumRowsPerTask in RealtimeToOfflineSegmentsTasksGe…

4ab1610

…nerator

refactoring

c71bcac

nit

d0ca568

nit

8db838b

fixes bug

3233c33

adds initial logic

65e6aef

changes logic

fd496bf

fixes bug

fec0b65

clean up

11c84be

clean up

31b3960

nit

aaa72e3

addresses PR comment

2992595

nit

58eb51c

Alternate solution

f4ed406

fixes bugs and nits

07f831c

noob-se7en mentioned this pull request Dec 9, 2024

Adds Support of maxNumRowsPerTask in RealtimeToOfflineSegmentsTasksGe… #14578

Closed

swaminathanmanish reviewed Dec 10, 2024

View reviewed changes

yashmayya added ingestion minion labels Dec 10, 2024

noob-se7en changed the title ~~Adds Support of maxNumRowsPerTask in RTO Generator~~ Adds Support of maxNumRowsPerTask in RTO Generator and Fixes Bugs Dec 10, 2024

noob-se7en requested a review from swaminathanmanish December 10, 2024 09:37

lint fix

318e89e

noob-se7en marked this pull request as ready for review December 10, 2024 09:45

fix multiple consecuritve failure scenrio

b8e0daf

swaminathanmanish reviewed Dec 10, 2024

View reviewed changes

...not/plugin/minion/tasks/realtimetoofflinesegments/RealtimeToOfflineSegmentsTaskExecutor.java Outdated Show resolved Hide resolved

noob-se7en added 4 commits December 13, 2024 00:19

Merge branch 'master' of github.com:apache/pinot into origin/parallel…

8e80c09

…ize_RTO-2

refactoring and clean up

30a9459

refactoring

d29af8d

refactoring

8bcbe0f

noob-se7en added 3 commits December 18, 2024 15:20

handles edge case

9a2eb80

nit

32628d0

clean up

cc70645

KKcorps reviewed Dec 18, 2024

View reviewed changes

noob-se7en added 2 commits December 18, 2024 22:40

addresses PR comment

903d519

nit

066d925

rajagopr reviewed Dec 19, 2024

View reviewed changes

noob-se7en added 2 commits December 24, 2024 00:05

Refactor var names

66ff5ad

nit

b307458

noob-se7en requested review from rajagopr and KKcorps December 23, 2024 18:38

noob-se7en added 3 commits December 24, 2024 00:19

nit

2d1f086

nit

f183b85

minor edge cases

46bbd20

noob-se7en commented Dec 30, 2024

View reviewed changes

noob-se7en added 3 commits December 30, 2024 12:43

nit

00cc1bc

fixes lintg

dca5736

nit

6040793

swaminathanmanish reviewed Jan 2, 2025

View reviewed changes

noob-se7en added 10 commits January 22, 2025 15:10

Merge branch 'master' of github.com:apache/pinot into origin/parallel…

8e418d8

…ize_RTO-2

minor refactoring

7380368

throws exception if failed to delete invalid segment

f4371c2

Adds logs

d00ea27

simplifies code

1dda7ca

nit

97846ff

Adds unit tests

97bf146

Adds unit tests

b12980b

Fixes unit test

2cdc946

fixes integration test

279a339

RTO Task Overhaul (BugFix and Support to run multiple subtasks) #14623

Are you sure you want to change the base?

RTO Task Overhaul (BugFix and Support to run multiple subtasks) #14623

Conversation

noob-se7en commented Dec 9, 2024 • edited Loading

codecov-commenter commented Dec 9, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noob-se7en Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noob-se7en commented Dec 9, 2024 •

edited

Loading

codecov-commenter commented Dec 9, 2024 •

edited

Loading

noob-se7en Jan 22, 2025 •

edited

Loading