Spark: Adding simple custom partition sort order option to RewriteManifests Spark Action #9731

zachdisc · 2024-02-15T18:47:43Z

What

This adds a simple sort method to the RewriteManifests spark action which lets user specify the partition column order to consider when grouping manifests.

Illustration:

RewriteManifests.Result result =
        actions
            .rewriteManifests(table)
            .sort("c", "b", "a")  < -- this is the new api piece
            .execute();

Closes #9615

Why

Iceberg's metadata is organized into a forest of manifest_files which point to data files sharing common partitions. By default, and during RewriteManifests, the partition grouping is determined by the default Spec partition order. If the primary query pattern is more aligned with the last partition in the table's spec, manifests are poorly suited to quickly plan and prune around those partitions.

EG

CREATE TABLE
...
PARTITIONED BY (region, storeId, bucket(ipAddress, 100), days(event_time)

Will create manifests that first group by region, whose manifest_file contents may span a wide range of event_time values. For a primary query pattern that doesn't care about region, storeId, etc, this leads to inefficient queries.

Requested Feedback and decisions

I chose to make the input to sort be the raw column names used in partitioning, not the internal hidden ones. AKA event_time instead of event_time_day. foo instead of foo_bucket_1234. Thoughts? Could readily allow both, or just the real, hidden partition column names if people prefer
I would next like to have a more capable functional interface sort(row -> {... return groupingString}), but was struggling to express a Java Functionlike input that could be used on a DataSet<Row>'s data_file struct - any pointers on how to parse a Row's struct into a native Pojo and supply a function that can be used like in a UDF here would be apprecaited!

zachdisc · 2024-02-15T18:48:40Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java

+            .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
+
+    // Check if these partitions are included in the spec
+    Preconditions.checkArgument(


Noting - if the user changes/specifies the spec() after this check has been done, we may wish to re-check this setup?

I'm not sure I understand the comment here?

Say there are 2 partition specs for this table. Spec1 partitions by (a_bucket, b), Spec2 partitions by (a_bucket_100, c_days). If you originally pass a valid sort(b, a), then change the .specId to 2, then the validated custom column order will now be inaccurate.

.palantir/revapi.yml

api/src/main/java/org/apache/iceberg/actions/RewriteManifests.java

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteManifestsAction.java

RussellSpitzer

Suggest that we make the API based on Partition Transform Names, I think this is probably the only thing we can do reasonably with multi-arg transforms

api/src/main/java/org/apache/iceberg/actions/RewriteManifests.java

zachdisc · 2024-02-21T21:49:09Z

Rev 2

Default implementation in RewriteManifests interface
Switched to using hidden transform names to specify sort order, not raw column name
Addressed other PR feedback

Going to work on the Function based api update next. Open to doing so in a new PR or this one.

jackye1995 · 2024-02-22T22:03:51Z

Looks like the CI test failed, @zachdisc could you take a look and fix?

Overall, we have the 2 options based on this thread: #9731 (comment)

option 1: we merge this and have a subsequent PR for the functional API and refactor the current change into it
option 2: we directly update the current PR to go with that approach

Is there any preference for which way we go?

zachdisc · 2024-02-23T02:08:53Z

Looks like the CI test failed, @zachdisc could you take a look and fix?

Overall, we have the 2 options based on this thread: #9731 (comment)

option 1: we merge this and have a subsequent PR for the functional API and refactor the current change into it option 2: we directly update the current PR to go with that approach

Is there any preference for which way we go?

I'm not quite sure why that failed, I ran spotlessApply and saw a successful build locally before the last push. Regardless, I have two changes inbound, so I'll fix along with the new updates.

zachdisc · 2024-02-23T03:09:21Z

Rev 3

Fixed styling which caused CI/CD to fail
Changed the clustering column to be a struct referencing the actual columns instead of a contacted string p1::p2::p3.

Thanks to @ Nandhakumar Saravana for pointing this out, when clustering by a string a::b::c spark doesn't know about relative priority/hierarchy. AKA a::b::c should be close to a::b::d, which it does when clustering by actual columns and not a String literal.

Change imminent for Functional interface addition too, just running last tests

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java

jackye1995 · 2024-02-23T05:14:37Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java

@@ -90,13 +94,13 @@ public class RewriteManifestsSparkAction
  public static final String USE_CACHING = "use-caching";
  public static final boolean USE_CACHING_DEFAULT = false;

+  private List<String> partitionSortColumns = null;


nit: we follow public static final -> private static final -> private final -> private to order variables

Is this a problem? I have these attributes as being set to null but able to be redefined. So I think private is correct here, its inline with the other attributes in this class. I do have this attribute in the wrong place though, moving in next rev.

api/src/main/java/org/apache/iceberg/actions/RewriteManifests.java

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java

jackye1995 · 2024-02-23T05:33:21Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java

-          Dataset<Row> transformedDF = repartitionAndSort(df, partitionColumn, numManifests);
+          Column partitionColumn = df.col(clusteringColumnName);
+          Dataset<Row> transformedDF =
+              repartitionAndSort(df, partitionColumn, numManifests).drop(clusteringColumnName);


The column here is more used for repartitioning, rather than sorting. Sorting only happens for rows with the same clustering column value. I am starting to think if we should call the API repartition(List<String> partitionFields), rather than sort(List<String> partitionFields).

Happy to make that change, but IMHO the API is indicating how the entire forest of manifests is sorted, not necessarily how each tree is sorted. Like, its sorting the manifest_list entries in the end, while the manifest_files are sorted further within that. Thoughts?

Though, I suppose it isn't really sorting the whole forest, it is clustering.

@nastra @jackye1995 thoughts on a general rename to cluster() for this api action?

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteManifestsAction.java

github-actions · 2024-12-22T00:16:30Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

zachdisc · 2024-12-23T15:15:40Z

I rebased locally and resolved merge conflicts and addressed Russel's ask to remove the function-based rewrite sorting.

zachdisc · 2024-12-23T15:47:22Z

Please let me know the best way to deal with merge conflicts. I thought to rebase and get everything back in sync with the main branch, but that looks like the wrong flow here. I can close this PR and reopen a fresh one if that's preferred.

zachdisc · 2024-12-27T16:13:14Z

Created a fresh PR with feedback here incorporated - #11881

This got messy with the accumulated changes and conflicts (paired with my likely messing things up with a rebase). Suggest closing this out in favor of the fresh start above.

github-actions bot added API spark labels Feb 15, 2024

zachdisc commented Feb 15, 2024

View reviewed changes