Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Adding simple custom partition sort order option to RewriteManifests Spark Action #9731

Open
wants to merge 1,019 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
1019 commits
Select commit Hold shift + click to select a range
10f8e2a
Build: Bump software.amazon.awssdk:bom from 2.28.16 to 2.28.21 (#11311)
dependabot[bot] Oct 14, 2024
383b74b
Build: Bump org.apache.datasketches:datasketches-java (#11307)
dependabot[bot] Oct 14, 2024
11fd224
Core: Switch usage to DataFileSet / DeleteFileSet (#11158)
nastra Oct 14, 2024
8af6a2a
[AWS] S3FileIO - Add Cross-Region Bucket Access (#11259)
munendrasn Oct 14, 2024
cf4959f
Core: Deprecate ContentCache.invalidateAll (#10494)
findepi Oct 14, 2024
6281aad
Spark 3.3, 3.4, 3.5: Remove unnecessary copying of FileScanTask (#11319)
huaxingao Oct 15, 2024
765d71f
Core: Rename DeleteFileHolder to PendingDeleteFile / Optimize duplica…
nastra Oct 15, 2024
46ed80d
Core: Fix version number in deprecation note for invalidateAll (#11325)
findepi Oct 16, 2024
9e849f2
Build, Spark, Flink: Bump junit from 5.10.1 to 5.11.1 (#11262)
tomtongue Oct 16, 2024
3c7f0fd
Core, Azure: Support wasb[s] paths in ADLSFileIO (#11294)
mrcnc Oct 16, 2024
4a11561
Make connect compatable with kafka plugin.discovery (#10536)
joswlv Oct 16, 2024
5905bc2
Spark 3.5: Spark Scan should ignore statistics not of type Apache Dat…
jeesou Oct 16, 2024
dd8efd0
Kafka Connect: Add regex for property file match (#11303)
ryanjclark Oct 16, 2024
5698814
OpenAPI: Remove repeated 'for' (#11338)
ebyhr Oct 17, 2024
cc019cc
AWS: Fix S3InputStream retry policy (#11335)
edgarRd Oct 17, 2024
792b930
API: (Test Only) Small fix to TestSerializableTypes.java (#11342)
aihuaxu Oct 17, 2024
4c8c8b7
Core: lazily load default Hadoop Configuration to avoid NPE with Hado…
stevenzwu Oct 17, 2024
b83e337
Revert "Core, Azure: Support wasb[s] paths in ADLSFileIO (#11294)" (#…
RussellSpitzer Oct 18, 2024
f72dce4
Flink: make FLIP-27 default in SQL and mark the old FlinkSource as de…
stevenzwu Oct 18, 2024
f75a824
Flink: disable the flaky range distribution bucketing tests for now (…
stevenzwu Oct 18, 2024
8f26282
OpenAPI: Standardize credentials in loadTable/loadView responses (#10…
nastra Oct 18, 2024
73b537d
Core: Add credentials to loadTable / loadView responses (#11173)
nastra Oct 18, 2024
7945cc1
API: Add RewriteTablePath action interface (#10920)
laithalzyoud Oct 18, 2024
16a484e
AWS: Switch to base2 entropy in ObjectStoreLocationProvider for optim…
ookumuso Oct 18, 2024
b9e4f68
Flink: Add IcebergSinkBuilder interface allowed unification of most o…
arkadius Oct 21, 2024
62def97
Build: Bump parquet from 1.13.1 to 1.14.3 (#11264)
dependabot[bot] Oct 21, 2024
bf13ae6
Build: Bump com.palantir.baseline:gradle-baseline-java (#11362)
dependabot[bot] Oct 21, 2024
2a7aa6f
Build: Bump com.google.errorprone:error_prone_annotations (#11360)
dependabot[bot] Oct 21, 2024
e35dbdf
Build: Bump software.amazon.awssdk:bom from 2.28.21 to 2.28.26 (#11359)
dependabot[bot] Oct 21, 2024
c72c076
Build: Bump com.google.cloud:libraries-bom from 26.48.0 to 26.49.0 (#…
dependabot[bot] Oct 21, 2024
c135c99
Core: Move deleteRemovedMetadataFiles(..) to CatalogUtil (#11352)
leesf Oct 21, 2024
fc8175c
Arrow: Fix indexing in Parquet dictionary encoded values readers (#11…
wypoon Oct 21, 2024
39ae45b
Spark 3.5: Update Spark to use planned Avro reads (#11299)
rdblue Oct 22, 2024
a8cf77e
Core, Spark 3.5: Remove dangling deletes as part of RewriteDataFilesA…
dramaticlly Oct 22, 2024
e6e01f0
Spark: Randomize view/function names in testing (#11381)
nastra Oct 23, 2024
1f4d2d7
Spark 3.4: Randomize view/function names in testing (#11382)
nastra Oct 23, 2024
19e6183
Spark 3.4: Action to remove dangling deletes (#11377)
dramaticlly Oct 23, 2024
d04e26f
Spark 3.5: Reset Spark Conf for each test in TestCompressionSettings …
huaxingao Oct 23, 2024
eb0357c
Spec: Adds Row Lineage (#11130)
RussellSpitzer Oct 23, 2024
08ff755
AWS: Support S3 directory bucket listing (#11021)
stubz151 Oct 24, 2024
5ef7b98
Core: Add LoadCredentialsResponse class/parser (#11339)
nastra Oct 24, 2024
ccb61cc
OpenAPI: Add endpoint for refreshing vended credentials (#11281)
nastra Oct 24, 2024
ebe0bd3
AWS: Use testcontainers-minio instead of S3Mock (#11349)
sullis Oct 24, 2024
d9bbfc1
Kafka Connect: Include third party licenses and notices in distributi…
bryanck Oct 24, 2024
d5696ae
Deprecate iceberg-pig (#11379)
jbonofre Oct 24, 2024
88a627a
Core: Track data files by spec id instead of full PartitionSpec (#11323)
nastra Oct 25, 2024
6174869
Spark 3.5: Don't change table distribution when only altering local o…
manuzhang Oct 25, 2024
e86b4da
Spec: Fix table of content generation (#11067)
ajantha-bhat Oct 25, 2024
7f17587
[KafkaConnect] Fix RecordConverter for UUID and Fixed Types (#11346)
singhpk234 Oct 25, 2024
e25760f
Core: Snapshot `summary` map must have `operation` key (#11354)
kevinjqliu Oct 25, 2024
3db0af3
Core: Update TableMetadataParser to ensure all streams closed (#11220)
erik-grepr Oct 26, 2024
d6f7998
Build: Bump Spark 3.4 to 3.4.4 (#11366)
manuzhang Oct 28, 2024
124cc47
Build: Bump junit from 5.11.1 to 5.11.3 (#11401)
dependabot[bot] Oct 28, 2024
cc0d173
Build: Bump software.amazon.awssdk:bom from 2.28.26 to 2.29.1 (#11400)
dependabot[bot] Oct 28, 2024
bbeee79
Core: Move Javadoc about commit retries to SnapshotProducer (#10995)
gaborkaszab Oct 28, 2024
9f16d34
Build: Bump junit-platform from 1.11.2 to 1.11.3 (#11402)
dependabot[bot] Oct 28, 2024
b28e095
Build: Bump org.xerial:sqlite-jdbc from 3.46.1.3 to 3.47.0.0 (#11407)
dependabot[bot] Oct 28, 2024
1b22346
Build: Bump software.amazon.s3.accessgrants:aws-s3-accessgrants-java-…
dependabot[bot] Oct 28, 2024
fc615f6
Build: Bump net.snowflake:snowflake-jdbc from 3.19.0 to 3.19.1 (#11406)
dependabot[bot] Oct 28, 2024
24c2f93
Build: Bump testcontainers from 1.20.2 to 1.20.3 (#11404)
dependabot[bot] Oct 28, 2024
a115297
Build: Bump com.google.errorprone:error_prone_annotations (#11403)
dependabot[bot] Oct 28, 2024
ad35017
Build: Bump mkdocs-macros-plugin from 1.2.0 to 1.3.7 (#11399)
dependabot[bot] Oct 28, 2024
7ef6e26
Build: Bump mkdocs-material from 9.5.39 to 9.5.42 (#11398)
dependabot[bot] Oct 28, 2024
bedb897
Flink: Fix disabling flaky range distribution bucketing tests (#11410)
manuzhang Oct 28, 2024
7ad2243
Bump Azurite to the latest version (#11411)
Fokko Oct 28, 2024
0711a2a
Build: Bump datamodel-code-generator from 0.26.1 to 0.26.2 (#11356)
dependabot[bot] Oct 28, 2024
28c4c50
Revert "Core: Snapshot `summary` map must have `operation` key (#1135…
kevinjqliu Oct 28, 2024
e0cad0d
Aliyun: Remove spring-boot dependency (#11291)
jbonofre Oct 28, 2024
0a88b15
Build: Bump com.azure:azure-sdk-bom from 1.2.25 to 1.2.28 (#11267)
dependabot[bot] Oct 28, 2024
34a7002
Spark: Flaky test due temp directory (#10811)
manuzhang Oct 28, 2024
b3723a6
Core: Add portable Roaring bitmap for row positions (#11372)
aokolnychyi Oct 28, 2024
2ddb804
Flink 1.20: Update Flink to use planned Avro reads (#11386)
jbonofre Oct 29, 2024
4003248
open-api: Fix `testFixtures` dependencies (#11422)
ajantha-bhat Oct 29, 2024
6330040
Core: use ManifestFiles.open when possible (#11414)
dramaticlly Oct 29, 2024
90bc87e
GCS: Refresh vended credentials (#11282)
nastra Oct 30, 2024
0dd8015
Docs: Add 21 blogs / fix one broken link (#11424)
AlexMercedCoder Oct 30, 2024
2010347
AWS: Refresh vended credentials (#11389)
nastra Oct 30, 2024
d153c54
Build: Bump Hadoop to 3.4.1 (#11428)
Fokko Oct 30, 2024
fc85962
Core: Remove credentials from LoadViewResponse (#11432)
nastra Oct 30, 2024
caa6ee9
OpenAPI: Remove credentials from LoadViewResult (#11433)
nastra Oct 30, 2024
f586b96
API: Add compatibility checks for Schemas with default values (#11434)
rdblue Oct 30, 2024
75dc410
Doc: Update rewrite data files spark procedure (#11396)
dramaticlly Oct 31, 2024
9cd378d
Docs: warn `parallelism > 1` doesn't work for migration procedures (#…
manuzhang Oct 31, 2024
3cafd79
Core: Log retry sleep time (#11413)
sullis Oct 31, 2024
b10b709
Core: Use RoaringPositionBitmap in position index (#11441)
aokolnychyi Nov 1, 2024
24f1b1c
Core: Add validation for table commit properties (#11437)
dramaticlly Nov 2, 2024
2c3a1d6
Core: Add cardinality to PositionDeleteIndex (#11442)
aokolnychyi Nov 2, 2024
0f1fb70
Puffin: Add deletion-vector-v1 blob type (#11238)
rdblue Nov 2, 2024
7217d8c
Spec: Add deletion vectors to the table spec (#11240)
rdblue Nov 2, 2024
906e0e5
API, Core: Add data file reference to DeleteFile (#11443)
aokolnychyi Nov 2, 2024
2ff5361
Build: Bump software.amazon.awssdk:bom from 2.29.1 to 2.29.6 (#11454)
dependabot[bot] Nov 4, 2024
7e6ce6a
Build: Bump com.gradleup.shadow:shadow-gradle-plugin from 8.3.3 to 8.…
dependabot[bot] Nov 4, 2024
d33155b
Build: Bump com.google.cloud:libraries-bom from 26.49.0 to 26.50.0 (#…
dependabot[bot] Nov 4, 2024
12b9c70
Build: Bump org.apache.httpcomponents.client5:httpclient5 (#11450)
dependabot[bot] Nov 4, 2024
04ed219
Build: Bump com.azure:azure-sdk-bom from 1.2.28 to 1.2.29 (#11453)
dependabot[bot] Nov 4, 2024
ec81acf
Flink: Maintenance - TableManager + ExpireSnapshots (#11144)
pvary Nov 4, 2024
f356cf4
Build: Bump mkdocs-material from 9.5.42 to 9.5.43 (#11455)
dependabot[bot] Nov 4, 2024
f5764be
Build: Bump net.snowflake:snowflake-jdbc from 3.19.1 to 3.20.0 (#11447)
dependabot[bot] Nov 4, 2024
ffaf39a
Build: Bump kafka from 3.8.0 to 3.8.1 (#11449)
dependabot[bot] Nov 4, 2024
d6aa472
Build: Bump jackson-bom from 2.18.0 to 2.18.1 (#11448)
dependabot[bot] Nov 4, 2024
b7bfed1
Core: Fix generated position delete file spec (#11458)
aokolnychyi Nov 4, 2024
30123ed
API, Core: Add content offset and size to DeleteFile (#11446)
aokolnychyi Nov 4, 2024
65c18f5
Revert "Build: Bump parquet from 1.13.1 to 1.14.3 (#11264)" (#11462)
RussellSpitzer Nov 4, 2024
268689f
Spark 3.5: Preserve data file reference during manifest rewrites (#11…
aokolnychyi Nov 4, 2024
fa9a6e1
Core: Make PositionDeleteIndex serializable (#11463)
aokolnychyi Nov 5, 2024
00ceb37
Spark 3.5: Preserve content offset and size during manifest rewrites …
aokolnychyi Nov 5, 2024
bba0b16
Spark 3.5: Fix flaky test due to temp directory not empty during dele…
manuzhang Nov 5, 2024
246c505
Core, Data, Flink, Spark: Improve tableDir initialization for tests (…
nastra Nov 5, 2024
eb3d4e0
Core: Support DVs in DeleteFileIndex (#11467)
aokolnychyi Nov 5, 2024
f9c0832
Core: Adapt commit, scan, and snapshot stats for DVs (#11464)
aokolnychyi Nov 5, 2024
877cd69
Spark: Synchronously merge new position deletes with old deletes (#11…
amogh-jahagirdar Nov 5, 2024
8341dae
Fix ADLSLocation file parsing (#11395)
mrcnc Nov 5, 2024
85c789e
open-api: Build runtime jar for test fixture (#11279)
ajantha-bhat Nov 6, 2024
7bea331
Core, Puffin: Add DV file writer (#11476)
aokolnychyi Nov 6, 2024
e345525
Flink: Fix config key typo in error message of SplitComparators (#11482)
liuml07 Nov 6, 2024
ce560b6
API: Removes Explicit Parameterization of Schema Tests (#11444)
RussellSpitzer Nov 7, 2024
72cddd9
Docs: Fix verifying release candidate with Spark and Flink (#11461)
manuzhang Nov 7, 2024
a8b5d57
Flink: Port #11144 to v1.19 (#11473)
pvary Nov 8, 2024
eab800d
Docs: Fix format of verifying release candidate with Flink (#11487)
manuzhang Nov 8, 2024
d18c751
Core: Support DVs in DeleteLoader (#11481)
aokolnychyi Nov 8, 2024
7394d2a
Infra: Update DOAP.RDF for Apache Iceberg 1.7.0 (#11492)
RussellSpitzer Nov 8, 2024
7d33358
Docs: Site Update for 1.7.0 Release (#11494)
RussellSpitzer Nov 8, 2024
0701c59
Infra: Add 1.7.0 to issue template (#11491)
RussellSpitzer Nov 8, 2024
4f414d8
Build: Let revapi compare against 1.7.0 (#11490)
RussellSpitzer Nov 8, 2024
4cd801e
Docs: Adds Release notes for 1.6.1 (#11500)
RussellSpitzer Nov 8, 2024
7bac716
Docs: Fixes Release Formatting for 1.7.0 Release Notes (#11499)
RussellSpitzer Nov 8, 2024
01aac10
DOCS: Explicitly specify `operation` as a _required_ field of `summar…
sungwy Nov 8, 2024
261f892
Spark: Exclude reading _pos column if it's not in the scan list (#11390)
huaxingao Nov 8, 2024
3b8a130
Build: Bump mkdocs-redirects from 1.2.1 to 1.2.2 (#11511)
dependabot[bot] Nov 11, 2024
fa58d48
Build: Bump mkdocs-material from 9.5.43 to 9.5.44 (#11510)
dependabot[bot] Nov 11, 2024
9f1f99b
Build: Bump software.amazon.awssdk:bom from 2.29.6 to 2.29.9 (#11509)
dependabot[bot] Nov 11, 2024
68c2eb1
Docs: Update multi-engine support after 1.7.0 release (#11503)
manuzhang Nov 11, 2024
823492b
Spark: Fix typo in Spark ddl comment (#11517)
hantangwangd Nov 11, 2024
7e84055
Core: Support commits with DVs (#11495)
aokolnychyi Nov 11, 2024
9fd170c
Kafka Connect: fix Hadoop dependency exclusion (#11516)
bryanck Nov 11, 2024
0c0dce9
Build: Upgrade to Gradle 8.11.0 (#11521)
jbonofre Nov 12, 2024
3dc56a3
Spark 3.5: Iceberg parser should passthrough unsupported procedure to…
pan3793 Nov 12, 2024
abf9732
Release: Use `dist/release` KEYS (#11526)
kevinjqliu Nov 12, 2024
28eb1b9
Pig: Remove iceberg-pig (#11380)
jbonofre Nov 13, 2024
90864fe
Core, Flink, Spark: Test DVs with format-version=3 (#11485)
nastra Nov 13, 2024
6578428
Spark: Update tests which assume file format to use enum instead of s…
huaxingao Nov 13, 2024
da1ae35
Spark 3.4: Support Spark Column Stats (#11532)
saitharun15 Nov 14, 2024
aee6154
Docs: Fix rendering lists (#11546)
manuzhang Nov 14, 2024
276ab4d
Build: Bump kafka from 3.8.1 to 3.9.0 (#11508)
dependabot[bot] Nov 14, 2024
5edf143
Support WASB scheme in ADLSFileIO (#11504)
mrcnc Nov 14, 2024
121df4a
Docs: 4 Spaces are Requried for Sublists (#11549)
RussellSpitzer Nov 14, 2024
1442e7f
API, Core, Spark: Ignore schema merge updates from long -> int (#11419)
rocco408 Nov 15, 2024
8dbbac1
Docs: Fix level of Deletion Vectors (#11547)
manuzhang Nov 15, 2024
620f3da
API, Core: Replace deprecated ContentFile#path usage with location (#…
amogh-jahagirdar Nov 15, 2024
9d95e62
API: Add Variant data type (#11324)
aihuaxu Nov 15, 2024
a0eec4e
Spark 3.5: Adapt DeleteFileIndexBenchmark for DVs (#11529)
aokolnychyi Nov 15, 2024
317f2ce
Spark 3.5: Adapt PlanningBenchmark for DVs (#11531)
aokolnychyi Nov 15, 2024
892251f
Spark 3.5: Add DVReaderBenchmark (#11537)
aokolnychyi Nov 15, 2024
0b98372
Build: Bump datamodel-code-generator from 0.26.2 to 0.26.3 (#11572)
dependabot[bot] Nov 17, 2024
6810d00
Build: Bump software.amazon.awssdk:bom from 2.29.9 to 2.29.15 (#11568)
dependabot[bot] Nov 17, 2024
bab521e
Build: Bump io.netty:netty-buffer from 4.1.114.Final to 4.1.115.Final…
dependabot[bot] Nov 17, 2024
a1ad2e0
Data, Flink, MR, Spark: Test deletes with format-version=3 (#11538)
nastra Nov 17, 2024
50e02a4
Build: Bump nessie from 0.99.0 to 0.100.0 (#11567)
dependabot[bot] Nov 18, 2024
c288f93
Build: Bump orc from 1.9.4 to 1.9.5 (#11571)
dependabot[bot] Nov 18, 2024
cbd59b1
API, Arrow, Core, Data, Spark: Replace usage of deprecated ContentFil…
amogh-jahagirdar Nov 18, 2024
7ebaa2a
Core: Serialize `null` when there is no current snapshot (#11560)
Fokko Nov 18, 2024
8bb129d
Spark 3.4: Iceberg parser should passthrough unsupported procedure to…
pan3793 Nov 18, 2024
9d61251
Spark 3.3: Iceberg parser should passthrough unsupported procedure to…
pan3793 Nov 18, 2024
a5837b6
Core: Inherited classes from SnapshotProducer has TableOperations red…
gaborkaszab Nov 19, 2024
fd83be4
Revert "Core: Use encoding/decoding methods for namespaces and deprec…
nastra Nov 19, 2024
4d11d5b
Core: Delete temp metadata file when version already exists (#11350)
leesf Nov 20, 2024
1fdf846
Build: Bump Apache Parquet 1.14.4 (#11502)
Fokko Nov 20, 2024
613227e
Core: Filter on live entries when reading the manifest (#9996)
Fokko Nov 20, 2024
2702a27
Core: Fix CCE when retrieving TableOps (#11585)
nastra Nov 20, 2024
33586ec
API, Core: Remove unnecessary casts to Iterable<T> (#11601)
nastra Nov 20, 2024
70473a4
Spark 3.5: Fix NotSerializableException when migrating Spark tables (…
manuzhang Nov 20, 2024
3bc8f4a
Spark 3.3: Deprecate support (#11596)
aokolnychyi Nov 20, 2024
e222727
Hive: Bugfix for incorrect Deletion of Snapshot Metadata Due to OutOf…
ZhendongBai Nov 20, 2024
cfc3b63
docs: Add `iceberg-go` to doc site (#11607)
zeroshade Nov 20, 2024
7c25b27
Spark 3.5: Procedure to compute table stats (#10986)
karuppayya Nov 20, 2024
29fb251
Parquet: Use native getRowIndexOffset support instead of calculating …
wypoon Nov 21, 2024
d91dfae
Spark: Fix changelog table bug for start time older than current snap…
Acehaidrey Nov 21, 2024
354ef17
Spark 3.5: Fix flaky TestRemoveOrphanFilesAction3 (#11616)
manuzhang Nov 21, 2024
e09ce27
Build: Upgrade to Gradle 8.11.1 (#11619)
jbonofre Nov 21, 2024
977d2ec
Core: Optimize MergingSnapshotProducer to use referenced manifests to…
amogh-jahagirdar Nov 21, 2024
c186ff7
Revert "Core: Update TableMetadataParser to ensure all streams closed…
hussein-awala Nov 21, 2024
8b85536
Add REST Catalog tests to Spark 3.5 integration test (#11093)
haizhou-zhao Nov 21, 2024
721faae
Spark 3.5: Correct the two-stage parsing strategy of antlr parser (#1…
pan3793 Nov 22, 2024
03ec03f
Spark 3.4: Correct the two-stage parsing strategy of antlr parser (#7…
pan3793 Nov 22, 2024
815e384
Docs: Add new blog post to Iceberg Blogs (#11627)
ismailsimsek Nov 22, 2024
4d2452c
Docs: Mention HIVE-28121 for MySQL/MariaDB-based HMS users (#11631)
pan3793 Nov 23, 2024
d55dceb
Spark 3.4: IcebergSource extends SessionConfigSupport (#7732)
pan3793 Nov 23, 2024
e9f0ba1
Spark 3.5: IcebergSource extends SessionConfigSupport (#11624)
pan3793 Nov 23, 2024
7583238
Spark 3.3: IcebergSource extends SessionConfigSupport (#11625)
pan3793 Nov 23, 2024
ff06e72
Build: Bump testcontainers from 1.20.3 to 1.20.4 (#11640)
dependabot[bot] Nov 25, 2024
dc83b48
Build: Bump mkdocs-material from 9.5.44 to 9.5.45 (#11641)
dependabot[bot] Nov 25, 2024
e52a2a1
Spark 3.3: Correct the two-stage parsing strategy of antlr parser (#1…
pan3793 Nov 25, 2024
1db36dc
Build: Bump software.amazon.awssdk:bom from 2.29.15 to 2.29.20 (#11639)
dependabot[bot] Nov 25, 2024
8ec88db
Core,Open-API: Don't expose the `last-column-id` (#11514)
Fokko Nov 25, 2024
a7e51b4
Build: Bump com.google.errorprone:error_prone_annotations (#11638)
dependabot[bot] Nov 25, 2024
c1a432d
Flink: Add table.exec.iceberg.use-v2-sink option (#11244)
arkadius Nov 25, 2024
d150f95
Docs: Use DataFrameWriterV2 in example (#11647)
pan3793 Nov 25, 2024
3870e5c
Docs: Add `WHEN NOT MATCHED BY SOURCE` to Spark doc (#11636)
hussein-awala Nov 25, 2024
a139d6c
Build: Bump nessie from 0.100.0 to 0.100.2 (#11637)
dependabot[bot] Nov 25, 2024
aa97326
Build: Delete branch automatically on PR merge (#11635)
manuzhang Nov 26, 2024
45ec035
Flink: Test both "new" Flink Avro planned reader and "deprecated" Avr…
jbonofre Nov 26, 2024
6f73270
Spark 3.4: Add procedure to compute table stats (#11652)
jeesou Nov 27, 2024
4134bb6
Flink: Backport #11244 to Flink 1.19 (Add table.exec.iceberg.use-v2-s…
arkadius Nov 27, 2024
67950d9
Doc: Fix some Javadoc URLs. (#11666)
zhongyujiang Nov 27, 2024
ab574ea
Docs: Add blog post showing Nussknacker with Iceberg integration (#11…
arkadius Nov 27, 2024
c228d16
Kafka Connect: Add config to prefix the control consumer group (#11599)
hugofriant Nov 27, 2024
2698e01
Flink: Backport Avro planned reader (and corresponding tests) on Flin…
jbonofre Nov 27, 2024
5d1c32a
Docs: Add RisingWave (#11642)
hengm3467 Nov 28, 2024
53603ec
REST: Docker file for REST Catalog Fixture (#11283)
ajantha-bhat Nov 28, 2024
6b29e5a
Core: Propagate custom metrics reporter when table is created/replace…
nastra Nov 28, 2024
05cee15
Spark: remove ROW_POSITION from project schema (#11610)
huaxingao Nov 28, 2024
28c4ae5
Default to `overwrite` when operation is missing (#11421)
Fokko Nov 28, 2024
57be178
Core,API: Set `503: added_snapshot_id` as required (#11626)
Fokko Nov 28, 2024
91dd581
Add GitHub Action to publish the `docker-rest-fixture` container (#11…
sungwy Nov 29, 2024
0d90172
Core, GCS, Spark: Replace wrong order of assertion (#11677)
ebyhr Nov 29, 2024
beb590f
REST: Clean up `iceberg-rest-fixture` docker image naming (#11676)
ajantha-bhat Dec 1, 2024
11f2ea0
Build: Bump software.amazon.awssdk:bom from 2.29.20 to 2.29.23 (#11683)
dependabot[bot] Dec 2, 2024
ff95425
Build: Bump mkdocs-material from 9.5.45 to 9.5.46 (#11680)
dependabot[bot] Dec 2, 2024
cdb3413
REST: Use `HEAD` request to check table existence (#10999)
ebyhr Dec 2, 2024
a197495
Build: Bump jackson-bom from 2.18.1 to 2.18.2 (#11681)
dependabot[bot] Dec 2, 2024
2c78118
Build: Bump org.xerial:sqlite-jdbc from 3.47.0.0 to 3.47.1.0 (#11682)
dependabot[bot] Dec 2, 2024
d7ef1f0
Spark 3.5: Make where clause case sensitive in rewrite data files (#1…
ludlows Dec 2, 2024
45cf8b7
Spark: Remove extra columns for ColumnarBatch (#11551)
huaxingao Dec 3, 2024
d406604
Spark: Add view support to SparkSessionCatalog (#11388)
nastra Dec 3, 2024
78daca8
Core: Fix warning message for deprecated OAuth2 server URI (#11694)
ebyhr Dec 4, 2024
3300d3c
Build: Bump Parquet to 1.15.0 (#11656)
Fokko Dec 4, 2024
e68434c
Spark 3.3, 3.4: Make where clause case sensitive in rewrite data file…
ludlows Dec 4, 2024
e18c55d
Core: Generalize Util.blockLocations (#11053)
okumin Dec 4, 2024
97ceb53
Docs: Spark procedure for stats collection (#11606)
karuppayya Dec 4, 2024
8a0c301
Spark 3.5: Align RewritePositionDeleteFilesSparkAction filter with Sp…
huaxingao Dec 6, 2024
47004f0
Spark 3.5: Write DVs in Spark for V3 tables (#11561)
amogh-jahagirdar Dec 6, 2024
0403c4b
Infra: Add 1.7.1 to issue template (#11711)
bryanck Dec 6, 2024
07f3d03
Update ASF doap.rdf to Release 1.7.1 (#11712)
bryanck Dec 6, 2024
f70af7c
Spark 3.3,3.4: Align RewritePositionDeleteFilesSparkAction filter wit…
huaxingao Dec 7, 2024
3391d3d
Build: Bump software.amazon.awssdk:bom from 2.29.23 to 2.29.29 (#11723)
dependabot[bot] Dec 9, 2024
2d49363
Build: Bump com.google.cloud:libraries-bom from 26.50.0 to 26.51.0 (#…
dependabot[bot] Dec 9, 2024
cea11cc
Build: Bump mkdocs-material from 9.5.46 to 9.5.47 (#11726)
dependabot[bot] Dec 9, 2024
ace1ce1
Add C++ to the list of languages in `doap.rdf` (#11714)
Fokko Dec 9, 2024
8dc8996
Build: Bump com.azure:azure-sdk-bom from 1.2.29 to 1.2.30 (#11725)
dependabot[bot] Dec 9, 2024
d040408
Add `curl` to the `iceberg-rest-fixture` Docker image (#11705)
dominikhei Dec 9, 2024
cffdb7e
Build: Bump nessie from 0.100.2 to 0.101.0 (#11722)
dependabot[bot] Dec 9, 2024
5d7363c
docs: 1.7.1 Release notes (#11717)
bryanck Dec 9, 2024
2d54279
Docs: Add guidelines for contributors to become committers (#11670)
rdblue Dec 9, 2024
b8f164f
Core, Flink, Spark: Drop deprecated APIs scheduled for removal in 1.8…
findepi Dec 10, 2024
6dac99b
Flink: Fix range distribution npe when value is null (#11662)
Guosmilesmile Dec 10, 2024
bd6049d
AWS: Enable RetryMode for AWS KMS client (#11420)
hsiang-c Dec 11, 2024
4548bd7
Core, Flink, Spark, KafkaConnect: Remove usage of deprecated path API…
amogh-jahagirdar Dec 11, 2024
534f99e
Infra: Build Iceberg REST fixture docker image for `arm64` architectu…
Fokko Dec 11, 2024
37e4ed2
Added custom sorting with partition column order or user supplied fun…
zachdisc Feb 15, 2024
1659e49
Removed function-based manifest rewrite
zachdisc Dec 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions api/src/main/java/org/apache/iceberg/actions/RewriteManifests.java
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
*/
package org.apache.iceberg.actions;

import java.util.List;
import java.util.function.Predicate;
import org.apache.iceberg.ManifestFile;

Expand All @@ -44,6 +45,20 @@ public interface RewriteManifests
*/
RewriteManifests rewriteIf(Predicate<ManifestFile> predicate);

/**
* Rewrite manifests in a given order, based on partition transforms
*
* <p>If not set, manifests will be rewritten in the order of the transforms in the table's
* current partition spec.
*
* @param partitionSortOrder a list of partition field transform names
zachdisc marked this conversation as resolved.
Show resolved Hide resolved
* @return this for method chaining
*/
default RewriteManifests sort(List<String> partitionSortOrder) {
zachdisc marked this conversation as resolved.
Show resolved Hide resolved
throw new UnsupportedOperationException(
this.getClass().getName() + " doesn't implement sort(List<String>)");
}

/**
* Passes a location where the staged manifests should be written.
*
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,12 @@
package org.apache.iceberg.spark.actions;

import static org.apache.iceberg.MetadataTableType.ENTRIES;
import static org.apache.spark.sql.functions.col;

import java.io.Serializable;
import java.util.Iterator;
import java.util.List;
import java.util.Set;
import java.util.UUID;
import java.util.function.Function;
import java.util.function.Predicate;
Expand All @@ -37,6 +39,7 @@
import org.apache.iceberg.ManifestFile;
import org.apache.iceberg.ManifestFiles;
import org.apache.iceberg.ManifestWriter;
import org.apache.iceberg.PartitionField;
import org.apache.iceberg.PartitionSpec;
import org.apache.iceberg.Partitioning;
import org.apache.iceberg.RollingManifestWriter;
Expand Down Expand Up @@ -70,6 +73,7 @@
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.StructType;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
Expand All @@ -90,13 +94,13 @@ public class RewriteManifestsSparkAction
public static final String USE_CACHING = "use-caching";
public static final boolean USE_CACHING_DEFAULT = false;

private List<String> partitionSortColumns = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we follow public static final -> private static final -> private final -> private to order variables

Copy link
Author

@zachdisc zachdisc Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a problem? I have these attributes as being set to null but able to be redefined. So I think private is correct here, its inline with the other attributes in this class. I do have this attribute in the wrong place though, moving in next rev.

private static final Logger LOG = LoggerFactory.getLogger(RewriteManifestsSparkAction.class);
private static final RewriteManifests.Result EMPTY_RESULT =
ImmutableRewriteManifests.Result.builder()
.rewrittenManifests(ImmutableList.of())
.addedManifests(ImmutableList.of())
.build();

zachdisc marked this conversation as resolved.
Show resolved Hide resolved
private final Table table;
private final int formatVersion;
private final long targetManifestSizeBytes;
Expand Down Expand Up @@ -160,6 +164,34 @@ public RewriteManifestsSparkAction stagingLocation(String newStagingLocation) {
return this;
}

/**
zachdisc marked this conversation as resolved.
Show resolved Hide resolved
* Supply an optional set of partition transform column names to sort the rewritten manifests by.
* Expects exact transformed column names used for partitioning; not the raw column names used for
* partitioning. E.G. supply 'data_bucket' and not 'data' for a partition defined as bucket(N,
* data)
*
* @param partitionSortOrder - Order of partitions to sort manifests by
* @return this action
*/
@Override
public RewriteManifestsSparkAction sort(List<String> partitionSortOrder) {
// Collect set of allowable partition columns to sort on
Set<String> availablePartitionNames =
spec.fields().stream().map(PartitionField::name).collect(Collectors.toSet());

// Check if these partitions are included in the spec
zachdisc marked this conversation as resolved.
Show resolved Hide resolved
Preconditions.checkArgument(
zachdisc marked this conversation as resolved.
Show resolved Hide resolved
partitionSortOrder.stream().allMatch(availablePartitionNames::contains),
"Cannot use custom sort order to rewrite manifests '%s'. All partition columns must be "
+ "defined in the current partition spec: %s. Choose from the available partitionable columns: %s",
partitionSortColumns,
this.spec.specId(),
zachdisc marked this conversation as resolved.
Show resolved Hide resolved
availablePartitionNames);
this.partitionSortColumns = partitionSortOrder;

return this;
}

@Override
public RewriteManifests.Result execute() {
String desc = String.format("Rewriting manifests in %s", table.name());
Expand Down Expand Up @@ -250,12 +282,40 @@ private List<ManifestFile> writeUnpartitionedManifests(
private List<ManifestFile> writePartitionedManifests(
ManifestContent content, Dataset<Row> manifestEntryDF, int numManifests) {

// Extract desired clustering/sorting criteria into a dedicated column
Dataset<Row> clusteredManifestEntryDF;
String clusteringColumnName = "__clustering_column__";
zachdisc marked this conversation as resolved.
Show resolved Hide resolved

if (partitionSortColumns != null) {
LOG.info(
"Sorting manifests for specId {} by partition columns in order of {} ",
spec.specId(),
partitionSortColumns);

// Map the top level partition column names to the column name referenced within the manifest
// entry dataframe
Column[] actualPartitionColumns =
partitionSortColumns.stream()
.map(p -> col("data_file.partition." + p))
.toArray(Column[]::new);

// Form a new temporary column to sort/cluster manifests on, based on the custom sort
// order provided
clusteredManifestEntryDF =
manifestEntryDF.withColumn(
clusteringColumnName, functions.struct(actualPartitionColumns));
} else {
clusteredManifestEntryDF =
zachdisc marked this conversation as resolved.
Show resolved Hide resolved
manifestEntryDF.withColumn(clusteringColumnName, col("data_file.partition"));
}

return withReusableDS(
manifestEntryDF,
clusteredManifestEntryDF,
df -> {
WriteManifests<?> writeFunc = newWriteManifestsFunc(content, df.schema());
Column partitionColumn = df.col("data_file.partition");
Dataset<Row> transformedDF = repartitionAndSort(df, partitionColumn, numManifests);
Column partitionColumn = df.col(clusteringColumnName);
Dataset<Row> transformedDF =
repartitionAndSort(df, partitionColumn, numManifests).drop(clusteringColumnName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column here is more used for repartitioning, rather than sorting. Sorting only happens for rows with the same clustering column value. I am starting to think if we should call the API repartition(List<String> partitionFields), rather than sort(List<String> partitionFields).

Copy link
Author

@zachdisc zachdisc Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to make that change, but IMHO the API is indicating how the entire forest of manifests is sorted, not necessarily how each tree is sorted. Like, its sorting the manifest_list entries in the end, while the manifest_files are sorted further within that. Thoughts?

Though, I suppose it isn't really sorting the whole forest, it is clustering.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra @jackye1995 thoughts on a general rename to cluster() for this api action?

return writeFunc.apply(transformedDF).collectAsList();
});
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,12 @@
import java.io.File;
import java.io.IOException;
import java.nio.file.Path;
import java.util.Comparator;
import java.util.List;
import java.util.Map;
import java.util.Random;
import java.util.UUID;
import java.util.stream.Collectors;
import org.apache.hadoop.conf.Configuration;
import org.apache.iceberg.DataFile;
import org.apache.iceberg.DataFiles;
Expand Down Expand Up @@ -65,6 +68,7 @@
import org.apache.iceberg.exceptions.CommitStateUnknownException;
import org.apache.iceberg.hadoop.HadoopTables;
import org.apache.iceberg.io.OutputFile;
import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
import org.apache.iceberg.relocated.com.google.common.collect.Lists;
Expand All @@ -73,6 +77,7 @@
import org.apache.iceberg.spark.SparkWriteOptions;
import org.apache.iceberg.spark.TestBase;
import org.apache.iceberg.spark.source.ThreeColumnRecord;
import org.apache.iceberg.types.Conversions;
import org.apache.iceberg.types.Types;
import org.apache.iceberg.util.CharSequenceSet;
import org.apache.iceberg.util.Pair;
Expand Down Expand Up @@ -466,6 +471,163 @@ public void testRewriteLargeManifestsPartitionedTable() throws IOException {
assertThat(newManifests).hasSizeGreaterThanOrEqualTo(2);
}

@TestTemplate
public void testRewriteManifestsPartitionedTableWithInvalidSortColumnsThowsException()
zachdisc marked this conversation as resolved.
Show resolved Hide resolved
zachdisc marked this conversation as resolved.
Show resolved Hide resolved
throws IOException {
PartitionSpec spec = PartitionSpec.builderFor(SCHEMA).identity("c1").bucket("c3", 10).build();
Map<String, String> options = Maps.newHashMap();
options.put(TableProperties.FORMAT_VERSION, String.valueOf(formatVersion));
options.put(TableProperties.SNAPSHOT_ID_INHERITANCE_ENABLED, snapshotIdInheritanceEnabled);
zachdisc marked this conversation as resolved.
Show resolved Hide resolved
Table table = TABLES.create(SCHEMA, spec, options, tableLocation);

SparkActions actions = SparkActions.get();

// c2 is not a partition column, cannot use for sorting
List<String> badSortKeys1 = ImmutableList.of("c1", "c2");
assertThatThrownBy(
() ->
actions
.rewriteManifests(table)
.rewriteIf(manifest -> true)
.sort(badSortKeys1)
.execute())
.isInstanceOf(IllegalArgumentException.class)
.message()
zachdisc marked this conversation as resolved.
Show resolved Hide resolved
.contains("Cannot use custom sort order");

// c3_bucket is the correct internal partition name to use, c3 is the untransformed colum name,
zachdisc marked this conversation as resolved.
Show resolved Hide resolved
// sort() expects
// the hidden partition column names
zachdisc marked this conversation as resolved.
Show resolved Hide resolved
List<String> badSortKeys2 = ImmutableList.of("c1", "c3");
assertThatThrownBy(
() ->
actions
.rewriteManifests(table)
.rewriteIf(manifest -> true)
.sort(badSortKeys2)
.execute())
.isInstanceOf(IllegalArgumentException.class)
.message()
.contains("Cannot use custom sort order");
zachdisc marked this conversation as resolved.
Show resolved Hide resolved
}

@TestTemplate
public void testRewriteManifestsPartitionedTableWithCustomSorting() throws IOException {
Random random = new Random();

PartitionSpec spec =
PartitionSpec.builderFor(SCHEMA).identity("c1").truncate("c2", 3).bucket("c3", 10).build();
Map<String, String> options = Maps.newHashMap();
options.put(TableProperties.FORMAT_VERSION, String.valueOf(formatVersion));
options.put(TableProperties.SNAPSHOT_ID_INHERITANCE_ENABLED, snapshotIdInheritanceEnabled);
Table table = TABLES.create(SCHEMA, spec, options, tableLocation);

List<DataFile> dataFiles = Lists.newArrayList();
for (int fileOrdinal = 0; fileOrdinal < 1000; fileOrdinal++) {
dataFiles.add(
newDataFile(
table,
TestHelpers.Row.of(
new Object[] {
fileOrdinal, String.valueOf(random.nextInt() * 100), random.nextInt(10)
})));
}
ManifestFile appendManifest = writeManifest(table, dataFiles);
table.newFastAppend().appendManifest(appendManifest).commit();

List<ManifestFile> manifests = table.currentSnapshot().allManifests(table.io());
assertThat(manifests).as("Should have 1 manifests before rewrite").hasSize(1);

// Capture the c3 partition's lower and upper bounds - used for later test assertions
Integer c3PartitionMin =
Conversions.fromByteBuffer(
Types.IntegerType.get(), manifests.get(0).partitions().get(2).lowerBound());
Integer c3PartitionMax =
Conversions.fromByteBuffer(
Types.IntegerType.get(), manifests.get(0).partitions().get(2).upperBound());

// Set the target manifest size to a small value to force splitting records into multiple files
table
.updateProperties()
.set(
TableProperties.MANIFEST_TARGET_SIZE_BYTES,
String.valueOf(manifests.get(0).length() / 2))
.commit();

SparkActions actions = SparkActions.get();

List<String> manifestSortKeys = ImmutableList.of("c3_bucket", "c2_trunc", "c1");
RewriteManifests.Result result =
actions
.rewriteManifests(table)
.rewriteIf(manifest -> true)
.sort(manifestSortKeys)
.option(RewriteManifestsSparkAction.USE_CACHING, useCaching)
.execute();

table.refresh();
List<ManifestFile> newManifests = table.currentSnapshot().allManifests(table.io());

assertThat(result.rewrittenManifests()).hasSize(1);
assertThat(result.addedManifests()).hasSizeGreaterThanOrEqualTo(2);

assertThat(newManifests).hasSizeGreaterThanOrEqualTo(2);

// Rewritten manifests are clustered by c3_bucket - each should contain only a subset of the
// lower and upper bounds
// of the partition 'c3'.
zachdisc marked this conversation as resolved.
Show resolved Hide resolved
List<Pair<Integer, Integer>> c3Boundaries =
newManifests.stream()
.map(manifest -> manifest.partitions().get(2))
.sorted(
Comparator.comparing(
ptn -> Conversions.fromByteBuffer(Types.IntegerType.get(), ptn.lowerBound())))
.map(
p ->
Pair.of(
(Integer)
Conversions.fromByteBuffer(Types.IntegerType.get(), p.lowerBound()),
(Integer)
Conversions.fromByteBuffer(Types.IntegerType.get(), p.upperBound())))
.collect(Collectors.toList());

List<Integer> lowers = c3Boundaries.stream().map(t -> t.first()).collect(Collectors.toList());
List<Integer> uppers = c3Boundaries.stream().map(t -> t.second()).collect(Collectors.toList());

// With custom sorting, this looks like
// - manifest 1 -> [lower bound = 0, upper bound = 4]
// - manifest 2 -> [lower bound = 4, upper bound = 9]
// Without the custom sorting, each manifest tracks the full range of c3 upper/lower bounds.
// AKA they look like
// - manifest 1 -> [lower bound = 0, upper bound = 9]
// - manifest 2 -> [lower bound = 0, upper bound = 9]
// So the upper bound of the partitions tracked by the first file should be LEQ the lower bounds
// of the second. Etc
assertThat(uppers.get(0))
.withFailMessage(
zachdisc marked this conversation as resolved.
Show resolved Hide resolved
"Upper bound of first manifest partition should be LEQ lower bound of second")
.isLessThanOrEqualTo(lowers.get(1));

// Each file should contain less than the full c3 partition span
c3Boundaries.forEach(
boundary -> {
assertThat(boundary.second() - boundary.first())
.withFailMessage(
zachdisc marked this conversation as resolved.
Show resolved Hide resolved
"Manifest should contain less than the full range of c3 bucket partitions")
.isLessThanOrEqualTo(c3PartitionMax - c3PartitionMin);
});

// c3's Bucket(10) partition means our true lower bound = 0 and true upper bound is 9. The first
// manifest should
// include the lower bound of 0, and the last should have the upper bound of 9
assertThat(lowers.get(0))
.withFailMessage("Lower bound of first manifest partition should be 0")
zachdisc marked this conversation as resolved.
Show resolved Hide resolved
.isEqualTo(c3PartitionMin);
assertThat(uppers.get(uppers.size() - 1))
.withFailMessage("Lower bound of first manifest partition should be 0")
.isEqualTo(c3PartitionMax);
}

@TestTemplate
public void testRewriteManifestsWithPredicate() throws IOException {
PartitionSpec spec = PartitionSpec.builderFor(SCHEMA).identity("c1").truncate("c2", 2).build();
Expand Down