Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Support aggregated basic stats in partition summary #11669

Closed
wants to merge 1 commit into from

Conversation

deniskuzZ
Copy link
Member

@deniskuzZ deniskuzZ commented Nov 27, 2024

@github-actions github-actions bot added the core label Nov 27, 2024
@pvary
Copy link
Contributor

pvary commented Nov 27, 2024

@deniskuzZ: Could you please provide a short description what data is stored in the summary and in what format?

I think it is important to understand the cost for keeping this stat up-to-date. How costly is to calculate it, and what is the data size increase caused by this change.

@findepi: Could this be useful for Trino? Does Trino have some optimiyation like this?

@pvary
Copy link
Contributor

pvary commented Nov 27, 2024

This discussion could be relevant here too: https://lists.apache.org/thread/0q1csnkfg8jc11zo1dlssjkr4v8s8zz0

@deniskuzZ
Copy link
Member Author

@pvary, unfortunately, that won't work. I was looking for an easy way to get basic partition stats, however, I missed the part that iceberg only keeps the changed partitions in a SnapshotSummary. Aggregation with just the prev snapshot value is not enough, it requires loop through all the snapshots.

table.newFastAppend().appendFile(FILE_A).commit();
partitions.data_bucket=0 -> added-data-files=1,added-records=1,added-files-size=10,total-records=3,total-files-size=30,total-data-files=3,total-delete-files=0,total-position-deletes=0,total-equality-deletes=0

table.newFastAppend().appendFile(FILE_B).commit();
partitions.data_bucket=1 -> added-data-files=1,added-records=1,added-files-size=10,total-records=2,total-files-size=20,total-data-files=2,total-delete-files=0,total-position-deletes=0,total-equality-deletes=0

table.newFastAppend().appendFile(FILE_A).commit();
partitions.data_bucket=0 -> added-data-files=1,added-records=1,added-files-size=10,total-records=3,total-files-size=30,total-data-files=3,total-delete-files=0,total-position-deletes=0,total-equality-deletes=0

do you think it's worth doing it in SnapshotSummary or is there some simpler/better way like create or update the partition stats puffin file right after the commit?

@deniskuzZ deniskuzZ closed this Nov 28, 2024
@deniskuzZ
Copy link
Member Author

Found partition stats tracker issue #8450 with the following design doc: https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk
But it doesn't seem to be completed yet: #11216

@pvary
Copy link
Contributor

pvary commented Nov 28, 2024

And here is the relevant mailing list thread: https://lists.apache.org/thread/knl1ol7s1o2p7rglgl2mm8c5mc2pk0sx

@ajantha-bhat: Are you still working on the proposal?

@ajantha-bhat
Copy link
Member

Yes, it is still active. But it is not getting enough reviews.
I am facing very hard to get reviews.

#11216 is the last PR that is needed for the functionality to work.

@ajantha-bhat
Copy link
Member

@deniskuzZ: Could you please comment on my last PR that this feature will be helpful for Hive? and you are looking for it.
It might help get more attention for review.

@jbonofre
Copy link
Member

Should we reopen this PR or is it superseded by another one ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants