Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement own df_n_col() to avoid calling ncol() (and dim()) #7049

Merged
merged 2 commits into from
Aug 5, 2024

Conversation

krlmlr
Copy link
Member

@krlmlr krlmlr commented Jul 8, 2024

This leads to premature materialization in union_all() and others with duckplyr 0.4.0 . I'll have to work around by vendoring everything that uses ncol() directly or indirectly into duckdb.

@DavisVaughan
Copy link
Member

DavisVaughan commented Jul 10, 2024

@krlmlr I've introduced our own version of df_n_col() and blocked usage of base ncol() to try and prevent this from happening in the future.

We do still need to call base::ncol() on a matrix in a single case (that's why your commit had test failures), so I also added in mat_n_col() as well to allow us to use that on matrices as required.


For future us:

ncol() was problematic because it is implemented as dim(x)[1], but calling dim() requires knowing the number of rows (retrieved through row names info, see below), which duckplyr does not know until it fully materializes the query:

> dim.data.frame
function (x) 
c(.row_names_info(x, 2L), length(x))

duckplyr does know the length of the data frame though, so we can utilize that.

df_n_col() asserts that if x inherits from "data.frame", then it must be built on a VECSXP where the length of that VECSXP corresponds to the number of columns in the data frame. This is the same assertion/assumption that vctrs makes.

Note that dim() is generic so in theory data frame subclasses could do weird things with it, but I think the assertion above is strong enough that that should not matter. (and if a subclass did make the number of columns returned by dim() inconsistent with the underlying VECSXP length, it would likely result in breaking all of the vctrs algorithms that work on data frames, so I feel good about this)

@DavisVaughan DavisVaughan requested a review from lionel- July 10, 2024 08:15
@DavisVaughan DavisVaughan changed the title Implement own ncol() to avoid calling dim() Implement own df_n_col() to avoid calling ncol() (and dim()) Jul 10, 2024
@krlmlr
Copy link
Member Author

krlmlr commented Jul 10, 2024

Thanks for taking this on, Davis. I love the "for future us" bit!

Copy link
Member Author

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor omission.

R/data-storms.R Show resolved Hide resolved
Copy link
Member

@lionel- lionel- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG!

@DavisVaughan DavisVaughan merged commit bffdfb1 into main Aug 5, 2024
13 checks passed
@DavisVaughan DavisVaughan deleted the b-ncol branch August 5, 2024 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants