-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Comparison expressions are translated in a way that allows them to be pushed down to Parquet #270
Conversation
duckdb 1.1.1 is on CRAN now, and has the necessary additions. Would you like to take another stab?
|
Thanks! I'll take another stab with these comments in mind. |
ec5b2af
to
7799d7c
Compare
Thanks. At some point, I can take over here, or we can do another round. Does the current implementation achieve the desired pushdown to Parquet? |
I can continue working on it if you can provide pointers? But I don't mind you making commits either. Thanks. The implementation pushes down filters to parquet. Below is reprex of the code given in the issue. A comparison expression is currently created only for object that share the same class or if it's a case of integer vs. numeric. I believe class is a better point of comparison than types, but I had some issues with objects that have multiple classes (i.e library(duckdb)
#> Loading required package: DBI
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(duckplyr)
#> The duckplyr package is configured to fall back to dplyr when it encounters an
#> incompatibility. Fallback events can be collected and uploaded for analysis to
#> guide future development. By default, no data will be collected or uploaded.
#> → Run `duckplyr::fallback_sitrep()` to review the current settings.
#> ✔ Overwriting dplyr methods with duckplyr methods.
#> ℹ Turn off with `duckplyr::methods_restore()`.
#>
#> Attaching package: 'duckplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
con_dp <- duckplyr:::get_default_duckdb_connection()
dbSendQuery(con_dp, "INSTALL httpfs; LOAD httpfs;")
#> <duckdb_result ed430 connection=a8ab0 statement='INSTALL httpfs; LOAD httpfs;'>
dbSendQuery(con_dp, "SET s3_region='auto';SET s3_endpoint='';")
#> <duckdb_result 25a40 connection=a8ab0 statement='SET s3_region='auto';SET s3_endpoint='';'>
f_duckplyr <- function() {
duckplyr::duckplyr_df_from_file(
"s3://duckplyr-demo-taxi-data/taxi-data-2019-partitioned/*/*.parquet",
"read_parquet",
options = list(hive_partitioning = TRUE),
class = class(tibble())
) |>
filter(total_amount > 0) |>
filter(!is.na(passenger_count)) |>
mutate(tip_pct = 100 * tip_amount / total_amount) |>
summarise(
avg_tip_pct = median(tip_pct),
n = n(),
.by = passenger_count
) |>
arrange(desc(passenger_count))
}
explain(f_duckplyr())
#> ┌───────────────────────────┐
#> │ ORDER_BY │
#> │ ──────────────────── │
#> │ read_parquet │
#> │ .passenger_count DESC │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PROJECTION │
#> │ ──────────────────── │
#> │ passenger_count │
#> │ avg_tip_pct │
#> │ n │
#> │ │
#> │ ~1762348 Rows │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ HASH_GROUP_BY │
#> │ ──────────────────── │
#> │ Groups: #0 │
#> │ │
#> │ Aggregates: │
#> │ median(#1) │
#> │ count_star() │
#> │ │
#> │ ~1762348 Rows │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PROJECTION │
#> │ ──────────────────── │
#> │ passenger_count │
#> │ tip_pct │
#> │ │
#> │ ~3524697 Rows │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PROJECTION │
#> │ ──────────────────── │
#> │ passenger_count │
#> │ tip_pct │
#> │ │
#> │ ~3524697 Rows │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PROJECTION │
#> │ ──────────────────── │
#> │ passenger_count │
#> │ tip_amount │
#> │ total_amount │
#> │ │
#> │ ~3524697 Rows │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ FILTER │
#> │ ──────────────────── │
#> │ (NOT ((passenger_count IS │
#> │ NULL) OR isnan(CAST │
#> │ (passenger_count AS DOUBLE│
#> │ )))) │
#> │ │
#> │ ~3524697 Rows │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ READ_PARQUET │
#> │ ──────────────────── │
#> │ Function: │
#> │ READ_PARQUET │
#> │ │
#> │ Projections: │
#> │ total_amount │
#> │ passenger_count │
#> │ tip_amount │
#> │ │
#> │ Filters: │
#> │ total_amount>0.0 AND │
#> │ total_amount IS NOT NULL │
#> │ │
#> │ ~17623488 Rows │
#> └───────────────────────────┘ |
9c74416
to
bfd1055
Compare
Let's see if this has an effect on the continuous benchmarks. |
This is how benchmark results would change (along with a 95% confidence interval in relative change) if bfd1055 is merged into main:
Further explanation regarding interpretation and methodology can be found in the documentation. |
bfd1055
to
e8a5627
Compare
I'm somewhat surprised to see that the timings aren't improving for the sf=1 runs (100_*). I'll work on making the plan available as JSON so that we can compare before-after at the level of the plan. |
This is how benchmark results would change (along with a 95% confidence interval in relative change) if e8a5627 is merged into main:
Further explanation regarding interpretation and methodology can be found in the documentation. |
Any filters using equality operators were not pushed down due to a rather obvious bug I had missed. I pushed a commit fixing it. The plans as JSON might come in handy anyway. |
This is how benchmark results would change (along with a 95% confidence interval in relative change) if 9c7d533 is merged into main:
Further explanation regarding interpretation and methodology can be found in the documentation. |
This is how benchmark results would change (along with a 95% confidence interval in relative change) if 6afb054 is merged into main:
Further explanation regarding interpretation and methodology can be found in the documentation. |
6afb054
to
f0ba24c
Compare
5c3d091
to
7509e9f
Compare
This is how benchmark results would change (along with a 95% confidence interval in relative change) if beb517b is merged into main:
Further explanation regarding interpretation and methodology can be found in the documentation. |
This is how benchmark results would change (along with a 95% confidence interval in relative change) if 6ae7c03 is merged into main:
Further explanation regarding interpretation and methodology can be found in the documentation. |
With 8b9de6e, the code could be simplified a bit. Ready to merge otherwise, thanks a lot! |
37d1ea1
to
8d5b06f
Compare
8d5b06f
to
28aa684
Compare
Thanks! |
This is how benchmark results would change (along with a 95% confidence interval in relative change) if 28aa684 is merged into main:
Further explanation regarding interpretation and methodology can be found in the documentation. |
Related to #172
Sketch of using introducing the use comparison expressions in duckplyr. See related PR in duckdb-r duckdb/duckdb-r#457