Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need non-aggregating versions of sum(), min(), and max() #90

Closed
krlmlr opened this issue May 5, 2024 · 11 comments
Closed

Need non-aggregating versions of sum(), min(), and max() #90

krlmlr opened this issue May 5, 2024 · 11 comments
Assignees

Comments

@krlmlr
Copy link
Collaborator

krlmlr commented May 5, 2024

With duckdb v0.10.2:

duckdb <- asNamespace("duckdb")
drv <- duckdb::duckdb()
con <- DBI::dbConnect(drv)
experimental <- FALSE
invisible(duckdb$rapi_load_rfuns(drv@database_ref))
df1 <- data.frame(a = 1)

rel1 <- duckdb$rel_from_df(con, df1, experimental = experimental)
rel2 <- duckdb$rel_project(
  rel1,
  list(
    {
      tmp_expr <- duckdb$expr_reference("a")
      duckdb$expr_set_alias(tmp_expr, "a")
      tmp_expr
    },
    {
      tmp_expr <- duckdb$expr_function("r_base::max", list(duckdb$expr_reference("a")))
      duckdb$expr_set_alias(tmp_expr, "b")
      tmp_expr
    }
  )
)
#> Error: {"exception_type":"Binder","exception_message":"Aggregates cannot be present in a Project relation!"}

Created on 2024-05-05 with reprex v2.1.0

@romainfrancois
Copy link
Collaborator

I don't understand what they should be doing then ?

@krlmlr
Copy link
Collaborator Author

krlmlr commented May 11, 2024

The same as in dplyr, as usual 🙃

data.frame(a = 1:3) |>
  dplyr::mutate(b = sum(a))
#>   a b
#> 1 1 6
#> 2 2 6
#> 3 3 6

Created on 2024-05-11 with reprex v2.1.0

@krlmlr
Copy link
Collaborator Author

krlmlr commented May 11, 2024

Since this error happens at the binding stage, it's not critical for correctness, but might be for performance.

@romainfrancois
Copy link
Collaborator

I see, so do the aggregate and recycle the value. I'll look into it.

@krlmlr
Copy link
Collaborator Author

krlmlr commented May 13, 2024

Thanks, Romain. #89 and the associated PR look more important at this stage.

@romainfrancois
Copy link
Collaborator

Having looked at how duckplyr handles e.g. the regular sum, this feels more like a translation problem and we need to squeeze in a duckdb$expr_window so that the duckdb expression gains a OVER ()

library(duckdbrfuns)

con <- duckdbrfuns:::duckdb_con()
duckdb <- asNamespace("duckdb")

experimental <- FALSE
df1 <- data.frame(a = 1:10)

rel1 <- duckdb$rel_from_df(con, df1, experimental = experimental)
rel2 <- duckdb$rel_project(
  rel1,
  list(
    {
      tmp_expr <- duckdb$expr_reference("a")
      duckdb$expr_set_alias(tmp_expr, "a")
      tmp_expr
    },
    {
      tmp_expr <- duckdb$expr_window(
        duckdb$expr_function("r_base::sum", list(duckdb$expr_reference("a")))
      )
      duckdb$expr_set_alias(tmp_expr, "b")
      tmp_expr
    }
  )
)
print(rel2)
#> DuckDB Relation: 
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [a as a, r_base::sum(a) OVER () as b]
#>   r_dataframe_scan(0x10de98358)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (INTEGER)
#> - b (INTEGER)
duckdb$rel_to_altrep(rel2)
#>     a  b
#> 1   1 55
#> 2   2 55
#> 3   3 55
#> 4   4 55
#> 5   5 55
#> 6   6 55
#> 7   7 55
#> 8   8 55
#> 9   9 55
#> 10 10 55

Created on 2024-05-21 with reprex v2.1.0

IIUC duckplyr bases the need for wrapping in a expr_window based on an allow list:

https://github.com/romainfrancois/duckplyr/blob/7bc1ccb925ef742cf7805c9391fec19b1233123c/R/relational.R#L242-L252

        known_window <- c(
          # Window functions
          "rank", "rank_dense", "dense_rank", "percent_rank",
          "row_number", "first_value", "last_value", "nth_value",
          "cume_dist", "lead", "lag", "ntile",

          # Aggregates
          "sum", "mean", "stddev", "min", "max", "median",
          #
          NULL
        )

Perhaps instead of (in addition to?) that the r_base:: functions could mark that they are aggregates ? i.e. use r_base::aggregate::sum instead of r_base::sum so that then duckplyr can take advantage of that naming convention to squeeze in a expr_window.

@krlmlr
Copy link
Collaborator Author

krlmlr commented May 21, 2024

Do we even need the "project" versions of sum() then? Does your reprex work with "r_base::aggregate::sum" instead of "r_base::sum" ?

Less code in the extension means less complexity. Sorry about the extra work.

@romainfrancois
Copy link
Collaborator

Yeah exactly, we only need the function that aggregate in the extension, as in #95

This needs then a little hand holding in duckplyr: tidyverse/duckplyr#171

@romainfrancois
Copy link
Collaborator

Perhaps we don't even need the naming convention, and we can somehow interrogate the extension to find out if a function is a ScalarFunctionSet or a AggregateFunctionset but I'm not sure it's worth it.

@romainfrancois
Copy link
Collaborator

Indeed the reprex needs r_base::aggregate::sum. Here's a simplified as we don't need the alias:

library(duckdbrfuns)

con <- duckdbrfuns:::duckdb_con()
duckdb <- asNamespace("duckdb")

experimental <- FALSE
df1 <- data.frame(a = 1:10)

rel1 <- duckdb$rel_from_df(con, df1, experimental = experimental)
rel2 <- duckdb$rel_project(
  rel1,
  list(
    duckdb$expr_reference("a"),
    duckdb$expr_window(
      duckdb$expr_function("r_base::aggregate::sum", list(duckdb$expr_reference("a")))
    )
  )
)
duckdb$rel_to_altrep(rel2)
#>     a r_base::aggregate::sum(a) OVER ()
#> 1   1                                55
#> 2   2                                55
#> 3   3                                55
#> 4   4                                55
#> 5   5                                55
#> 6   6                                55
#> 7   7                                55
#> 8   8                                55
#> 9   9                                55
#> 10 10                                55

Created on 2024-05-21 with reprex v2.1.0

@krlmlr
Copy link
Collaborator Author

krlmlr commented May 22, 2024

Thanks, confirmed with most recent duckdb, without having to change the names:

library(duckdb)
#> Loading required package: DBI
drv <- duckdb()
con <- dbConnect(drv)
duckdb <- asNamespace("duckdb")

duckdb$rapi_load_rfuns(drv@database_ref)

experimental <- FALSE
df1 <- data.frame(a = 1:3)

rel1 <- duckdb$rel_from_df(con, df1, experimental = experimental)
rel2 <- duckdb$rel_project(
  rel1,
  list(
    duckdb$expr_reference("a"),
    duckdb$expr_window(
      duckdb$expr_function("r_base::sum", list(duckdb$expr_reference("a")))
    )
  )
)
duckdb$rel_to_altrep(rel2)
#>   a r_base::sum(a) OVER ()
#> 1 1                      6
#> 2 2                      6
#> 3 3                      6

Created on 2024-05-22 with reprex v2.1.0

Looks like this issue is moot. I need to trace it back to the original duckplyr problem.

@krlmlr krlmlr closed this as completed May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants