Possible regression from joins refactor in 2.3.0 #1346

MichaelChirico · 2023-08-07T16:49:10Z

I'm having a bear of a time trying to debug this issue that's arising when updating 2.2.1 -> 2.3.3. Genuinely hard to disentangle where the issue's coming from since there's so many layers where things may have gone wrong.

The query in the test being broken is pretty simple: inner_join() on some 3-row input tables:

personal_data <- tibble(
  name = c("Alice", "Nick", "Bob"),
  age = c(32, 25, 44)
)

employee_data <- tibble(
  name = c("Alice", "Bob", "Michael"),
  employee_id = c(1, 2, 3)
)

# Copy these tables to DB backend
tables <- list(personal_data, employee_data)
table_names <- paste0("r_tmp_", seq_along(tables))

db_conn <- withr::local_db_connection(MakeTestDBIConnection())
db_names <- lapply(
  table_names,
  \(table_name) DBI::Id(namespace = "datascape", table_name = table_name)
)
db_tables <- mapply(dplyr::copy_to,
  df = tables, name = db_names,
  MoreArgs = list(dest = db_conn, temporary = FALSE),
  SIMPLIFY = FALSE
)

dplyr::inner_join(db_tables[[1]], db_tables[[2]], by = "name")

This test (which compares this join's output to the local dplyr equivalent) works as expected on 2.2.1 but breaks on 2.3.3:

Error in `dplyr::collect(x)`: Failed to collect lazy table.
Caused by error in `doTryCatch()`:
! INVALID_ARGUMENT: SQL_ANALYSIS_ERROR: Syntax error: Expected end of input but got "." [at 2:46]
FROM `datascape`.`r_tmp_1` AS `\`datascape\``.`\`r_tmp_1\``

The issue is the already-escaped name `datascape`.`r_tmp_1` is re-escaped unsuccessfully.

Poking around in debugging I'm not able to tell what went wrong. It's possible our own connection methods are doing something unexpected, for example.

Just one observation:

Here, IIUC, we should respect the pre-escaped nature of the input when constructing by$x_as:

dbplyr/R/lazy-join-query.R

Lines 171 to 178 in 5fa4410

    
           op$joins$by <- purrr::map2( 
        
             op$joins$by, seq_along(op$joins$by), 
        
             function(by, i) { 
        
               by$x_as <- table_names_out[op$joins$by_x_table_id[[i]]] 
        
               by$y_as <- table_names_out[i + 1L] 
        
               by 
        
             } 
        
           )

Debugging, I see this around that step:

dput(op$joins$by)
# list(list(
#   x = structure("name", class = c("ident", "character")),
#   y = structure("name", class = c("ident", "character")),
#   condition = "==", 
#   on = structure(character(0), class = c("sql", "character")),
#   na_matches = "never"
# ))
dput(table_names_out)
# c("`datascape`.`r_tmp_1`", "`datascape`.`r_tmp_2`")
dput(op$joins$by_x_table_id)
# list(1L)

# but also
dput(op$x$x)
# structure("`datascape`.`r_tmp_1`", class = c("ident_q", "ident", "character"))
dput(op$joins$table)
# list(structure("`datascape`.`r_tmp_2`", class = c("ident_q", "ident", "character")))

Perhaps table_names_out should be ident_q at this step, but even if so, it would have the same result:

identical(
  ident(table_names_out[1L]),
  ident(ident_q(table_names_out[1L]))
)
# [1] TRUE

Should ident() have an escape for "ident" input? And then we should make sure table_names_out reflects the same ident_q class as the input x$x and joins$table?

Maybe I'm barking up the wrong tree.

The text was updated successfully, but these errors were encountered:

mgirlich · 2023-08-08T05:42:34Z

I tried this with SQLite and the current dev version and it works (with some minor changes: DBI::Id(namespace = "datascape", table_name = table_name) needs to be replaced by DBI::Id(schema = "datascape", table = table_name)).

There are a couple of difficulties regarding different ways to specify a table identifier (ident(), ident_q(), in_schema(), in_catalog(), DBI::Id()). This got quite complicated and messy, so they are now (internally) all converted to one common class. This is also where the requirement to change the names in DBI::Id() to schema and name comes from.

Can you try this out with the current dev version?

MichaelChirico · 2023-08-08T07:27:36Z

Thanks for the tips. Working on using the updated version now.

Initial observations:

Just using the changes to DBI::Id() in 2.3.3 were not enough. Now trying at 5fa4410.
This test fails for us. Looks like a timezone thing:

dbplyr/tests/testthat/test-backend-mysql.R

Line 28 in 6a43c63

    
           expect_equal(escape(as.POSIXct("2020-01-01"), con = con), sql("'2020-01-01 00:00:00'"))

escape(as.POSIXct("2020-01-01"), con = con) (`actual`) not equal to sql("'2020-01-01 00:00:00'") (`expected`).

`actual`:   "<SQL> '2020-01-01 08:00:00'"
`expected`: "<SQL> '2020-01-01 00:00:00'"

mgirlich · 2023-08-08T07:30:19Z

Yes, the 2nd observation is missing to explicitly set a timezone. This was done in other tests but missed there...

mgirlich · 2023-08-08T07:41:26Z

The 2nd issue should be fixed now.

MichaelChirico · 2023-08-08T07:43:07Z

Next up:

Error in `UseMethod("sql_render")`: no applicable method for 'sql_render' applied to an object of class "c('sql', 'character')"

I see sql_render.sql was deleted from R/sql-build.R, but I don't see any mention of this in the NEWS, so not sure how to migrate.

NVM, I see that was just a placeholder method. I can simply remove the call to sql_render().

MichaelChirico · 2023-08-08T07:49:17Z

Mild breakage worth noting in the NEWS: we were using:

sql_not_supported("median()")

Which now produces the ugly message "`median()()` is not available in this SQL variant". A simple change to remove our own () but worth a nod in the NEWS. Or, the dbplyr logic could internally check for () and skip if present.

mgirlich · 2023-08-08T13:31:58Z

Thanks, I added the breaking change for sql_not_supported() to the NEWS.

MichaelChirico · 2023-08-08T17:31:23Z

I am stuck trying to update to dev. It breaks dozens of existing tests and I don't have any more bandwidth to investigate.

There's also an interaction with our local patch to fix #1016 that's being further broken by the update.

Looks like we will have to live with dbplyr 2.2.1 for the foreseeable future.

mgirlich · 2023-08-09T05:19:03Z

Is this a public package? It would be interesting to see what tests are broken now and whether dbplyr can help to avoid this for now.

MichaelChirico · 2023-08-09T17:19:48Z

Unfortunately it's an internal package. If you're OK to work slowly through shareable tidbits, I can try to share failures / relevant bits of our source. But that's likely to be somewhat painful. WDYT, is there any more productive way to debug here?

mgirlich · 2023-08-10T06:04:17Z

I would expect that many of the failing tests boil down to only a couple of changes. Unfortunately, I couldn't see that coming as your package is internal and therefore didn't pop up in the revdep checks.
Obviously, you should first go through the breaking changes and search for the functions mentioned in there. In case you already did this, I'm happy to answer question or help in finding the place where issues occur. Though you have to decide whether it is faster to provide bits of source you can share or search for the issue yourself.

MichaelChirico · 2023-08-11T20:04:34Z

I would expect that many of the failing tests boil down to only a couple of changes.

That's almost certainly right. Let's see what kind of progress we can make. Our package is {f1} (see paper for the SQL backend).

One crucial thing here is we're maintaining a patch in light of #1016:

dbplyr/R/verb-copy-to.R

Line 91 in 244ea87

out <- tbl_src_dbi(dest, name, vars = names(df))

-out <- tbl_src_dbi(dest, name, vars = names(df))
+out <- tbl(dest, name)

The dest object used to pass rlang::string(), but now it appears to be a dbplyr_table_ident.

I updated our .TableNameToIdentifier() helper in light of this:

-  if (rlang::inherits_any(table_name, c("SQL", "sql", "Id", "ident_q"))) {
+  if (rlang::inherits_any(table_name, c("SQL", "sql", "Id", "dbplyr_table_ident", "ident_q"))) {
    return(table_name)
  }

But now the next call fails:

Error in `UseMethod("as.sql")`: no applicable method for 'as.sql' applied to an object of class "c('dbplyr_table_ident', 'vctrs_rcrd', 'vctrs_vctr')" ('test-cache.R:13:3')
--------------------------------------------------------------------------------
Error in `UseMethod("as.sql")`: no applicable method for 'as.sql' applied to an object of class "c('dbplyr_table_ident', 'vctrs_rcrd', 'vctrs_vctr')"
Backtrace:
  1. dplyr::copy_to(...)
       at test-cache.R:13:2
  4. dplyr:::copy_to.DBIConnection(...)
  6. dbplyr:::copy_to.src_sql(...)
  8. f1:::tbl.src_F1DBIConnection(dest, name)
 10. f1:::tbl.F1DBIConnection(src$con, from, ...)
 11. dbplyr::as.sql(.TableNameToIdentifier(from), con = src)

I see some as.sql methods related, but none for dbplyr_table_ident:

dbplyr/R/schema.R

Lines 71 to 81 in 244ea87

    
           #' @export 
        
           as.sql.dbplyr_schema <- function(x, con) { 
        
             ident_q(paste0(escape(x$schema, con = con), ".", escape(x$table, con = con))) 
        
           } 
        
           #' @export 
        
           as.sql.dbplyr_catalog <- function(x, con) { 
        
             ident_q(paste0( 
        
               escape(x$catalog, con = con), ".", escape(x$schema, con = con), ".", escape(x$table, con = con) 
        
             )) 
        
           }

Could you recommend what such a method might look like?

~~maybe~~ $ doesn't work on the dbplyr_table_ident object, so as.sql.dbplyr_schema is not an exact match.

Maybe we should work around this on our end because it's our patch mucking things up. Instead I tried:

  if (rlang::inherits_any(table_name, c("SQL", "sql", "Id", "ident_q"))) {
    return(table_name)
  }
+  if (inherits(table_name, "dbplyr_table_ident")) {
+    table_name <- unclass(table_name)
+    return(DBI::Id(schema = table_name$schema, table = table_name$table))
+  }

This made progress, but a new error I'm not sure how to deal with:

Error in `h(simpleError(msg, call))`: error in evaluating the argument 'name' in selecting a method for function 'dbRemoveTable': Corrupt rcrd: not a list

MichaelChirico · 2023-08-11T23:30:13Z

Another thing I tried:

dbplyr/R/table-ident.R

Lines 210 to 218 in 244ea87

    
           table_ident_name <- function(x) { 
        
             table <- vctrs::field(x, "table") 
        
             quoted <- vctrs::field(x, "quoted") 
        
             if (quoted) { 
        
               NULL 
        
             } else { 
        
               table 
        
             } 
        
           }

table_ident_name <- function(x) {
+  if (inherits(x, "ident_q")) return(x)
  table <- vctrs::field(x, "table")
  quoted <- vctrs::field(x, "quoted")
  if (quoted) {
    NULL
  } else {
    table
  }
}

But that (or maybe some other change) leads to the wrong SQL being generated:

── Error ('test-cache.R:58:3'): Caches are written and read. ───────────────────
Error in `doTryCatch(return(expr), name, parentenv, handler)`: INVALID_ARGUMENT: f1::SQL_ANALYSIS_ERROR: Syntax error: Expected keyword JOIN but got ")" [at 2:29]
FROM (`datascape`.`mtcars_x`) AS `datascape`.`mtcars_x`

So the table name has been wrapped in () and given an AS alias, but really the query should be FROM `datascape`.`mtcars_x` . Not sure where to look now.

mgirlich · 2023-08-16T08:47:53Z

To me the first question is whether you really need tbl.src_F1DBIConnection() and tbl.F1DBIConnection()? If so, it would be interesting to know why. If you can get rid of it this should probably simplify your code and make it more compatible to the dev version.
Instead of as.sql() you can use escape() on a <dblyr_table_ident> object. But ideally, you would keep the <dblyr_table_ident> object. In particular in the lazy table object.

Regarding wrapping: this probably comes from sql_query_wrap.DBIConnection(). It checks whether from is a table identifier or SQL. If it is SQL it needs to be wrapped.

MichaelChirico · 2023-08-18T06:56:51Z

whether you really need

I think there's some history there; certainly a lot of tests fail without them. Here are their current implementations:

.IdentifierNames <- function(len) {
  if (len == 0) {
    return(NULL)
  }
  if (len == 1) {
    return("table")
  }
  if (len == 2) {
    return(c("schema", "table"))
  }
  c("schema", "table", paste0("suffix_", seq_len(len - 2)))
}

.TableNameToIdentifier <- function(table_name) {
  if (rlang::inherits_any(table_name, c("SQL", "sql", "Id", "ident_q"))) {
    return(table_name)
  }
  if (inherits(table_name, "dbplyr_table_ident")) {
    table_name <- unclass(table_name)
    return(DBI::Id(schema = table_name$schema, table = table_name$table))
  }
  stopifnot(rlang::is_string(table_name))

  parts <- strsplit(table_name, ".", fixed = TRUE)[[1]]
  names(parts) <- .IdentifierNames(length(parts))

  do.call(DBI::Id, as.list(parts))
}

tbl.F1DBIConnection <- function(src, from, ..., vars = NULL) {
  from <- dbplyr::as.sql(.TableNameToIdentifier(from), con = src)

  vars <- vars %||% dplyr::db_query_fields(src, from, ...)
  dbi_src <- dbplyr::src_dbi(src, auto_disconnect = FALSE)
  tbl <- dplyr::make_tbl(
    c("F1DBIConnection", "googlesql", "dbi", "sql", "lazy"),
    src = dbi_src,
    lazy_query = dbplyr::lazy_base_query(from, vars, class = "remote")
  )
  tbl
}

tbl.src_F1DBIConnection <- function(src, from, ...) {
  tbl(src$con, from, ...)
}

this probably comes from sql_query_wrap.DBIConnection()

Oh, nice lead. We have sql_query_wrap.GoogleSQLDBIConnection:

sql_query_wrap.GoogleSQLDBIConnection <- function(
    con, from,
    name = .UniqueName("subquery"),
    ...,
    lvl = 0) {
  if (is.ident(from)) {
    return(setNames(from, name))
  }
  if (is.null(name)) {
    return(build_sql("(", from, ")", con = con))
  }
  if (!is.ident(name)) {
    name <- ident(name)
  }
  build_sql("(", from, ") AS ", name, con = con)
}

I added a case for dbplyr_table_ident:

sql_query_wrap.GoogleSQLDBIConnection <- function(
    con, from,
    name = .UniqueName("subquery"),
    ...,
    lvl = 0) {
+  if (inherits(from, "dbplyr_table_ident")) {
+    return(from)
+  }
  if (is.ident(from)) {
    return(setNames(from, name))
  }
  if (is.null(name)) {
    return(build_sql("(", from, ")", con = con))
  }
  if (!is.ident(name)) {
    name <- ident(name)
  }
  build_sql("(", from, ") AS ", name, con = con)
}

That gets us to a different SQL generation error:

SELECT `datascape`.`mtcars_x`.*
FROM `datascape`.`mtcars_x`
LIMIT 6

The preamble to that looks like:

f1.mtcars <- copy_to(dbi_con,
  mtcars2,
  name = "datascape.mtcars_x",
  temporary = FALSE
)
tbl_mtcars <- tbl(dbi_con, "datascape.mtcars_x")

head(tbl_mtcars)

Looks like sql_render.select_query() is what's generating that botched SELECT... is that expecting something like

SELECT ALIAS.*
FROM `datascape`.`mtcars_x` AS ALIAS
LIMIT 6

MichaelChirico · 2023-08-18T07:07:17Z

I'm also noticing that inside sql_query_wrap.GoogleSQLDBIConnection() we have:

name <-
structure("`datascape`.`mtcars_x`", class = c("ident_q", "ident", 
"character"))

from <-
structure(list(table = "`datascape`.`mtcars_x`", schema = NA_character_, 
    catalog = NA_character_, quoted = TRUE, alias = NA_character_), class = c("dbplyr_table_ident", 
"vctrs_rcrd", "vctrs_vctr"))

That looks a bit off as I'd expect datascape to maybe be the catalog or schema?

mgirlich · 2023-08-18T08:35:40Z

I think you can simplify the whole setup a lot

# remove `tbl.F1DBIConnection()`

tbl.src_F1DBIConnection <- function(src, from, ...) {
  # maybe you don't really want to try to extract the schema automatically.
  # the dev version of dbplyr checks if there is a `.` in the name and informs the user that they
  # probably wanted to use `in_schema()` or `in_catalog()`
  from <- .TableNameToIdentifier(from)
  
  tbl_sql(
    c("F1DBIConnection", "googlesql", "dbi"),
    src = src,
    from = from,
    ...
  )
}

Either remove your sql_query_wrap() method or simplify it

sql_query_wrap.GoogleSQLDBIConnection <- function(
    con, from,
    name = NULL,
    ...,
    lvl = 0) {
  if (is.null(name)) {
    out <- sql_query_wrap(
      con = con,
      from = from,
      name = .UniqueName("subquery"),
      ...,
      lvl = lvl
    )
  }
  
  NextMethod("sql_query_wrap")
}

mgirlich · 2023-10-27T09:29:20Z

@MichaelChirico I'm closing the issue for now. Let me know if you need more help.

MichaelChirico · 2023-10-29T19:24:16Z

Our custom backend is still deeply broken and nobody has any bandwidth to invest in writing the package from scratch. We are stuck with an old version of {dbplyr} and will be for the foreseeable future. This is already causing issues as the cutting-edge {dplyr} implementation gets further from what's available in our old {dbplyr}. Alas.

MichaelChirico mentioned this issue Aug 8, 2023

Fix typos #1348

Merged

mgirlich closed this as completed Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible regression from joins refactor in 2.3.0 #1346

Possible regression from joins refactor in 2.3.0 #1346

MichaelChirico commented Aug 7, 2023

mgirlich commented Aug 8, 2023

MichaelChirico commented Aug 8, 2023

mgirlich commented Aug 8, 2023

mgirlich commented Aug 8, 2023

MichaelChirico commented Aug 8, 2023 •

edited

Loading

MichaelChirico commented Aug 8, 2023 •

edited

Loading

mgirlich commented Aug 8, 2023

MichaelChirico commented Aug 8, 2023

mgirlich commented Aug 9, 2023

MichaelChirico commented Aug 9, 2023

mgirlich commented Aug 10, 2023

MichaelChirico commented Aug 11, 2023 •

edited

Loading

MichaelChirico commented Aug 11, 2023

mgirlich commented Aug 16, 2023

MichaelChirico commented Aug 18, 2023

MichaelChirico commented Aug 18, 2023 •

edited

Loading

mgirlich commented Aug 18, 2023

mgirlich commented Oct 27, 2023

MichaelChirico commented Oct 29, 2023

Possible regression from joins refactor in 2.3.0 #1346

Possible regression from joins refactor in 2.3.0 #1346

Comments

MichaelChirico commented Aug 7, 2023

mgirlich commented Aug 8, 2023

MichaelChirico commented Aug 8, 2023

mgirlich commented Aug 8, 2023

mgirlich commented Aug 8, 2023

MichaelChirico commented Aug 8, 2023 • edited Loading

MichaelChirico commented Aug 8, 2023 • edited Loading

mgirlich commented Aug 8, 2023

MichaelChirico commented Aug 8, 2023

mgirlich commented Aug 9, 2023

MichaelChirico commented Aug 9, 2023

mgirlich commented Aug 10, 2023

MichaelChirico commented Aug 11, 2023 • edited Loading

MichaelChirico commented Aug 11, 2023

mgirlich commented Aug 16, 2023

MichaelChirico commented Aug 18, 2023

MichaelChirico commented Aug 18, 2023 • edited Loading

mgirlich commented Aug 18, 2023

mgirlich commented Oct 27, 2023

MichaelChirico commented Oct 29, 2023

MichaelChirico commented Aug 8, 2023 •

edited

Loading

MichaelChirico commented Aug 8, 2023 •

edited

Loading

MichaelChirico commented Aug 11, 2023 •

edited

Loading

MichaelChirico commented Aug 18, 2023 •

edited

Loading