Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible regression from joins refactor in 2.3.0 #1346

Closed
MichaelChirico opened this issue Aug 7, 2023 · 19 comments
Closed

Possible regression from joins refactor in 2.3.0 #1346

MichaelChirico opened this issue Aug 7, 2023 · 19 comments

Comments

@MichaelChirico
Copy link
Contributor

I'm having a bear of a time trying to debug this issue that's arising when updating 2.2.1 -> 2.3.3. Genuinely hard to disentangle where the issue's coming from since there's so many layers where things may have gone wrong.

The query in the test being broken is pretty simple: inner_join() on some 3-row input tables:

personal_data <- tibble(
  name = c("Alice", "Nick", "Bob"),
  age = c(32, 25, 44)
)

employee_data <- tibble(
  name = c("Alice", "Bob", "Michael"),
  employee_id = c(1, 2, 3)
)

# Copy these tables to DB backend
tables <- list(personal_data, employee_data)
table_names <- paste0("r_tmp_", seq_along(tables))

db_conn <- withr::local_db_connection(MakeTestDBIConnection())
db_names <- lapply(
  table_names,
  \(table_name) DBI::Id(namespace = "datascape", table_name = table_name)
)
db_tables <- mapply(dplyr::copy_to,
  df = tables, name = db_names,
  MoreArgs = list(dest = db_conn, temporary = FALSE),
  SIMPLIFY = FALSE
)

dplyr::inner_join(db_tables[[1]], db_tables[[2]], by = "name")

This test (which compares this join's output to the local dplyr equivalent) works as expected on 2.2.1 but breaks on 2.3.3:

Error in `dplyr::collect(x)`: Failed to collect lazy table.
Caused by error in `doTryCatch()`:
! INVALID_ARGUMENT: SQL_ANALYSIS_ERROR: Syntax error: Expected end of input but got "." [at 2:46]
FROM `datascape`.`r_tmp_1` AS `\`datascape\``.`\`r_tmp_1\``

The issue is the already-escaped name `datascape`.`r_tmp_1` is re-escaped unsuccessfully.

Poking around in debugging I'm not able to tell what went wrong. It's possible our own connection methods are doing something unexpected, for example.

Just one observation:

Here, IIUC, we should respect the pre-escaped nature of the input when constructing by$x_as:

dbplyr/R/lazy-join-query.R

Lines 171 to 178 in 5fa4410

op$joins$by <- purrr::map2(
op$joins$by, seq_along(op$joins$by),
function(by, i) {
by$x_as <- table_names_out[op$joins$by_x_table_id[[i]]]
by$y_as <- table_names_out[i + 1L]
by
}
)

Debugging, I see this around that step:

dput(op$joins$by)
# list(list(
#   x = structure("name", class = c("ident", "character")),
#   y = structure("name", class = c("ident", "character")),
#   condition = "==", 
#   on = structure(character(0), class = c("sql", "character")),
#   na_matches = "never"
# ))
dput(table_names_out)
# c("`datascape`.`r_tmp_1`", "`datascape`.`r_tmp_2`")
dput(op$joins$by_x_table_id)
# list(1L)

# but also
dput(op$x$x)
# structure("`datascape`.`r_tmp_1`", class = c("ident_q", "ident", "character"))
dput(op$joins$table)
# list(structure("`datascape`.`r_tmp_2`", class = c("ident_q", "ident", "character")))

Perhaps table_names_out should be ident_q at this step, but even if so, it would have the same result:

identical(
  ident(table_names_out[1L]),
  ident(ident_q(table_names_out[1L]))
)
# [1] TRUE

Should ident() have an escape for "ident" input? And then we should make sure table_names_out reflects the same ident_q class as the input x$x and joins$table?

Maybe I'm barking up the wrong tree.

@mgirlich
Copy link
Collaborator

mgirlich commented Aug 8, 2023

I tried this with SQLite and the current dev version and it works (with some minor changes: DBI::Id(namespace = "datascape", table_name = table_name) needs to be replaced by DBI::Id(schema = "datascape", table = table_name)).

There are a couple of difficulties regarding different ways to specify a table identifier (ident(), ident_q(), in_schema(), in_catalog(), DBI::Id()). This got quite complicated and messy, so they are now (internally) all converted to one common class. This is also where the requirement to change the names in DBI::Id() to schema and name comes from.

Can you try this out with the current dev version?

@MichaelChirico
Copy link
Contributor Author

Thanks for the tips. Working on using the updated version now.

Initial observations:

  1. Just using the changes to DBI::Id() in 2.3.3 were not enough. Now trying at 5fa4410.
  2. This test fails for us. Looks like a timezone thing:

expect_equal(escape(as.POSIXct("2020-01-01"), con = con), sql("'2020-01-01 00:00:00'"))

escape(as.POSIXct("2020-01-01"), con = con) (`actual`) not equal to sql("'2020-01-01 00:00:00'") (`expected`).

`actual`:   "<SQL> '2020-01-01 08:00:00'"
`expected`: "<SQL> '2020-01-01 00:00:00'"

@mgirlich
Copy link
Collaborator

mgirlich commented Aug 8, 2023

Yes, the 2nd observation is missing to explicitly set a timezone. This was done in other tests but missed there...

@mgirlich
Copy link
Collaborator

mgirlich commented Aug 8, 2023

The 2nd issue should be fixed now.

@MichaelChirico
Copy link
Contributor Author

MichaelChirico commented Aug 8, 2023

Next up:

Error in `UseMethod("sql_render")`: no applicable method for 'sql_render' applied to an object of class "c('sql', 'character')"

I see sql_render.sql was deleted from R/sql-build.R, but I don't see any mention of this in the NEWS, so not sure how to migrate.


NVM, I see that was just a placeholder method. I can simply remove the call to sql_render().

@MichaelChirico
Copy link
Contributor Author

MichaelChirico commented Aug 8, 2023

Mild breakage worth noting in the NEWS: we were using:

sql_not_supported("median()")

Which now produces the ugly message "`median()()` is not available in this SQL variant". A simple change to remove our own () but worth a nod in the NEWS. Or, the dbplyr logic could internally check for () and skip if present.

@mgirlich
Copy link
Collaborator

mgirlich commented Aug 8, 2023

Thanks, I added the breaking change for sql_not_supported() to the NEWS.

@MichaelChirico
Copy link
Contributor Author

I am stuck trying to update to dev. It breaks dozens of existing tests and I don't have any more bandwidth to investigate.

There's also an interaction with our local patch to fix #1016 that's being further broken by the update.

Looks like we will have to live with dbplyr 2.2.1 for the foreseeable future.

@mgirlich
Copy link
Collaborator

mgirlich commented Aug 9, 2023

Is this a public package? It would be interesting to see what tests are broken now and whether dbplyr can help to avoid this for now.

@MichaelChirico
Copy link
Contributor Author

Unfortunately it's an internal package. If you're OK to work slowly through shareable tidbits, I can try to share failures / relevant bits of our source. But that's likely to be somewhat painful. WDYT, is there any more productive way to debug here?

@mgirlich
Copy link
Collaborator

I would expect that many of the failing tests boil down to only a couple of changes. Unfortunately, I couldn't see that coming as your package is internal and therefore didn't pop up in the revdep checks.
Obviously, you should first go through the breaking changes and search for the functions mentioned in there. In case you already did this, I'm happy to answer question or help in finding the place where issues occur. Though you have to decide whether it is faster to provide bits of source you can share or search for the issue yourself.

@MichaelChirico
Copy link
Contributor Author

MichaelChirico commented Aug 11, 2023

I would expect that many of the failing tests boil down to only a couple of changes.

That's almost certainly right. Let's see what kind of progress we can make. Our package is {f1} (see paper for the SQL backend).

One crucial thing here is we're maintaining a patch in light of #1016:

out <- tbl_src_dbi(dest, name, vars = names(df))

-out <- tbl_src_dbi(dest, name, vars = names(df))
+out <- tbl(dest, name)

The dest object used to pass rlang::string(), but now it appears to be a dbplyr_table_ident.

I updated our .TableNameToIdentifier() helper in light of this:

-  if (rlang::inherits_any(table_name, c("SQL", "sql", "Id", "ident_q"))) {
+  if (rlang::inherits_any(table_name, c("SQL", "sql", "Id", "dbplyr_table_ident", "ident_q"))) {
    return(table_name)
  }

But now the next call fails:

Error in `UseMethod("as.sql")`: no applicable method for 'as.sql' applied to an object of class "c('dbplyr_table_ident', 'vctrs_rcrd', 'vctrs_vctr')" ('test-cache.R:13:3')
--------------------------------------------------------------------------------
Error in `UseMethod("as.sql")`: no applicable method for 'as.sql' applied to an object of class "c('dbplyr_table_ident', 'vctrs_rcrd', 'vctrs_vctr')"
Backtrace:
  1. dplyr::copy_to(...)
       at test-cache.R:13:2
  4. dplyr:::copy_to.DBIConnection(...)
  6. dbplyr:::copy_to.src_sql(...)
  8. f1:::tbl.src_F1DBIConnection(dest, name)
 10. f1:::tbl.F1DBIConnection(src$con, from, ...)
 11. dbplyr::as.sql(.TableNameToIdentifier(from), con = src)

I see some as.sql methods related, but none for dbplyr_table_ident:

dbplyr/R/schema.R

Lines 71 to 81 in 244ea87

#' @export
as.sql.dbplyr_schema <- function(x, con) {
ident_q(paste0(escape(x$schema, con = con), ".", escape(x$table, con = con)))
}
#' @export
as.sql.dbplyr_catalog <- function(x, con) {
ident_q(paste0(
escape(x$catalog, con = con), ".", escape(x$schema, con = con), ".", escape(x$table, con = con)
))
}

Could you recommend what such a method might look like?

maybe $ doesn't work on the dbplyr_table_ident object, so as.sql.dbplyr_schema is not an exact match.

Maybe we should work around this on our end because it's our patch mucking things up. Instead I tried:

  if (rlang::inherits_any(table_name, c("SQL", "sql", "Id", "ident_q"))) {
    return(table_name)
  }
+  if (inherits(table_name, "dbplyr_table_ident")) {
+    table_name <- unclass(table_name)
+    return(DBI::Id(schema = table_name$schema, table = table_name$table))
+  }

This made progress, but a new error I'm not sure how to deal with:

Error in `h(simpleError(msg, call))`: error in evaluating the argument 'name' in selecting a method for function 'dbRemoveTable': Corrupt rcrd: not a list

@MichaelChirico
Copy link
Contributor Author

Another thing I tried:

dbplyr/R/table-ident.R

Lines 210 to 218 in 244ea87

table_ident_name <- function(x) {
table <- vctrs::field(x, "table")
quoted <- vctrs::field(x, "quoted")
if (quoted) {
NULL
} else {
table
}
}

table_ident_name <- function(x) {
+  if (inherits(x, "ident_q")) return(x)
  table <- vctrs::field(x, "table")
  quoted <- vctrs::field(x, "quoted")
  if (quoted) {
    NULL
  } else {
    table
  }
}

But that (or maybe some other change) leads to the wrong SQL being generated:

── Error ('test-cache.R:58:3'): Caches are written and read. ───────────────────
Error in `doTryCatch(return(expr), name, parentenv, handler)`: INVALID_ARGUMENT: f1::SQL_ANALYSIS_ERROR: Syntax error: Expected keyword JOIN but got ")" [at 2:29]
FROM (`datascape`.`mtcars_x`) AS `datascape`.`mtcars_x`

So the table name has been wrapped in () and given an AS alias, but really the query should be FROM `datascape`.`mtcars_x` . Not sure where to look now.

@mgirlich
Copy link
Collaborator

To me the first question is whether you really need tbl.src_F1DBIConnection() and tbl.F1DBIConnection()? If so, it would be interesting to know why. If you can get rid of it this should probably simplify your code and make it more compatible to the dev version.
Instead of as.sql() you can use escape() on a <dblyr_table_ident> object. But ideally, you would keep the <dblyr_table_ident> object. In particular in the lazy table object.

Regarding wrapping: this probably comes from sql_query_wrap.DBIConnection(). It checks whether from is a table identifier or SQL. If it is SQL it needs to be wrapped.

@MichaelChirico
Copy link
Contributor Author

whether you really need

I think there's some history there; certainly a lot of tests fail without them. Here are their current implementations:

.IdentifierNames <- function(len) {
  if (len == 0) {
    return(NULL)
  }
  if (len == 1) {
    return("table")
  }
  if (len == 2) {
    return(c("schema", "table"))
  }
  c("schema", "table", paste0("suffix_", seq_len(len - 2)))
}

.TableNameToIdentifier <- function(table_name) {
  if (rlang::inherits_any(table_name, c("SQL", "sql", "Id", "ident_q"))) {
    return(table_name)
  }
  if (inherits(table_name, "dbplyr_table_ident")) {
    table_name <- unclass(table_name)
    return(DBI::Id(schema = table_name$schema, table = table_name$table))
  }
  stopifnot(rlang::is_string(table_name))

  parts <- strsplit(table_name, ".", fixed = TRUE)[[1]]
  names(parts) <- .IdentifierNames(length(parts))

  do.call(DBI::Id, as.list(parts))
}

tbl.F1DBIConnection <- function(src, from, ..., vars = NULL) {
  from <- dbplyr::as.sql(.TableNameToIdentifier(from), con = src)

  vars <- vars %||% dplyr::db_query_fields(src, from, ...)
  dbi_src <- dbplyr::src_dbi(src, auto_disconnect = FALSE)
  tbl <- dplyr::make_tbl(
    c("F1DBIConnection", "googlesql", "dbi", "sql", "lazy"),
    src = dbi_src,
    lazy_query = dbplyr::lazy_base_query(from, vars, class = "remote")
  )
  tbl
}

tbl.src_F1DBIConnection <- function(src, from, ...) {
  tbl(src$con, from, ...)
}

this probably comes from sql_query_wrap.DBIConnection()

Oh, nice lead. We have sql_query_wrap.GoogleSQLDBIConnection:

sql_query_wrap.GoogleSQLDBIConnection <- function(
    con, from,
    name = .UniqueName("subquery"),
    ...,
    lvl = 0) {
  if (is.ident(from)) {
    return(setNames(from, name))
  }
  if (is.null(name)) {
    return(build_sql("(", from, ")", con = con))
  }
  if (!is.ident(name)) {
    name <- ident(name)
  }
  build_sql("(", from, ") AS ", name, con = con)
}

I added a case for dbplyr_table_ident:

sql_query_wrap.GoogleSQLDBIConnection <- function(
    con, from,
    name = .UniqueName("subquery"),
    ...,
    lvl = 0) {
+  if (inherits(from, "dbplyr_table_ident")) {
+    return(from)
+  }
  if (is.ident(from)) {
    return(setNames(from, name))
  }
  if (is.null(name)) {
    return(build_sql("(", from, ")", con = con))
  }
  if (!is.ident(name)) {
    name <- ident(name)
  }
  build_sql("(", from, ") AS ", name, con = con)
}

That gets us to a different SQL generation error:

SELECT `datascape`.`mtcars_x`.*
FROM `datascape`.`mtcars_x`
LIMIT 6

The preamble to that looks like:

f1.mtcars <- copy_to(dbi_con,
  mtcars2,
  name = "datascape.mtcars_x",
  temporary = FALSE
)
tbl_mtcars <- tbl(dbi_con, "datascape.mtcars_x")

head(tbl_mtcars)

Looks like sql_render.select_query() is what's generating that botched SELECT... is that expecting something like

SELECT ALIAS.*
FROM `datascape`.`mtcars_x` AS ALIAS
LIMIT 6

@MichaelChirico
Copy link
Contributor Author

MichaelChirico commented Aug 18, 2023

I'm also noticing that inside sql_query_wrap.GoogleSQLDBIConnection() we have:

name <-
structure("`datascape`.`mtcars_x`", class = c("ident_q", "ident", 
"character"))

from <-
structure(list(table = "`datascape`.`mtcars_x`", schema = NA_character_, 
    catalog = NA_character_, quoted = TRUE, alias = NA_character_), class = c("dbplyr_table_ident", 
"vctrs_rcrd", "vctrs_vctr"))

That looks a bit off as I'd expect datascape to maybe be the catalog or schema?

@mgirlich
Copy link
Collaborator

I think you can simplify the whole setup a lot

# remove `tbl.F1DBIConnection()`

tbl.src_F1DBIConnection <- function(src, from, ...) {
  # maybe you don't really want to try to extract the schema automatically.
  # the dev version of dbplyr checks if there is a `.` in the name and informs the user that they
  # probably wanted to use `in_schema()` or `in_catalog()`
  from <- .TableNameToIdentifier(from)
  
  tbl_sql(
    c("F1DBIConnection", "googlesql", "dbi"),
    src = src,
    from = from,
    ...
  )
}

Either remove your sql_query_wrap() method or simplify it

sql_query_wrap.GoogleSQLDBIConnection <- function(
    con, from,
    name = NULL,
    ...,
    lvl = 0) {
  if (is.null(name)) {
    out <- sql_query_wrap(
      con = con,
      from = from,
      name = .UniqueName("subquery"),
      ...,
      lvl = lvl
    )
  }
  
  NextMethod("sql_query_wrap")
}

@mgirlich
Copy link
Collaborator

@MichaelChirico I'm closing the issue for now. Let me know if you need more help.

@MichaelChirico
Copy link
Contributor Author

Our custom backend is still deeply broken and nobody has any bandwidth to invest in writing the package from scratch. We are stuck with an old version of {dbplyr} and will be for the foreseeable future. This is already causing issues as the cutting-edge {dplyr} implementation gets further from what's available in our old {dbplyr}. Alas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants