Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copy_to creates tables with invalid column names #1016

Closed
jensmassberg opened this issue Sep 29, 2022 · 6 comments
Closed

copy_to creates tables with invalid column names #1016

jensmassberg opened this issue Sep 29, 2022 · 6 comments
Labels
reprex needs a minimal reproducible example

Comments

@jensmassberg
Copy link

I use dbplyr with a database where some column names are invalid.
If a table with invalid column name is copied to the data base, invalid
names are changed to valid ones. Thus when using copy_to, the table in the
database might have column names than are different from the original data frame.

This was no problem in dbplyr 2.1.1, but in dbplyr 2.2.1 the returned tbl object
contains the old (invalid) column names instead of the new correct ones. This leads
to bugs when working with the table.

A closer look into the code showed that the following line led to the problem:
https://github.com/tidyverse/dbplyr/blame/main/R/verb-copy-to.R#L86

Before the line was changes, the column names were read from the table. Now it is
assumed that they are the same as the ones in the data frame. This assumption isn't
true in my case and I think that there are also other data bases where this isn't
necessarily true (random example: https://www.techtalkcorner.com/parquet-files-column-names-with-spaces/).

Can we either

  • get back to the old behaviour in copy_to to read the column names from the table or
  • add a parameter to switch the behavior (use column names from data frame or read them from table)?
@mgirlich
Copy link
Collaborator

mgirlich commented Oct 6, 2022

It feels quite unexpected to me that the database changes the column names. So, I'm not a big fan of the old behaviour. At least, this should be a warning. So it also does not make sense to me to add such a parameter. Instead you should rather rename the columns before copying.
But the current behaviour is not that good either. Though, it feels like quite an edge case.

@hadley What's your opinion on that?

@hadley
Copy link
Member

hadley commented Oct 6, 2022

Hmmmmmm, given that some databases do this, and we can't really protect it, it seems like it would be safest to ask the database for the column names, and then maybe warn the user if they're changed from what they supplied?

@krlmlr
Copy link
Member

krlmlr commented Nov 19, 2022

Can we:

  • add an argument query_column_names = NA
  • if NA, column names are only queried if they don't match [a-z_]+
  • otherwise, we respect the user's preference

@mgirlich
Copy link
Collaborator

This seems reasonable to me. It avoids the roundtrip in most cases and should still be safe.

@hadley
Copy link
Member

hadley commented Dec 5, 2022

This seems like a lot of work for a situation that seems to occur rarely. @jensmassberg can you give us more information about which database you're using and how the names are transformed?

@hadley hadley added the reprex needs a minimal reproducible example label Dec 5, 2022
@mgirlich
Copy link
Collaborator

Closing as we are missing more details here. Happy to reopen if we get more information or this comes up more frequently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reprex needs a minimal reproducible example
Projects
None yet
Development

No branches or pull requests

4 participants