Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: in read_delim() and associated functions, make trim_ws = TRUE also work on ASCII code 160, non-breaking spaces #1570

Open
gklarenberg opened this issue Jan 9, 2025 · 0 comments

Comments

@gklarenberg
Copy link

gklarenberg commented Jan 9, 2025

If strings contain non-breaking spaces (ASCII code 160), the argument trim_ws = TRUE in read_csv() (or read_delim()) does not work. This was unexpected to me, as str_trim() from stringr from the tidyverse does.

Reprex:

library(tidyverse)

###### Example with regular spaces ###### 
# Create a vector with strings, spaces represented by ASCII code 32:
x <- c(intToUtf8(c(32, 65, 32, 119, 111, 114, 100)), # leading space
       intToUtf8(c(65, 32, 115, 101, 110, 116, 101, 110, 99, 101, 32))) # trailing space
x
#> [1] " A word"     "A sentence "

# Save as a csv
write_csv(data.frame(x), "reg_spaces")
# Read back in as csv
x2 <- read_csv("reg_spaces")
#> Rows: 2 Columns: 1
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): x
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Check: no leading or trailing spaces! :D
x2$x
#> [1] "A word"     "A sentence"

# Check other functions:
trimws(x) # Works
#> [1] "A word"     "A sentence" 
str_trim(x) # Works
#> [1] "A word"     "A sentence" 

###### Example with non-breaking spaces ###### 
# Create a vector with strings, spaces represented by ASCII code 160 (non-breaking spaces):
y <- c(intToUtf8(c(160, 65, 32, 119, 111, 114, 100)), 
       intToUtf8(c(65, 32, 115, 101, 110, 116, 101, 110, 99, 101, 160)))
y
#> [1] " A word"     "A sentence "

# Write out as a csv and read back in:
write_csv(data.frame(y), "nonbreak_spaces")
y2 <- read_csv("nonbreak_spaces")
#> Rows: 2 Columns: 1
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): y
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Check: still has leading and trailing spaces... :(
y2$y
#> [1] " A word"     "A sentence "

# Check other functions:
trimws(y) # Does not work
#> [1] " A word"     "A sentence " 
str_trim(y) # Works!
#> [1] "A word"     "A sentence" 

IRL situation: copied text from a website (list of countries separated by commas) to an Excel (csv) spreadsheet, applied "Text to table", using a comma as the separator, to place each country name in a separate column. I ignored the leading white spaces, assuming read_csv() would take care of it, but it did not. After some research, it appears that the csv kept the non-breaking spaces from the website (?), and read_csv() does not remove these.

Looking at the underlying code, I think the parse.ccp code (in the function parse_vector_) could be adjusted to explicitly remove leading and trailing non-breaking spaces. Or it could be added to the header file Token.h: in lines 119 and 121, add \u00A0 as white spaces to remove, in addition to ' ' and '\t'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant