Feature request: in `read_delim()` and associated functions, make `trim_ws = TRUE` also work on ASCII code 160, non-breaking spaces #1570

gklarenberg · 2025-01-09T04:46:43Z

If strings contain non-breaking spaces (ASCII code 160), the argument trim_ws = TRUE in read_csv() (or read_delim()) does not work. This was unexpected to me, as str_trim() from stringr from the tidyverse does.

Reprex:

library(tidyverse)

###### Example with regular spaces ###### 
# Create a vector with strings, spaces represented by ASCII code 32:
x <- c(intToUtf8(c(32, 65, 32, 119, 111, 114, 100)), # leading space
       intToUtf8(c(65, 32, 115, 101, 110, 116, 101, 110, 99, 101, 32))) # trailing space
x
#> [1] " A word"     "A sentence "

# Save as a csv
write_csv(data.frame(x), "reg_spaces")
# Read back in as csv
x2 <- read_csv("reg_spaces")
#> Rows: 2 Columns: 1
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): x
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Check: no leading or trailing spaces! :D
x2$x
#> [1] "A word"     "A sentence"

# Check other functions:
trimws(x) # Works
#> [1] "A word"     "A sentence" 
str_trim(x) # Works
#> [1] "A word"     "A sentence" 

###### Example with non-breaking spaces ###### 
# Create a vector with strings, spaces represented by ASCII code 160 (non-breaking spaces):
y <- c(intToUtf8(c(160, 65, 32, 119, 111, 114, 100)), 
       intToUtf8(c(65, 32, 115, 101, 110, 116, 101, 110, 99, 101, 160)))
y
#> [1] " A word"     "A sentence "

# Write out as a csv and read back in:
write_csv(data.frame(y), "nonbreak_spaces")
y2 <- read_csv("nonbreak_spaces")
#> Rows: 2 Columns: 1
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): y
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Check: still has leading and trailing spaces... :(
y2$y
#> [1] " A word"     "A sentence "

# Check other functions:
trimws(y) # Does not work
#> [1] " A word"     "A sentence " 
str_trim(y) # Works!
#> [1] "A word"     "A sentence"

IRL situation: copied text from a website (list of countries separated by commas) to an Excel (csv) spreadsheet, applied "Text to table", using a comma as the separator, to place each country name in a separate column. I ignored the leading white spaces, assuming read_csv() would take care of it, but it did not. After some research, it appears that the csv kept the non-breaking spaces from the website (?), and read_csv() does not remove these.

Looking at the underlying code, I think the parse.ccp code (in the function parse_vector_) could be adjusted to explicitly remove leading and trailing non-breaking spaces. Or it could be added to the header file Token.h: in lines 119 and 121, add \u00A0 as white spaces to remove, in addition to ' ' and '\t'.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: in `read_delim()` and associated functions, make `trim_ws = TRUE` also work on ASCII code 160, non-breaking spaces #1570

Feature request: in `read_delim()` and associated functions, make `trim_ws = TRUE` also work on ASCII code 160, non-breaking spaces #1570

gklarenberg commented Jan 9, 2025 •

edited

Loading

Feature request: in read_delim() and associated functions, make trim_ws = TRUE also work on ASCII code 160, non-breaking spaces #1570

Feature request: in read_delim() and associated functions, make trim_ws = TRUE also work on ASCII code 160, non-breaking spaces #1570

Comments

gklarenberg commented Jan 9, 2025 • edited Loading

Feature request: in `read_delim()` and associated functions, make `trim_ws = TRUE` also work on ASCII code 160, non-breaking spaces #1570

Feature request: in `read_delim()` and associated functions, make `trim_ws = TRUE` also work on ASCII code 160, non-breaking spaces #1570

gklarenberg commented Jan 9, 2025 •

edited

Loading