-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
323 lines (240 loc) · 9.11 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
tidy.opts = list(blank = FALSE)
)
options(tidyverse.quiet = TRUE)
```
# powerjoin <img src='man/figures/logo.png' align="right" height="139" />
{powerjoin} extends {dplyr}'s join functions.
* Make your joins safer with the `check` argument and the `check_specs()`function
* Deal with conflicting column names by combining, coalescing them etc using the `conflict` argument
* Preprocess input, for instance to select columns to join without having to repeat
key columns in the selection
* Do painless fuzzy joins thanks to a generalized `by` argument accepting formulas
* Fill unmatched values using the `fill` argument
* Operate recursive joins by providing lists of data frames to `x` and `y`
* Keep or drop key columns with more flexibility thanks to an enhanced `keep`argument
## Installation
Install CRAN version with:
``` r
install.packages("powerjoin")
```
Or development version with:
``` r
remotes::install_github("moodymudskipper/powerjoin")
```
## Now let's match penguins
```{r}
library(powerjoin)
library(tidyverse)
# toy dataset built from Allison Horst's {palmerpenguins} package and
# Hadley Wickham's {babynames}
male_penguins <- tribble(
~name, ~species, ~island, ~flipper_length_mm, ~body_mass_g,
"Giordan", "Gentoo", "Biscoe", 222L, 5250L,
"Lynden", "Adelie", "Torgersen", 190L, 3900L,
"Reiner", "Adelie", "Dream", 185L, 3650L
)
female_penguins <- tribble(
~name, ~species, ~island, ~flipper_length_mm, ~body_mass_g,
"Alonda", "Gentoo", "Biscoe", 211, 4500L,
"Ola", "Adelie", "Dream", 190, 3600L,
"Mishayla", "Gentoo", "Biscoe", 215, 4750L,
)
```
## Safer joins
The `check` argument receives an object created by the `check_specs()` function,
which provides ways to handle specific input properties, its arguments
can be :
* `"ignore"` : stay silent (default except for `implicit_keys`)
* `"inform"`
* `"warn"`
* `"abort"`
We can print these defaults :
```{r}
check_specs()
```
By default it works like {dplyr}, informing in case of implicit keys, and no
further checks :
```{r, error = TRUE}
power_inner_join(
male_penguins[c("species", "island")],
female_penguins[c("species", "island")]
)
```
We can silence the implicit key detection and check that we have unique keys in
the right table
```{r}
check_specs(implicit_keys = "ignore", duplicate_keys_right = "abort")
```
```{r, error = TRUE}
power_inner_join(
male_penguins[c("species", "island")],
female_penguins[c("species", "island")],
check = check_specs(implicit_keys = "ignore", duplicate_keys_right = "abort")
)
```
The `column_conflict` argument guarantees that you won't have columns renamed without you
knowing, you might need it most of the time, we could setup some development and
production specs for our most common joins:
```{r}
dev_specs <- check_specs(
column_conflict = "abort",
inconsistent_factor_levels = "inform",
inconsistent_type = "inform"
)
prod_specs <- check_specs(
column_conflict = "abort",
implicit_keys = "abort"
)
```
This will save some typing :
<!-- For some reason this chunk makes markdown bug, so dirty fix -->
```{r, error = TRUE, eval = FALSE}
power_inner_join(
male_penguins,
female_penguins,
by = c("species", "island"),
check = dev_specs
)
#> Error: The following columns are conflicted and their conflicts are not handled:
#> 'name', 'flipper_length_mm', 'body_mass_g'
```
## Handle column conflict
We saw above how to fail when encountering column conflict, here we show how to
handle it.
To resolve conflicts between identically named join columns, set the `conflict`
argument to a 2 argument function (or formula) that will take as arguments the 2 conflicting
joined columns after the join.
```{r}
df1 <- tibble(id = 1:3, value = c(10, NA, 30))
df2 <- tibble(id = 2:4, value = c(22, 32, 42))
power_left_join(df1, df2, by = "id", conflict = `+`)
```
Coalescing is the most common use case and we provide the functions `coalesce_xy()`
and `coalesce_yx()` to ease this task (both wrapped around `dplyr::coalesce()`).
```{r}
power_left_join(df1, df2, by = "id", conflict = coalesce_xy)
power_left_join(df1, df2, by = "id", conflict = coalesce_yx)
```
Note that the function is operating on vectors by default, not rowwise, however
we can make it work rowwise by using `rw` in the lhs of the formula.
```{r}
power_left_join(df1, df2, by = "id", conflict = ~ sum(.x, .y, na.rm = TRUE))
power_left_join(df1, df2, by = "id", conflict = rw ~ sum(.x, .y, na.rm = TRUE))
```
If you need finer control, `conflict` can also be a named list of such functions,
formulas or special values, each to be applied on the relevant pair of conflicted
columns.
## Preprocess inputs
Traditionally key columns need to be repeated when preprocessing inputs
before a join, which is an annoyance and an opportunity for mistakes.
With {powerjoin} we can do :
```{r}
power_inner_join(
male_penguins %>% select_keys_and(name),
female_penguins %>% select_keys_and(female_name = name),
by = c("species", "island")
)
```
For semi joins, just omit arguments to `select_keys_and()`:
```{r}
power_inner_join(
male_penguins,
female_penguins %>% select_keys_and(),
by = c("species", "island")
)
```
We could also aggregate on keys before the join, without the need for any
`group_by()`/`ungroup()` gymnastics :
```{r}
power_left_join(
male_penguins %>% summarize_by_keys(male_weight = mean(body_mass_g)),
female_penguins %>% summarize_by_keys(female_weight = mean(body_mass_g)),
by = c("species", "island")
)
```
`pack_along_keys()` packs given columns, or all non key columns by default, into
a data frame column named by the `name` argument, it's useful to namespace the
data and avoid conflicts
```{r}
power_left_join(
male_penguins %>% pack_along_keys(name = "m"),
female_penguins %>% pack_along_keys(name = "f"),
by = c("species", "island")
)
```
We have more of these, all variants of tidyverse functions :
* `nest_by_keys()` nests given columns, or all by default, if `name` is given
a single list column of data frames is created
* `complete_keys()` expands the key columns, so all combinations are present,
filling the rest of the new rows with `NA`s. Absent factor levels are expanded
as well.
<!-- * `pivot_wider_by_keys()` and `pivot_longer_by_keys()` assume the "id" columns are the keys -->
These functions do not modify the data but add an attribute that will be processed
by the join function later on, so no function should be used on top of them.
## Fuzzy joins
To do fuzzy joins we use formulas in the `by` argument, in this formula we use,
`.x` and `.y` to describe the left and right tables. This is very flexible
but can be costly since a cartesian product is computed.
```{r}
power_inner_join(
male_penguins %>% select_keys_and(male_name = name),
female_penguins %>% select_keys_and(female_name = name),
by = c(~.x$flipper_length_mm < .y$flipper_length_mm, ~.x$body_mass_g > .y$body_mass_g)
)
```
We might also mix fuzzy joins with regular joins :
```{r}
power_inner_join(
male_penguins %>% select_keys_and(male_name = name),
female_penguins %>% select_keys_and(female_name = name),
by = c("island", ~.x$flipper_length_mm > .y$flipper_length_mm)
)
```
Finally we might want to create a column with a value used in the comparison,
in that case we will use `<-` in the formula (several times if needed)`:
```{r}
power_inner_join(
male_penguins %>% select_keys_and(male_name = name),
female_penguins %>% select_keys_and(female_name = name),
by = ~ (mass_ratio <- .y$body_mass_g / .x$body_mass_g) > 1.2
)
```
## Fill unmatched values
The `fill` argument is used to specify what to fill unmatched values with,
note that missing values resulting from matches are not replaced.
```{r}
df1 <- tibble(id = 1:3)
df2 <- tibble(id = 1:2, value2 = c(2, NA), value3 = c(NA, 3))
power_left_join(df1, df2, by = "id", fill = 0)
power_left_join(df1, df2, by = "id", fill = list(value2 = 0))
```
## Join recursively
The `x` and `y` arguments accept lists of data frames so one can do :
```{r}
df1 <- tibble(id = 1, a = "foo")
df2 <- tibble(id = 1, b = "bar")
df3 <- tibble(id = 1, c = "baz")
power_left_join(list(df1, df2, df3), by = "id")
power_left_join(df1, list(df2, df3), by = "id")
```
## Enhanced `keep` argument
By default, as in *{dplyr}*, key columns are merged and given names from the
left table. In case of a fuzzy join columns that participate in a fuzzy join are
kept from both sides.
We provide additional values `"left"`, `"right"`, `"both"` and `"none"` to choose
which keys to keep or drop.
## Notes
This package supersedes the {safejoin} package which had an unfortunate homonym on CRAN and
had a suboptimal interface and implementation.
Hadley Wickham, Romain François and David Robinson are credited for their work
in {dplyr} and {fuzzyjoin} since this package contains some code copied from these packages.