-
Notifications
You must be signed in to change notification settings - Fork 41
/
Copy path19-joins.Rmd
466 lines (308 loc) · 12.2 KB
/
19-joins.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
# Joins
```{r 19-01, message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
oldopt <- options(pillar.print_max = 4, pillar.print_min = 4)
library(nycflights13)
library(tidyverse)
```
**Learning objectives:**
- Identify keys to connect a pair of data frames
- Use mutating and filtering joins to combine data
- Understand how joins work and understand the output
- Understand how various key matching conditions work
## What? {-}
- Joining two data frames `x` and `y`: combining the information from both data frames to create a new data frame, by **matching rows in `x` to rows in `y` based on one or more common variables** (keys)
## What? {-}
- A join is the operation of joining.
Joins can be classified by several criteria:
- is the information of data frame `y` included in the result? (mutating vs filtering joins)
- what happens with non-matching rows? (inner vs outer joins / semi-join vs anti-join)
- how are key matching conditions defined? (equality vs inequality vs no restrictions)
## Keys {-}
Keys = the variables used to connect a pair of data frames in a join.
- `x$key` must match `y$key` for a row in `x` to be matched to a row in `y`
- **compound key**: key that consists of > 1 variable
## Keys {-}
Every join involves a **pair of keys**: one key in each data frame.
They typically play a different role depending on the data frame they belong to:
- one is a data frame's **primary key**: the _variable_ or the _set of variables_ that **_uniquely identifies_** each observation
- the other is called a **foreign key**:
- it _corresponds_ to the primary key (same meaning, same number of variables)
- its values can be repeated
![](images/19_equality_match.png)
## Keys {-}
Primary & foreign key relationships in the nycflights13 package:
![](images/19_relational.png)
## Keys {-}
Tips:
- joining is easiest if primary and foreign key have the same name
- check the primary keys!
- each value must occur only once
- there must be no missing values
## Keys {-}
**Surrogate key**: a single variable added to reflect a compound primary key; makes life easier
```{r}
flights |>
count(time_hour, carrier, flight) |>
filter(n > 1)
flights2 <- flights |>
mutate(id = row_number(), .before = 1)
flights2
```
## Mutating joins {-}
**Mutating joins** add columns from data frame `y`.
- Inner join: only keep matching rows.
- Outer join: also keep non-matching rows from `x` (left join), `y` (right join) or both (full join).
## Mutating joins {-}
![](images/19_inner.png)
## Mutating joins {-}
![](images/19_left.png)
## Mutating joins {-}
![](images/19_right.png)
## Mutating joins {-}
![](images/19_full.png)
## Mutating joins: examples {-}
```{r}
flights2 <- flights |>
select(year, time_hour, origin, dest, tailnum, carrier)
flights2
```
## Mutating joins: examples {-}
```{r}
flights2 |>
left_join(airlines)
```
## Mutating joins: examples {-}
```{r}
flights2 |>
inner_join(airlines)
```
## Mutating joins: extras {-}
`left_join()`, `right_join()` and `inner_join()` have an argument `unmatched =`
- defaults to `"drop"`: drop non-matching records from (respectively) `y`, `x` or both `x` and `y`
- can be set to `"error"` to verify that all records from `y`, `x` or both `x` and `y` are kept, if that is what you expect
## Mutating joins: extras {-}
All mutating join functions have an argument `keep =`
- defaults to `NULL`: equi joins retain only the key from `x`, while non-equi joins retain both keys
- can be set to `TRUE` to force both keys to be retained in the output
```{r}
flights2 |>
inner_join(airlines, keep = TRUE)
```
## Relationships in mutating joins {-}
The _relationship_ describes how many rows in `x` each value of `y$key` is _expected_ to match (**one** or **many**), _and vice-versa_, giving rise to 4 possible combinations:
- one-to-many
- many-to-one
- one-to-one
- many-to-many
Which relationship applies, follows from the keys being primary or foreign in `x` and `y`.
## Relationships in mutating joins {-}
**One-to-many**: primary key is in `x`.
Example with left join:
![](images/19_one-to-many.png)
## Relationships in mutating joins {-}
**Many-to-one**: primary key is in `y`.
Example with left join or inner join:
![](images/19_many-to-one.png)
## Relationships in mutating joins {-}
**One-to-one**: each row in `x` matches at most one row in `y`, and vice-versa.
![](images/19_right.png)
## Relationships in mutating joins {-}
**Many-to-many**: no primary keys involved!
![](images/19_many-to-many.png)
This gives a warning by default, since it can be unintended.
To avoid the warning (an intentional many-to-many), explicitly set `relationship = "many-to-many"`.
## Relationships in mutating joins {-}
The `relationship` argument in mutating joins allows you to take control over the expected relationship.
From `help("mutate-joins")`:
> In production code, it is best to preemptively set `relationship` to whatever relationship you expect to exist between the keys of `x` and `y`, as this forces an error to occur immediately if the data doesn't align with your expectations.
## Filtering joins {-}
**Filtering joins** filter `x` based on (non-)matching rows in `y`.
They never duplicate rows!
## Filtering joins {-}
![](images/19_semi.png)
## Filtering joins {-}
![](images/19_anti.png)
## Filtering joins: examples {-}
```{r}
flights2 |>
semi_join(weather)
```
## Filtering joins: examples {-}
```{r}
flights2 |>
anti_join(weather)
```
## Specifying join keys and their matching conditions {-}
Without specifying a join key: the 'natural join' is applied (messages in previous slides).
BUT:
- this will miss key pairs with different names
- this will assume any pair of identical variable names to be part of the key pair
## Specifying join keys and their matching conditions {-}
E.g. this is NOT wat we intend (`year` has a different meaning in `flights` vs `planes`):
```{r}
flights2 |>
left_join(planes)
```
## Specifying join keys and their matching conditions {-}
So it's much recommended to explicitly define the join key using `join_by()`:
- `inner_join(x, y, by = join_by(...))`
- `left_join(x, y, by = join_by(...))`
- `right_join(x, y, by = join_by(...))`
- `full_join(x, y, by = join_by(...))`
- `semi_join(x, y, by = join_by(...))`
- `anti_join(x, y, by = join_by(...))`
## Specifying join keys and their matching conditions {-}
When both keys have the same name(s): just provide the key name(s).
```{r}
flights2 |>
left_join(planes, join_by(tailnum))
```
## Specifying join keys and their matching conditions {-}
When both keys have the same name(s): just provide the key name(s).
```{r}
flights2 |>
semi_join(weather, join_by(origin, time_hour))
```
## Specifying join keys and their matching conditions {-}
`join_by(var1)` is shorthand for `join_by(var1 == var1)`.
`join_by(var1, var2)` is shorthand for `join_by(var1 == var1, var2 == var2)`.
The expression form in `join_by()` defines:
- _each_ key's name -- these can differ between the `x` and `y` data frame
- the matching condition that defines the match between `x$key` and `y$key`: equality (`==`) or inequality (`<`, `<=` etc, and helpers)
## Some `join_by()` examples {-}
Join airport attributes without losing flights:
```{r}
flights2 |>
left_join(airports, join_by(origin == faa))
```
## Some `join_by()` examples {-}
Join airport attributes without losing flights:
```{r}
flights2 |>
left_join(airports, join_by(dest == faa))
```
## Some `join_by()` examples {-}
```{r}
df <- tibble(id = 1:4, name = c("John", "Simon", "Tracy", "Max"))
df
```
Doing a self-join with an inequality matching condition to get all name combinations:
```{r}
df |> inner_join(df, join_by(id < id)) |> print(n = Inf)
```
## Key matching conditions {-}
- Equality condition: so called '**equi joins**'.
- You will use this most of the time.
- All others are sometimes called 'non-equi joins'
- Inequality conditions:
- **inequality joins**: `join_by(id < id)`
- **rolling joins** (only closest): `join_by(closest(id < id))`
- **overlap joins**: to set interval conditions
- `between(x, y_lower, y_upper)` is short for `x >= y_lower, x <= y_upper`
- `within(x_lower, x_upper, y_lower, y_upper)` is short for `x_lower >= y_lower, x_upper <= y_upper`
- `overlaps(x_lower, x_upper, y_lower, y_upper)` is short for `x_lower <= y_upper, x_upper >= y_lower`
- No conditions: **cross joins**.
Use separate function `cross_join()`.
## Key matching conditions {-}
![](images/19_lt.png)
## Key matching conditions {-}
**`cross_join(x, y)`**
![](images/19_cross.png)
## Key matching conditions: examples {-}
```{r include=FALSE}
set.seed(123)
employees <- tibble(
name = sample(babynames::babynames$name, 100),
birthday = ymd("2022-01-01") + (sample(365, 100, replace = TRUE) - 1)
)
parties <- tibble(
q = 1:4,
party = ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
start = ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
end = ymd(c("2022-04-03", "2022-07-10", "2022-10-02", "2022-12-31"))
)
```
```{r}
employees
parties
```
## Key matching conditions: examples {-}
When is the birthday party for each employee?
```{r}
employees |>
inner_join(
parties,
join_by(between(birthday, start, end)),
unmatched = "error"
)
```
## Key matching conditions: examples {-}
```{r}
df <- tibble(name = c("John", "Simon", "Tracy", "Max"))
df
```
Generate permutations in a self-join with `cross_join()`:
```{r}
df |> cross_join(df)
```
```{r include=FALSE}
options(oldopt)
```
## Meeting Videos
### Cohort 5
`r knitr::include_url("https://www.youtube.com/embed/4Ju9nj82Ksk")`
<details>
<summary> Meeting chat log </summary>
```
00:08:43 Ryan Metcalf: So sorry for my delay. Aerating the lawn! Fun!
00:10:36 Becki R. (she/her): Hello!
00:22:56 Becki R. (she/her): That's a good idea! It would be a nice reference.
00:24:30 Jon Harmon (jonthegeek): https://cran.r-project.org/web/packages/DiagrammeR/index.html
00:24:33 Sandra Muroy: diagramR?
00:25:12 Jon Harmon (jonthegeek): I guess the sharable link is supposed to be: https://cran.r-project.org/package=DiagrammeR
00:26:06 Becki R. (she/her): That's a good reminder, thanks.
00:56:06 Ryan Metcalf: Another way to notate Venn diagrams and/or discrete mathematics is to use LaTeX. This is not intended to deter you learning of R but may be useful when working with a larger team (especially DB admins!) https://www.purplemath.com/modules/venndiag2.htm
01:04:20 Jon Harmon (jonthegeek): ggplot2::diamonds %>% dplyr::count(carat, cut, color, clarity, depth, table, price, x, y, z, sort = TRUE)
01:08:49 Ryan Metcalf: Great Job Federica! Looking forward to next week!
```
</details>
`r knitr::include_url("https://www.youtube.com/embed/BDckA3OdsiU")`
<details>
<summary> Meeting chat log </summary>
```
00:22:57 Jon Harmon (jonthegeek): Try this to see a quick example. You're right that the order of the columns is left then right either way, but the rows to keep changes.
library(dplyr)
band_members %>% left_join(band_instruments)
band_members %>% right_join(band_instruments)
00:29:17 Jon Harmon (jonthegeek): band_instruments %>% left_join(band_members)
00:39:07 lucus w: Seems like setdiff() is similar to dplyr::anti_join()… they maybe differ by how they prints?
00:40:45 Jon Harmon (jonthegeek): anti_join only needs to match on one key column to exclude; setdiff is looking for the entire observations to be repeated.
00:41:29 lucus w: Interesting.. thanks!
00:52:28 Njoki Njuki Lucy: I don't understand what's happening exactly, if age is greater than 25, have it as 25L(int) otherwise its age?
00:53:56 Njoki Njuki Lucy: okay, thanks
00:54:28 Federica Gazzelloni: sorry my connection
01:05:28 Njoki Njuki Lucy: thank you :)
01:05:33 lucus w: Thanks y'all
```
</details>
### Cohort 6
`r knitr::include_url("https://www.youtube.com/embed/P5FlxEhWE74")`
<details>
<summary> Meeting chat log </summary>
```
LOG
```
</details>
### Cohort 7
`r knitr::include_url("https://www.youtube.com/embed/BWk9H1yrB7o")`
### Cohort 8
`r knitr::include_url("https://www.youtube.com/embed/O57XwU3Erus")`
<details>
<summary> Meeting chat log </summary>
```
00:09:40 shamsuddeen: start
01:04:02 shamsuddeen: https://dplyr.tidyverse.org/reference/setops.html
01:05:16 Abdou: Reacted to "https://dplyr.tidyve..." with 👍
01:06:07 shamsuddeen: end
```
</details>