-
Notifications
You must be signed in to change notification settings - Fork 176
/
Copy pathdata-applications.qmd
457 lines (392 loc) · 20.2 KB
/
data-applications.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
# Applications: Data {#sec-data-applications}
```{r}
#| include: false
source("_common.R")
```
## Case study: Olympic 1500m {#sec-case-study-paralympics}
While many of you may be glued to the Olympic Games every four years (or every two years if you fancy both summer and winter sports), the Paralympic Games are less popular than the Olympic Games, even if they hold the same competitive thrills.
The Paralympic Games began as a way to support soldiers who had been wounded in World War II as a way to help them rehabilitate.
The first Paralympic Games were held in Rome, Italy in 1960.
Since 1988 (Seoul, South Korea), the Paralympic Games have been held a few weeks later than the Olympic Games in the same city, in both the summer and winter.
In this case study we introduce a dataset comparing Olympic and Paralympic gold medal finishers in the 1500m running competition (the Olympic "mile", if a bit shorter than a full mile).
The goal of the case study is to walk you through what a data scientist does when they first get a hold of a dataset.
We also provide some "foreshadowing" of concepts and techniques we'll introduce in the next few chapters on exploratory data analysis.
Last, we introduce [Simpson's paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox) and discuss the importance of understanding the impact of multiple variables in an analysis.
::: {.data data-latex=""}
The [`paralympic_1500`](http://openintrostat.github.io/openintro/reference/paralympic_1500.html) data can be found in the [openintro](http://openintrostat.github.io/openintro/) R package.
:::
@tbl-paralympic-df-tail shows the last five rows from the dataset, which are the five most recent 1500m races.
Notice that there are racers from both the Men's and Women's divisions as well as those of varying visual impairment (T11, T12, T13, and Olympic).
The T11 athletes have almost complete visual impairment, run with a black-out blindfold, and are allowed to run with a guide-runner.
T12 and T13 athletes have some visual impairment, and the visual acuity of Olympic runners is not determined.
When you encounter a new dataset, taking a peek at the last few rows as we did in @tbl-paralympic-df-tail should be instinctual.
It can be helpful to look at the first few rows of the data as well to get a sense of other aspects of the data which may not be apparent in the last few rows.
@tbl-paralympic-df-head shows the top five rows of the `paralympic_1500` dataset, which reveals that for at least the first five Olympiads, there were no runners in the Women's division or in the Paralympics.
```{r}
#| label: tbl-paralympic-df-tail
#| tbl-cap: Last five rows of the `paralympic_1500` dataset.
#| tbl-pos: H
paralympic_1500 |>
rowid_to_column() |>
slice_tail(n = 5) |>
kbl(
col.names = c(
"", "year", "city", "country_of_games", "division", "type",
"name", "country_of_athlete", "time", "time_min"
),
linesep = "", booktabs = TRUE,
row.names = FALSE
) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")
) |>
column_spec(3, width = "3em") |>
column_spec(4, width = "4em") |>
column_spec(7, width = "5em") |>
column_spec(8, width = "5em") |>
column_spec(10, width = "4em")
```
```{r}
#| label: tbl-paralympic-df-head
#| tbl-cap: First five rows of the `paralympic_1500` dataset.
#| tbl-pos: H
paralympic_1500 |>
rowid_to_column() |>
slice_head(n = 5) |>
kbl(
col.names = c(
"", "year", "city", "country_of_games", "division", "type",
"name", "country_of_athlete", "time", "time_min"
),
linesep = "", booktabs = TRUE,
row.names = FALSE
) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")
) |>
column_spec(3, width = "3em") |>
column_spec(4, width = "4em") |>
column_spec(7, width = "5em") |>
column_spec(8, width = "5em") |>
column_spec(10, width = "4em")
```
At this stage it's also useful to think about how the data were collected, as that will inform the scope of any inference you can make based on your analysis of the data.
::: {.guidedpractice data-latex=""}
Do these data come from an observational study or an experiment?[^03-data-applications-1]
:::
[^03-data-applications-1]: This is an observational study.
Researchers collected data on past gold medal race times in both Olympic and Paralympic Games.
::: {.guidedpractice data-latex=""}
There are `r nrow(paralympic_1500)` rows and `r ncol(paralympic_1500)` columns in the dataset.
What does each row and each column represent?[^03-data-applications-2]
:::
[^03-data-applications-2]: Each row represents a 1500m gold medal race and each column represents a variable containing information on each race.
Once you've identified the rows and columns, it's useful to review the data dictionary to learn about what each column in the dataset represents.
The data dictionary is provided in @tbl-paralympic-var-def.
```{r}
#| label: tbl-paralympic-var-def
#| tbl-cap: Variables and their descriptions for the `paralympic_1500` dataset.
#| tbl-pos: H
paralympic_var_def <- tribble(
~variable, ~description,
"year", "Year the Games took place.",
"city", "City of the Games.",
"country_of_games", "Country of the Games.",
"division", "Division: `Men` or `Women`.",
"type", "Type: `Olympic`, `T11`, `T12`, or `T13`.",
"name", "Name of the athlete.",
"country_of_athlete", "Country of athlete.",
"time", "Time of gold medal race, in m:s.",
"time_min", "Time of gold medal race, in decimal minutes (min + sec/60)."
)
paralympic_var_def |>
kbl(
linesep = "", booktabs = TRUE,
col.names = c("Variable", "Description")
) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped"), full_width = TRUE
) |>
column_spec(1, monospace = TRUE) |>
column_spec(2, width = "30em")
```
We now have a better sense of what each column represents, but we do not yet know much about the characteristics of each of the variables.
::: {.workedexample data-latex=""}
Determine whether each variable in the `paralympic_1500` dataset is numerical or categorical.
For numerical variables, further classify them as continuous or discrete.
For categorical variables, determine if the variable is ordinal.
------------------------------------------------------------------------
The numerical variables in the dataset are `year` (discrete), and `time_min` (continuous).
The categorical variables are `city`, `country_of_games`, `division`, `type`, `name`, and `country_of_athlete`.
The `time` variable is trickier to classify -- we can think of it as numerical, but it is classified as categorical.
The categorical classification is due to the colon `:` which separates the hours from the seconds.
Sometimes the data dictionary (presented in @tbl-paralympic-var-def) isn't sufficient for a complete analysis, and we need to go back to the data source and try to understand the data better before we can proceed with the analysis meaningfully.
:::
Next, let's try to get to know each variable a little bit better.
For categorical variables, this involves figuring out what their levels are and how commonly represented they are in the data.
@fig-paralympic-cat shows the distributions of two of the categorical variables in this dataset.
We can see that the United States has hosted the Games most often, but runners from Great Britain and Kenya have won the 1500m most often.
There are a large number of countries who have had a single gold medal winner of the 1500m.
Similarly, there are a large number of countries who have hosted the Games only once.
Over the last century, the name describing the country for athletes from one particular region has changed and includes Russian Federation, Unified Team, and Russian Paralympic Committee.
Both of the visualizations are bar plots, which you will learn more about in @sec-explore-categorical.
Similarly, we can examine the distributions of the numerical variables as well.
We already know that the 1500m times are mostly between 3.5min and 4.5min, based on @tbl-paralympic-df-tail and @tbl-paralympic-df-head.
We can break down the 1500m time by division and type of race.
@tbl-paralympic-summary shows the mean, minimum, and maximum 1500m times broken down by division and race type.
Recall that the Men's Olympic division has taken place since 1896, whereas the Men's Paralympic division has happened only since 1960.
The maximum race time, therefore, should be taken into context in terms of the year of the Games.
```{r}
#| label: fig-paralympic-cat
#| fig-cap: |
#| Distributions of categorical variables in the `paralympic_1500` dataset.
#| fig-subcap:
#| - Country of origin of the athlete
#| - Country in which the Games gook place
#| fig-alt: |
#| Two separate bar plots. The left panel shows a bar plot counting the number
#| of gold medal athletes from each country. Great Britain has had 8 top
#| finishers, Kenya has had 7 top finishers, and Tunisia and Algeria have both
#| had 5. The right panel shows a bar plot counting the number of Games which
#| have happened in each country. The USA has hosted 4 Games, the UK has hosted
#| 3 Games, and each of Japan, Greece, Germany, France, and Australia have
#| hosted the Games twice.
#| fig-asp: 2
#| fig-width: 4
#| layout-ncol: 2
paralympic_1500 |>
group_by(country_of_games, year) |>
sample_n(size = 1) |>
ungroup() |>
group_by(country_of_games) |>
count(country_of_games, sort = TRUE) |>
ungroup() |>
mutate(country_of_games = fct_reorder(country_of_games, n)) |>
ggplot(aes(
y = country_of_games, x = n,
fill = fct_rev(country_of_games)
)) +
geom_col(show.legend = FALSE) +
scale_fill_openintro() +
labs(
x = "Count",
y = NULL,
title = "Country of Games"
) +
theme(plot.title.position = "plot")
paralympic_1500 |>
group_by(country_of_athlete, year) |>
sample_n(size = 1) |>
ungroup() |>
group_by(country_of_athlete) |>
count(country_of_athlete, sort = TRUE) |>
ungroup() |>
mutate(country_of_athlete = fct_reorder(country_of_athlete, n)) |>
ggplot(aes(y = country_of_athlete, x = n)) +
geom_col(show.legend = FALSE) +
labs(
x = "Count",
y = NULL,
title = "Country of athlete"
) +
theme(plot.title.position = "plot")
```
```{r}
#| label: tbl-paralympic-summary
#| tbl-cap: |
#| Mean, minimum, and maximum of the gold medal times for the 1500m race
#| broken down by division and type of race.
#| tbl-pos: H
paralympic_1500 |>
group_by(division, type) |>
summarise(
mean = round(mean(time_min), 3),
min = round(min(time_min), 3),
max = round(max(time_min), 3)
) |>
kbl(linesep = "", booktabs = TRUE) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")
)
```
**Fun fact!**
Sometimes playing around with the dataset will uncover interesting elements about the context in which the data were collected.
A scatterplot of the Men's 1500m broken down by race type shows that, in each given year, the Olympic runner is substantially faster than the Paralympic runners, with one exception.
In the Rio de Janeiro 2016 Games, the [T13 gold medal athlete ran faster (3:48.29) than the Olympic gold medal athlete (3:50.00)](https://www.paralympic.org/news/remarkable-finish-1500m-rio-2016) (see @fig-paralympic-rio).
In fact, some internet sleuthing tells you that the *top four* T13 finishers all finished the 1500m under 3:50.00!
```{r}
#| label: fig-paralympic-rio
#| fig-cap: |
#| 1500m race time for Men's Olympic and Paralympic athletes. Dashed grey line
#| represents the Rio Games in 2016.
#| fig-alt: |
#| A scatterplot with year on the x-axis and gold medal 1500m time on the y-axis.
#| The points are colored by which group the athlete is in - T11, T12, T13, or
#| Olympic. A vertical line at 2016 show that in the Rio Games the T13 gold medal
#| athlete was faster than the Olympic gold medal athlete.
#| fig-asp: 0.5
paralympic_1500 |>
filter(division == "Men") |>
filter(year > 1950) |>
ggplot(aes(x = year, y = time_min, group = type, color = type, shape = type)) +
geom_vline(xintercept = 2016, color = "darkgrey", lty = 2, lwd = 0.5) +
geom_point(size = 2) +
scale_color_openintro() +
labs(
x = "Year",
y = NULL,
color = "Race type",
shape = "Race type",
title = "1500m race time, in minutes"
) +
theme(
legend.position = c(0.1, 0.75),
legend.background = element_rect(fill = "white", color = "gray", linewidth = 0.1)
)
```
So far we examined aspects of some of the individual variables, and we have broken down the 1500m race times in terms of division and race type.
You might have already wondered how the race times vary across year.
The `paralymic_1500` dataset will provide us with an ability to explore an important statistical concept, Simpson's paradox.
## Simpson's paradox
Simpson's paradox \index{Simpson's paradox} is a description of three (or more) variables.
The paradox happens when a third variable reverses the relationship between the first two variables.
Let's start by considering how the 1500m gold medal race times have changed over year.
@fig-paralympic-ungrouped shows a scatterplot describing 1500m race times and year for Men's Olympic and Paralympic (T11) athletes with a line of best fit (to the entire dataset) superimposed (see @sec-model-slr where we will present fitting a line to a scatterplot).
Notice that the line of best fit shows a *positive* relationship between race time and year.
That is, for later years, the predicted gold medal time is higher than in earlier years.
```{r}
#| label: fig-paralympic-ungrouped
#| fig-cap: |
#| 1500m race time for Men's Olympic and Paralympic (T11) athletes. The line
#| represents a line of best fit to the entire dataset.
#| fig-alt: |
#| A scatterplot with year on the x-axis and gold medal 1500m time on the
#| y-axis. A line of best fit is drawn over the points.
#| fig-asp: 0.5
paralympic_1500 |>
filter(division == "Men", type == "Olympic" | type == "T11") |>
filter(year > 1950) |>
ggplot(aes(x = year, y = time_min)) +
geom_smooth(method = "lm", se = FALSE) +
geom_point(size = 2) +
labs(
x = "Year",
y = NULL,
title = "1500m race time, in minutes"
)
```
Of course, both your eye and your intuition are likely telling you that it wouldn't make any sense to try to model all of the athletes together.
Instead, a separate model should be run for each of the two types of Games: Olympic and Paralympic (T11).
@fig-paralympic-grouped shows a scatterplot describing 1500m race times and year for Men's Olympic and Paralympic (T11) athletes with a line of best fit superimposed separately for each of the two types of races.
Notice that within each type of race, the relationship between 1500m race time and year is now *negative*.
```{r}
#| label: fig-paralympic-grouped
#| fig-cap: |
#| 1500m race time for Men's Olympic and Paralympic (T11) athletes. The best fit
#| line is now fit separately to the Olympic and Paralympic athletes.
#| fig-alt: |
#| A scatterplot with year on the x-axis and gold medal 1500m time on the y-axis.
#| The points are colored by the type of athlete - T11 or Olympic. Lines of best fit
#| are drawn separately for the two groups (T11 and Olympic).
#| fig-asp: 0.5
paralympic_1500 |>
filter(division == "Men", type == "Olympic" | type == "T11") |>
filter(year > 1950) |>
ggplot(aes(
x = year, y = time_min, group = type,
color = type, shape = type
)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE) +
scale_color_openintro() +
labs(
x = "Year",
y = NULL,
color = "Race type",
shape = "Race type",
title = "1500m race time, in minutes"
) +
theme(
legend.position = c(0.1, 0.8),
legend.background = element_rect(fill = "white", color = "gray", linewidth = 0.1)
)
```
::: {.important data-latex=""}
**Simpson's paradox.**
Simpson's paradox happens when an association or relationship between two variables in one direction (e.g., positive) reverses (e.g., becomes negative) when a third variable is considered.
:::
Simpson's paradox was seen in the 1500m race data because the aggregate data showed a positive relationship (positive slope) between year and race time but a negative relationship (negative slope) between year and race time when broken down by the type of race.
Simpson's paradox is observed with categorical data and with numeric data.
Often the paradox happens because the third variable (here, race type) is imbalanced.
There are either more observations in one group or the observations happen at different intervals across the two groups.
In the 1500m data, we saw that the T11 runners had fewer observations and their times were both generally slower and more recent than the Olympic runners.
In the 1500m analysis, it would be most prudent to report the trends separately for the Olympic and the T11 athletes.
However, in other situations, it might be better to aggregate the data and report the overall trend.
Many additional examples of Simpson's paradox and a further exploration is given in @Witmer:2021.
In this case study, we introduced you to the very first steps a data scientist takes when they start working with a new dataset.
In the next few chapters, we will introduce exploratory data analysis, and you'll learn more about the various types of data visualizations and summary statistics you can make to get to know your data better.
Before you move on, we encourage you to think about whether the following questions can be answered with this dataset, and if yes, how you might go about answering them?
It's okay if your answer is "I'm not sure", we simply want to get your exploratory juices flowing to prime you for what's to come!
1. Has there every been a year when a visually impaired paralympic gold medal athlete beat the Olympic gold medal athlete?
2. When comparing the paralympic and Olympic 1500m gold medal athletes, does Simpson's paradox hold in the Women's division?
3. Is there a biological boundary which establishes a time under which no human could run 1500m?
\clearpage
## Interactive R tutorials {#sec-data-tutorials}
Navigate the concepts you've learned in this part in R using the following self-paced tutorials.
All you need is your browser to get started!
::: {.alltutorials data-latex=""}
[Tutorial 1: Introduction to data](https://openintrostat.github.io/ims-tutorials/01-data/)
::: {.content-hidden unless-format="pdf"}
<https://openintrostat.github.io/ims-tutorials/01-data>
:::
:::
::: {.singletutorial data-latex=""}
[Tutorial 1 - Lesson 1: Language of data](https://openintro.shinyapps.io/ims-01-data-01/)
::: {.content-hidden unless-format="pdf"}
<https://openintro.shinyapps.io/ims-01-data-01>
:::
:::
::: {.singletutorial data-latex=""}
[Tutorial 1 - Lesson 2: Types of studies](https://openintro.shinyapps.io/ims-01-data-02/)
::: {.content-hidden unless-format="pdf"}
<https://openintro.shinyapps.io/ims-01-data-02>
:::
:::
::: {.singletutorial data-latex=""}
[Tutorial 1 - Lesson 3: Sampling strategies and experimental design](https://openintro.shinyapps.io/ims-01-data-03/)
::: {.content-hidden unless-format="pdf"}
<https://openintro.shinyapps.io/ims-01-data-03>
:::
:::
::: {.singletutorial data-latex=""}
[Tutorial 1 - Lesson 4: Case study](https://openintro.shinyapps.io/ims-01-data-04/)
::: {.content-hidden unless-format="pdf"}
<https://openintro.shinyapps.io/ims-01-data-04>
:::
:::
::: {.content-hidden unless-format="pdf"}
You can also access the full list of tutorials supporting this book at\
<https://openintrostat.github.io/ims-tutorials>.
:::
::: {.content-visible when-format="html"}
You can also access the full list of tutorials supporting this book [here](https://openintrostat.github.io/ims-tutorials).
:::
## R labs {#sec-data-labs}
Further apply the concepts you've learned in this part in R with computational labs that walk you through a data analysis case study.
::: {.singlelab data-latex=""}
[Intro to R - Birth rates](https://www.openintro.org/go?id=ims-r-lab-intro-to-r)
::: {.content-hidden unless-format="pdf"}
<https://www.openintro.org/go?id=ims-r-lab-intro-to-r>
:::
:::
::: {.content-hidden unless-format="pdf"}
You can also access the full list of labs supporting this book at\
<https://www.openintro.org/go?id=ims-r-labs>.
:::
::: {.content-visible when-format="html"}
You can also access the full list of labs supporting this book [here](https://www.openintro.org/go?id=ims-r-labs).
:::