-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy pathchapter11.Rmd
156 lines (121 loc) · 4.76 KB
/
chapter11.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
title: 'Causal Inference: Chapter 11'
output: html_document
---
# Chapter 11: Why model?
This is the code for Chapter 11. Throughout this chapter and the rest of the book, we'll use the tidyverse metapackage to load the core tidyverse packages, as well the broom package. In this chapter, the main packages we'll be using are ggplot2 for data visualization, dplyr for data manipulation, and broom for [tidying](https://r4ds.had.co.nz/tidy-data.html) regression model results.
## Program 11.1
To replicate Program 11.1, we first need to create the data set, `binary_a_df`.
```{r, fig.cap="Figure 11.1"}
library(tidyverse)
library(broom)
binary_a_df <- data.frame(
a = c(rep(1, 8), rep(0, 8)),
y = c(200, 150, 220, 110, 50, 180, 90, 170,
170, 30, 70, 110, 80, 50, 10, 20)
)
ggplot(binary_a_df, aes(a, y)) +
geom_point(size = 4, col = "white", fill = "#E69F00", shape = 21) +
scale_x_continuous(breaks = c(0, 1), expand = expand_scale(.5)) +
theme_minimal(base_size = 20)
```
To summarize the data, we'll use dplyr to group the dataset by `a`, then get the sample size, mean, standard deviation, and range. `knitr::kable()` is used to print the results nicely to a table. (For more information on the `%>%` pipe operator, see [R for Data Science](https://r4ds.had.co.nz/pipes.html).)
```{r skim_data}
binary_a_df %>%
group_by(a) %>%
summarize(
n = n(),
mean = mean(y),
sd = sd(y),
minimum = min(y),
maximum = max(y)
) %>%
knitr::kable(digits = 2)
```
Similarly, we'll plot and summarize `categorical_a_df`.
```{r, fig.cap="Figure 11.2"}
categorical_a_df <- data.frame(a = sort(rep(1:4, 4)),
y = c(110, 80, 50, 40, 170, 30, 70, 50,
110, 50, 180, 130, 200, 150, 220, 210))
ggplot(categorical_a_df, aes(a, y)) +
geom_point(size = 4, col = "white", fill = "#E69F00", shape = 21) +
scale_x_continuous(breaks = 1:4, expand = expand_scale(.25)) +
theme_minimal(base_size = 20)
```
```{r}
categorical_a_df %>%
group_by(a) %>%
summarize(
n = n(),
mean = mean(y),
sd = sd(y),
minimum = min(y),
maximum = max(y)
) %>%
knitr::kable(digits = 2)
```
## Program 11.2
Program 11.2 uses a continuous exposure and outcome data, `continuous_a_df`. Plotting the points is similar to above.
```{r, fig.cap="Figure 11.3"}
continuous_a_df <- data.frame(
a = c(3, 11, 17, 23, 29, 37, 41, 53,
67, 79, 83, 97, 60, 71, 15, 45),
y = c(21, 54, 33, 101, 85, 65, 157, 120,
111, 200, 140, 220, 230, 217, 11, 190)
)
ggplot(continuous_a_df, aes(a, y)) +
geom_point(size = 4, col = "white", fill = "#E69F00", shape = 21) +
theme_minimal(base_size = 20)
```
We'll also add a regression line using `geom_smooth()` with `method = "lm"`.
```{r, fig.cap="Figure 11.4"}
ggplot(continuous_a_df, aes(a, y)) +
geom_point(size = 4, col = "white", fill = "grey85", shape = 21) +
geom_smooth(method = "lm", se = FALSE, col = "#E69F00", size = 1.2) +
theme_minimal(base_size = 20)
```
To fit an OLS regression model of `a` on `y`, we'll use`lm()` and then tidy the results with `tidy()` from the broom package.
```{r}
linear_regression <- lm(y ~ a, data = continuous_a_df)
linear_regression %>%
# get the confidence intervals using `conf.int = TRUE`
tidy(conf.int = TRUE) %>%
# drop the test statistic and P-value
select(-statistic, -p.value) %>%
knitr::kable(digits = 2)
```
To predict a value of `y` for when `a = 90`, we pass the linear regression model object `linear_regression` to the `predict()` function. We also give it a new data frame that contains only one value, `a = 90`.
```{r}
linear_regression %>%
predict(newdata = data.frame(a = 90))
```
Similarly, using `binary_a_df`, which has a treatmant `a` that is a binary:
```{r}
lm(y ~ a, data = binary_a_df) %>%
tidy(conf.int = TRUE) %>%
select(-statistic, -p.value) %>%
knitr::kable(digits = 2)
```
## Program 11.3
Fitting and tidying a quadratic version of `a` is similar, but we use `I()` to include `a^2`.
```{r}
smoothed_regression <- lm(y ~ a + I(a^2), data = continuous_a_df)
smoothed_regression %>%
tidy(conf.int = TRUE) %>%
select(-statistic, -p.value) %>%
# remove `I()` from the term name
mutate(term = ifelse(term == "I(a^2)", "a^2", term)) %>%
knitr::kable(digits = 2)
```
To plot the quadratic function, we give `geom_smooth()` a `formula` argument: `formula = y ~ x + I(x^2)`.
```{r, fig.cap="Figure 11.5"}
ggplot(continuous_a_df, aes(a, y)) +
geom_point(size = 4, col = "white", fill = "grey85", shape = 21) +
geom_smooth(method = "lm", se = FALSE, col = "#E69F00", formula = y ~ x + I(x^2), size = 1.2) +
theme_minimal(base_size = 20)
```
But predicting is done the same way.
```{r}
smoothed_regression %>%
predict(newdata = data.frame(a = 90))
```