-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathr-for-data-science-notes.Rmd
178 lines (128 loc) · 4.88 KB
/
r-for-data-science-notes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
title: "R for Data Science notes and exercises"
output:
html_document:
df_print: paged
---
R Markdown Notebook to follow along the "R for Data Science" book by Hadley Wickham and Garrett Grolemund. Available from https://r4ds.had.co.nz/.
```{r}
library(tidyverse)
# Get a feel for the columns in the data.frame
str(mpg)
# First visualization, trying to understand correlation between mileage and displacement
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point()
```
### 3.2.4 Exercises
#### 1. Run ggplot(data = mpg). What do you see?
```{r}
ggplot(data = mpg)
```
#### 2. How many rows are in mpg? How many columns?
```{r}
nrow(mpg)
ncol(mpg)
```
#### 3. What does the drv variable describe? Read the help for ?mpg to find out.
```{r}
table(mpg$drv)
ggplot(data = mpg, aes(x = drv)) + geom_bar()
# Looks like 4WD, RWD, and FWD
ggplot(data = mpg, aes(x = drv, y = hwy)) +
geom_point()
# Better mileage in FWD (which there was a better way to compute these)
summary(mpg[which(mpg$drv == '4'),]$hwy) # 19.17
summary(mpg[which(mpg$drv == 'f'),]$hwy) # 28.16
summary(mpg[which(mpg$drv == 'r'),]$hwy) # 21
# Found better way to compute in "Análise e Exploração de Dados com R" by Miguel Rocha and Pedro G. Ferreira
tapply(mpg$hwy, mpg$drv, mean)
# Found a better way to do it
?mpg # "the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd"
```
#### 4. Make a scatterplot of hwy vs cyl.
```{r}
ggplot(data = mpg) +
geom_point(aes(x = hwy, y = cyl))
# Looks much better
ggplot(data = mpg) +
geom_point(aes(x = cyl, y = hwy))
```
#### 5. What happens if you make a scatterplot of class vs drv? Why is the plot not useful?
```{r}
ggplot(data = mpg) +
geom_point(aes(x = class, y = drv))
ggplot(data = mpg) +
geom_point(aes(y = class, x = drv))
# I actually consider this useful, you can see how certain classes of cars have more different types of drive (there all kinds of subcompacts), compared to others (minivan only FWD)
```
## 3.3 Aesthetic mappings
(...) the filled shapes (21–24) have a border of colour and are filled with fill.
```{r}
# 4 variables in same data viz
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, fill = class, color = drv), shape = 21)
```
### 3.3.1 Exercises
#### 1. What’s gone wrong with this code? Why are the points not blue?
```{r}
ggplot(data = mpg) +
geom_point(aes(x = displ, y = hwy, color = "blue"))
```
I see the color aestetic is being provided to geom_point through a mapping. So it's interpreting "blue" not as the color blue but as a constant string blue.
If you provide it directly to geom_point it should work better.
##### 2. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?
```{r}
str(mpg)
```
I would submit that:
- manufacturer (15 levels), model (38 levels), trans (10 levels), drv (3 levels), fl (5 levels), class (7 levels) are categorical and
- displ, year, cyl, cty, hwy are continuous.
Considering all finite, discrete, non-sorted values as categorical.
##### 3. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
```{r}
ggplot(data = mpg) +
geom_point(aes(x = displ, y = hwy, size = manufacturer))
```
Not advised to use size for discrete variable.
##### 4. What happens if you map the same variable to multiple aesthetics?
```{r}
ggplot(data = mpg) +
geom_point(aes(x = displ, y = hwy, size = cyl, color = cyl))
```
Not as informative
##### 5. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)
```{r}
ggplot(data = mpg) +
geom_point(aes(x = displ, y = hwy, stroke = cyl, color = class))
```
Modifies the width of the border
##### 6. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.
```{r}
ggplot(data = mpg) +
geom_point(aes(x = displ, y = hwy, color = displ < 5))
```
Works as expected, two colors (one for FALSE another for TRUE).
#### 3.5.1 Exercises
##### What happens if you facet on a continuous variable?
```{r}
ggplot(data = mpg) +
geom_point(aes(x = cyl, y = hwy)) +
facet_wrap(~ displ)
```
It discretizes the continuous variable, and treats it as categorical.
##### 2. What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?
```{r}
ggplot(mpg) +
geom_point(aes(displ, hwy)) +
facet_grid(drv ~ cyl)
```
They represent missing values, e.g. we don't have data for 4-wheel-drive cars with 5 cylinders.
##### 3. What plots does the following code make? What does . do?
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ drv)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~drv)
```