generated from jtr13/cctemplate
-
Notifications
You must be signed in to change notification settings - Fork 139
/
Copy pathbase_r_data_organization_visualization.Rmd
258 lines (192 loc) · 8.41 KB
/
base_r_data_organization_visualization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
# R based data organization and visualization
Tengteng Tao
```{r}
library(tidyverse)
library(ggridges)
library(ggplot2)
library(scales)
```
## Introduction
In this cheat sheet, I conclude all the important functions we have used in the past two graded homeworks. The whole cheat sheet has two sections. In the first section, all functions related to plotting will be listed. In the second section, all functions which are helpful in organizing data will be mentioned.
## Plots and their corresponding codes
### Box plots
To create box plots, we use the function geom_boxplot() from the ggplot2. The basic function is
ggplot(dataframe, aes(y = TheColumnForYaxis , x = TheColumnForXaxis ))+
geom_boxplot()
The example used data fastfood from the openintro package. I used the data calories and restaurant to make the box plot. In addition,to make sure the plot is intuitive enough, the plot is ordered by the mean value from high to low.
```{r}
df <- openintro::fastfood
ggplot(df, aes(x = reorder(restaurant, calories, median), y = calories))+
geom_boxplot()+
coord_flip()+
xlab("")+
theme_grey(14)
```
### Histograms
In this section, I will introduce the two ways to make histograms. Examples used data mtl from package openintro
The first way is plotting with base R. We can simply use function hist() like
hist(datafram$TheColumnYouWantToPlot)
```{r}
df <- openintro::mtl
hist(df$asubic, col = "lightblue", xlab = "asubic", main = "Histogram of asubic based on R")
```
The second way is using ggplot2. The thing needed to be noticed is that the bin number is default on 30 for any data set. Acoordingly, for a more intuitive plot, we always need to change the bins' number.
The basic function is:
ggplot(datafram, aes(TheColumnYouWantToPlot))+
geom_hist()
```{r}
ggplot(df, aes(asubic))+
geom_histogram()+stat_bin(bins = 9)+
ggtitle("Histogram of asubic based on ggplot")
```
### Density curves
To create density curves, we use geom_density from ggplot2. Basic function is:
ggplot(dataframe, aes(x=TheColumnYouWantToPlot))+
geom_density()
The data we used for example here is the astralia soybean from the agridat package.
```{r}
df <- agridat::australia.soybean
ggplot(df, aes(x = yield))+
geom_density(alpha = .2, color = "blue")+
theme_grey(14)
```
### Normal curves
To create normal curves, we need use stat_function(fun = dnorm, args = list(mean, sd)). The basic function is:
ggplot(dataframe, aes(x=TheColumnYouWantToPlot))+
stat_function(fun = dnorm, args = list(mean = mean(TheColumnYouWantToPlot), sd = sd(TheColumnYouWantToPlot)))
The data we used for example here is the astralia soybean from the agridat package.
```{r}
df <- agridat::australia.soybean
ggplot(df, aes(x = yield))+
stat_function(fun = dnorm, args = list(mean = mean(df$yield), sd = sd(df$yield)), color = "red")
```
### Ridgeline plot
To create a ridgeline plot, we need to use geom_density_ridges() from the package ggridges. The basic function is:
ggplot(df, aes(y = TheColumnForYaxis , x = TheColumnForXaxis ))+
geom_density_ridges()+
The example used data loans_full_schema from package openintro. In addition,to make sure the plot is intuitive enough, the plot is ordered by the mean value from high to low.
```{r}
df <- openintro::loans_full_schema
ggplot(df, aes(y = reorder(loan_purpose, loan_amount, median),
x = loan_amount))+
geom_density_ridges(fill = "blue", alpha = .5, scale = 1)+
theme_ridges()+
theme(legend.position = "none")
```
### Frequency Bar Chart
To create a frequency bar chart, we need to use geom_bar() from ggplot2. Here is the basic function:
ggplot(dataframe, aes(x = TheColumnYouWantToPlot))+
geom_bar(aes(y = ..count..))
Data used here is Roof.Style from ames in the package openintro. In addition,to make sure the plot is intuitive enough, the plot is ordered in ascending order.
```{r}
df <- openintro::ames
ggplot(df, aes(x = fct_rev(fct_infreq(Roof.Style)))) +
geom_bar( aes(y = ..count..), color = "blue", fill = "lightblue") +
xlab("") +
ylab("Frequency") +
ggtitle("Frequency bar chart for the roof styles of the properties")
```
### Cleveland dot plots
To create a cleveland dot plots, we need to geom_point from ggplot2. The basic function is
ggplot(dataframe, aes(y =reorder(TheColumnForYaxis, TheColumnForXaxis) , x = TheColumnForXaxis))+
geom_point()
Data used here is seattlepets from package openintro. I plotted the most popular 30 names.
```{r}
df <- openintro::seattlepets %>% dplyr::count(animal_name, sort = TRUE) %>% drop_na()
ggplot(df[1:30,], aes(x = n, y = reorder(animal_name, n)))+
geom_point()
```
### Scatter plot
To create a scatter plot, we also use geom_point(). The basic function is
ggplot(dataframe, aes(y = TheColumnForYaxis , x = TheColumnForXaxis))+
geom_point()
Data used here is ames from the package openintro
```{r}
df <- openintro::ames
ggplot(df, aes(area, price))+
geom_point( alpha = .15, stroke = 0,size = 1.5)+
ggtitle("Scatter plot of price vs. area")
```
### Density contour lines
To create density contour lines, we need to use geom_density_2d(). Here is the basic function:
ggplot(dataframe, aes(y = TheColumnForYaxis , x = TheColumnForXaxis))+
geom_density_2d()
Data used here is ames from the package openintro.
```{r}
ggplot(df,aes(area, price)) +
geom_density_2d(aes(colour=..level..)) +
scale_colour_gradient(low="green",high="red") +
ggtitle("Density contour lines of price vs. area")
```
### Hexagonal heatmap
To create hexagonal heatmap, we need to use geom_hex(). Here is the basic function:
ggplot(dataframe, aes(y = TheColumnForYaxis , x = TheColumnForXaxis))+
geom_hex()
Data used here is ames from the package openintro.
```{r}
ggplot(df,aes(area, price)) +
geom_hex(bins = 30) +
scale_fill_gradient(low = "#F2F0F7", high = "#08519C" ) +
theme_bw() +
ggtitle("Hexagonal heatmap of price vs. area")
```
### Square heatmap
To create hexagonal heatmap, we need to use geom_bin_2d(). Here is the basic function:
ggplot(dataframe, aes(y = TheColumnForYaxis , x = TheColumnForXaxis))+
geom_bin_2d()
Data used here is ames from the package openintro.
```{r}
ggplot(df, aes(area, price)) +
geom_bin_2d(bins = 20) +
scale_fill_gradient(low = "#F2F0F7", high = "#08519C" ) +
theme_bw() +
ggtitle("Square heatmap of price vs. area")
```
## Data organization functions
### Pipe(%>%) Operator
Pipe operator can be used to simplified your code. It can be simple interpreted as "and then" For example:
filter(data, variable == numeric_value) and
data %>% filter(variable == numeric_value) will yield same result.
By using this operator properly, we can make our code clean and brief.
### facet_warp()
facet_warp() allow us to combine multiple plots, which can give us a more directly view of comparison between those type of plots.
The data we used for example here is the astralia soybean from the agridat package.
```{r}
df <- agridat::australia.soybean
ggplot(df, aes(x = yield, color = loc, fill = loc))+
geom_histogram(aes(y = ..density..), fill = NA) +
facet_wrap(~loc, nrow = 2, strip.position = "right")+
theme_grey(14)
```
### filter()
filter() function allows us to select those variable with specific values. For example, in the data seattlepets from the package openintro, we can use filter to find all dogs' name:
```{r}
df <- openintro::seattlepets %>% filter(species == "Dog")
df
```
### group_by()
group_by()function allow us to group the data by some variables.
For example, we want to group the data that crimes in 2020 by county and region, we can do this
```{r}
df <- read.csv("https://data.ny.gov/api/views/ca8h-8gjq/rows.csv")
df %>% filter(Year == 2020) %>% group_by(County, Region)
```
### summarise()
We can use summarise() to measure some value for a certain group. For example, in data set ames from openintro, we want to know the average price and area for each Neighborhood, we can do this:
```{r}
df <-openintro::ames %>%
group_by(Neighborhood) %>%
summarise(price = mean(price), area = mean(area))
df
```
### summarise_at()
For some specific situation, we need to summaries many variables. Writing them one by one will be time comsuming and we can do this by using summarise_at().
For Example, we want to know for year 2020, the total number of each type of crime happened in every county. We can write something like this:
```{r}
df2020 <- read.csv("https://data.ny.gov/api/views/ca8h-8gjq/rows.csv") %>%
filter(Year == 2020)
df2020$Property.Total = NULL
df2020 %>%
group_by(County) %>%
summarise_at(.vars = names(.)[7:13], .funs = c(sum = "sum"))
```