forked from jtr13/cc21fall2
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbase_r_tidyverse_data_table.Rmd
executable file
·230 lines (182 loc) · 7.96 KB
/
base_r_tidyverse_data_table.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
# Comparison among base R, tidyverse, and datatable
Siyue Han
```{r}
# Load the tidyverse package
library(tidyverse)
# Load the data.table package
library(data.table)
```
## Introduction
There are many ways to read and analyze data in R, the `data.frame` provided in base R can handle most of the situations. Therefore, I have been using it for most of the time so far and occasionally used `tibble` from `tidyverse`. But one time when I was dealing with a large csv, I found it was so slow with `data.frame`. With the help of google, I used `data.table` for the first time, and it was amazing. Therefore, I'd like to share this way of reading and analyzing data to more people.
## Reading Data
First, let's see the performance of reading csv data among these three environments.
### Data.Frame
This is an example of reading csv into a data.frame.
```{r}
start <- Sys.time()
df <- read.csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")
end <- Sys.time()
print(end - start)
```
### Tibble
This is an example of reading csv into a tibble.
```{r}
start <- Sys.time()
tb <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")
end <- Sys.time()
print(end - start)
```
### Data.Table
This is an example of reading csv into a data.table.
```{r}
start <- Sys.time()
dt <- fread("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")
end <- Sys.time()
print(end - start)
```
As we can see, `data.table` can read a csv file super fast, especially when the file is large. `tibble` in `tidyverse` is slightly faster than `data.frame` in base R, but is still much slower than `data.table`.
## Processing Data
Then, let's see the differences of processing data among these three environments.
### Selecting column(s) and row(s)
```{r}
start <- Sys.time()
x1 <- df[101:110, c('Lat', 'Long_')]
end <- Sys.time()
print(end - start)
start <- Sys.time()
x2 <- select(tb, Lat, Long_) %>% slice(101:110)
end <- Sys.time()
print(end - start)
start <- Sys.time()
x3 <- dt[101:110, .(Lat, Long_)]
end <- Sys.time()
print(end - start)
```
### Filtering row(s)
```{r}
start <- Sys.time()
x1 <- df[df$`X10.31.21` > 500000,]
end <- Sys.time()
print(end - start)
start <- Sys.time()
x2 <- filter(tb, `10/31/21` > 500000)
end <- Sys.time()
print(end - start)
start <- Sys.time()
x3 <- dt[`10/31/21` > 500000,]
end <- Sys.time()
print(end - start)
```
### Sorting the table
```{r}
start <- Sys.time()
x1 <- df[order(-df$`X10.31.21`), ]
end <- Sys.time()
print(end - start)
start <- Sys.time()
x2 <- arrange(tb, -`10/31/21`)
end <- Sys.time()
print(end - start)
start <- Sys.time()
x3 <- dt[order(-`10/31/21`), ]
end <- Sys.time()
print(end - start)
```
### Summarizing columns by group
```{r}
start <- Sys.time()
x1 <- aggregate(df$`X10.31.21`, list(df$Province_State), sum)
end <- Sys.time()
print(end - start)
start <- Sys.time()
x2 <- group_by(tb, Province_State) %>% summarise(sum(`10/31/21`))
end <- Sys.time()
print(end - start)
start <- Sys.time()
x3 <- dt[ , lapply(.(`10/31/21`), sum), by = Province_State]
end <- Sys.time()
print(end - start)
```
### Pivoting longer
```{r}
start <- Sys.time()
x1 <- reshape(df,
varying = 12:dim(df)[2],
timevar = "Date", v.names="Cases",
direction = "long")
x1 <- x1[, c('Combined_Key', 'Date', 'Cases')]
end <- Sys.time()
print(end - start)
start <- Sys.time()
x2 <- pivot_longer(tb,
names_to = "Date",
values_to = "Cases",
-(1:11)) %>%
select(Combined_Key, Date, Cases)
end <- Sys.time()
print(end - start)
start <- Sys.time()
x3 <- melt(dt,
id.vars = 1:11,
variable.name = "Date",
value.name = "Cases")[
, .(Combined_Key, Date, Cases)]
end <- Sys.time()
print(end - start)
```
### Joining tables
```{r}
start <- Sys.time()
x1 <- merge(df[, 1:11], x1)
end <- Sys.time()
print(end - start)
start <- Sys.time()
x2 <- left_join(select(tb, 1:11), x2, by = "Combined_Key")
end <- Sys.time()
print(end - start)
start <- Sys.time()
x3 <- dt[, 1:11][x3, on = "Combined_Key"]
end <- Sys.time()
print(end - start)
```
### Chaining structures
Base R does not have chaining structure like the `tidyverse` or `data.table`. Here we compare chaining structures in `tidyverse` and `data.table`.
The `tidyverse` uses `%>%` to connect operations together.
The `data.table` uses bracketed operations back to back as `[...][...]`.
```{r}
start <- Sys.time()
x2 <- tb %>%
mutate(year = `10/31/21` - `10/31/20`) %>%
group_by(Province_State) %>%
summarise(year = sum(year)) %>%
arrange(-year)
end <- Sys.time()
print(end - start)
start <- Sys.time()
x3 <- dt[, year := `10/31/21` - `10/31/20`, ][
order(-year), .(year = sum(year)), by = Province_State]
end <- Sys.time()
print(end - start)
```
From all the above, we can see that when doing simple operations such as selecting, filtering and sorting, Base R can finish very fast. However, when doing complex operations such as pivoting and joining, Base R will cost huge amount of time. Comparing `tidyverse` and `data.table`, we can see that `data.table` have slightly faster speed than `tidyverse` in almost every task. Especially, when using chaining structure, `data.table` finishes much faster than `tidyverse`. This is probably because `data.table` includes many different operations together in one bracketed operation. In the above example, it use one bracketed operation to do the `group_by`, `summarise` and `arrange` task in `tidyverse`. On the other hand, since `tidyverse` does only one task in each function, and what task a function will do is easy to understand through its name, the code in `tidyverse` is more readable than in `data.table`.
## Summary of key functions
|Environment | base | tidyverse | data.table |
| ----------- | ----------- | ----------- | ----------- |
|Supported data class | data.frame | tibble | data.table|
|Reading data | read.csv | read_csv | fread|
|Subset by column | [ , ...] | select() | [ ,... , ]|
|Subset by rows | [... , ] | filter() | [... , , ]|
|Create new column | df\$y = ... | mutate(tb, y = ...) | [ , y := ..., ]|
|Delete a column | df\$y = NULL | select(tb, -y) | [ , y := NULL, ]|
|Summarize | apply(df[ , y], 2, ...) | summarise() | [ , ...(y), ]|
|Grouping | aggregate() | group_by() | [ , , by = ...]|
|Pivot to long | reshape() | pivot_longer() | melt()|
|Pivot to wide | reshape() | pivot_wider() | dcast()|
|Joining tables | merge() | left_join() | dt1[dt2, on = ...]|
## Conclusion
The motivation for this project is that one day when I was dealing with a large csv, I found it was so slow with `data.frame`. With the help of google, I used `data.table` for the first time, and it was amazing. Therefore, I'd like to compare its performance with that of `base R` and `tidyverse` when reading and analyzing data.
From this project, I learned how to use `data.table` to analyze data. Also I learned the advantage of using each of the three ways. `data.frame` in base R is the most convenient and easy way to deal with data analytics tasks, but it takes too much time when the data is large or the operation is complex. Therefore, in such cases, it would be better to use `data.table` and `tidyverse`. In cases when we are handling very large dataset, `data.table` would be a good choice since it runs extremely fast. In cases when we are not requiring the speed so much, especially when collaborating with others, we can choose `tidyverse` since its code is more readable.
## Reference
[1] https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html
[2] https://megapteraphile.wordpress.com/2020/03/25/data-frame-vs-data-table-vs-tibble-in-r/
[3] https://mgimond.github.io/rug_2019_12/Index.html