forked from jtr13/cc21fall2
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathusing_python_in_r.Rmd
208 lines (161 loc) · 8.37 KB
/
using_python_in_r.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
# Using python in r
Xingyu Lu
```{r, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```
**1. Background**
R and python are two of the most popular programming languages for data scientists. Sometimes it might be hard to decide which one to use, especially for beginners.
However, it's easy for us to compare these languages. One languages isn't better than another for all cases. For example, R is almost the best language to do exploratory data analysis and python is more readable and can be integrated to any part of your workflow easily. **The following figure is a comparison screenshot from resource 1**
![](resources/use_python_in_r/pythonvsr.png)
As a beginner in R, I enjoyed the beautiful plots R can make, and also suffered from the obscure data processing functions and limited online resources and documentations.
And then, *reticulate* package that can allow using python in R becomes a life saver for R beginners that is already familiar with python.
**2. *reticulate* package**
*reticulate* package allows users to use python modules, classes and functions in R. Although it CANNOT guarantee all the benefits of python, it can at least give you some convenience.
*(a) Setup*
Surprisingly, it's quite easy to set up python in R Markdown. The only thing you need is to include the following block:
```{r}
path <- '/usr/local/opt/[email protected]/bin/python3'
library(reticulate)
use_python(path)
```
The `path` variable is the path pointing to the python you are using. To find the path, you can do the following:
* Windows: use the following code in cmd
```
python -c "import os, sys; print(os.path.dirname(sys.executable))"
```
* MacOS: use the follwing code in terminal
```
which python
```
*(b) Usage*
After the setup, you can use the python block normally just as r blocks in R Markdown files: wrap your python code with \`\`\`\{python\} \`\`\` just like what you will do for your R code.
For example, if you want to load the wait list data in HW3 Q3, you could
```{python}
import pandas as pd
df = pd.read_csv('stats_wl.csv')
df.head()
```
You can manipulate the data in Python, which is easier than R.
**Note: a useful function: py_install(some python package)**
*(c) Plot using Python packages*
It's also possible to make plots using python packages like *matplotlib* and *seaborn*.
For example, I want to see the distribution of final status of all the students.
```{python figure.width=15, figure.height=15}
import seaborn as sns
df_status = df[['Name', 'Status']].drop_duplicates()
print(df_status.shape)
sns.set(font_scale=0.5)
sns.countplot(y='Status', data=df_status).set_title('Countplot for Final Status of all Students')
```
*(d) Integrate into R*
Upto now, we are only using Python functions. It's like we are just treating RStudio as another editor for Python. And how can we integrate python into R, to enjoy the coding experience in Python and the nice plots of R at the same time?
Well, it's quite simple.
*(d-1) What's happening to the Python block in R Markdown?*
After running the above blocks in RStudio, you should observe the following messages in your Console:
![](resources/use_python_in_r/console.png)
This means your console is currently in python mode, and you could run python code in the console just like all other IDEs. Off course, you can't use R codes here unless you exit the REPL and return to R, just as the message states.
Also, in the Environment, you should also observe the following data and variables:
![](resources/use_python_in_r/env.png)
Notice that you would no longer observe those variables in the Environment window once you exit the REPL mode in the console.
![](resources/use_python_in_r/envr.png)
*(d-2) Load variable in REPL mode to R mode*
Then how to pass the preprocessed variable, `df_status` back to your R mode? Actually there will be a variable called `py` in your R environment, which can access all the variables in the REPL mode via `$` operator. So
```{r}
data <- py$df_status
head(data)
```
which loads the preprocessed data to R mode.
*(d-3) Specific example*
Recall Q3 in HW3. To make the plot, I suffered a lot in preprocessing data in R, and the code I used in my homework is as follows:
```{r fig.height=15,fig.width=15}
library(ggalluvial)
library(data.table)
library(dplyr)
library(tidyr)
df <- read.csv('resources/use_python_in_r/stats_wl.csv')
df_W <-
df %>%
pivot_wider(names_from = Date, values_from = Priority)
head <- colnames(df_W)
df_W <- data.frame(lapply(df_W, as.character), stringsAsFactors=FALSE)
names(df_W) <- head
for (r in 1:nrow(df_W)){
present <- FALSE
for (c in 3:ncol(df_W)) {
if (!present && !is.na(df_W[r,c])){
present <- TRUE
}
if (present){
if (is.na(df_W[r,c])){
status <- df_W[r, 2]
df_W[r,c] <- ifelse(status=='Registered', 'R', ifelse(status=='Dropped Class', 'D', 'L'))
}
if (!(df_W[r, c] %in% c('R', 'D', 'L')) && nchar(df_W[r,c])==1) {
df_W[r, c] <- paste('0', df_W[r, c], sep='')
}
}
}
}
df_W %>%
to_lodes_form(axis = 2:15) %>%
drop_na() %>%
subset(x != 'Status') %>%
mutate(status=ifelse(stratum=='R', 'Registered', ifelse(stratum=='L', 'Left List', ifelse(stratum=='D', 'Dropped Class', 'Joined')))) %>%
ggplot(aes(alluvium = alluvium, x = x, stratum = stratum, y = 1, fill=status)) +
geom_alluvium(color = "black") +
geom_stratum() +
geom_text(stat = "stratum", aes(label = paste(after_stat(stratum)))) +
labs(x='Data', y='Student')
```
So the data used to plot the figure is actually the `df_W`. There are a lot of preprocessing steps like handling the data type, filling the missing value etc. There must be some fancy and powerful functions in R that allows me to accomplish the task easily. However, it would largely increase my productivity if I could take advantage of my experience in Python. So, I if I could use Python to redo the process, it would be
```{python}
df = pd.read_csv('resources/use_python_in_r/stats_wl.csv')
students = set(df['Name'])
dates = list(set(df['Date']))
dates.sort()
name_priority_map = {}
for e in df.values:
name, date, priority = e[0], e[1], e[2]
if name not in name_priority_map:
name_priority_map[name] = {}
name_priority_map[name][date] = str(priority).zfill(2)
name_status_map = {e[0]: e[1][0] for e in df_status.values}
processed = []
for s, m in name_priority_map.items():
entry = []
appear = False
for d in dates:
if d in m:
appear = True
if not appear:
entry.append('Unseen')
elif d in m:
entry.append(m[d])
else:
entry.append(name_status_map[s])
processed.append([s] + entry)
df_processed = pd.DataFrame(processed, columns=['Name'] + dates)
```
which finishes the preprocessing and then
```{r fig.height=15,fig.width=15}
library(naniar)
df_p <- as.data.frame(py$df_processed)
df_p %>%
replace_with_na_all(condition = ~. == 'Unseen') %>%
to_lodes_form(axis = 2:14) %>%
drop_na() %>%
mutate(status=ifelse(stratum=='R', 'Registered', ifelse(stratum=='L', 'Left List', ifelse(stratum=='D', 'Dropped Class', 'Joined')))) %>%
ggplot(aes(alluvium = alluvium, x = x, stratum = stratum, y = 1, fill=status)) +
geom_alluvium(color = "black") +
geom_stratum() +
geom_text(stat = "stratum", aes(label = paste(after_stat(stratum)))) +
labs(x='Data', y='Student')
```
Which yields the same plot. The code simplicity didn't change a lot. However, the productivity increased: it takes me around 5 hours to preprocess the data in R while only 20 minutes in Python.
**3. Conclusion**
Although it might be unnecessary for an R expert to use Python in R Markdown, it is really useful for R beginners to do a little bit of set up and take the advantage of using Python in R.
**4. Resources**
1. https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis?utm_source=adwords_ppc&utm_campaignid=12492439679&utm_adgroupid=122563407961&utm_device=c&utm_keyword=python%20r&utm_matchtype=b&utm_network=g&utm_adpostion=&utm_creative=504158803093&utm_targetid=dsa-296652664266&utm_loc_interest_ms=&utm_loc_physical_ms=9073479&gclid=CjwKCAjwq9mLBhB2EiwAuYdMtVLKk_tdQSa2uN-sVjFzkaIiqZYLKwE7Qcd2iYZHHmpqd0EaXKOksBoCTLIQAvD_BwE
2. https://cran.r-project.org/web/packages/reticulate/vignettes/calling_python.html
3. https://stackoverflow.com/questions/647515/how-can-i-find-where-python-is-installed-on-windows
4. https://cran.r-project.org/web/packages/reticulate/vignettes/r_markdown.html