forked from jtr13/cc21fall2
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpostgresql_r.Rmd
135 lines (92 loc) · 6.8 KB
/
postgresql_r.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# How to Integrate R with PostgreSQL
Julia Wang
```{r, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(RPostgres)
library(DBI)
```
## Motivation
This guide will teach you how to connect RStudio to your local instance of PostgreSQL, a popular open source object-relational database server management system, that is free to download. Once we have become familiar with connecting to our PostgreSQL instance, we will perform a quick Linear Regression on some sample data, store the data in your PostgreSQL database, and then reuse that LR model to make a prediction.
Note: instance and database will be used interchangeably in this guide, but mean the same thing.
## How to Integrate R with PostgreSQL
If you haven't already, you will have to [download PostgreSQL]((https://www.postgresql.org/download/){target="_blank"} and setup pgAdmin 4. You should note down is the user and password that you create in this process, as well as make a note of what port and host you used to set up PostgreSQL (the defaults for this are localhost and 5432). You will need this information later to establish a connection between RStudio and your PostgreSQL instance.
Once you have set up PostgreSQL and were able to connect to pgAdmin 4, now in RStudio, we want to install and use the relevant RPostgreSQL packages. You can do so by running the following commands.
``` {r, eval = FALSE}
install.packages("RPostgres")
install.packages("devtools")
install.packages("DBI")
install.packages("remotes")
```
### Create your connection
In the case you want to use a separate dummy/admin user, you can create a user by going to Servers > PostgreSQL > Databases > <database name> > Login > right click Create > Login/Group Role. The name will be the user name and set the password in Definition to whatever you want. It should have the priviledge to login at the bare minimum. In the following R block, we are going to setup a connection to our PostgreSQL instance. One question you might have reading the following block is what the drv (driver) does. A driver in this context is a tool that allows applications to access data in your database in other systems, similar to how printer drivers allow you to talk to your printer and print things. We pass a driver and relevant information about your database and a database user to create a connection, which we are able to use to then run R commands, such as dbListTables, which list all of the tables in your database.
``` {r, eval=FALSE}
db <- '<your db name here>' # provide the name of your db
host_db <- '<your db server here>' # i.e. 'ec2-54-83-201-96.compute-1.amazonaws.com or localhost if you are connecting to an instance on your own laptop'
db_port <- '5432' # or any other port specified by the DBA
db_user <- '<your db user here>' # user that has access to your SQL instance
db_password <- '<your db user pwd here>' # password of the user who has access to your SQL instance.
drv <- RPostgres::Postgres()
con <- dbConnect(drv, dbname = db, host=host_db, port=db_port, user=db_user, password=db_password)
dbListTables(con) # returns a list of tables in your database
dbExistsTable(con, "<table name>") # checks if the table exists in your database
```
### Basic SQL commands
In addition to the two commands that I showed above, there are a few more basic commands that are built into the RPostgres library, which will allow you to read and write from your database. The following block will show you how to write a dataframe into a table in your instance, as well as query it afterwards.
``` {r, eval=FALSE}
# Create a table
library(tidyverse)
cars <- mtcars %>% rownames_to_column("carname")
dbWriteTable(con, 'cars', cars) # add the mtcars data set to your database as a table called "cars"
cars <- cars %>% mutate(id = row_number()) # if you want to change your data, such as adding an id column
dbWriteTable(con, 'cars', cars, overwrite=TRUE, append=FALSE) # you will need to pass in additional parameters. Overwrite will drop and recreate the table with your new data. Append just appends your data, your df needs to be in the same shape with the same variable names in order to work.
```
If you don't notice your table immediately, make sure to right click the tables dropdown in the sidebar and hit refresh.
Read your table:
``` {r, eval=FALSE}
dbReadTable(con, "cars") # read your newly created table
result <- dbReadTable(con, "cars") # can also be stored as variable
```
In case you prefer writing raw sql, you can use the dbGetQuery method. There is also a dbSendQuery method which is a bit more involved, but for our purposes, dbGetQuery will be sufficient.
``` {r, eval=FALSE}
library(ggplot2)
dbGetQuery(con, 'ALTER TABLE cars ADD CONSTRAINT cars_pk PRIMARY KEY (id)') # add primary key to the id column
db_cars <- dbGetQuery(con, 'SELECT * FROM cars WHERE id <= 20') # filter down our data set
ggplot(db_cars, aes(x=disp, y=mpg)) +
geom_point() +
ggtitle("Miles Per Gallon vs. Displacement (cu.in.)") +
ylab("Miles Per Gallon") +
xlab("Displacement (cu.in.)")
```
The dbSendQuery method only submits and synchronously executes the SQL query to the database engine. It does not extract any records, for that you need to use the dbFetch method, and then you must call dbClearResult when you finish fetching the records you need. If you are going to run SQL queries through RStudio, then most likely you should be using dbGetQuery.
Now that we are able to connect to and talk to our PostgreSQL instance, we can now run analyses using the data stored there as well as store our results there. We can also pull that saved model to make predictions as well, which can save us time on retraining our models.
``` {r, eval=FALSE}
model <- lm(mpg ~ disp, db_cars)
serialized_model <- rawToChar(serialize(model, NULL, ascii=TRUE)) # serialize (convert string) model so it can be stored in database
# create table for model to be stored in
dbGetQuery(con, 'CREATE TABLE models (
id SERIAL PRIMARY KEY,
model TEXT NOT NULL
);')
# insert model into models table
insert_query <-'INSERT INTO models (model) VALUES ($1)'
rs <- dbSendQuery(con, insert_query, list(serialized_model))
dbClearResult(rs)
# read the model from postgreSQL
result <- dbGetQuery(con, "SELECT model FROM models WHERE id = 1")
# revert serialization, and we have a working model again!
db_model <- unserialize(charToRaw(as.character(result[,c('model')])))
summary(db_model)
# use the model to make a prediction
disp <- 128
X_test <- data.frame(disp)
predict(db_model, X_test)
```
Make sure to clean up your connections once you are finished working with your database.
``` {r, cleanup, eval=FALSE}
# disconnect once you are finished
dbDisconnect(con)
dbUnloadDriver(drv)
```
## Sources
- https://www.datacareer.de/blog/connect-to-postgresql-with-r-a-step-by-step-example/
- https://stackoverflow.com/questions/1395115/storing-r-objects-in-a-relational-database