-
Notifications
You must be signed in to change notification settings - Fork 41
/
Copy path24-web_scraping.Rmd
162 lines (115 loc) · 4.3 KB
/
24-web_scraping.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
# Web scraping
**Learning objectives**
- Decide whether to scrape data from a web page.
- Recognize enough HTML to find your way around a web page.
- Extract tables from web pages.
- Extract other data from web pages.
```{r web_scraping-packages, eval=TRUE, message=FALSE, warning=FALSE}
library(rvest)
library(tidyverse)
```
## Ethics & Legalities {-}
> [If the data isn’t public, non-personal, or factual or you’re scraping the data specifically to make money with it, you’ll need to talk to a lawyer.](https://r4ds.hadley.nz/webscraping#scraping-ethics-and-legalities)
- Be polite (and {[polite](https://dmi3kno.github.io/polite/)})
- Check Terms of Service
- Beware PII
- Facts usually aren't copyrightable
## Typical HTML structure {-}
HTML = **H**yper**T**ext **M**arkup **L**anguage
- Hierarchical structure
- Element = `<tagname attribute="a" other="b">content</tagname>`
- Start tag: `<tagname>`
- Attributes: `attribute="a" other="b"`
- Content: `content`
- End tag: `</tagname>`
- Elements nest inside elements (as content)
- Nested elements = "children"
## Use {rvest} to scrape web pages {-}
[{rvest}](https://rvest.tidyverse.org/) ("harvest") = tidyverse web-scraping package
- Load html to scrape: `read_html()`
- Shortcut for tables: `html_table()`
## Example: Table {-}
[IMDB data from the Internet Archive](https://web.archive.org/web/20220201012049/https://www.imdb.com/chart/top/)
(currently times out, disabled for now; next cohort should revise!)
```{r web_scraping-tables, eval=FALSE}
url <- "https://web.archive.org/web/20220201012049/https://www.imdb.com/chart/top/"
html <- read_html(url)
html |>
html_table()
```
## Select a specific element {-}
`html_element()` returns same # outputs as inputs (1 thing in, 1 thing out)
- `"thing"` = `<thing>` tag
- `".thing"` = something with attribute `class="thing"`
- `"#thing"` = something with attribute `id="thing"`
## Example: Specific table {-}
(See "Example: Table" slide above; fix that then this)
```{r web_scraping-html_element, eval=FALSE}
html |>
html_element("table") |>
html_table()
```
## Select finer-grained elements {-}
`html_elements()` finds *all* matches
👍 Rule of thumb:
- `html_elements()` to get observations (rows)
- `html_element()` to get variables for each observation (columns)
## Extract data {-}
- `html_text()` for raw text (you probably don't want this)
- `html_text2()` for clean text
- `html_attr()` for attribute value (eg url `href`)
## Example: Star Wars Rows {-}
[Star Wars films (1-7)](https://rvest.tidyverse.org/articles/starwars.html)
```{r web_scraping-star_wars-section}
url <- "https://rvest.tidyverse.org/articles/starwars.html"
html <- read_html(url)
section <- html |> html_elements("section")
section
```
## Example: Star Wars Directors {-}
```{r web_scraping-star_wars-directors}
section |> html_element(".director") |> html_text2()
```
## Example: Star Wars Tibble {-}
```{r web_scraping-star_wars-tibble}
tibble(
title = section |>
html_element("h2") |>
html_text2(),
released = section |>
html_element("p") |>
html_text2() |>
stringr::str_remove("Released: ") |>
readr::parse_date(),
director = section |>
html_element(".director") |>
html_text2(),
intro = section |>
html_element(".crawl") |>
html_text2()
)
```
## Learn more {-}
- [SelectorGadget](https://rvest.tidyverse.org/articles/selectorgadget.html)
- [CSS Diner](https://flukeout.github.io/)
- [MDN CSS selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors)
- [*Web APIs with R* book club](https://DSLC.io/wapir)
## Meeting Videos {-}
### Cohort 7 {-}
`r knitr::include_url("https://www.youtube.com/embed/G5_pr9HxbT4")`
<details>
<summary> Meeting chat log </summary>
```
00:04:47 Oluwafemi Oyedele: Hi Tim!!!
00:05:27 Tim Newby: Hi Oluwafemi - can you hear me?
00:05:37 Oluwafemi Oyedele: Yes
00:11:53 Oluwafemi Oyedele: start
00:33:49 Oluwafemi Oyedele: https://rvest.tidyverse.org/articles/selectorgadget.html
00:40:24 Oluwafemi Oyedele: stop
```
</details>
`r knitr::include_url("https://www.youtube.com/embed/HnJ3ZY1seY4")`
### Cohort 8 {-}
`r knitr::include_url("https://www.youtube.com/embed/dOVWSSqUvt0")`
### Cohort 9 {-}
`r knitr::include_url("https://www.youtube.com/embed/Hs928CH-_E4")`