24-web_scraping.Rmd

# Web scraping

**Learning objectives**

-   Decide whether to scrape data from a web page.
-   Recognize enough HTML to find your way around a web page.
-   Extract tables from web pages.
-   Extract other data from web pages.

```{r web_scraping-packages, eval=TRUE, message=FALSE, warning=FALSE}
library(rvest)
library(tidyverse)
```

## Ethics & Legalities {-}

> [If the data isn’t public, non-personal, or factual or you’re scraping the data specifically to make money with it, you’ll need to talk to a lawyer.](https://r4ds.hadley.nz/webscraping#scraping-ethics-and-legalities)

-   Be polite (and {[polite](https://dmi3kno.github.io/polite/)})
-   Check Terms of Service
-   Beware PII
-   Facts usually aren't copyrightable

## Typical HTML structure {-}

HTML = **H**yper**T**ext **M**arkup **L**anguage

-   Hierarchical structure
-   Element = `<tagname attribute="a" other="b">content</tagname>`
    -   Start tag: `<tagname>`
    -   Attributes: `attribute="a" other="b"`
    -   Content: `content`
    -   End tag: `</tagname>`
-   Elements nest inside elements (as content) 
    -   Nested elements = "children"

## Use {rvest} to scrape web pages {-}

[{rvest}](https://rvest.tidyverse.org/) ("harvest") = tidyverse web-scraping package

-   Load html to scrape: `read_html()`
-   Shortcut for tables: `html_table()`

## Example: Table {-}

[IMDB data from the Internet Archive](https://web.archive.org/web/20220201012049/https://www.imdb.com/chart/top/)

(currently times out, disabled for now; next cohort should revise!)

```{r web_scraping-tables, eval=FALSE}
url <- "https://web.archive.org/web/20220201012049/https://www.imdb.com/chart/top/"
html <- read_html(url)
html |> 
  html_table()
```

## Select a specific element {-}

`html_element()` returns same # outputs as inputs (1 thing in, 1 thing out)

-   `"thing"` = `<thing>` tag
-   `".thing"` = something with attribute `class="thing"`
-   `"#thing"` = something with attribute `id="thing"`

## Example: Specific table {-}

(See "Example: Table" slide above; fix that then this)

```{r web_scraping-html_element, eval=FALSE}
html |> 
  html_element("table") |> 
  html_table()
```

## Select finer-grained elements {-}

`html_elements()` finds *all* matches

👍 Rule of thumb:

-   `html_elements()` to get observations (rows)
-   `html_element()` to get variables for each observation (columns)

## Extract data {-}

-   `html_text()` for raw text (you probably don't want this)
-   `html_text2()` for clean text
-   `html_attr()` for attribute value (eg url `href`)

## Example: Star Wars Rows {-}

[Star Wars films (1-7)](https://rvest.tidyverse.org/articles/starwars.html)

```{r web_scraping-star_wars-section}
url <- "https://rvest.tidyverse.org/articles/starwars.html"
html <- read_html(url)

section <- html |> html_elements("section")
section
```

## Example: Star Wars Directors {-}

```{r web_scraping-star_wars-directors}
section |> html_element(".director") |> html_text2()
```

## Example: Star Wars Tibble {-}

```{r web_scraping-star_wars-tibble}
tibble(
  title = section |> 
    html_element("h2") |> 
    html_text2(),
  released = section |> 
    html_element("p") |> 
    html_text2() |> 
    stringr::str_remove("Released: ") |> 
    readr::parse_date(),
  director = section |> 
    html_element(".director") |> 
    html_text2(),
  intro = section |> 
    html_element(".crawl") |> 
    html_text2()
)
```

## Learn more {-}

-   [SelectorGadget](https://rvest.tidyverse.org/articles/selectorgadget.html)
-   [CSS Diner](https://flukeout.github.io/)
-   [MDN CSS selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors)
-   [*Web APIs with R* book club](https://DSLC.io/wapir)

## Meeting Videos {-}

### Cohort 7 {-}

`r knitr::include_url("https://www.youtube.com/embed/G5_pr9HxbT4")`

<details>
  <summary> Meeting chat log </summary>
```
00:04:47	Oluwafemi Oyedele:	Hi Tim!!!
00:05:27	Tim Newby:	Hi Oluwafemi - can you hear me?
00:05:37	Oluwafemi Oyedele:	Yes
00:11:53	Oluwafemi Oyedele:	start
00:33:49	Oluwafemi Oyedele:	https://rvest.tidyverse.org/articles/selectorgadget.html
00:40:24	Oluwafemi Oyedele:	stop
```
</details>

`r knitr::include_url("https://www.youtube.com/embed/HnJ3ZY1seY4")`

### Cohort 8 {-}

`r knitr::include_url("https://www.youtube.com/embed/dOVWSSqUvt0")`

### Cohort 9 {-}

`r knitr::include_url("https://www.youtube.com/embed/Hs928CH-_E4")`