r-for-data-science-notes.Rmd

---
title: "R for Data Science notes and exercises"
output:
  html_document:
    df_print: paged
---

R Markdown Notebook to follow along the "R for Data Science" book by Hadley Wickham and Garrett Grolemund. Available from https://r4ds.had.co.nz/.

```{r}
library(tidyverse)

# Get a feel for the columns in the data.frame
str(mpg)

# First visualization, trying to understand correlation between mileage and displacement
ggplot(data = mpg, aes(x = displ, y = hwy)) + 
  geom_point()
```

### 3.2.4 Exercises
#### 1. Run ggplot(data = mpg). What do you see?

```{r}
ggplot(data = mpg)
```

#### 2. How many rows are in mpg? How many columns?

```{r}
nrow(mpg)
ncol(mpg)
```

#### 3. What does the drv variable describe? Read the help for ?mpg to find out.

```{r}
table(mpg$drv)
ggplot(data = mpg, aes(x = drv)) + geom_bar()
# Looks like 4WD, RWD, and FWD
ggplot(data = mpg, aes(x = drv, y = hwy)) + 
  geom_point()
# Better mileage in FWD (which there was a better way to compute these)
summary(mpg[which(mpg$drv == '4'),]$hwy)   # 19.17
summary(mpg[which(mpg$drv == 'f'),]$hwy) # 28.16
summary(mpg[which(mpg$drv == 'r'),]$hwy) # 21
# Found better way to compute in "Análise e Exploração de Dados com R" by Miguel Rocha and Pedro G. Ferreira
tapply(mpg$hwy, mpg$drv, mean)
# Found a better way to do it
?mpg # "the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd"
```

#### 4. Make a scatterplot of hwy vs cyl.

```{r}
ggplot(data = mpg) +
  geom_point(aes(x = hwy, y = cyl))

# Looks much better
ggplot(data = mpg) +
  geom_point(aes(x = cyl, y = hwy))
```

#### 5. What happens if you make a scatterplot of class vs drv? Why is the plot not useful?

```{r}
ggplot(data = mpg) +
  geom_point(aes(x = class, y = drv))
ggplot(data = mpg) +
  geom_point(aes(y = class, x = drv))
# I actually consider this useful, you can see how certain classes of cars have more different types of drive (there all kinds of subcompacts), compared to others (minivan only FWD) 
```

## 3.3 Aesthetic mappings

(...) the filled shapes (21–24) have a border of colour and are filled with fill.

```{r}
# 4 variables in same data viz
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, fill = class, color = drv), shape = 21)
```

### 3.3.1 Exercises

#### 1. What’s gone wrong with this code? Why are the points not blue?

```{r}
ggplot(data = mpg) + 
  geom_point(aes(x = displ, y = hwy, color = "blue"))
```

I see the color aestetic is being provided to geom_point through a mapping. So it's interpreting "blue" not as the color blue but as a constant string blue.
If you provide it directly to geom_point it should work better.

##### 2. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

```{r}
str(mpg)
```

I would submit that:

- manufacturer (15 levels), model (38 levels), trans (10 levels), drv (3 levels), fl (5 levels), class (7 levels) are categorical and

- displ, year, cyl, cty, hwy are continuous.

Considering all finite, discrete, non-sorted values as categorical.

##### 3. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

```{r}
ggplot(data = mpg) + 
  geom_point(aes(x = displ, y = hwy, size = manufacturer))
```

Not advised to use size for discrete variable.

##### 4. What happens if you map the same variable to multiple aesthetics?

```{r}
ggplot(data = mpg) + 
  geom_point(aes(x = displ, y = hwy, size = cyl, color = cyl))
```

Not as informative

##### 5. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

```{r}
ggplot(data = mpg) + 
  geom_point(aes(x = displ, y = hwy, stroke = cyl, color = class))
```

Modifies the width of the border

##### 6. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.

```{r}
ggplot(data = mpg) + 
  geom_point(aes(x = displ, y = hwy, color = displ < 5))
```

Works as expected, two colors (one for FALSE another for TRUE).

#### 3.5.1 Exercises

##### What happens if you facet on a continuous variable?

```{r}
ggplot(data = mpg) + 
  geom_point(aes(x = cyl, y = hwy)) +
  facet_wrap(~ displ)
```

It discretizes the continuous variable, and treats it as categorical.

##### 2. What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?
```{r}
ggplot(mpg) + 
  geom_point(aes(displ, hwy)) +
  facet_grid(drv ~ cyl)
```

They represent missing values, e.g. we don't have data for 4-wheel-drive cars with 5 cylinders.


##### 3. What plots does the following code make? What does . do?

```{r}
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ drv)
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_wrap(~drv)
```