forked from csgillespie/efficientR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path05-input-output.Rmd
379 lines (270 loc) · 30.5 KB
/
05-input-output.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
---
knit: "bookdown::preview_chapter"
---
```{r, echo=FALSE}
rm(list=ls())
```
# Efficient input/output {#input-output}
This chapter explains how to efficiently read and write data in R. Input/output (I/O) is the technical term for reading and writing data: the process of getting information into a particular computer system (in this case R) and then exporting it to the 'outside world' again (in this case as a file format that other software can read). Data I/O will be needed on projects where data comes from, or goes to, external sources. However, the majority of R resources and documentation start with the optimistic assumption that your data has already been loaded, ignoring the fact that importing datasets into R, and exporting them to the world outside the R ecosystem, can be a time-consuming and frustrating process. Tricky, slow or ultimately unsuccessful data I/O can cripple efficiency right at the outset of a project. Conversely, reading and writing your data efficiently will make your R projects more likely to succeed in the outside world.
The first section introduces **rio**, a 'meta package' for efficiently reading and writing data in a range of file formats. **rio** requires only two intuitive functions for data I/O, making it efficient to learn and use. Next we explore in more detail efficient functions for reading in files stored in common *plain text* file formats from the **readr** and **data.table** packages. Binary formats, which can dramatically reduce file sizes and read/write times, are covered next.
With the accelerating digital revolution and growth in open data, an increasing proportion of the world's data can be downloaded from the internet. This trend is set to continue, making section \@ref(download), on downloading and importing data from the web, important for 'future-proofing' your I/O skills. The benchmarks in this chapter demonstrate that choice of file format and packages for data I/O can have a huge impact on computational efficiency.
Before reading in a single line of data, it is worth considering a general principle for reproducible data management: never modify raw data files. Raw data should be seen as read-only, and contain information about its provenance. Keeping the original file name and commenting on its origin are a couple of ways to improve reproducibility, even when the data are not publicly available.
### Prerequisites {-}
R can read data from a variety of sources. We begin by discussing the generic package **rio** that
handles a wide variety of data types. Special attention is paid to CSV files, which leads to
the **readr** and **data.table** packages. The relatively new package **feather** is
introduced as a binary file format, that has cross-language support.
```{r eval=FALSE}
library("rio")
library("readr")
library("data.table")
library("feather")
```
We also use the **WDI** package to illustrate accessing online data sets
```{r eval=FALSE}
library("WDI")
```
## Top 5 tips for efficient data I/O
1. If possible, keep the names of local files downloaded from the internet or copied onto your computer unchanged. This will help you trace the provenance of the data in the future.
1. R's native file format is `.Rds`. These files can be imported and exported using `readRDS()` and `saveRDS()` for fast and space efficient data storage.
1. Use `import()` from the **rio** package to efficiently import data from a wide range of formats, avoiding the hassle of loading format-specific libraries.
1. Use the **readr** or **data.table** equivalents of `read.table()` to efficiently import large text files.
1. Use `file.size()` and `object.size()` to keep track of the size of files and R objects and take action if they get too big.
## Versatile data import with rio
**rio** is a 'A Swiss-Army Knife for Data I/O'. **rio** provides easy-to-use and computationally efficient functions for importing and exporting tabular data in a range of file formats. As stated in the package's [vignette](https://cran.r-project.org/web/packages/rio/vignettes/rio.html), **rio** aims to "simplify the process of importing data into R and exporting data from R." The vignette goes on to explain how many of the functions for data I/O described in R's [Data Import/Export manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) are out of date (for example referring to **WriteXLS** but not the more recent **readxl** package) and difficult to learn.
This is why **rio** is covered at the outset of this chapter: if you just want to get data into R, with a minimum of time learning new functions, there is a fair chance that **rio** can help, for many common file formats. At the time of writing, these include `.csv`, `.feather`, `.json`, `.dta`, `.xls`, `.xlsx` and Google Sheets (see the package's [github page](https://github.com/leeper/rio) for up-to-date information). Below we illustrate the key **rio** functions of `import()` and `export()`:
```{r, eval=FALSE}
library("rio")
# Specify a file
fname = system.file("extdata/voc_voyages.tsv", package = "efficient")
# Import the file (uses the fread function from data.table)
voyages = import(fname)
# Export the file as an Excel spreadsheet
export(voyages, "voc_voyages.xlsx")
```
There was no need to specify the optional `format` argument for data import and export functions because this is inferred by the *suffix*, in the above example `.tsv` and `.xlsx` respectively. You can override the inferred file format for both functions with the `format` argument. You could, for example, create a comma-delimited file called `voc_voyages.xlsx` with `export(voyages, "voc_voyages.xlsx", format = "csv")`. However, this would **not** be a good idea: it is important to ensure that a file's suffix matches its format.
To provide another example, the code chunk below downloads and imports as a data frame information about the countries of the world stored in `.json` (downloading data from the internet is covered in more detail in Section \@ref(download)):
```{r, eval=FALSE}
capitals = import("https://github.com/mledoze/countries/raw/master/countries.json")
```
```{block, json, type='rmdtip'}
The ability to import and use `.json` data is becoming increasingly common as it a standard output format for many APIs. The **jsonlite** and **geojsonio** packages have been developed to make this as easy as possible.
```
### Exercises {-}
1. The final line in the code chunk above shows a neat feature of **rio** and some other packages: the output format is determined by the suffix of the file-name, which make for concise code. Try opening the `voc_voyages.xlsx` file with an editor such as LibreOffice Calc or Microsoft Excel to ensure that the export worked, before removing this rather inefficient file format from your system:
```{r, results="hide", warning=FALSE}
file.remove("voc_voyages.xlsx")
```
2. Try saving the `voyages` data frames into 3 other file formats of your choosing (see `vignette("rio")` for supported formats). Try opening these in external programs. Which file formats are more portable?
3. As a bonus exercise, create a simple benchmark to compare the write times for the different file formats used to complete the previous exercise. Which is fastest? Which is the most space efficient?
## Plain text formats {#fread}
'Plain text' data files are encoded in a format (typically UTF-8) that can be read by humans and computers alike. The great thing about plain text is their simplicity and their ease of use: any programming language can read a plain text file. The most common plain text format is `.csv`, comma-separated values, in which columns are separated by commas and rows are separated by line breaks. This is illustrated in the simple example below:
```
Person, Nationality, Country of Birth
Robin, British, England
Colin, British, Scotland
```
There is often more than one way to read data into R and `.csv` files are no exception. The method you choose has implications for computational efficiency. This section investigates methods for getting plain text files into R, with a focus on three approaches: base R's plain text reading functions such as `read.csv()`; the **data.table** approach, which uses the function `fread()`; and the newer **readr** package which provides `read_csv()` and other `read_*()` functions such as `read_tsv()`. Although these functions perform differently, they are largely cross-compatible, as illustrated in the below chunk, which loads data on the concentration of CO^2^ in the atmosphere over time:
```{block, type="rmdwarning"}
In general, you should never "hand-write" a CSV file. Instead, you should use `write.csv()` or an
equivalent function. The Internet Engineering Task Force has the [CSV definition](https://www.ietf.org/rfc/rfc4180.txt)
that facilities sharing CSV files between tools and operating systems.
```
```{r, echo=-1:-2}
library("readr") # for read_ functions
library("data.table") # for fread function
df_co2 = read.csv("extdata/co2.csv")
df_co2_dt = readr::read_csv("extdata/co2.csv")
df_co2_readr = data.table::fread("extdata/co2.csv")
```
```{block, read-table-csv, type='rmdnote'}
Note that a function 'derived from' another in this context means that it calls another function. The functions such as `read.csv()` and `read.delim()` in fact are *wrappers* around `read.table()`. This can be seen in the source code of `read.csv()`, for example, which shows that the function is roughly the equivalent of `read.table(file, header = TRUE, sep = ",")`.
```
Although this section is focussed on reading text files, it demonstrates the wider principle that the speed and flexibility advantages of additional read functions can be offset by the disadvantage of additional package dependencies (in terms of complexity and maintaining the code) for small datasets. The real benefits kick in on large datasets. Of course, there are some data types that *require* a certain package to load in R: the **readstata13** package, for example, was developed solely to read in `.dta` files generated by versions of Stata 13 and above.
Figure \@ref(fig:5-1) demonstrates that the relative performance gains of the **data.table** and **readr** approaches increase with data size, especially for data with many rows. Below around $1$ MB `read.csv()` is actually *faster* than `read_csv()` while `fread` is much faster than both, although these savings are likely to be inconsequential for such small datasets.
For files beyond $100$ MB in size `fread()` and `read_csv()` can be expected to be around *5 times faster* than `read.csv()`. This efficiency gain may be inconsequential for a one-off file of $100$ MB running on a fast computer (which still takes less than a minute with `read.csv()`), but could represent an important speed-up if you frequently load large text files.
(ref:5-1) Benchmarks of base, data.table and readr approches for reading csv files, using the functions read.csv(), fread() and read_csv(), respectively. The facets ranging from $2$ to $200$ represent the number of columns in the csv file.
```{r 5-1, fig.cap="(ref:5-1)", echo=FALSE, warning=FALSE, message=FALSE, out.width="90%", fig.align="center"}
local(source("code/05-io_f1.R", local=TRUE))
```
When tested on a large ($4$GB) `.csv` file it was found that `fread()` and `read_csv()` were almost identical in load
times and that `read.csv()` took around $5$ times longer. This consumed more than $10$GB of RAM, making it unsuitable to
run on many computers (see Section \@ref(ram) for more on memory). Note that both **readr** and base methods can be
made significantly faster by pre-specifying the column types at the outset (see below). Further details are provided by the help in `?read.table`.
```{r eval=FALSE}
read.csv(file_name, colClasses = c("numeric", "numeric"))
```
In some cases with R programming there is a trade-off between speed and robustness. This is illustrated below with reference to differences in how the **readr**, **data.table** and base R approaches handle unexpected values. Figure \@ref(fig:5-1) highlights the benefit of switching to `fread()` and (eventually) to `read_csv()` as the dataset size increases. For a small ($1$MB) dataset:
`fread()` is around $5$ times faster than base R.
### Differences between `fread()` and `read_csv()`
The file `voc_voyages` was taken from a dataset on Dutch naval expeditions used with permission from the CWI Database Architectures Group. The data is described more fully at [monetdb.org](https://www.monetdb.org/Documentation/UserGuide/MonetDB-R). From this dataset we primarily use the 'voyages' table which lists Dutch shipping expeditions by their date of departure.
```{r, cache=TRUE}
fname = system.file("extdata/voc_voyages.tsv", package = "efficient")
voyages_base = read.delim(fname)
```
When we run the equivalent operation using **readr**,
```{r, cache=TRUE}
voyages_readr = readr::read_tsv(fname)
```
a warning is raised regarding row 2841 in the `built` variable. This is because `read_*()` decides what class each variable is based on the first $1000$ rows, rather than all rows, as base `read.*()` functions do. Printing the offending element
```{r}
voyages_base$built[2841] # a factor.
voyages_readr$built[2841] # an NA: text cannot be converted to numeric
```
Reading the file using **data.table**
```{r, warning=FALSE}
# Verbose warnings not shown
voyages_dt = data.table::fread(fname)
```
generates 5 warning messages stating that columns 2, 4, 9, 10 and 11 were `Bumped to type character on data row ...`, with the offending rows printed in place of `...`. Instead of changing the offending values to `NA`, as **readr** does for the `built` column (9), `fread()` automatically converts any columns it thought of as numeric into characters. An additional feature of `fread` is that it can read-in a selection of the columns, either by their index or name, using the `select` argument. This is illustrated below by reading in only half of the columns (the first 11) from the voyages dataset and comparing the result with `fread()`'ing all the columns in.
```{r, warning=FALSE}
microbenchmark(times = 5,
with_select = data.table::fread(fname, select = 1:11),
without_select = data.table::fread(fname)
)
```
To summarise, the differences between base, **readr** and **data.table** functions for reading in data go beyond code execution times. The functions `read_csv()` and `fread()` boost speed partially at the expense of robustness because they decide column classes based on a small sample of available data. The similarities and differences between the approaches are summarised for the Dutch shipping data in Table \@ref(tab:colclasses).
```{r colclasses, echo=FALSE}
vcols = as.data.frame(rbind(
sapply(voyages_base, class),
sapply(voyages_readr, class),
sapply(voyages_dt, class)))
vcols = dplyr::select(vcols, number, boatname, built, departure_date)
vcols$Function = c("base", "readr", "data.table")
knitr::kable(vcols, caption = "Comparison of base, **readr** and **data.table** reading in the voyages data set.")
```
Table \@ref(tab:colclasses) shows 4 main similarities and differences between the three read types of read function:
- For uniform data such as the 'number' variable in Table \@ref(tab:colclasses), all reading methods yield the same result (integer in this case).
- For columns that are obviously characters such as 'boatname', the base method results in factors (unless `stringsAsFactors` is set to `FALSE`) whereas `fread()` and `read_csv()` functions return characters.
- For columns in which the first 1000 rows are of one type but which contain anomalies, such as 'built' and 'departure_data' in the shipping example, `fread()` coerces the result to characters.
`read_csv()` and siblings, by contrast, keep the class that is correct for the first 1000 rows and sets the anomalous records to `NA`. This is illustrated in \@ref(tab:colclasses), where `read_tsv()` produces a `numeric` class for the 'built' variable, ignoring the non-numeric text in row 2841.
- `read_*()` functions generate objects of class `tbl_df`, an extension of the `data.frame` class, as discussed in Section \@ref(dplyr). `fread()` generates objects of class `data.table()`. These can be used as standard data frames but differ subtly in their behaviour.
An additional difference is that `read_csv()` creates data frames of class `tbl_df`, *and* the `data.frame`. This makes no practical difference, unless the **tibble** package is loaded, as described in section \@ref(efficient-data-frames-with-tibble) in the next chapter.
The wider point associated with these tests is that functions that save time can also lead to additional considerations or complexities for your workflow. Taking a look at what is going on 'under the hood' of fast functions to increase speed, as we have done in this section, can help understand the knock-on consequences of choosing fast functions over slower functions from base R.
```{r, eval=FALSE, tidy=FALSE, echo=FALSE}
# # This was a final sentence from the previous paragraph that I've removed for now:
# In some cases there will be no knock-on consequences of using faster functions provided by packages but you should be aware that it is a possibility.
# I've removed this for now as it's such a large and unwieldy dataset
url = "http://download.cms.gov/nppes/NPPES_Data_Dissemination_Aug_2015.zip"
download.file(url, "largefile.zip") # takes many minutes
unzip("largefile.zip", exdir = "data") # many minutes
bigfile = "npidata_20050523-20150809.csv"
file.info(bigfile) # file info (4 GB+)
# split -b1000m npidata_20050523-20150809.csv # original command commented out
```
### Preprocessing text outside R
There are circumstances when datasets become too large to read directly into R.
Reading in a $4$ GB text file using the functions tested above, for example, consumes all available RAM on a $16$ GB machine. To overcome this limitation, external *stream processing* tools can be used to preprocess large text files.
The following command, using the Linux command line 'shell' (or Windows based Linux shell emulator [Cygwin](https://cygwin.com/install.html)) command `split`, for example, will break a large multi GB file into many chunks, each of which is more manageable for R:
```{r, engine='bash', eval=FALSE}
split -b100m bigfile.csv
```
The result is a series of files, set to 100 MB each with the `-b100m` argument in the above code. By default these will be called `xaa`, `xab` and can be read in *one chunk at a time* (e.g. using `read.csv()`, `fread()` or `read_csv()`, described in the previous section) without crashing most modern computers.
Splitting a large file into individual chunks may allow it to be read into R.
This is not an efficient way to import large datasets, however, because it results in a non-random sample of the data this way.
A more efficient, robust and scalable way to work with large datasets is via databases, covered in Section \@ref(working-with-databases) in the next chapter.
## Binary file formats
There are limitations to plain text files. Even the trusty CSV format is "restricted to tabular data, lacks type-safety, and has limited precision for numeric values" [@JSSv071i02].
Once you have read-in the raw data (e.g. from a plain text file) and tidied it (covered in the next chapter), it is common to want to save it for future use. Saving it after tidying is recommended, to reduce the chance of having to run all the data cleaning code again. We recommend saving tidied versions of large datasets in one of the binary formats covered in this section: this will decrease read/write times and file sizes, making your data more
portable.^[Geographical data, for example, can be slow to read in external formats. A large `.shp` or `.geojson` file can take more than $100$ times longer to load than an equivalent `.Rds` or `.Rdata` file.]
Unlike plain text files data stored in binary formats cannot be read by humans. This allows space-efficient data compression but means that the files will be less language agnostic. R's native file format, `.Rds`, for example may be difficult to read and write using external programs such as Python or LibreOffice Calc. This section provides an overview of binary file formats in R, with benchmarks to show how they compare with the plain text format `.csv` covered in the previous section.
### Native binary formats: Rdata or Rds?
`.Rds` and `.RData` are R's native binary file formats. These formats are optimised for speed and compression ratios. But what is the difference between them? The following code chunk demonstrates the key difference between them:
```{r}
save(df_co2, file = "extdata/co2.RData")
saveRDS(df_co2, "extdata/co2.Rds")
load("extdata/co2.RData")
df_co2_rds = readRDS("extdata/co2.Rds")
identical(df_co2, df_co2_rds)
```
The first method is the most widely used. It uses the `save()` function which takes any number of R objects and writes them to a file, which must be specified by the `file =` argument. `save()` is like `save.image()`, which saves *all* the objects currently loaded in R.
The second method is slightly less used but we recommend it. Apart from being slightly more concise for saving single R objects, the `readRDS()` function is more flexible: as shown in the subsequent line, the resulting object can be assigned to any name. In this case we called it `df_co2_rds` (which we show to be identical to `df_co2`, loaded with the `load()` command) but we could have called it anything or simply printed it to the console.
Using `saveRDS()` is good practice because it forces you to specify object names. If you use `save()` without care, you could forget the names of the objects you saved and accidentally overwrite objects that already existed.
### The feather file format
Feather was developed as a collaboration between R and Python developers to create a fast, light and language agnostic format for storing data frames. The code chunk below shows how it can be used to save and then re-load the `df_co2` dataset loaded previously in both R and Python:
```{r, eval=FALSE}
library("feather")
write_feather(df_co2, "extdata/co2.feather")
df_co2_feather = read_feather("extdata/co2.feather")
```
```{r, engine='python', eval=FALSE}
import feather
path = 'data/co2.feather'
df_co2_feather = feather.read_dataframe(path)
```
### Benchmarking binary file formats
We know that binary formats are advantageous from space and read/write time perspectives, but how much so? The benchmarks in this section, based on large matrices containing random numbers, are designed to help answer this question. Figure \@ref(fig:5-2) shows the *relative* efficiency gains of the feather and Rds formats, compared with base CSV. From left to right, figure \@ref(fig:5-2) shows benefits in terms of file size, read times, and write times.
In terms of file size, Rds files perform the best, occupying just over a quarter of the hard disc space compared with the equivalent CSV files. The equivalent feather format also outperformed the CSV format, occupying around half the disc space.
(ref:5-2) Comparison of the performance of binary formats for reading and writing datasets with 20 column with the plain text format CSV. The functions used to read the files were read.csv(), readRDS() and feather::read_feather() respectively. The functions used to write the files were write.csv(), saveRDS() and feather::write_feather().
```{r 5-2, echo=FALSE, fig.cap="(ref:5-2)", out.width="90%", fig.align="center", message=FALSE}
local(source("code/05-io_f2.R", local=TRUE))
```
The results of this simple disk usage benchmark show that saving data in a compressed binary format can save space and if your data will be shared on-line, reduce data download time and bandwidth usage. But how does each method compare from a computational efficiency perspective? The read and write times for each file format are illustrated in the middle and right hand panels of \@ref(fig:5-2).
The results show that file size is not a reliable predictor of data read and write times. This is due to the computational overheads of compression. Although feather files occupied more disc space, they were roughly equivalent in terms of read times: the functions `read_feather()` and `readRDS()` were consistently around 10 times faster than `read.csv()`. In terms of write times, feather excels: `write_feather()` was around 10 times faster than `write.csv()`, whereas `saveRDS()` was only around 1.2 times faster.
```{block, content-dependent-performance, type='rmdnote'}
Note that the performance of different file formats depends on the content of the data being saved. The benchmarks above showed savings for matrices of random numbers. For real life data, the results would be quite different. The `voyages` dataset, saved as an Rds file, occupied less than a quarter the disc space as the original TSV file, whereas the file size was *larger* than the original when saved as a feather file!
```
```{r size-voyages, eval=FALSE, echo=FALSE}
saveRDS(voyages_readr, "voyages.Rds")
feather::write_feather(voyages_readr, "voyages.feather")
size_csv = file.size("extdata/voc_voyages.tsv")
file.size("voyages.Rds") / size_csv
file.size("voyages.feather") / size_csv
vfiles = list.files(pattern = "voyages")
file.remove(vfiles)
```
### Protocol Buffers
Google's [Protocol Buffers](https://developers.google.com/protocol-buffers/) offers a portable, efficient and scalable solution to binary data storage. A recent package, **RProtoBuf**, provides an R interface. This approach is not covered in this book, as it is new, advanced and not (at the time of writing) widely used in the R community. The approach is discussed in detail in a [paper](https://www.jstatsoft.org/article/view/v071i02) on the subject, which also provides an excellent overview of the issues associated with different file formats [@JSSv071i02].
## Getting data from the internet {#download}
The code chunk below shows how the functions
`download.file()` and `unzip()` can be used to download and unzip a dataset from the internet.
(Since R 3.2.3 the base function `download.file()` can be used to download from secure (`https://`) connections on any operating system.)
R can automate processes that are often performed manually, e.g. through the graphical user interface of a web browser, with potential advantages for reproducibility and programmer efficiency. The result is data stored neatly in the `data` directory ready to be imported. Note we deliberately kept the file name intact, enhancing understanding of the data's *provenance* so future users can quickly find out where the data came from. Note also that part of the dataset is stored in the **efficient** package. Using R for basic file management can help create a reproducible workflow, as illustrated below.
```{r, eval=FALSE}
url = "https://www.monetdb.org/sites/default/files/voc_tsvs.zip"
download.file(url, "voc_tsvs.zip") # download file
unzip("voc_tsvs.zip", exdir = "data") # unzip files
file.remove("voc_tsvs.zip") # tidy up by removing the zip file
```
This workflow applies equally to downloading and loading single files. Note that one could make the code more concise by replacing the second line with `df = read.csv(url)`. However, we recommend downloading the file to disk so that if for some reason it fails (e.g. if you would like to skip the first few lines), you don't have to keep downloading the file over and over again. The code below downloads and loads data on atmospheric concentrations of CO^2^. Note that this dataset is also available from the **datasets** package.
```{r, eval=FALSE}
url = "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/co2.csv"
download.file(url, "extdata/co2.csv")
df_co2 = read_csv("extdata/co2.csv")
```
There are now many R packages to assist with the download and import of data. The organisation [rOpenSci](https://ropensci.org/) supports a number of these.
The example below illustrates this using the WDI package (not supported by rOpenSci) to access World Bank data on CO2 emissions in the transport sector:
```{r, eval=FALSE}
library("WDI")
WDIsearch("CO2") # search for data on a topic
co2_transport = WDI(indicator = "EN.CO2.TRAN.ZS") # import data
```
There will be situations where you cannot download the data directly or when the data cannot be made available. In this case, simply providing a comment relating to the data's origin (e.g. `# Downloaded from http://example.com`) before referring to the dataset can greatly improve the utility of the code to yourself and others.
There are a number of R packages that provide more advanced functionality than simply downloading files.
The CRAN task view on [Web technologies](https://cran.r-project.org/web/views/WebTechnologies.html) provides
a comprehensive list. The two packages for interacting with web pages are **httr** and **RCurl**. The former package
provides (a relatively) user-friendly interface for executing standard HTTP methods, such as `GET` and `POST`.
It also provides support for web authentication protocols and returns HTTP status codes that are essential for debugging.
The **RCurl** package focuses on lower-level support and is particularly useful for web-based XML support or FTP operations.
```{r, eval=FALSE, echo=FALSE}
# Not shown as a distraction (RL)
# download data from nasa, as described here:
# http://disc.sci.gsfc.nasa.gov/recipes/?q=recipes/How-to-Read-Data-in-netCDF-Format-with-R
library("raster") # requires the ncdf4 package to be installed
r = raster("extdata/nc_3B42.20060101.03.7A.HDF.Z.ncml.nc")
```
## Accessing data stored in packages
Most well documented packages provide some example data for you to play with. This can help demonstrate use cases in specific domains, that uses a particular data format. The command `data(package = "package_name")` will show the datasets in a package. Datasets provided by **dplyr**, for example, can be viewed with `data(package = "dplyr")`.
Raw data (i.e. data which has not been converted into R's native `.Rds` format) is usually located within the sub-folder `extdata` in R (which corresponds to `inst/extdata` when developing packages. The function `system.file()` outputs file paths associated with specific packages. To see all the external files within the **readr** package, for example, one could use the following command:
```{r}
list.files(system.file("extdata", package = "readr"))
```
Further, to 'look around' to see what files are stored in a particular package, one could type the following, taking advantage of RStudio's intellisense file completion capabilities (using copy and paste to enter the file path):
```{r, results="hide"}
system.file(package = "readr")
#> [1] "/home/robin/R/x86_64-pc-linux-gnu-library/3.3/readr"
```
Hitting `Tab` after the second command should trigger RStudio to create a miniature pop-up box listing the files within the folder, as illustrated in figure \@ref(fig:5-3).
```{r 5-3, echo=FALSE, fig.cap="Discovering files in R packages using RStudio's 'intellisense'.", out.width="50%", fig.align='center'}
knitr::include_graphics("figures/f5_3_rstudio-package-filepath-intellisense.png")
```