forked from csgillespie/efficientR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03-programming.Rmd
814 lines (677 loc) · 32.8 KB
/
03-programming.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
---
knit: "bookdown::preview_chapter"
---
# Efficient programming {#programming}
Many people who use R would not describe themselves as "programmers". Instead they tend to
have advanced domain level knowledge, understand standard R data
structures, such as vectors and data frames, but have little formal training in computing.
Sound familiar? In that case this chapter is for you.
In this
chapter we will discuss "big picture" programming techniques. We cover general concepts and R programming
techniques about code optimisation, before describing idiomatic programming
structures. We conclude the chapter by examining relatively easy ways of speeding up
code using the **compiler** package and parallel processing, using multiple CPUs.
### Prerequisites {-}
In this chapter we introduce two new packages, **compiler** and **memoise**. The **compiler**
package comes with R, so it will already be installed.
```{r}
library("compiler")
library("memoise")
```
We also use the **pryr** and **microbenchmark** packages in the exercises.
## Top 5 tips for efficient programming
1. Be careful never to grow vectors.
1. Vectorise code whenever possible.
1. Use factors when appropriate.
1. Avoid unnecessary computation by caching variables.
1. Byte compile packages for an easy performance boost.
## General advice {#general}
Low level languages like C and Fortran demand more from the programmer. They force you to declare the type of
every variable used, give you the burdensome responsibility of memory management and have to be compiled. The
advantage of such languages, compared with R, is that they are faster to run.
The disadvantage is that they take longer to learn and can not be run interactively.
```{block, type="rmdnote"}
The wikipedia page on compiler optimisations gives a nice overview of standard
optimisation techniques (https://en.wikipedia.org/wiki/Optimizing_compiler).
```
R users don't tend to worry about data types.
This is advantageous in terms of creating concise code, but can result in R programs that are slow. While optimisations such as going
parallel can double speed, poor code can easily run 100's of times slower, so it's important to understand the causes of slow code. These are covered in @Burns2011, which should be considered essential
reading for any aspiring R programmers.
Ultimately calling an R function always ends up calling some underlying C/Fortran
code. For example the base R function `runif()` only contains a single line that
consists of a call to `C_runif()`.
```{r eval=TRUE, results="hide"}
function (n, min = 0, max = 1)
.Call(C_runif, n, min, max)
```
A **golden rule** in R programming is to access the underlying C/Fortran routines as
quickly as possible; the fewer functions calls required to achieve this, the better.
For example, suppose `x` is a standard vector of length `n`. Then
```{r echo=3}
n = 2
x = runif(n)
x = x + 1
```
involves a single function call to the `+` function. Whereas the `for` loop
```{r bad_loop}
for(i in seq_len(n))
x[i] = x[i] + 1
```
has
* `n` function calls to `+`;
* `n` function calls to the `[` function;
* `n` function calls to the `[<-` function (used in the assignment operation);
* Two function calls: one to `for` and another to `seq_len()`.
It isn't that the `for` loop is slow, rather it is because we have many more function
calls. Each individual function call is quick, but the total combination is slow.
```{block, type="rmdnote"}
Everything in R is a function call. When we execute `1 + 1`, we are actually
executing `'+'(1, 1)`.
```
#### Exercise {-}
Use the **microbenchmark** package to compare the vectorised construct `x = x + 1`, to the
`for` loop version. Try varying the size of the input vector.
### Memory allocation
Another general technique is to be careful with memory allocation. If possible
pre-allocate your vector then fill in the values.
```{block, type="rmdtip"}
You should also consider pre-allocating memory for data frames and lists. Never grow
an object. A good rule of thumb is to compare your objects before and after a `for`
loop; have they increased in length?
```
Let's consider three methods of creating a sequence of numbers. __Method 1__ creates
an empty vector and gradually increases (or grows) the length of the vector
```{r echo=TRUE, tidy=FALSE}
method1 = function(n) {
vec = NULL # Or vec = c()
for(i in seq_len(n))
vec = c(vec, i)
vec
}
```
__Method 2__ creates an object of the final length and then changes the values in the
object by subscripting:
```{r echo=TRUE, tidy=FALSE}
method2 = function(n) {
vec = numeric(n)
for(i in seq_len(n))
vec[i] = i
vec
}
```
__Method 3__ directly creates the final object
```{r eval=TRUE, echo=TRUE}
method3 = function(n) seq_len(n)
```
To compare the three methods we use the `microbenchmark()` function from the previous chapter
```{r tidy=FALSE,eval=FALSE}
microbenchmark(times = 100, unit = "s",
method1(n), method2(n), method3(n))
```
The table below shows the timing in seconds on my machine for these three methods for
a selection of values of `n`. The relationships for varying `n` are all roughly linear
on a log-log scale, but the timings between methods are drastically different. Notice
that the timings are no longer trivial. When $n=10^7$, method $1$ takes around an hour
whilst method $2$ takes $2$ seconds and method $3$ is almost instantaneous. Remember
the golden rule; access the underlying C/Fortran code as quickly as possible.
$n$ | Method 1 | Method 2 | Method 3
----|----------|----------|---------
$10^5$ | $\phantom{000}0.21$ | $0.02$ | $0.00$
$10^6$ | $\phantom{00}25.50$ | $0.22$ | $0.00$
$10^7$ | $3827.00$ | $2.21$ | $0.00$
Table: Time in seconds to create sequences. When $n=10^7$, method $1$ takes around an
hour while the other methods take less than $3$ seconds.
### Vectorised code
```{block, type="rmdnote"}
Technically `x = 1` creates a vector of length $1$. In this section, we use _vectorised_
to indicate that functions work with vectors of all lengths.
```
Recall the __golden rule__ in R programming, access the underlying C/Fortran routines
as quickly as possible; the fewer functions calls required to achieve this, the
better. With this mind, many R functions are _vectorised_, that is
the function's inputs and/or outputs naturally work with vectors, reducing the
number of function calls required. For example, the code
```{r, echo=2}
n = 10
x = runif(n) + 1
```
performs two vectorised operations. First `runif()` returns `n` random numbers. Second
we add `1` to each element of the vector. In general it is a good idea to exploit
vectorised functions. Consider this piece of R code that calculates the sum of
$\log(x)$
```{r eval=FALSE, echo=TRUE, tidy=FALSE}
log_sum = 0
for(i in 1:length(x))
log_sum = log_sum + log(x[i])
```
```{block, type="rmdwarning"}
Using `1:length(x)` can lead to hard-to-find bugs when `x` has length zero. Instead
use `seq_along(x)` or `seq_len(length(x))`.
```
This code could easily be vectorised via
```{r eval=TRUE}
log_sum = sum(log(x))
```
Writing code this way has a number of benefits.
* It's faster. When $n = 10^7$ the _R way_ is about forty times faster.
* It's neater.
* It doesn't contain a bug when `x` is of length $0$.
As with the general example in section \@ref(general), the slowdown isn't due to
the `for` loop. Instead, it's because there are many more functions calls.
#### Exercises {-}
1. Time the two methods for calculating the log sum.
1. What happens when the `length(x) = 0`, i.e. we have an empty vector?
#### Example: Monte-Carlo integration {-}
It's also important to make full use of R functions that use vectors. For example,
suppose we wish to estimate the integral
\[
\int_0^1 x^2 dx
\]
using a Monte-Carlo method. Essentially, we throw darts at the curve and count
the number of darts that fall below the curve (as in \@ref(fig:3-1)).
_Monte Carlo Integration_
1. Initialise: `hits = 0`
1. __for i in 1:N__
1. $~~~$ Generate two random numbers, $U_1, U_2$, between 0 and 1
1. $~~~$ If $U_2 < U_1^2$, then `hits = hits + 1`
1. __end for__
1. Area estimate = `hits/N`
Implementing this Monte-Carlo algorithm in R would typically lead to something like:
```{r tidy=FALSE}
monte_carlo = function(N) {
hits = 0
for (i in seq_len(N)) {
u1 = runif(1)
u2 = runif(1)
if (u1 ^ 2 > u2)
hits = hits + 1
}
return(hits / N)
}
```
In R this takes a few seconds
```{r cache=TRUE}
N = 500000
system.time(monte_carlo(N))
```
In contrast a more R-centric approach would be
```{r echo=TRUE}
monte_carlo_vec = function(N) sum(runif(N)^2 > runif(N))/N
```
The `monte_carlo_vec()` function contains (at least) four aspects of vectorisation
* The `runif()` function call is now fully vectorised;
* We raise entire vectors to a power via `^`;
* Comparisons using `>` are vectorised;
* Using `sum()` is quicker than an equivalent for loop.
The function `monte_carlo_vec()` is around $30$ times faster than `monte_carlo()`.
```{r 3-1, fig.cap="Example of Monte-Carlo integration. To estimate the area under the curve throw random points at the graph and count the number of points that lie under the curve.", echo=FALSE,fig.width=6, fig.height=4, fig.align="center", out.width="70%"}
local(source("code/03-programming_f1.R", local=TRUE))
```
### Exercise {-}
Verify that `monte_carlo_vec()` is faster than `monte_carlo()`. How does this relate to
the number of darts, i.e. the size of `N`, that is used
## Communicating with the user
When we create a function we often want the function to give efficient feedback on the
current state. For example, are there missing arguments or has a numerical calculation
failed. There are three main techniques for communicating with the user.
### Fatal errors: `stop()` {-}
Fatal errors are raised by calling the `stop()`, i.e. execution is terminated. When
`stop()` is called, there is no way for a function to continue. For instance, when we
generate random numbers using `rnorm()` the first argument is the sample size,`n`. If
the number of observations to return is less than $1$, an error is raised. When we need
to raise an error, we should do so as quickly as possible; otherwise it's a waste of
resources. Hence, the first few lines of a function typically perform argument checking.
Suppose we call a function that raises an error. What then? Efficient, robust code
_catches_ the error and handles it appropriately. Errors can be caught using `try()`
and `tryCatch()`. For example,
```{r}
# Suppress the error message
good = try(1 + 1, silent = TRUE)
bad = try(1 + "1", silent = TRUE)
```
When we inspect the objects, the variable `good` just contains the number `2`
```{r}
good
```
However, the `bad` object is a character string with class `try-error` and a `condition`
attribute that contains the error message
```{r}
bad
```
We can use this information in a standard conditional statement
```{r eval=FALSE}
if(class(bad) == "try-error")
# Do something
```
Further details on error handling, as well as some excellent advice on general
debugging techniques, are given in [@Wickham2014].
### Warnings: `warning()` {-}
Warnings are generated using the `warning()` function. When a warning is raised, it
indicates potential problems. For example, `mean(NULL)` returns `NA` and also raises a
warning.
When we come across a warning in our code, it is important to solve the problem and
not just ignore the issue. While ignoring warnings saves time in the short-term, warnings
can often mask deeper issues that have crept into our code.
```{block, type="rmdnote"}
Warnings can be hidden using `suppressWarnings()`.
```
### Informative output: `message()` and `cat()` {-}
To give informative output, use the `message()` function. For example, in the
**poweRlaw** package, the `message()` function is used to give the user an estimate of
expected run time. Providing a rough estimate of how long the function takes, allows the user to
optimise their time. Similar to warnings, messages can be suppressed with
`suppressMessages()`.
Another function used for printing messages is `cat()`. In general `cat()` should only be
used in `print()`/`show()` methods, e.g. look at the function definition of the
S3 print method for `difftime` objects, `getS3method("print", "difftime")`.
### Exercises {-}
The `stop()` function has an argument `call.` that indicates if the function call
should be part of the error message. Create a function and experiment with this option.
### Invisible returns
The `invisible()` function allows you to return a temporarily invisible copy of an
object. This is particularly useful for functions that return values which can be
assigned, but are not printed when they are not assigned. For example suppose we have
a function that plots the data and fits a straight line
```{r}
regression_plot = function(x, y, ...) {
# Plot and pass additional arguments to default plot method
plot(x, y, ...)
# Fit regression model
model = lm(y ~ x)
# Add line of best fit to the plot
abline(model)
invisible(model)
}
```
When the function is called, a scatter graph is plotted with the line of best fit, but
the output is invisible. However when we assign the function to an object, i.e.
`out = regression_plot(x, y)` the variable `out` contains the output of the `lm()` call.
Another example is `hist()`. Typically we don't want anything displayed in the console
when we call the function
```{r fig.keep="none", echo=2}
x = rnorm(x)
hist(x)
```
However if we assign the output to an object, `out = hist(x)`, the object `out` is
actually a list containing, _inter alia_, information on the mid-points, breaks and
counts.
## Factors
Factors are much maligned objects. While at times they are awkward, they do have their
uses. A factor is used to store categorical variables. This data type is unique to R
(or at least not common among programming languages). The difference between factors
and strings is important because R treats factors and strings differently. Although
factors look similar to character vectors, they are actually integers. This leads to
initially surprising behaviour
```{r}
x = 4:6
c(x)
c(factor(x))
```
In this case the `c()` function is using the underlying integer representation of the
factor. Dealing with the wrong case of behaviour is a common source of inefficiency for
R users.
Often categorical variables get stored as $1$, $2$, $3$, $4$, and $5$, with associated
documentation elsewhere that explains what each number means. This is clearly a pain.
Alternatively we store the data as a character vector. While this is fine, the
semantics are wrong because it doesn't convey that this is a categorical variable.
It's not sensible to say that you should **always** or **never** use factors, since
factors have both positive and negative features. Instead we need to examine each case
individually.
As a general rule, if your variable has an inherent order, e.g. small vs large, or
you have a fixed set of categories, then you should consider using a factor.
### Inherent order
Factors can be used for ordering in graphics. For instance, suppose we have a data set
where the variable `type`, takes one of three values, `small`, `medium` and `large`.
Clearly there is an ordering. Using a standard `boxplot()` call,
```{r fig.keep="none", echo=6}
set.seed(1)
level = c("Small", "Medium", "Large")
type = rep(level, each=30)
y = rnorm(90)
type_factor = factor(type, levels = level)
boxplot(y ~ type)
```
would create a boxplot where the $x$-axis was alphabetically ordered. By converting
`type` into factor, we can easily specify the correct ordering.
```{r, boxplot_factor, eval=TRUE, fig.keep="none"}
boxplot(y ~ factor(type, levels = c("Small", "Medium", "Large")))
```
```{block, type="rmdwarning"}
Most users interact with factors via the `read.csv()` function where character columns
are automatically converted to factors. This feature can be irritating if our data is
messy and we want to clean and recode variables. Typically when reading in data via
`read.csv()`, we use the `stringsAsFactors = FALSE` argument. Although this argument can
add in the global `options()` list and placed in the `.Rprofile`, this leads to
non-portable code, so should be avoided.
```
### Fixed set of categories
Suppose our data set relates to months of the year
```{r}
m = c("January", "December", "March")
```
If we sort `m` in the usual way, `sort(m)`, we perform standard alpha-numeric
ordering; placing `December` first. This is technically correct, but not that helpful.
We can use factors to remedy this problem by specifying the admissible levels
```{r}
# month.name contains the 12 months
fac_m = factor(m, levels = month.name)
sort(fac_m)
```
#### Exercise {-}
Factors are slightly more space efficient than characters. Create a character vector
and corresponding factor and use `pryr::object_size()` to calculate the space needed for
each object.
```{r echo=FALSE, eval=FALSE}
ch = sample(month.name, 1e6, replace = TRUE)
fac = factor(ch, levels = month.name)
pryr::object_size(ch)
pryr::object_size(fac)
```
## The apply family
The apply functions can be an alternative to writing for loops. The general idea is to apply (or map) a function to
each element of an object. For example, you can apply a function to each row or column of a matrix.
A list of available functions is given in \@ref(tab:apply-family), with a short description. In
general, all the apply functions have similar properties:
* Each function takes at least two arguments: an object and another function. The function is passed as an argument.
* Every apply function has the dots, `...`, argument that is used to pass on arguments to the function that is
given as an argument.
Using apply functions when possible, can lead to more succinct and idiomatic R code. In this section,
we will cover the three main functions, `apply()`, `lapply()`, and `sapply()`. Since the apply functions are covered in
most R textbooks, we just give a brief introduction to the topic and provide pointers to other resources
at the end of this section.
```{block, type="rmdnote"}
Most people rarely use the other apply functions. For example, I have only used `eapply()`
once. Students in my class uploaded R scripts. Using `source()`, I was able to read
in the scripts to a separate environment. I then applied a marking scheme to each
environment using `eapply()`. Using separate environments, avoided object name clashes.
```
```{r apply-family, echo=FALSE}
dd = tibble::frame_data(
~Function, ~Description,
"`apply`", "Apply functions over array margins",
"`by`", "Apply a function to a data frame split by factors",
"`eapply`", "Apply a function over values in an environment",
"`lapply`", "Apply a function over a list or vector",
"`mapply`", "Apply a function to multiple list or vector arguments",
"`rapply`", "Recursively apply a function to a list",
"`tapply`", "Apply a function over a ragged array")
knitr::kable(dd, caption="The apply family of functions from base R.", row.names = FALSE)
```
The `apply()` function is used to apply a function to each row or column of a matrix. In many data science
problems, this is a common task. For example, to calculate the standard deviation of the rows we have
```{r, results="hide"}
data("ex_mat", package="efficient")
# MARGIN=1: corresponds to rows
row_sd = apply(ex_mat, 1, sd)
```
The first argument of `apply()` is the object of interest. The second argument is the `MARGIN`.
This is a vector giving the subscripts which the function (the third argument)
will be applied over. When the object is a matrix, a margin of `1` indicates rows and `2` indicates columns.
So to calculate the column standard deviations, the second argument is changed to `2`
```{r, results="hide"}
col_med = apply(ex_mat, 2, sd)
```
Additional arguments can be passed to the function that is to be applied to the data. For example,
to pass the `na.rm` argument to the `sd` function, we have
```{r}
row_sd = apply(ex_mat, 1, sd, na.rm = TRUE)
```
The `apply()` function also works on higher dimensional arrays; a one dimensional array is a vector,
a two dimensional array is a matrix.
The `lapply()` function is similar to `apply()`; with the key difference being that the input type is a vector or list and the return type is a list. Essentially, we apply a function to each element of a list or vector. The functions `sapply()` and `vapply()` are similar to `lapply()`, but the return type is not necessary a list.
### Example: the movies data set
The internet movie [database](http://imdb.com/) is a website that collects movie data supplied
by studios and fans. It is one of the largest movies databases on the web and is maintained by Amazon.
The **ggplot2movies** package contains about sixty thousand movies stored as a data frame
```{r}
data(movies, package = "ggplot2movies")
```
Movies are rated between $1$ and $10$ by fans. Columns $7$ to $16$ of the `movies` data set
gives the percentage of voters for a particular rating.
```{r}
ratings = movies[, 7:16]
```
For example, 4.5% of voters, rated the first movie a rating of $1$
```{r}
ratings[1, ]
```
We can use the `apply()` function to investigate voting patterns. The function `nnet::which.is.max()`
finds the maximum position in a vector, but breaks ties at random; `which.max()` just returns the first value.
Using `apply()`, we can easily determine the most popular rating for each movie and plot the results
```{r, 3-3, fig.keep="last", echo=1:2, fig.height=4, fig.width=6, fig.cap="Movie voting preferences.", fig.align="center", out.width="70%"}
popular = apply(ratings, 1, nnet::which.is.max)
plot(table(popular))
local(source("code/03-programming_f3.R", local=TRUE))
```
Figure \@(fig:3-3) highlights that voting patterns are clearly not uniform between $1$ and $10$. The
most popular vote is the highest rating, $10$. Clearly if you went to the trouble
of voting for a movie, it was either very good, or very bad (there is also a peak at $1$).
Rating a movie $7$ is also a popular choice (search the web for "most popular number" and $7$ dominates the rankings.)
### Type consistency
When programming it is helpful if the return value from a function always takes the
same form. Unfortunately, not all base R functions follow this idiom. For example
the functions `sapply()` and `[.data.frame()` aren't type consistent
```{r, results="hide"}
two_cols = data.frame(x = 1:5, y = letters[1:5])
zero_cols = data.frame()
sapply(two_cols, class) # a character vector
sapply(zero_cols, class) # a list
two_cols[, 1:2] # a data.frame
two_cols[, 1] # an integer vector
```
This can cause unexpected problems. The functions `lapply()` and `vapply()` are type
consistent. Likewise `dplyr::select()` and `dplyr:filter()`. The **purrr** package has
some type consistent alternatives to base R functions. For example, `map_dbl()` etc. to
replace `Map()` and `flatten_df()` to replace `unlist()`.
#### Other resources {-}
Almost every R book has a section on the apply function. Below, we've given the resources we feel are
most helpful.
* Each function has a number of examples in the associated help page. You can directly
access the examples using the `example()` function, e.g. to run the `apply()` examples, use
`example("apply")`.
* There is a very detailed StackOverflow [answer](http://stackoverflow.com/q/3505701/203420)
which describes when, where and how to use each of the functions.
* In a similar vein, Neil Saunders has a nice blog [post](https://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/)
giving an overview of the functions.
* The apply functions are an example of functional programming. Chapter 16 of _R for data Science_ describes
the interplay between loops and functional programming in more detail [@grolemund_r_2016],
while @Wickham2014 gives a more in-depth description of the topic.
#### Exercises {-}
1. Rewrite the `sapply()` function calls above using `vapply()` to ensure type consistency.
1. How would you make subsetting data frames with `[` type consistent? Hint: look at
the `drop` argument.
## Caching variables
A straightforward method for speeding up code is to calculate objects once and reuse
the value when necessary. This could be as simple as replacing `sd(x)` in multiple
function calls with the object `sd_x` that is defined once and reused. For example,
suppose we wish to normalise each column of a matrix. However, instead of using the
standard deviation of each column, we will use the standard deviation of the
entire data set
```{r, echo=-1, results="hide"}
x = matrix(rnorm(100), ncol=10)
apply(x, 2, function(i) mean(i) / sd(x))
```
This is inefficient since the value of `sd(x)` is constant and thus recalculating the standard
deviation for every column is unnecessary. Instead we should evaluate once and
store the result
```{r, results="hide"}
sd_x = sd(x)
apply(x, 2, function(i) mean(i) / sd_x)
```
If we compare the two methods on a $100$ row by $1000$ column matrix, the cached
version is around $100$ times faster (figure \@ref(fig:3-5)).
```{r, 3-5, fig.keep="last", echo=FALSE, fig.height=4, fig.width=6, fig.cap="Performance gains obtained from caching the standard deviation in a $100$ by $1000$ matrix.", fig.align="center", out.width="70%"}
local(source("code/03-programming_f5.R", local=TRUE))
```
A more advanced form of caching is to use the **memoise** package. If a function is
called multiple times with the same input, it may be possible to speed things up by
keeping a cache of known answers that it can retrieve. The **memoise** package allows
us to easily store the value of function call and returns the cached result when the
function is called again with the same arguments. This package trades off memory
versus speed, since the memoised function stores all previous inputs and outputs. To
cache a function, we simply pass the function to the **memoise** function.
The classic memoise example is the factorial function. Another example is to limit use
to a web resource. For example, suppose we are developing a shiny (an interactive
graphic) application where the user can fit a regression line to data. The user can
remove points and refit the line. An example function would be
```{r}
# Argument indicates row to remove
plot_mpg = function(row_to_remove) {
data(mpg, package = "ggplot2")
mpg = mpg[-row_to_remove, ]
plot(mpg$cty, mpg$hwy)
lines(lowess(mpg$cty, mpg$hwy), col = 2)
}
```
We can use **memoise** speed up by caching results. A quick benchmark
```{r benchmark_memoise, fig.keep="none", cache=TRUE, results="hide"}
m_plot_mpg = memoise(plot_mpg)
microbenchmark(times = 10, unit = "ms", m_plot_mpg(10), plot_mpg(10))
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> m_plot_mpg(10) 0.04 4e-02 0.07 8e-02 8e-02 0.1 10 a
#> plot_mpg(10) 40.20 1e+02 95.52 1e+02 1e+02 107.1 10 b
```
suggests that we can obtain a $100$-fold speed-up.
#### Exercise {-}
Construct a box plot of timings for the standard plotting function and the memoised
version.
### Function closures
```{block, type="rmdwarning"}
The following section is meant to provide an introduction to function closures with
example use cases. See [@Wickham2014] for a detailed introduction.
```
More advanced caching is available using _function closures_. A closure in R is an
object that contains functions bound to the environment the closure was created in.
Technically all functions in R have this property, but we use the term function
closure to denote functions where the environment is not in `.GlobalEnv`. One of the
environments associated with a function is known as the enclosing environment, that
is, where the function was created. This allows us to store values between function calls.
Suppose we want to create a stop-watch type function. This is easily achieved with a function
closure
```{r}
# <<- assigns values to the parent environment
stop_watch = function() {
start_time = stop_time = NULL
start = function() start_time <<- Sys.time()
stop = function() {
stop_time <<- Sys.time()
difftime(stop_time, start_time)
}
list(start = start, stop = stop)
}
watch = stop_watch()
```
The object `watch` is a list, that contains two functions. One function for starting
the timer
```{r}
watch$start()
```
the other for stopping the timer
```{r results="hide"}
watch$stop()
```
Without using function closures, the stop-watch function would be longer, more complex and
therefore more inefficient. When used properly function closures are very useful
programming tools for writing concise code.
#### Exercise {-}
1. Write a stop-watch function __without__ using function closures.
1. Many stop-watches have the ability to measure not only your overall time but also your
individual laps. Add a `lap()` function to the `stop_watch()` function that will record
individual times, while still keeping track of the overall time.
```{block, type="rmdnote"}
A related idea to function closures, is non-standard evaluation (NSE), or programming on the language.
NSE crops up all the time in R. For example, when we execute, `plot(height, weight)`
R automatically labels the x- and y-axis of the plot with `height` and
`weight`. This is powerful concept that enables us to simplify code. More detail
is given in the "Non-standard evaluation" of [@Wickham2014].
```
## The byte compiler
The **compiler** package, written by R Core member Luke Tierney has been part of R
since version 2.13.0. The **compiler** package allows R functions to be compiled,
resulting in a byte code version that may run faster^[The authors have yet to find a
situation where byte compiled code runs significantly slower.]. The compilation
process eliminates a number of costly operations the interpreter has to perform, such
as variable lookup.
Since R 2.14.0, all of the standard functions and packages in base R are pre-compiled
into byte-code. This is illustrated by the base function `mean()`:
```{r}
getFunction("mean")
```
The third line contains the `bytecode` of the function. This means that the
**compiler** package has translated the R function into another language that can be
interpreted by a very fast interpreter. Amazingly the **compiler** package is almost
entirely pure R, with just a few C support routines.
### Example: the mean function
The **compiler** package comes with R, so we just need to load the package in the
usual way
```{r}
library("compiler")
```
Next we create an inefficient function for calculating the mean. This function takes
in a vector, calculates the length and then updates the `m` variable.
```{r}
mean_r = function(x) {
m = 0
n = length(x)
for(i in seq_len(n))
m = m + x[i] / n
m
}
```
This is clearly a bad function and we should just use the `mean()` function, but it's a useful
comparison. Compiling the function is straightforward
```{r}
cmp_mean_r = cmpfun(mean_r)
```
Then we use the `microbenchmark()` function to compare the three variants
```{r results="hide", eval=FALSE}
# Generate some data
x = rnorm(1000)
microbenchmark(times = 10, unit = "ms", # milliseconds
mean_r(x), cmp_mean_r(x), mean(x))
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> mean_r(x) 0.358 0.361 0.370 0.363 0.367 0.43 10 c
#> cmp_mean_r(x) 0.050 0.051 0.052 0.051 0.051 0.07 10 b
#> mean(x) 0.005 0.005 0.008 0.007 0.008 0.03 10 a
```
The compiled function is around seven times faster than the uncompiled function. Of
course the native `mean()` function is faster, but compiling does make a significant
difference (figure \@ref(fig:3-4)).
```{r 3-4, echo=FALSE, fig.height=4, fig.width=6, fig.cap="Comparison of mean functions.", eval=TRUE, fig.align="center", out.width="70%"}
local(source("code/03-programming_f4.R", local=TRUE))
```
### Compiling code
There are a number of ways to compile code. The easiest is to compile individual
functions using `cmpfun()`, but this obviously doesn't scale. If you create a package,
you can automatically compile the package on installation by adding
```
ByteCompile: true
```
to the `DESCRIPTION` file. Most R packages installed using `install.packages()` are not
compiled. We can enable (or force) packages to be compiled by starting R with the
environment variable `R_COMPILE_PKGS` set to a positive integer value and specify
that we install the package from `source`, i.e.
```{r eval=FALSE}
## Windows users will need Rtools
install.packages("ggplot2", type = "source")
```
Or if we want to avoid altering the `.Renviron` file, we can specify an additional
argument
```{r eval=FALSE}
install.packages("ggplot2", type = "source", INSTALL_opts = "--byte-compile")
```
A final option is to use just-in-time (JIT) compilation. The `enableJIT()` function
disables JIT compilation if the argument is `0`. Arguments `1`, `2`, or `3` implement
different levels of optimisation. JIT can also be enabled by setting the environment
variable `R_ENABLE_JIT`, to one of these values.
```{block, type="rmdtip"}
We recommend setting the compile level to the maximum value of 3.
```
The impact of compiling on install will vary from package to package: for packages
that already have lots of pre-compiled code speed gains will be small [@team2016installation].
```{block type="rmdwarning"}
Not all packages work if compiled on installation.
```