-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path04-functions.html
269 lines (263 loc) · 20.7 KB
/
04-functions.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>Software Carpentry: R for reproducible scientific analysis</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<div class="banner">
<a href="http://software-carpentry.org" title="Software Carpentry">
<img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
</a>
</div>
<article>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<a href="index.html"><h1 class="title">R for reproducible scientific analysis</h1></a>
<h2 class="subtitle">Creating functions</h2>
<section class="objectives panel panel-warning">
<div class="panel-heading">
<h2 id="learning-objectives"><span class="glyphicon glyphicon-certificate"></span>Learning Objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>Be able to define a function that takes arguments.</li>
<li>Be able to return a value from a function.</li>
<li>Be able to write conditional statements with <code>if</code> and <code>else</code>.</li>
<li>Know when and how to set default values for function arguments.</li>
<li>Understand why we should divide programs into small, single-purpose functions.</li>
</ul>
</div>
</section>
<p>There are many reasons to make it easy to rerun our analyses. The gapminder data is updated periodically, and we may want to pull in that new information later and re-run our analysis again. We may also obtain similar data from a different source in the future.</p>
<p>In this lesson, we’ll learn how to write a function so that we can repeat a set of operations with a single command. Once we have a function that is known to work, we can use it repeatedly without worrying about how it works, just as we have used functions like min and max.</p>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2 id="what-is-a-function"><span class="glyphicon glyphicon-pushpin"></span>What is a function?</h2>
</div>
<div class="panel-body">
<p>Functions gather a sequence of operations into a whole, preserving it for ongoing use. Functions provide:</p>
<ul>
<li>a name we can remember and invoke it by</li>
<li>relief from the need to remember the individual operations</li>
<li>a defined set of inputs and expected outputs</li>
<li>rich connections to the larger programming environment</li>
</ul>
<p>As the basic building block of most programming languages, user-defined functions constitute “programming” as much as any single abstraction can. If you have written a function, you are a computer programmer.</p>
</div>
</aside>
<h2 id="defining-a-function">Defining a function</h2>
<p>We define a function by assigning the output of <code>function</code> to a variable. The list of argument names are contained within parentheses. Next, the <a href="reference.html#function-body">body</a> of the function–the statements that are executed when it runs–is contained within curly braces (<code>{}</code>). The statements in the body are indented by two spaces. This makes the code easier to read but does not affect how the code operates.</p>
<p>When we call the function, the values we pass to it are assigned to those variables so that we can use them inside the function. Inside the function, we use a <a href="reference.html#return-statement">return statement</a> to send a result back to whoever asked for it.</p>
<p>Calling our own function is no different from calling any other function.</p>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2 id="tip"><span class="glyphicon glyphicon-pushpin"></span>Tip</h2>
</div>
<div class="panel-body">
<p>One feature unique to R is that the return statement is not required. R automatically returns whichever variable is on the last line of the body of the function. Since we are just learning, we will explicitly define the return statement.</p>
</div>
</aside>
<p>Previously we calculated the GDP by multiplying the population and gdp per capita. Rather than specifying the dataset we want to calculate gdp for every time, let’s turn this into a function.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Takes a dataset and multiplies the pop column with the gdpPercap column.</span>
calcGDP <-<span class="st"> </span>function(dat) {
gdp<-dat$gdpPercap *<span class="st"> </span>dat$pop
<span class="kw">return</span>(gdp)
}</code></pre></div>
<p>We define <code>calcGDP</code> by assigning it to the output of <code>function</code>. The list of argument names are contained within parentheses. Next, the body of the function – the statements executed when you call the function – is contained within curly braces (<code>{}</code>).</p>
<p>We’ve indented the statements in the body by two spaces. This makes the code easier to read but does not affect how it operates.</p>
<p>When we call the function, the values we pass to it are assigned to the arguments, which become variables inside the body of the function.</p>
<p>Inside the function, we use the <code>return</code> function to send back the result. This return function is optional: R will automatically return the results of whatever command is executed on the last line of the function.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co">#first reload the original data</span>
gapminder <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="dt">file=</span><span class="st">"gapminder-FiveYearData.csv"</span>)
<span class="kw">calcGDP</span>(<span class="kw">head</span>(gapminder))</code></pre></div>
<pre class="output"><code>[1] 6567086330 7585448670 8758855797 9648014150 9678553274 11697659231
</code></pre>
<p>That’s not very informative. Let’s also output the information from the other columns.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Takes a dataset and multiplies the population column with the GDP per capita column.</span>
calcGDP <-<span class="st"> </span>function(dat) {
gdp <-<span class="st"> </span>dat$pop *<span class="st"> </span>dat$gdpPercap
dat <-<span class="st"> </span><span class="kw">cbind</span>(dat, gdp)
<span class="kw">return</span>(dat)
}
<span class="kw">calcGDP</span>(<span class="kw">head</span>(gapminder))</code></pre></div>
<p>Note we can specify any dataset or subset of our data.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">calcGDP</span>(gapminder[<span class="dv">20</span>:<span class="dv">30</span>,])</code></pre></div>
<p>We can use <code>==</code> to subset data by a particular value.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">head</span>(<span class="kw">calcGDP</span>(gapminder[gapminder$year ==<span class="st"> </span><span class="dv">2007</span>, ]))</code></pre></div>
<p>We can get values for two different years using by specifying one year OR another using <code>|</code> (the converse is <code>&</code>)</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">head</span>(<span class="kw">calcGDP</span>(gapminder[gapminder$year ==<span class="st"> </span><span class="dv">2007</span>|gapminder$year ==<span class="st"> </span><span class="dv">1952</span>, ]))</code></pre></div>
<p>Since this is getting unwieldy to read let’s put all this subsetting into our function. When we call the function we want to specify the dataset, year(s), and country(ies).</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">calcGDP</span>(gapminder, <span class="dv">1952</span>,<span class="st">"Afghanistan"</span>)</code></pre></div>
<p>We can also use a matching function <code>%in%</code> to subset data by a range of values.</p>
<p>To do that, we need add some more arguments to our function so we can extract year and country.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Takes a dataset and multiplies the population column with the GDP per capita column.</span>
calcGDP <-<span class="st"> </span>function(dat, year, country) {
dat <-<span class="st"> </span>dat[dat$year %in%<span class="st"> </span>year, ]
dat <-<span class="st"> </span>dat[dat$country %in%<span class="st"> </span>country,]
gdp <-<span class="st"> </span>dat$pop *<span class="st"> </span>dat$gdpPercap
dat <-<span class="st"> </span><span class="kw">cbind</span>(dat, gdp)
<span class="kw">return</span>(dat)
}</code></pre></div>
<p>The function now takes a subset of the rows for all columns by year. It then subsets this subset by country. Then it calculates the GDP for the subset of the previous two steps. The function then adds the GDP as a new column to the subsetted data and returns this as the final result. Because we have defined all of these pieces of code in one function we can now repeat this process for any dataset.</p>
<p>We can now calculate the GDP for a single combination of year and country.</p>
<p>By using <code>%in%</code> we can also give multiple years or countries to those arguments.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">calcGDP</span>(gapminder, <span class="dv">1952</span>:<span class="dv">1962</span>,<span class="dt">country=</span><span class="st">"Afghanistan"</span>)</code></pre></div>
<pre class="output"><code> country year pop continent lifeExp gdpPercap gdp
1 Afghanistan 1952 8425333 Asia 28.801 779.4453 6567086330
2 Afghanistan 1957 9240934 Asia 30.332 820.8530 7585448670
3 Afghanistan 1962 10267083 Asia 31.997 853.1007 8758855797</code></pre>
<p>Note that we haven’t changed our original dataset. The subsetting only occurs to the copy of the data inside the function.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">dim</span>(gapminder)</code></pre></div>
<p>Now let’s expand this function to check whether the year and country are specified. If they aren’t then we can use all of them. We can use conditional statements to set actions to occur only if a condition or a set of conditions are met.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># if</span>
if (condition is true) {
perform action
}
<span class="co"># if ... else</span>
if (condition is true) {
perform action
} else { <span class="co"># that is, if the condition is false,</span>
perform alternative action
}</code></pre></div>
<p>A common use of an <code>if</code> statement if to check is to compare values. For example: <sub><del>{.r} x=1001 if(x==1001){ print(‘x is 1001’) } else{ print(‘x is not 1001’) }</del></sub></p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">x=<span class="dv">1001</span>
if(x><span class="dv">1000</span>){
<span class="kw">print</span>(<span class="st">'x is greater than 1000'</span>)
} else{
<span class="kw">print</span>(<span class="st">'x is not greater than 1000'</span>)
}</code></pre></div>
<p>And if I’m coding properly I would put this in a function.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">check1000<-function(x){
if(x><span class="dv">1000</span>){
<span class="kw">print</span>(x)
<span class="kw">print</span>(<span class="st">'is greater than 1000'</span>)
} else{
<span class="kw">print</span>(x)
<span class="kw">print</span>(<span class="st">'is not greater than 1000'</span>)
}
}
<span class="kw">check1000</span>(<span class="dv">1001</span>)</code></pre></div>
<p>For calculating gdp information we first specify the default value of year and country as NULL. We then check whether when the function is called the year or country is specified or the default value is used using an <code>if</code> statement and the <code>is.null</code> function.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Takes a dataset and multiplies the population column with the GDP per capita column.</span>
calcGDP <-<span class="st"> </span>function(dat, <span class="dt">year=</span><span class="ot">NULL</span>, <span class="dt">country=</span><span class="ot">NULL</span>) {
if(!<span class="kw">is.null</span>(year)) {
dat <-<span class="st"> </span>dat[dat$year %in%<span class="st"> </span>year, ]
}
if (!<span class="kw">is.null</span>(country)) {
dat <-<span class="st"> </span>dat[dat$country %in%<span class="st"> </span>country,]
}
gdp <-<span class="st"> </span>dat$pop *<span class="st"> </span>dat$gdpPercap
dat <-<span class="st"> </span><span class="kw">cbind</span>(dat, <span class="dt">gdp=</span>gdp)
<span class="kw">return</span>(dat)
}</code></pre></div>
<p>The function now subsets the provided data by year if the year argument isn’t empty, then subsets the result by country if the country argument isn’t empty. Then it calculates the GDP for whatever subset emerges from the previous two steps. The function then adds the GDP as a new column to the subsetted data and returns this as the final result. You can see that the output is much more informative than just getting a vector of numbers.</p>
<p>Let’s take a look at what happens when we specify the year:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">head</span>(<span class="kw">calcGDP</span>(gapminder, <span class="dt">year=</span><span class="dv">2007</span>))</code></pre></div>
<pre class="output"><code> country year pop continent lifeExp gdpPercap gdp
12 Afghanistan 2007 31889923 Asia 43.828 974.5803 31079291949
24 Albania 2007 3600523 Europe 76.423 5937.0295 21376411360
36 Algeria 2007 33333216 Africa 72.301 6223.3675 207444851958
48 Angola 2007 12420476 Africa 42.731 4797.2313 59583895818
60 Argentina 2007 40301927 Americas 75.320 12779.3796 515033625357
72 Australia 2007 20434176 Oceania 81.235 34435.3674 703658358894
</code></pre>
<p>Or for a specific country:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">calcGDP</span>(gapminder, <span class="dt">country=</span><span class="st">"Australia"</span>)</code></pre></div>
<pre class="output"><code> country year pop continent lifeExp gdpPercap gdp
61 Australia 1952 8691212 Oceania 69.120 10039.60 87256254102
62 Australia 1957 9712569 Oceania 70.330 10949.65 106349227169
63 Australia 1962 10794968 Oceania 70.930 12217.23 131884573002
64 Australia 1967 11872264 Oceania 71.100 14526.12 172457986742
65 Australia 1972 13177000 Oceania 71.930 16788.63 221223770658
66 Australia 1977 14074100 Oceania 73.490 18334.20 258037329175
67 Australia 1982 15184200 Oceania 74.740 19477.01 295742804309
68 Australia 1987 16257249 Oceania 76.320 21888.89 355853119294
69 Australia 1992 17481977 Oceania 77.560 23424.77 409511234952
70 Australia 1997 18565243 Oceania 78.830 26997.94 501223252921
71 Australia 2002 19546792 Oceania 80.370 30687.75 599847158654
72 Australia 2007 20434176 Oceania 81.235 34435.37 703658358894
</code></pre>
<p>Or both:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">calcGDP</span>(gapminder, <span class="dt">year=</span><span class="dv">2007</span>, <span class="dt">country=</span><span class="st">"Australia"</span>)</code></pre></div>
<pre class="output"><code> country year pop continent lifeExp gdpPercap gdp
72 Australia 2007 20434176 Oceania 81.235 34435.37 703658358894
</code></pre>
<p>Let’s walk through the body of the function:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">calcGDP <-<span class="st"> </span>function(dat, <span class="dt">year=</span><span class="ot">NULL</span>, <span class="dt">country=</span><span class="ot">NULL</span>) {</code></pre></div>
<p>Here we’ve added two arguments, <code>year</code>, and <code>country</code>. We’ve set <em>default arguments</em> for both as <code>NULL</code> using the <code>=</code> operator in the function definition. This means that those arguments will take on those values unless the user specifies otherwise.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"> if(!<span class="kw">is.null</span>(year)) {
dat <-<span class="st"> </span>dat[dat$year %in%<span class="st"> </span>year, ]
}</code></pre></div>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"> if (!<span class="kw">is.null</span>(country)) {
dat <-<span class="st"> </span>dat[dat$country %in%<span class="st"> </span>country,]
}</code></pre></div>
<p>Here, we check whether each additional argument is set to <code>null</code>, and whenever they’re not <code>null</code> overwrite the dataset stored in <code>dat</code> with a subset given by the non-<code>null</code> argument.</p>
<p>We can now ask the function to calculate the GDP for:</p>
<ul>
<li>The whole dataset;</li>
<li>A single year;</li>
<li>A single country;</li>
<li>A single combination of year and country.</li>
</ul>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2 id="tip-pass-by-value"><span class="glyphicon glyphicon-pushpin"></span>Tip: Pass by value</h2>
</div>
<div class="panel-body">
<p>Functions in R almost always make copies of the data to operate on inside of a function body. When we modify <code>dat</code> inside the function we are modifying the copy of the gapminder dataset stored in <code>dat</code>, not the original variable we gave as the first argument.</p>
<p>This is called “pass-by-value” and it makes writing code much safer: you can always be sure that whatever changes you make within the body of the function, stay inside the body of the function.</p>
</div>
</aside>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2 id="tip-function-scope"><span class="glyphicon glyphicon-pushpin"></span>Tip: Function scope</h2>
</div>
<div class="panel-body">
<p>Another important concept is scoping: any variables (or functions!) you create or modify inside the body of a function only exist for the lifetime of the function’s execution. When we call <code>calcGDP</code>, the variables <code>dat</code>, <code>gdp</code> only exist inside the body of the function. Even if we have variables of the same name in our interactive R session, they are not modified in any way when executing a function.</p>
</div>
</aside>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="challenge-1"><span class="glyphicon glyphicon-pencil"></span>Challenge 1</h2>
</div>
<div class="panel-body">
<p>What is the expected result from the following script?</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">add3 <-<span class="st"> </span>function(y){
y<span class="dv">+3</span>
}
x <-<span class="st"> </span><span class="dv">10</span>
y <-<span class="st"> </span><span class="kw">add3</span>(x)
<span class="kw">print</span>(x)
<span class="kw">print</span>(y)</code></pre></div>
</div>
</section>
</div>
</div>
</article>
<div class="footer">
<a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
<a class="label swc-blue-bg" href="https://github.com/swcarpentry/lesson-template">Source</a>
<a class="label swc-blue-bg" href="mailto:[email protected]">Contact</a>
<a class="label swc-blue-bg" href="LICENSE.html">License</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
<script src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'></script>
</body>
</html>