forked from swcarpentry/r-novice-inflammation
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path01-supp-factors.html
249 lines (246 loc) · 17.7 KB
/
01-supp-factors.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>Software Carpentry: Programming with R</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<div class="banner">
<a href="http://software-carpentry.org" title="Software Carpentry">
<img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
</a>
</div>
<article>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<a href="index.html"><h1 class="title">Programming with R</h1></a>
<h2 class="subtitle">Understanding factors</h2>
<section class="objectives panel panel-warning">
<div class="panel-heading">
<h2 id="learning-objectives"><span class="glyphicon glyphicon-certificate"></span>Learning Objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>Understand how to represent categorical data in R</li>
<li>Know the difference between ordered and unordered factors</li>
<li>Be aware of some of the problems encountered when using factors</li>
</ul>
</div>
</section>
<p>This section is modeled after the <a href="http://datacarpentry.org">datacarpentry lessons</a>.</p>
<p>Factors are used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting.</p>
<p>Factors are stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.</p>
<p>Once created, factors can only contain a pre-defined set values, known as <em>levels</em>. By default, R always sorts <em>levels</em> in alphabetical order. For instance, if you have a factor with 2 levels:</p>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2 id="tip"><span class="glyphicon glyphicon-pushpin"></span>Tip</h2>
</div>
<div class="panel-body">
<p>The <code>factor()</code> command is used to create and modify factors in R</p>
</div>
</aside>
<pre class="sourceCode r"><code class="sourceCode r">sex <-<span class="st"> </span><span class="kw">factor</span>(<span class="kw">c</span>(<span class="st">"male"</span>, <span class="st">"female"</span>, <span class="st">"female"</span>, <span class="st">"male"</span>))</code></pre>
<p>R will assign <code>1</code> to the level <code>"female"</code> and <code>2</code> to the level <code>"male"</code> (because <code>f</code> comes before <code>m</code>, even though the first element in this vector is <code>"male"</code>). You can check this by using the function <code>levels()</code>, and check the number of levels using <code>nlevels()</code>:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">levels</span>(sex)</code></pre>
<pre class="output"><code>[1] "female" "male"
</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">nlevels</span>(sex)</code></pre>
<pre class="output"><code>[1] 2
</code></pre>
<p>Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”) or it is required by particular type of analysis. Additionally, specifying the order of the levels allows us to compare levels:</p>
<pre class="sourceCode r"><code class="sourceCode r">food <-<span class="st"> </span><span class="kw">factor</span>(<span class="kw">c</span>(<span class="st">"low"</span>, <span class="st">"high"</span>, <span class="st">"medium"</span>, <span class="st">"high"</span>, <span class="st">"low"</span>, <span class="st">"medium"</span>, <span class="st">"high"</span>))
<span class="kw">levels</span>(food)</code></pre>
<pre class="output"><code>[1] "high" "low" "medium"
</code></pre>
<pre class="sourceCode r"><code class="sourceCode r">food <-<span class="st"> </span><span class="kw">factor</span>(food, <span class="dt">levels =</span> <span class="kw">c</span>(<span class="st">"low"</span>, <span class="st">"medium"</span>, <span class="st">"high"</span>))
<span class="kw">levels</span>(food)</code></pre>
<pre class="output"><code>[1] "low" "medium" "high"
</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">min</span>(food) ## doesn't work</code></pre>
<pre class="error"><code>Error in Summary.factor(structure(c(1L, 3L, 2L, 3L, 1L, 2L, 3L), .Label = c("low", : 'min' not meaningful for factors
</code></pre>
<pre class="sourceCode r"><code class="sourceCode r">food <-<span class="st"> </span><span class="kw">factor</span>(food, <span class="dt">levels =</span> <span class="kw">c</span>(<span class="st">"low"</span>, <span class="st">"medium"</span>, <span class="st">"high"</span>), <span class="dt">ordered=</span><span class="ot">TRUE</span>)
<span class="kw">levels</span>(food)</code></pre>
<pre class="output"><code>[1] "low" "medium" "high"
</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">min</span>(food) ## works!</code></pre>
<pre class="output"><code>[1] low
Levels: low < medium < high
</code></pre>
<p>In R’s memory, these factors are represented by numbers (1, 2, 3). They are better than using simple integer labels because factors are self describing: <code>"low"</code>, <code>"medium"</code>, and <code>"high"</code>" is more descriptive than <code>1</code>, <code>2</code>, <code>3</code>. Which is low? You wouldn’t be able to tell with just integer data. Factors have this information built in. It is particularly helpful when there are many levels (like the subjects in our example data set).</p>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="challenge---representing-data-in-r"><span class="glyphicon glyphicon-pencil"></span>Challenge - Representing data in R</h2>
</div>
<div class="panel-body">
<p>You have a vector representing levels of exercise undertaken by 5 subjects</p>
<p><strong>“l”,“n”,“n”,“i”,“l”</strong> ; n=none, l=light, i=intense</p>
<p>What is the best way to represent this in R?</p>
<ol style="list-style-type: lower-alpha">
<li><p>exercise <- c(“l”, “n”, “n”, “i”, “l”)</p></li>
<li><p>exercise <- factor(c(“l”, “n”, “n”, “i”, “l”), ordered = TRUE)</p></li>
<li><p>exercise < -factor(c(“l”, “n”, “n”, “i”, “l”), levels = c(“n”, “l”, “i”), ordered = FALSE)</p></li>
<li><p>exercise <- factor(c(“l”, “n”, “n”, “i”, “l”), levels = c(“n”, “l”, “i”), ordered = TRUE)</p></li>
</ol>
</div>
</section>
<h3 id="converting-factors">Converting Factors</h3>
<p>Converting from a factor to a number can cause problems:</p>
<pre class="sourceCode r"><code class="sourceCode r">f <-<span class="st"> </span><span class="kw">factor</span>(<span class="kw">c</span>(<span class="fl">3.4</span>, <span class="fl">1.2</span>, <span class="dv">5</span>))
<span class="kw">as.numeric</span>(f)</code></pre>
<pre class="output"><code>[1] 2 1 3
</code></pre>
<p>This does not behave as expected (and there is no warning).</p>
<p>The recommended way is to use the integer vector to index the factor levels:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">levels</span>(f)[f]</code></pre>
<pre class="output"><code>[1] "3.4" "1.2" "5"
</code></pre>
<p>This returns a character vector, the <code>as.numeric()</code> function is still required to convert the values to the proper type (numeric).</p>
<pre class="sourceCode r"><code class="sourceCode r">f <-<span class="st"> </span><span class="kw">levels</span>(f)[f]
f <-<span class="st"> </span><span class="kw">as.numeric</span>(f)</code></pre>
<h3 id="using-factors">Using Factors</h3>
<p>Lets load our example data to see the use of factors:</p>
<pre class="sourceCode r"><code class="sourceCode r">dat <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="dt">file =</span> <span class="st">'data/sample.csv'</span>, <span class="dt">stringsAsFactors =</span> <span class="ot">TRUE</span>)</code></pre>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2 id="tip-1"><span class="glyphicon glyphicon-pushpin"></span>Tip</h2>
</div>
<div class="panel-body">
<p><code>stringsAsFactors=TRUE</code> is the default behaviour for R. We could leave this argument out. It is included here for clarity.</p>
</div>
</aside>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str</span>(dat)</code></pre>
<pre class="output"><code>'data.frame': 100 obs. of 9 variables:
$ ID : Factor w/ 100 levels "Sub001","Sub002",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Gender : Factor w/ 4 levels "f","F","m","M": 3 3 3 1 3 4 1 3 3 1 ...
$ Group : Factor w/ 3 levels "Control","Treatment1",..: 1 3 3 2 2 3 1 3 3 1 ...
$ BloodPressure: int 132 139 130 105 125 112 173 108 131 129 ...
$ Age : num 16 17.2 19.5 15.7 19.9 14.3 17.7 19.8 19.4 18.8 ...
$ Aneurisms_q1 : int 114 148 196 199 188 260 135 216 117 188 ...
$ Aneurisms_q2 : int 140 209 251 140 120 266 98 238 215 144 ...
$ Aneurisms_q3 : int 202 248 122 233 222 320 154 279 181 192 ...
$ Aneurisms_q4 : int 237 248 177 220 228 294 245 251 272 185 ...
</code></pre>
<p>Notice the first 3 columns have been converted to factors. These values were text in the data file so R automatically interpreted them as catagorical variables.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">summary</span>(dat)</code></pre>
<pre class="output"><code> ID Gender Group BloodPressure Age
Sub001 : 1 f:35 Control :30 Min. : 62.0 Min. :12.10
Sub002 : 1 F: 4 Treatment1:35 1st Qu.:107.5 1st Qu.:14.78
Sub003 : 1 m:46 Treatment2:35 Median :117.5 Median :16.65
Sub004 : 1 M:15 Mean :118.6 Mean :16.42
Sub005 : 1 3rd Qu.:133.0 3rd Qu.:18.30
Sub006 : 1 Max. :173.0 Max. :20.00
(Other):94
Aneurisms_q1 Aneurisms_q2 Aneurisms_q3 Aneurisms_q4
Min. : 65.0 Min. : 80.0 Min. :105.0 Min. :116.0
1st Qu.:118.0 1st Qu.:131.5 1st Qu.:182.5 1st Qu.:186.8
Median :158.0 Median :162.5 Median :217.0 Median :219.0
Mean :158.8 Mean :168.0 Mean :219.8 Mean :217.9
3rd Qu.:188.0 3rd Qu.:196.8 3rd Qu.:248.2 3rd Qu.:244.2
Max. :260.0 Max. :283.0 Max. :323.0 Max. :315.0
</code></pre>
<p>Notice the <code>summary()</code> function handles factors differently to numbers (and strings), the occurence counts for each value is often more useful information.</p>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2 id="tip-2"><span class="glyphicon glyphicon-pushpin"></span>Tip</h2>
</div>
<div class="panel-body">
<p>The <code>summary()</code> function is a great way of spotting errors in your data, look at the <em>dat$Gender</em> column. It’s also a great way for spotting missing data.</p>
</div>
</aside>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="challenge---reordering-factors"><span class="glyphicon glyphicon-pencil"></span>Challenge - Reordering factors</h2>
</div>
<div class="panel-body">
<p>The function <code>table()</code> tabulates observations and can be used to create bar plots quickly. For instance:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">table</span>(dat$Group)</code></pre>
<pre class="output"><code>
Control Treatment1 Treatment2
30 35 35
</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">barplot</span>(<span class="kw">table</span>(dat$Group))</code></pre>
<p><img src="fig/01-supp-factors-reordering-factors-1.png" title="plot of chunk reordering-factors" alt="plot of chunk reordering-factors" style="display: block; margin: auto;" /> Use the <code>factor()</code> command to modify the column dat$Group so that the <em>control</em> group is plotted last</p>
</div>
</section>
<h3 id="removing-levels-from-a-factor">Removing Levels from a Factor</h3>
<p>Some of the Gender values in our dataset have been coded incorrectly. Let’s remove factors.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">barplot</span>(<span class="kw">table</span>(dat$Gender))</code></pre>
<p><img src="fig/01-supp-factors-gender-counts-1.png" title="plot of chunk gender-counts" alt="plot of chunk gender-counts" style="display: block; margin: auto;" /></p>
<p>Values should have been recorded as lowercase ‘m’ & ‘f’. We should correct this.</p>
<pre class="sourceCode r"><code class="sourceCode r">dat$Gender[dat$Gender ==<span class="st"> 'M'</span>] <-<span class="st"> 'm'</span></code></pre>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="challenge---updating-factors"><span class="glyphicon glyphicon-pencil"></span>Challenge - Updating factors</h2>
</div>
<div class="panel-body">
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">plot</span>(<span class="dt">x =</span> dat$Gender, <span class="dt">y =</span> dat$BloodPressure)</code></pre>
<p><img src="fig/01-supp-factors-updating-factors-1.png" title="plot of chunk updating-factors" alt="plot of chunk updating-factors" style="display: block; margin: auto;" /></p>
<p>Why does this plot show 4 levels?</p>
<p><em>Hint</em> how many levels does dat$Gender have?</p>
</div>
</section>
<p>We need to tell R that “M” is no longer a valid value for this column. We use the <code>droplevels()</code> function to remove extra levels.</p>
<pre class="sourceCode r"><code class="sourceCode r">dat$Gender <-<span class="st"> </span><span class="kw">droplevels</span>(dat$Gender)
<span class="kw">plot</span>(<span class="dt">x =</span> dat$Gender, <span class="dt">y =</span> dat$BloodPressure)</code></pre>
<p><img src="fig/01-supp-factors-dropping-levels-1.png" title="plot of chunk dropping-levels" alt="plot of chunk dropping-levels" style="display: block; margin: auto;" /></p>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2 id="tip-3"><span class="glyphicon glyphicon-pushpin"></span>Tip</h2>
</div>
<div class="panel-body">
<p>Adjusting the <code>levels()</code> of a factor provides a useful shortcut for reassigning values in this case.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">levels</span>(dat$Gender)[<span class="dv">2</span>] <-<span class="st"> 'f'</span>
<span class="kw">plot</span>(<span class="dt">x =</span> dat$Gender, <span class="dt">y =</span> dat$BloodPressure)</code></pre>
<p><img src="fig/01-supp-factors-adjusting-levels-1.png" title="plot of chunk adjusting-levels" alt="plot of chunk adjusting-levels" style="display: block; margin: auto;" /></p>
</div>
</aside>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2 id="key-points"><span class="glyphicon glyphicon-pushpin"></span>Key Points</h2>
</div>
<div class="panel-body">
<ul>
<li>Factors are used to represent categorical data</li>
<li>Factors can be <em>ordered</em> or <em>unordered</em></li>
<li>Some R functions have special methods for handling factors</li>
</ul>
</div>
</aside>
</div>
</div>
</article>
<div class="footer">
<a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
<a class="label swc-blue-bg" href="https://github.com/swcarpentry/r-novice-inflammation">Source</a>
<a class="label swc-blue-bg" href="mailto:[email protected]">Contact</a>
<a class="label swc-blue-bg" href="LICENSE.html">License</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
<script src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'></script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-37305346-2', 'auto');
ga('send', 'pageview');
</script>
</body>
</html>