-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathRScienceLibr_2_ProbSolDataTypes.Rmd
290 lines (171 loc) · 5.85 KB
/
RScienceLibr_2_ProbSolDataTypes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
---
title: "Programming for Analysis with R Part II"
author: "Fred LaPolla"
date: "May 10, 2021"
output: slidy_presentation
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
***
## Objectives
Students will be able to:
>- Identify sources for troubleshooting
>- Identify data types that R works with
>- Recod the data type of an object
***
## Pulling in the data
```{r}
library(RCurl)
url <- getURL("https://raw.githubusercontent.com/fredlapolla/RScience2021_libr/main/NYC_HANES_DIAB.csv")
nyc <- read.csv(text = url)
```
***
## Problem Solving
</br>
</br>
</br>
There are too many commands in R to memorize them, also there is no need to do this and literally no one memorizes all the commands they use.
***
## Problem Solving
</br>
</br>
</br>
R is case sensitive!
***
## Problem Solving
</br>
</br>
</br>
R is a machine! It won't know what you mean!
***
## Problem Solving
</br>
</br>
</br>
Look for **typos!**
This can mean extra commas, periods, and parentheses that don't close. Sometimes the error message will clue you in, other times (most times?) not.
***
## Problem Solving
</br>
</br>
</br>
If the console presents a blinking plus sign:
'+
after running a command, hit **ESC.**
The plus sign means some set of parentheses or quotes was not closed
***
## Problem Solving
</br>
</br>
</br>
When in doubt, Google it.
Literally no one memorizes more than a handful of commands, and most of the time you will need to look up the specific syntax.
***
## Problem Solving
</br>
</br>
</br>
My two favorite sites:
>- Stack Overflow: a site where people pose questions and others try to answer them. https://stackoverflow.com/questions/
>- Quick-R by DataCamp: Descriptions of how to do the regular things you may attempt in R. https://www.statmethods.net/stats/frequencies.html
***
## Packages
</br>
</br>
</br>
Often the thing you want to do does not come with "out of the box" R (or you could do it but it's difficult)
***
## Packages
Typically you will learn about these in articles like those on Stack Exchange, but some we will talk about here. The Tidyverse, a set of packages created by Hadley Wickham, are pretty common for data cleaning, analysis and visualization.
***
## Packages
Try installing it!
```{r}
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
library(tidyverse)
```
***
## Packages
This gives you lots of nice tools for data cleaning and wrangling.
```{r echo=TRUE}
hasDiab <- filter(nyc, DX_DBTS == 1)
head(hasDiab)
```
***
## Data Types in R
</br>
</br>
</br>
R works with several types of data. Think of these as the sorts of things that appear in a given cell in a spreadsheet.
>- Character
>- Numeric
>- Integer
>- Factor (i.e. categorical or ordinal variable)
>- Logical (T/F)
>- Complex (think the imaginary number i)
***
## Character
</br>
</br>
</br>
Character data means that R is treating text like it is words.
Imagine we created a spreadsheet of members of your lab. Everyone's name would be a character: Eg "Fred"
**Note of caution:** sometimes R reads in data in ways that we do not want it to, if it sees even one random letter (say a typo) in a column of numbers it will treat that as a character. Sometimes it will read in character strings as a great deal of factor values.
***
## Numeric and Integer
</br>
>- Numeric data is any number e.g. 12, 13.5 etc.
>- Integer data is a rounded integer and is written with an L: e.g. 12L 13L etc.
>- **Important** Sometimes what appears to us as a number is being read as a character. It will appear with quotes as "12" or "13.5" to let us know R is not reading it as a number.
***
## Factor
</br>
</br>
</br>
A factor variable is a value that can have one of several options, e.g. Place of residence: Manhattan, Brooklyn, the Bronx, Queens, Staten Island, Other.
In R, factor variables can be ordered, e.g. First year, Second Year, Third Year students. These are often called ordinal variables. Factors can also be unordered like in the NYC boros above.
***
## Factor Levels and Labels
</br>
</br>
</br>
When you work with factor variables, the different choices are called "levels." You can rename levels using a levels() command, but it is important that the order of the levels match the existing order.
```{r}
nyc$GENDER <- factor(nyc$GENDER, levels = 1:2)
summary(nyc$GENDER)
```
```{r}
levels(nyc$GENDER) <- c("Male", "Female")
summary(nyc$GENDER)
```
***
## Alternate approach with Labels
You can also do this with labels for the levels.
An "in the weeds" explanation: Levels and labels are not the same as in SPSS, what this is doing is looking at the AGEGROUP column, saying that the groups (or levels) for this group are the integers 1,2,3, and then assigning as a name the Youth, Middle and Aged labels.
```{r}
nyc$AGEGROUP <- factor(nyc$AGEGROUP, levels = 1:3, labels = c("Youth", "Middle", "Aged"))
summary(nyc$AGEGROUP)
```
**Note:** Similar to the examples above, sometimes R reads in character strings as factors. If your cell/gene IDs are listing with "levels" you may have to change the type of data.
***
## Logical
</br>
</br>
</br>
R understands TRUE and FALSE as concepts.
TRUE is equivalent to 1 and FALSE to 0.
This means you can find sums of arguments that meet criteria to see how many are true, or you can use means to find proportions. An example of finding the proportion can be seen:
```{r}
mean(nyc$SPAGE < 50)
```
To find the proportion of people under 50. This is because "nyc$SPAGE<50" runs through every value in the column and assigns TRUE, i.e. 1, if they are younger than 50, and FALSE, 0, if over 50. So mean is the same as summing all those ones and dividing by the total, which is the same as finding a proportion.
***
## Determining the data type
</br>
</br>
</br>
If you run class() on a column of data, you will get the data type of that column:
```{r}
class(nyc$AGEGROUP)
```