-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathweek3.html
218 lines (214 loc) · 11.9 KB
/
week3.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Week 3: Pivot Tables</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link href="css/bootstrap.min.css" rel="stylesheet">
<link href="css/custom.css" rel="stylesheet">
</head>
<body class="markdown github">
<header class="navbar-inverse navbar-fixed-top">
<div class="container">
<nav role="navigation">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a href="index.html" class="navbar-brand">J298 Data Journalism</a>
</div> <!-- /.navbar-header -->
<!-- Collect the nav links, forms, and other content for toggling -->
<div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
<ul class="nav navbar-nav">
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Class notes<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="week1.html">What is data?</a></li>
<li><a href="week2.html">Types of stories</a></li>
<li><a href="week3.html">Working with spreadsheets</a></li>
<li><a href="week4.html">Acquiring, cleaning, and formatting data</a></li>
<li><a href="week5.html">R, RStudio, and the tidyverse</a></li>
<li><a href="week6.html">Data journalism in the tidyverse</a></li>
<li><a href="week7.html">Don't let the data lie to you</a></li>
<li><a href="week8.html">Databases and SQL</a></li>
<li><a href="week9.html">Finding stories using maps</a></li>
<li><a href="week10.html">Maps meet databases</a></li>
<li><a href="week11.html">More PostGIS</a></li>
<li><a href="week12.html">R practice</a></li>
<li><a href="week13.html">PostGIS practice</a></li>
<li><a href="week14.html">More fun with R</a></li>
</ul>
</li>
<li><a href="software.html">Software</a></li>
<li><a href="datasets.html">Data</a></li>
<li><a href="questions.html">If you get stuck</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Email instructors<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="mailto:[email protected]">Peter Aldhous</a></li>
<li><a href="mailto:[email protected]">Amanda Hickman</a></li>
</ul>
</li>
</ul>
</div><!-- /.navbar-collapse -->
</nav>
</div> <!-- /.navbar-header -->
</header>
<div class="container all">
<h2 id="week-3">Week 3</h2>
<p><em>Instructor: Amanda Hickman</em></p>
<h3 id="presentations">Presentations</h3>
<p>Chenwei Tian & Chloe Lessard</p>
<h3 id="quiz-review">Quiz Review</h3>
<p>How did we do? Peter is going to grade them, but we can use Google Spreadsheets to check our answers to some questions.</p>
<blockquote>
<ol>
<li>An agency’s budget was $21 million in 2017. It is $24.5 million in 2018. What was the percentage increase, rounded to one decimal place?</li>
</ol>
</blockquote>
<p>Can we use a spreadsheet formula to calculate this? (Yes. So open Google Sheets ...)</p>
<blockquote>
<ol>
<li>There were 15,000 violent crimes in 2017 in city A, which had a population of 350,000. There were 9,000 violent crimes in City B, which had a population of 500,000. Complete the following sentence with a number rounded to one decimal place: “City A’s violent crime rate was ... times that of city B.”</li>
</ol>
</blockquote>
<p>What we need to do is normalize the data to the population.</p>
<table>
<thead>
<tr>
<th>City</th>
<th>Crimes</th>
<th>Population</th>
<th>Rate </th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>15000</td>
<td>350000</td>
<td></td>
</tr>
<tr>
<td>B</td>
<td>9000</td>
<td>500000</td>
</tr>
</tbody>
</table>
<h2 id="spreadsheets">Spreadsheets</h2>
<p>Spreadsheets are the simplest tool in your data toolbox. There are some real limitations that you'll encounter when you try to use spreadsheets for large datasets or complex, reproducible analysis, but some facility with core spreadsheet functions will go a long, long way towards improving your data skills.</p>
<p>Sort, filter, summarize, group and join.</p>
<p>We're going to use Google Sheets in class, but <a href="https://www.libreoffice.org/download/download/">LibreOffice Calc</a> is a great, open source option. Excel is also perfectly good, though it's been a long time since anyone showed me a reason to choose it over LibreOffice.</p>
<h3 id="filtering-">Filtering:</h3>
<p>On a data set the size of ProPublica's Medicaid data, you aren't going to be able to use Google Sheets (which has a max of ~200K cells -- meaning that a four column spreadsheet can only accommodate 50K rows). So as the semester progresses and we move into tools that are way harder to use than spreadsheets are, it's because we can't use a spreadsheet.</p>
<p>Open <a href="data/week3/311_Cases_Dec2017.csv"><code>311_Cases_Dec2017.csv</code></a> . We're going to start by just sniffing around. The best way to get it into a Google Drive spreadsheet is to use <code>File > Import ...</code></p>
<p>The data is San Francisco's 311 call records, from <a href="https://data.sfgov.org/City-Infrastructure/311-Cases/vw6y-z8j6">SF's Open Data Portal</a> -- I used their service to filter out only the cases opened between 12/01/2017 12:00:00 AM and 01/01/2018 12:00:00 AM.</p>
<p>Click in any cell in the header row and look for the 'filter' icon. Click it. It will take a minute but we'll get a little filter icon in each header cell, and then we can start to skim a bit. Try filtering by status, and source.</p>
<p>Notice that it takes a minute to get through this data.</p>
<p>Good reading on 311 data:</p>
<ul>
<li><p><a href="http://kycir.org/2018/02/01/behind-the-data-how-we-analyzed-louisville-trash-complaints/">http://kycir.org/2018/02/01/behind-the-data-how-we-analyzed-louisville-trash-complaints/</a></p>
</li>
<li><p><a href="http://kycir.org/2018/01/30/the-next-louisville-what-trash-cans-tell-us-about-poverty/">http://kycir.org/2018/01/30/the-next-louisville-what-trash-cans-tell-us-about-poverty/</a></p>
</li>
</ul>
<h3 id="pitches">Pitches</h3>
<p>In groups of 3, pull up your pitches and go over them. Questions to ask:</p>
<ul>
<li>Is this enough? too much?</li>
<li>"Compared to What?" -- what are the points of comparison you'd want to see as a reader?</li>
<li>What would make it more interesting? Feel free to riff, speculate, and brainstorm but also try to dial in what you think this is going to need as a story.</li>
</ul>
<p><em>Notes:</em> Most of you need to...
a) get your hands on the actual data and make sure there's something there,
b) go do some reading of other reporting on this, or make a few phone calls to figure out what the context is. </p>
<h2 id="formulas">Formulas</h2>
<p>I keep a running list of my favorite spreadsheet commands on a <a href="https://github.com/amandabee/cunyjdata/wiki/Tip-Sheet:-Spreadsheets">course wiki</a> -- it's a great resource.</p>
<p>You went over the basics week 1: <code>=SUM</code>, <code>=AVERAGE</code>, <code>=MAX</code>, <code>=MIN</code></p>
<table>
<thead>
<tr>
<th>City</th>
<th>Crimes</th>
<th>Population</th>
<th>Formula</th>
<th>Rate</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>15000</td>
<td>350000</td>
<td>=(B2/C2)*100000</td>
<td>4286</td>
</tr>
<tr>
<td>B</td>
<td>9000</td>
<td>500000</td>
<td>=(B3/C3)*100000</td>
<td>1800</td>
</tr>
</tbody>
</table>
<h2 id="pivot-tables">Pivot Tables</h2>
<p>Later this semester, we'll use some much more powerful tools, especially R. And the honest truth is that spreadsheets can be super dangerous. It's hard to keep track of the transformations you're applying when they're hidden in spreadsheet functions. But: you can do a lot in spreadsheets and sometimes they're exactly the right tool.</p>
<ul>
<li>Turn off the auto-filter.</li>
<li>Make sure you don't have any data selected.</li>
<li><code>Data > Pivot Table...</code></li>
<li>Add <code>Responsible Agency</code> and then <code>Category</code> as rows</li>
<li>Add <code>CaseID</code> as value. </li>
<li>Take a look at the default operation: is the sum of all case ids a useful value?</li>
<li>Add <code>Supervisor District</code> as columns.</li>
<li>Use "Sort by" to sort within the Pivot Table.</li>
</ul>
<h3 id="try-it-">Try it:</h3>
<p>If you wanted to quickly see the opened vs closed case count for each Supervisor's district, how would you do that?</p>
<h3 id="more-data-earthquakes">More Data: Earthquakes</h3>
<p>The USGS publishes <a href="https://earthquake.usgs.gov/earthquakes/feed/v1.0/csv.php">real time earthquake data</a>, but the locations they include are not something we can sort by.</p>
<p>Start by using <code>Data > Text to Columns ...</code> to separate out the broad location from the specifics.</p>
<ul>
<li><p>Filter for blank cells. We can copy and paste to fill in the blank cells.</p>
</li>
<li><p>Create a pivot table -- use "COUNT of Mag"; sort it by count.</p>
</li>
<li><p>Go back to the spreadsheet and Categorize by magnitude. I made the slightly arbitrary decision that we're going to call everything below 4.5 "Minor", and everything above 6.0 "Strong" -- the USGS actually has a few more grades in their system, but these are good buckets.</p>
</li>
</ul>
<h4 id="-lookup-e2-0-4-5-6-minor-moderate-strong-"><code>=LOOKUP(E2,{0,4.5,6},{"Minor","Moderate","Strong"})</code></h4>
<ul>
<li><p>Fill Down: Select F2 through F6 and do <kbd>ctrl</kbd><kbd>d</kbd> -- that's one "fill down" option. Or grab the blue box and pull down. Or, hover over the little blue square and double click. They're all good options.</p>
</li>
<li><p>Go back to your pivot table and add the "categories" as columns.</p>
</li>
</ul>
<h2 id="more-data-graduation-outcomes">More Data: Graduation Outcomes</h2>
<p>The State of California publishes quite a bit of <a href="https://www.cde.ca.gov/ds/sd/sd/filescohort.asp">high school graduation</a> data statewide. Open <a href="raw-data/week3/calpads_cohort16_alameda.csv">calpad_cohort16_alameda.csv</a> in Google Sheets.</p>
<p>Note that the statewide data comes with a 14 digit code. Take a look at the <a href="https://www.cde.ca.gov/ds/sd/sd/fscohort.asp">dictionary</a> and let's figure out how to break out the CDS into county, district, school.</p>
<p>If we want to know how Oakland Technical compares to the rest of Oakland's High Schools, how would you approach that?</p>
<h1 id="homework">Homework</h1>
<p><strong>Installs:</strong> make sure that you have <a href="software.html">Chrome and Open Refine installed</a>.</p>
<p><strong>Pivot Table Exercise:</strong> The complete assignment is in bCourses.</p>
<p>Go back to the Berkeley Collision Data that Peter shared with you <a href="https://ucb-dataj.github.io/2018/data/week1/berkeley_collisions.csv">Week 1</a>. Use a pivot table to get the total number of fatalities by day of the week for each year. You will probably want to consult the <a href="https://ucb-dataj.github.io/2018/TIMS.html">data dictionary</a>.</p>
<ul>
<li>What day of the week had the most fatalities in Berkeley, between 2006 and 2014?</li>
</ul>
<p>Then, revise your pivot table to answer the following two questions keeping in mind that a single fatal crash may include multiple fatalities:</p>
<ul>
<li>Has the number of severe or fatal crashes gone up, gone down, or stayed about the same? (You can eyeball this.)</li>
<li>What kind of weather was noted at the time of the most serious and fatal collisions? What conclusion can you reasonably draw from that?</li>
</ul>
<p>Share your spreadsheet with me ([email protected]) so I can see your pivot table.</p>
<p><strong>Data Presentation <em>Feb. 8</em></strong>
Cat Schuknecht & James Steinbauer</p>
</div> <!-- /.container all -->
<script src="https://code.jquery.com/jquery.min.js"></script>
<script src="js/bootstrap.min.js"></script>
</body>
</html>