-
Notifications
You must be signed in to change notification settings - Fork 701
/
Copy pathvisualizing_proportions.Rmd
563 lines (474 loc) · 30.5 KB
/
visualizing_proportions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
```{r echo = FALSE, message = FALSE}
# run setup script
source("_common.R")
library(tidyr)
library(forcats)
library(ggforce)
library(ggridges)
main_size = 14 / .pt
small_rel <- 12/14
small_size <- small_rel * main_size
```
# Visualizing proportions {#visualizing-proportions}
We often want to show how some group, entity, or amount breaks down into individual pieces that each represent a *proportion* of the whole. Common examples include the proportions of men and women in a group of people, the percentages of people voting for different political parties in an election, or the market shares of companies. The archetypal such visualization is the pie chart, omnipresent in any business presentation and much maligned among data scientists. As we will see, visualizing proportions can be challenging, in particular when the whole is broken into many different pieces or when we want to see changes in proportions over time or across conditions. There is no single ideal visualization that always works. To illustrate this issue, I discuss a few different scenarios that each call for a different type of visualization.
```{block type='rmdtip', echo=TRUE}
Remember: You always need to pick the visualization that best fits your specific dataset and that highlights the key data features you want to show.
```
## A case for pie charts
From 1961 to 1983, the German parliament (called the *Bundestag*) was composed of members of three different parties, CDU/CSU, SPD, and FDP. During most of this time, CDU/CSU and SPD had approximately comparable numbers of seats, while the FDP typically held only a small fraction of seats. For example, in the 8th Bundestag, from 1976--1980, the CDU/CSU held 243 seats, SPD 214, and FDP 39, for a total of 496. Such parliamentary data is most commonly visualized as a pie chart (Figure \@ref(fig:bundestag-pie)).
(ref:bundestag-pie) Party composition of the 8th German Bundestag, 1976--1980, visualized as a pie chart. This visualization shows clearly that the ruling coalition of SPD and FDP had a small majority over the opposition CDU/CSU.
```{r, bundestag-pie, fig.cap='(ref:bundestag-pie)'}
# calculate the start and end angles for each pie
bund_pie <- bundestag %>%
arrange(seats) %>%
mutate(
seat_total = sum(seats),
end_angle = 2*pi*cumsum(seats)/seat_total, # ending angle for each pie slice
start_angle = lag(end_angle, default = 0), # starting angle for each pie slice
mid_angle = 0.5*(start_angle + end_angle), # middle of each pie slice, for the text label
hjust = ifelse(mid_angle>pi, 1, 0),
vjust = ifelse(mid_angle<pi/2 | mid_angle>3*pi/2, 0, 1)
)
rpie = 1
rlabel = 1.05 * rpie
ggplot(bund_pie) +
geom_arc_bar(
aes(
x0 = 0, y0 = 0, r0 = 0, r = rpie,
start = start_angle, end = end_angle, fill = party
),
color = "white", size = 0.5
) +
geom_text(
aes(
x = rlabel*sin(mid_angle),
y = rlabel*cos(mid_angle),
label = party,
hjust = hjust, vjust = vjust
),
family = dviz_font_family, size = main_size
) +
geom_text(
aes(
x = 0.6*sin(mid_angle),
y = 0.6*cos(mid_angle),
label = seats
),
family = dviz_font_family, size = main_size,
color = c("black", "white", "white")
) +
coord_fixed(clip = "off") +
scale_x_continuous(
limits = c(-1.5, 1.5),
expand = c(0, 0),
name = "",
breaks = NULL,
labels = NULL
) +
scale_y_continuous(
limits = c(-1.01, 1.15),
expand = c(0, 0),
name = "",
breaks = NULL,
labels = NULL
) +
scale_fill_manual(
values = bund_pie$colors[order(bund_pie$party)]
) +
theme_dviz_map() +
theme(
legend.position = "none",
plot.margin = margin(3.5, 1.5, 3.5, 1.5)
)
```
A pie chart breaks a circle into slices such that the area of each slice is proportional to the fraction of the total it represents. The same procedure can be performed on a rectangle, and the result is a stacked bar chart (Figure \@ref(fig:bundestag-stacked-bars)). Depending on whether we slice the bar vertically or horizontally, we obtain vertically stacked bars (Figure \@ref(fig:bundestag-stacked-bars)a) or horizontally stacked bars (Figure \@ref(fig:bundestag-stacked-bars)b).
(ref:bundestag-stacked-bars) Party composition of the 8th German Bundestag, 1976--1980, visualized as stacked bars. (a) Bars stacked vertically. (b) Bars stacked horizontally. It is not immediately obvious that SPD and FDP jointly had more seats than CDU/CSU.
```{r, bundestag-stacked-bars, fig.width = 5.5*6/4.2, fig.cap='(ref:bundestag-stacked-bars)'}
bundestag <- mutate(bundestag,
label_y = cumsum(seats) - seats/2)
bt_bars_stacked_base <- ggplot(bundestag, aes(x = 1, y = seats, fill = factor(party, levels = rev(party)))) +
geom_col(position = "stack", color = "white") +
geom_text(
aes(x = 1., y = label_y, label = seats),
size = main_size, family = dviz_font_family,
color = c("white", "white", "black")
) +
scale_y_continuous(expand = c(0, 0)) +
scale_x_discrete(expand = c(0, 0), breaks = NULL, name = NULL) +
scale_fill_manual(values = rev(bundestag$colors), guide = "none")
bt_bars_yax <- axis_canvas(bt_bars_stacked_base, axis = "y") +
geom_text(
data = bundestag,
aes(x = 0.06, y = label_y, label = party),
hjust = 0, vjust = 0.5, size = small_size,
family = dviz_font_family
) +
scale_x_continuous(expand = c(0, 0), limits = c(0, 1))
bt_bars_stacked <- insert_yaxis_grob(
bt_bars_stacked_base +
theme_dviz_hgrid() +
theme(
axis.ticks = element_line(color = "gray70"),
plot.margin = margin(7, 1.5, 7, 1.5)
),
bt_bars_yax, grid::unit(.5, "null"))
bt_bars_xax <- axis_canvas(bt_bars_stacked_base, axis = "y") +
geom_text(
data = bundestag,
aes(x = 0., y = label_y, label = party, hjust = 0.5, vjust = 0, size = small_size),
family = dviz_font_family
) +
scale_x_continuous(expand = c(0, 0), limits = c(0, 1)) +
coord_flip()
bt_bars_hstacked <- insert_xaxis_grob(
bt_bars_stacked_base + coord_flip() +
scale_y_continuous(expand = c(0, 0), position = "right") +
theme_dviz_vgrid() +
theme(
axis.ticks = element_line(color = "gray70"),
plot.margin = margin(3, 1.5, 3, 3)
),
bt_bars_xax, grid::unit(14, "pt"), position = "bottom")
plot_grid(
bt_bars_stacked,
plot_grid(
NULL, bt_bars_hstacked, NULL,
ncol = 1, rel_heights = c(1, 6, 7.5)
),
rel_widths = c(4, 7), labels = "auto"
)
```
We can also take the bars from Figure \@ref(fig:bundestag-stacked-bars)a and place them side-by-side rather than stacking them on top of each other. This visualization makes it easier to perform a direct comparison of the three groups, though it obscures other aspects of the data (Figure \@ref(fig:bundestag-bars-side-by-side)). Most importantly, in a side-by-side bar plot the relationship of each bar to the total is not visually obvious.
(ref:bundestag-bars-side-by-side) Party composition of the 8th German Bundestag, 1976--1980, visualized as side-by-side bars. As in Figure \@ref(fig:bundestag-stacked-bars), it is not immediately obvious that SPD and FDP jointly had more seats than CDU/CSU.
```{r bundestag-bars-side-by-side, fig.width = 5, fig.asp = 3/4, fig.cap='(ref:bundestag-bars-side-by-side)'}
bt_bars <- ggplot(bundestag, aes(x = factor(party, levels = bundestag$party), y = seats, fill = party)) +
geom_col() +
geom_text(aes(label = seats), size = main_size, vjust = 2, color = c("white", "white", "black")) +
scale_x_discrete(name = NULL) +
scale_y_continuous(expand = c(0, 0)) +
scale_fill_manual(values = bundestag$colors[order(bundestag$party)], guide = "none") +
#geom_hline(yintercept = c(50, 100, 150, 200), color = "#ffffff70", size = .5) +
coord_cartesian(clip = "off") +
theme_dviz_hgrid() +
theme(
axis.line.x = element_blank(),
axis.ticks.x = element_blank()
)
bt_bars
```
Many authors categorically reject pie charts and argue in favor of side-by-side or stacked bars. Others defend the use of pie charts in some applications. My own opinion is that none of these visualizations is consistently superior over any other. Depending on the features of the dataset and the specific story you want to tell, you may want to favor one or the other approach. In the case of the 8th German Bundestag, I think that a pie chart is the best option. It shows clearly that the ruling coalition of SPD and FDP jointly had a small majority over the CDU/CSU (Figure \@ref(fig:bundestag-pie)). This fact is not visually obvious in any of the other plots (Figures \@ref(fig:bundestag-stacked-bars) and \@ref(fig:bundestag-bars-side-by-side)).
In general, pie charts work well when the goal is to emphasize simple fractions, such as one-half, one-third, or one-quarter. They also work well when we have very small datasets. A single pie chart, as in Figure \@ref(fig:bundestag-pie), looks just fine, but a single column of stacked bars, as in Figure \@ref(fig:bundestag-stacked-bars)a, looks awkward. Stacked bars, on the other hand, can work for side-by-side comparisons of multiple conditions or in a time series, and side-by-side bars are preferred when we want to directly compare the individual fractions to each other. A summary of the various pros and cons of pie charts, stacked bars, and side-by-side bars is provided in Table \@ref(tab:pros-cons-pie-bar).
Table: (\#tab:pros-cons-pie-bar) Pros and cons of common approaches to visualizing proportions: pie charts, stacked bars, and side-by-side bars.
----------------------------------------------------------------------------------------
Pie chart Stacked bars Side-by-side bars
----------------------------- ------------------- ------------------- -------------------
Clearly visualizes the data ✔ ✔ ✖
as proportions of a whole
Allows easy visual comparison ✖ ✖ ✔
of the relative proportions
Visually emphasizes simple ✔ ✖ ✖
fractions, such as 1/2, 1/3,
1/4
Looks visually appealing ✔ ✖ ✔
even for very small datasets
Works well when the whole is ✖ ✖ ✔
broken into many pieces
Works well for the ✖ ✔ ✖
visualization of many sets of
proportions or time series
of proportions
----------------------------------------------------------------------------------------
## A case for side-by-side bars {#side-by-side-bars}
I will now demonstrate a case where pie charts fail. This example is modeled after a critique of pie charts originally posted on Wikipedia [@Schutz-piecharts]. Consider the hypothetical scenario of five companies, A, B, C, D, and E, who all have roughly comparable market share of approximately 20%. Our hypothetical dataset lists the market share of each company for three consecutive years. When we visualize this dataset with pie charts, it is difficult to see what exactly is going on (Figure \@ref(fig:marketshare-pies)). It appears that the market share of company A is growing and the one of company E is shrinking, but beyond this one observation we can't tell what's going on. In particular, it is unclear how exactly the market shares of the different companies compare within each year.
(ref:marketshare-pies) Market share of five hypothetical companies, A--E, for the years 2015--2017, visualized as pie charts. This visualization has two major problems: 1. A comparison of relative market share within years is nearly impossible. 2. Changes in market share across years are difficult to see.
```{r marketshare-pies, fig.width = 5*6/4.2, fig.asp = .35, fig.cap='(ref:marketshare-pies)'}
# calculate the start and end angles for each pie
market_pies_df <- marketshare %>%
group_by(year) %>%
mutate(total = sum(percent),
end_angle = 2*pi*cumsum(percent)/total, # ending angle for each pie slice
start_angle = lag(end_angle, default = 0), # starting angle for each pie slice
mid_angle = 0.5*(start_angle + end_angle), # middle of each pie slice, for the text label
hjust = ifelse(mid_angle>pi, 1, 0),
vjust = ifelse(mid_angle<pi/2 | mid_angle>3*pi/2, 0, 1))
rpie = 1
rlabel = 1.05 * rpie
market_pies <- ggplot(market_pies_df) +
geom_arc_bar(
aes(
x0 = 0, y0 = 0, r0 = 0, r = rpie,
start = start_angle, end = end_angle, fill = company
),
color = NA
) +
geom_text(
aes(x = rlabel*sin(mid_angle), y = rlabel*cos(mid_angle), label = company, hjust = hjust, vjust = vjust),
family = dviz_font_family,
size = small_size
) +
coord_fixed() +
facet_wrap(~year) +
scale_x_continuous(limits = c(-1.2, 1.2), expand = c(0, 0), name = NULL, breaks = NULL, labels = NULL) +
scale_y_continuous(limits = c(-1.2, 1.2), expand = c(0, 0), name = NULL, breaks = NULL, labels = NULL) +
scale_fill_OkabeIto(order = c(1:3, 5, 4)) +
guides(fill = "none") +
theme_dviz_open() +
theme(
axis.line = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
axis.ticks.length = grid::unit(0, "pt"),
plot.margin = margin(7, 7, 0, 7),
legend.position = "none",
strip.background = element_blank(),
strip.text.x = element_text(size= 14, margin = margin(0, 0, 0.1, 0))
)
stamp_bad(market_pies)
```
The picture becomes a little clearer when we switch to stacked bars (Figure \@ref(fig:marketshare-stacked)). Now the trends of a growing market share for company A and a shrinking market share for company E are clearly visible. However, the relative market shares of the five companies within each year are still hard to compare. And it is difficult to compare the market shares of companies B, C, and D across years, because the bars are shifted relative to each other across years. This is a general problem of stacked-bar plots, and the main reason why I normally do not recommend this type of visualization.
(ref:marketshare-stacked) Market share of five hypothetical companies for the years 2015--2017, visualized as stacked bars. This visualization has two major problems: 1. A comparison of relative market shares within years is difficult. 2. Changes in market share across years are difficult to see for the middle companies B, C, and D, because the location of the bars changes across years.
```{r marketshare-stacked, fig.cap='(ref:marketshare-stacked)'}
stacked_bars <- ggplot(marketshare, aes(x = year, y = percent, fill = company)) +
geom_col(position = "stack") +
scale_y_continuous(
name = "market share",
labels = scales::percent_format(accuracy = 1, scale = 1),
expand = c(0, 0)
) +
scale_fill_OkabeIto(order = c(1:3, 5, 4)) +
theme_dviz_open() +
theme(plot.margin = margin(14, 7, 3, 1.5))
stamp_bad(stacked_bars)
```
For this hypothetical data set, side-by-side bars are the best choice (Figure \@ref(fig:marketshare-side-by-side)). This visualization highlights that both companies A and B have increased their market share from 2015 to 2017 while both companies D and E have reduced theirs. It also shows that market shares increase sequentially from company A to E in 2015 and similarly decrease in 2017.
(ref:marketshare-side-by-side) Market share of five hypothetical companies for the years 2015--2017, visualized as side-by-side bars.
```{r marketshare-side-by-side, fig.cap='(ref:marketshare-side-by-side)'}
ggplot(marketshare, aes(x = company, y = percent, fill = company)) +
geom_col() +
facet_wrap(~year) +
scale_y_continuous(
name = "market share",
labels = scales::percent_format(accuracy = 1, scale = 1),
expand = c(0, 0)
) +
scale_fill_OkabeIto(order = c(1:3, 5, 4), guide = "none") +
theme_dviz_open() +
theme(strip.background = element_blank())
```
## A case for stacked bars and stacked densities {#stacked-densities}
In Section \@ref(side-by-side-bars), I wrote that I don't normally recommend sequences of stacked bars, because the location of the internal bars shifts along the sequence. However, the problem of shifting internal bars disappears if there are only two bars in each stack, and in those cases the resulting visualization can be quite clear. As an example, consider the proportion of women in a country's national parliament. We will specifically look at the African country Rwanda, which as of 2016 tops the list of countries with the highest proportion of female parliament members. Rwanda has had a majority female parliament since 2008, and since 2013 nearly two-thirds of its members of parliament are female. To visualize how the proportion of women in the Rwandan parliament has changed over time, we can draw a sequence of stacked bar graphs (Figure \@ref(fig:women-parliament)). This figure provides an immediate visual representation of the changing proportions over time. To help the reader see exactly when the majority turned female, I have added a dashed horizontal line at 50%. Without this line, it would be near impossible to determine whether from 2003 to 2007 the majority was male or female. I have not added similar lines at 25% and 75%, to avoid making the figure too cluttered.
(ref:women-parliament) Change in the gender composition of the Rwandan parliament over time, 1997 to 2016. Data source: Inter-Parliamentary Union (IPU), ipu.org.
```{r women-parliament, fig.width = 6, fig.asp = .55, fig.cap = '(ref:women-parliament)'}
ccode = "RWA" # Rwanda
#ccode = "BEL" # Belgium
#ccode = "ARB" # Arab world
#ccode = "BOL" # Bolivia
#ccode = "EUU" # European Union
women_parliaments %>% filter(country_code == ccode & year > 1990) %>%
mutate(women = perc_women, men = 100 - perc_women) %>%
select(-perc_women) %>%
gather(gender, percent, women, men) %>%
mutate(gender = factor(gender, levels = c("women", "men"))) -> women_rwanda
plot_base <- ggplot(women_rwanda, aes(x = year, y = percent, fill = gender)) +
#geom_col(position = "stack", width = .9, color = "white") +
geom_col(position = "stack", width = 1, color = "#FFFFFF", size = .75, alpha = 0.8) +
geom_hline(
yintercept = c(50),
color = "#000000FF", size = 0.4, linetype = 2
#color = "#FFFFFFA0"
) +
geom_hline(yintercept = 100, color = "black") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(
name = "relative proportion",
labels = scales::percent_format(accuracy = 1, scale = 1),
expand = c(0, 0)
) +
scale_fill_manual(values = c("#D55E00E0", "#0072B2E0"), guide = "none") +
coord_cartesian(clip = "off") +
theme_dviz_open() +
theme(
#axis.ticks.y = element_blank(),
#axis.ticks.x = element_blank(),
#axis.line.x = element_blank(),
axis.line.y = element_blank(),
plot.margin = margin(14, 1.5, 3, 1.5)
)
# calculate label position
labels <- filter(women_rwanda, year == max(year)) %>%
mutate(pos = 100 - cumsum(percent) + 0.5*percent)
yax <- axis_canvas(plot_base, axis = "y") +
geom_text(data = labels, aes(y = pos, label = paste0(" ", gender)),
family = dviz_font_family,
x = 0, hjust = 0, size = 14/.pt)
ggdraw(insert_yaxis_grob(plot_base, yax, grid::unit(.15, "null")))
```
If we want to visualize how proportions change in response to a continuous variable, we can switch from stacked bars to stacked densities. Stacked densities can be thought of as the limiting case of infinitely many infinitely small stacked bars arranged side-by-side. The densities in stacked-density plots are typically obtained from kernel density estimation, as described in Chapter \@ref(histograms-density-plots), and I refer you to that chapter for a general discussion of the strengths and weaknesses of this method.
To give an example where stacked densities may be appropriate, consider the health status of people as a function of age. Age can be considered a continuous variable, and visualizing the data in this way works reasonably well (Figure \@ref(fig:health-vs-age)). Even though we have four health categories here, and I'm generally not a fan of stacking multiple conditions, as discussed above, I think in this case the figure is acceptable. We can see clearly that overall health declines as people age, and we can also see that despite this trend, over half of the population remain in good or excellent health until very old age.
(ref:health-vs-age) Health status by age, as reported by the general social survey (GSS).
```{r health-vs-age, fig.asp = .5, fig.cap='(ref:health-vs-age)'}
df_health <- select(happy, age, health) %>%
na.omit()
# color brewer 5-class PuBu
colors = c('#f1eef6', '#bdc9e1', '#74a9cf', '#2b8cbe', '#045a8d')[5:1]
p_health <- ggplot(df_health, aes(x = age, y = ..count.., fill = health, color = health)) +
geom_density(position = "fill") +
#geom_hline(yintercept = c(.25, .50, .75), color = "#FFFFFF60") +
scale_x_continuous(name = "age (years)", expand = c(0, 0)) +
scale_y_continuous(
expand = c(0, 0), name = "relative proportion",
labels = scales::percent
) +
scale_color_manual(values = colors) +
scale_fill_manual(values = colors) +
theme_dviz_open() +
theme(
#axis.ticks.y = element_blank(),
#axis.ticks.x = element_blank(),
axis.line.x = element_blank(),
axis.line.y = element_blank(),
plot.margin = margin(7, 7, 3, 1.5)
)
df_marital <- select(happy, age, marital) %>%
na.omit() %>%
filter(marital != "separated") %>% # remove separated to make plot simpler
mutate(marital = factor(marital, levels = c("widowed", "divorced", "married", "never married")))
p_marital <- ggplot(df_marital, aes(x = age, y = ..count.., fill = marital, color = marital)) +
geom_density(position = "fill") +
scale_x_continuous(name = "age (years)", expand = c(0, 0)) +
scale_y_continuous(
expand = c(0, 0), name = "relative proportion",
labels = scales::percent
) +
scale_color_manual(values = colors, name = "marital status") +
scale_fill_manual(values = colors, name = "marital status") +
theme_dviz_open() +
theme(
#axis.ticks.y = element_blank(),
axis.line.x = element_blank(),
axis.line.y = element_blank(),
plot.margin = margin(7, 7, 3, 1.5)
)
p_aligned <- align_plots(p_health, p_marital, align = 'v')
ggdraw(p_aligned[[1]])
```
Nevertheless, this figure has a major limitation: By visualizing the proportions of the four health conditions as percent of the total, the figure obscures that there are many more young people than old people in the dataset. Thus, even though the *percentage* of people reporting to be in good health remains approximately unchanged across ages spanning seven decades, the *absolute number* of people in good health declines as the total number of people at a given age declines. I will present a potential solution to this problem in the next section.
## Visualizing proportions separately as parts of the total
Side-by-side bars have the problem that they don't clearly visualize the size of the individual parts relative to the whole and stacked bars have the problem that the different bars cannot be compared easily because they have different baselines. We can resolve these two issues by making a separate plot for each part and in each plot showing the respective part relative to the whole. For the health dataset of Figure \@ref(fig:health-vs-age), this procedure results in Figure \@ref(fig:health-vs-age-facets). The overall age distribution in the dataset is shown as the shaded gray areas, and the age distributions for each health status are shown in blue. This figure highlights that in absolute terms, the number people with excellent or good health declines past ages 30--40, while the number of people with fair health remains approximately constant across all ages.
(ref:health-vs-age-facets) Health status by age, shown as proportion of the total number of people in the survey. The colored areas show the density estimates of the ages of people with the respective health status and the gray areas show the overall age distribution.
```{r health-vs-age-facets, fig.width = 5.5*6/4.2, fig.asp = 0.35, fig.cap='(ref:health-vs-age-facets)'}
ggplot(mutate(df_health, health = fct_rev(health)), aes(x = age, y = ..count..)) +
geom_density_line(data = select(df_health, -health), aes(fill = "all people surveyed "), color = "transparent") +
geom_density_line(aes(fill = "highlighted group"), color = "transparent") +
facet_wrap(~health, nrow = 1) +
scale_x_continuous(name = "age (years)", limits = c(15, 98), expand = c(0, 0)) +
scale_y_continuous(name = "count", expand = c(0, 0)) +
scale_fill_manual(
values = c("#b3b3b3a0", "#2b8cbed0"),
name = NULL,
guide = guide_legend(direction = "horizontal")
) +
coord_cartesian(clip = "off") +
theme_dviz_hgrid() +
theme(
axis.line.x = element_blank(),
strip.text = element_text(size = 14, margin = margin(0, 0, 0.2, 0, "cm")),
legend.position = "bottom",
legend.justification = "right",
legend.margin = margin(4.5, 0, 1.5, 0, "pt"),
legend.spacing.x = grid::unit(4.5, "pt"),
legend.spacing.y = grid::unit(0, "pt"),
legend.box.spacing = grid::unit(0, "cm")
)
```
To provide a second example, let's consider a different variable from the same survey: marital status. Marital status changes much more drastically with age than does health status, and a stacked densities plot of marital status vs age is not very illuminating (Figure \@ref(fig:marital-vs-age)).
(ref:marital-vs-age) Marital status by age, as reported by the general social survey (GSS). To simplify the figure, I have removed a small number of cases that report as separated. I have labeled this figure as "bad" because the frequency of people who have never been married or are widowed changes so drastically with age that the age distributions of married and divorced people are highly distorted and difficult to interpret.
```{r marital-vs-age, fig.asp = 0.5, fig.cap='(ref:marital-vs-age)'}
stamp_bad(p_aligned[[2]])
```
The same dataset visualized as partial densities is much clearer (Figure \@ref(fig:marital-vs-age-facets)). In particular, we see that the proportion of married people peaks around the late 30s, the proportion of divorced people peaks around the early 40s, and the proportion of widowed people peaks around the mid 70s.
(ref:marital-vs-age-facets) Marital status by age, shown as proportion of the total number of people in the survey. The colored areas show the density estimates of the ages of people with the respective marital status, and the gray areas show the overall age distribution.
```{r marital-vs-age-facets, fig.width = 5.5*6/4.2, fig.asp = 0.35, fig.cap='(ref:marital-vs-age-facets)'}
ggplot(mutate(df_marital, marital = fct_rev(marital)), aes(x = age, y = ..count..)) +
geom_density_line(data = select(df_marital, -marital), aes(fill = "all people surveyed "), color = "transparent") +
geom_density_line(aes(fill = "highlighted group"), color = "transparent") +
facet_wrap(~marital, nrow = 1) +
scale_x_continuous(name = "age (years)", limits = c(15, 98), expand = c(0, 0)) +
scale_y_continuous(name = "count", expand = c(0, 0)) +
scale_fill_manual(
values = c("#b3b3b3a0", "#2b8cbed0"),
name = NULL,
guide = guide_legend(direction = "horizontal")
) +
coord_cartesian(clip = "off") +
theme_dviz_hgrid() +
theme(
axis.line.x = element_blank(),
strip.text = element_text(size = 14, margin = margin(0, 0, 0.2, 0, "cm")),
legend.position = "bottom",
legend.justification = "right",
legend.margin = margin(4.5, 0, 1.5, 0, "pt"),
legend.spacing.x = grid::unit(4.5, "pt"),
legend.spacing.y = grid::unit(0, "pt"),
legend.box.spacing = grid::unit(0, "cm")
)
```
However, one downside of Figure \@ref(fig:marital-vs-age-facets) is that this representation doesn't make it easy to determine relative proportions at any given point in time. For example, if we wanted to know at what age more than 50% of all people surveyed are married, we could not easily tell from Figure \@ref(fig:marital-vs-age-facets). To answer this question, we can instead use the same type of display but show relative proportions instead of absolute counts along the *y* axis (Figure \@ref(fig:marital-vs-age-proportions)). Now we see that married people are in the majority starting in their late 20s, and widowed people are in the majority starting in their mid 70s.
(ref:marital-vs-age-proportions) Marital status by age, shown as proportion of the total number of people in the survey. The areas colored in blue show the percent of people at the given age with the respective status, and the areas colored in gray show the percent of people with all other marital statuses.
```{r marital-vs-age-proportions, fig.width = 5.5*6/4.2, fig.asp = 0.35, fig.cap='(ref:marital-vs-age-proportions)'}
df_marital2 <- rbind(
mutate(df_marital,
marital = as.character(fct_collapse(marital, `never married` = "never married", aother = c("married", "divorced", "widowed"))),
highlight = "never married"
),
mutate(df_marital,
marital = as.character(fct_collapse(marital, married = "married", aother = c("never married", "divorced", "widowed"))),
highlight = "married"
),
mutate(df_marital,
marital = as.character(fct_collapse(marital, divorced = "divorced", aother = c("never married", "married", "widowed"))),
highlight = "divorced"
),
mutate(df_marital,
marital = as.character(fct_collapse(marital, widowed = "widowed", aother = c("never married", "married", "divorced"))),
highlight = "widowed"
)
) %>%
mutate(
highlight = factor(highlight, levels = c("never married", "married", "divorced", "widowed"))
)
ggplot(df_marital2, aes(age)) +
annotate(geom = "rect", xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf, fill = "#b3b3b3a0", color = NA) +
geom_density_line(
aes(y = stat(count), fill = marital), color = "transparent", position = "fill"
) +
facet_wrap(~highlight, nrow = 1) +
scale_x_continuous(
name = "age (years)",
#limits = c(15, 98),
expand = c(0, 0)
) +
scale_y_continuous(name = "relative proportion", expand = c(0, 0), labels = scales::percent) +
scale_fill_manual(
values = c("transparent", "#2b8cbed0", "#2b8cbed0", "#2b8cbed0", "#2b8cbed0"),
name = NULL,
breaks = c("aother", "divorced"),
labels = c("all people surveyed ", "highlighted group"),
guide = guide_legend(
direction = "horizontal",
override.aes = list(fill = c("#bebebe", "#3590c0"))
)
) +
coord_cartesian(clip = "off") +
theme_dviz_hgrid() +
theme(
axis.line.x = element_blank(),
strip.text = element_text(size = 14, margin = margin(0, 0, 0.2, 0, "cm")),
legend.position = "bottom",
legend.justification = "right",
legend.margin = margin(4.5, 0, 1.5, 0, "pt"),
legend.spacing.x = grid::unit(4.5, "pt"),
legend.spacing.y = grid::unit(0, "pt"),
legend.box.spacing = grid::unit(0, "cm")
)
```