Google Ngrams are a popular way to show how the usage of specific terms has changed over time.
This package calls Google Ngrams from within R. It has two functions: ngram()
, which fetches the underlying data for a single Google Ngram, and ngram_group()
, which fetches and combines the underlying data for multiple Google Ngrams.
Significantly, ngram_group
makes it possible to chart joint frequencies in a way that is robust to subtle linguistic shifts over time.
The example below uses the base ngram()
function. It replicates the plot you get when you visit https://books.google.com/ngrams/.
library(ggplot2)
library(tidyr)
library(dplyr)
library(gbNgram)
# set query terms
main.terms <- c("Frankenstein", "Albert Einstein", "Sherlock Holmes")
# plot by term
df <- ngram(main.terms, smoothing=3)
df.plot <- df %>%
gather(term, frequency, -year)
p <- ggplot(df.plot, aes(year, frequency, colour=term)) + geom_line() +
ylab("Frequency") +
xlab("Year") +
ggtitle("Terms in Google Books, 1800-2000") +
theme_bw()
p
The real value of gbNgram is in ngram_group
.
Suppose you're a researcher interested in how discourse around Muslim authors has changed over time. Here's one graph you could create:
main.terms <- c("poet", "historian", "writer")
adjs <- c("Muslim", "Moslem")
# plot by term
df <- ngram_group(main.terms, adjs,
include.all.cases = TRUE,
include.plurals = TRUE,
yr.start = 1908, yr.end = 2008)
df.plot <- df %>%
gather(term, frequency, -year)
p <- ggplot(df.plot, aes(year, frequency, colour=term)) + geom_line() +
ylab("Frequency") +
xlab("Year") +
ggtitle("Terms in Google Books, 1908-2008") +
theme_bw()
p
The graph shows a steady rise and fall in descriptions of Muslim authors. Significantly, what the line for "Muslim poet" shows is actually a joint frequency rather than a single one. More specifically, it shows how often "Muslim poet", "muslim poet", "Moslem poet" and "moslem poet" appear in a given year, as well as the plural of each term.
What the above doesn't show though is how the word used to describe Muslims has changed over time. For that, try this:
# plot by qualifer
df <- ngram_group(main.terms, adjs, group.by=2,
include.plurals = TRUE,
include.all.cases = TRUE,
yr.start = 1958, yr.end = 2008)
df.plot <- df %>%
gather(term, frequency, -year)
p <- ggplot(df.plot, aes(year, frequency, colour=term)) + geom_line() +
ylab("Frequency") +
xlab("Year") +
ggtitle("Terms in Google Books, 1908-2008") +
theme_bw()
p
Here the line for "Muslim" shows how often "Muslim poet", "muslim poet", "Muslim historian", "muslim historian", "Muslim writer" or "muslim writer" appears in a given year -- as well, again, as the plural of each term.
As you can see, when describing literary figures, "Moslem" was used slightly more often than "Muslim" early in the 20th century.
Currently the only way to install gbNgram is via devtools
:
library(devtools)
devtools::install_github(chrismeserole/gbNgram)