Color-issues in hydrological publications

Authors: Michael Stoelzle, University Freiburg, Germany and Lina Stein, University Bristol, UK

Summary: The rainbow color map is scientifically incorrect and hinders people with color vision deficiency to view visualizations in a correct way. Due to perceptual non-uniform color gradients within the rainbow color map the data representation is distorted what can lead to misinterpretation of results and flaws in science communication. Here we present the data of a paper survey of 797 scientific publication in the journal Hydrology and Earth System Sciences. With in the survey all papers were classified according to color issues. Find details about the data below.

Kaggle: There is also a Kaggle notebook available (https://www.kaggle.com/modche/rainbow-papersurvey-hydrology) to load the survey data and to look into the data.

Load data frame

#install.packages("tidyverse")
library(tidyverse)

#read data remotely from github
file <- 'https://raw.githubusercontent.com/modche/rainbow_hydrology/main/hess_papers_rainbow.txt'

df <- read_tsv(file)

# read file with base R
#df_alternative <- read.delim(file, sep = "\t")

1. Overview data variables of paper survey

year = year of publication (YYYY)
date = date (YYYY-MM-DD) of publication
title = full paper title from journal website
authors = list of authors comma-separated
n_authors = number of authors (integer between 1 and 27)
col_code = color-issue classification (see below)
volume = Journal volume
start_page = first page of paper (consecutive)
end_page = last page of paper (consecutive)
base_url = base url to access the PDF of the paper with /volume/start_page/year/
filename = specific file name of the paper PDF (e.g. hess-9-111-2005.pdf)

str(df)

## spec_tbl_df [797 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ year      : num [1:797] 2005 2005 2005 2005 2005 ...
##  $ date      : Date[1:797], format: "2005-05-09" "2005-06-09" ...
##  $ title     : chr [1:797] "Bringing it all together" "Consumptive water use to feed humanity - curing a blind spot" "Significance of spatial variability in precipitation for process-oriented modelling: results from two nested ca"| __truncated__ "Impact of phosphorus control measures on in-river phosphorus retention associated with point source pollution" ...
##  $ authors   : chr [1:797] "J. C. I. Dooge" "M. Falkenmark and M. Lannerstad" "D. Tetzlaff and S. Uhlenbrook" "B. O. L. Demars, D. M. Harper, J.-A. Pitt, and R. Slaughter" ...
##  $ n_authors : num [1:797] 1 2 2 4 12 7 2 4 2 2 ...
##  $ col_code  : chr [1:797] "bw" "0" "bw" "bw" ...
##  $ volume    : num [1:797] 9 9 9 9 9 9 9 9 9 9 ...
##  $ start_page: num [1:797] 3 15 29 43 57 67 81 95 111 127 ...
##  $ end_page  : num [1:797] 14 28 41 55 66 80 94 109 126 137 ...
##  $ base_url  : chr [1:797] "https://hess.copernicus.org/articles/9/3/2005/" "https://hess.copernicus.org/articles/9/15/2005/" "https://hess.copernicus.org/articles/9/29/2005/" "https://hess.copernicus.org/articles/9/43/2005/" ...
##  $ filename  : chr [1:797] "hess-9-3-2005.pdf" "hess-9-15-2005.pdf" "hess-9-29-2005.pdf" "hess-9-43-2005.pdf" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   year = col_double(),
##   ..   date = col_date(format = ""),
##   ..   title = col_character(),
##   ..   authors = col_character(),
##   ..   n_authors = col_double(),
##   ..   col_code = col_character(),
##   ..   volume = col_double(),
##   ..   start_page = col_double(),
##   ..   end_page = col_double(),
##   ..   base_url = col_character(),
##   ..   filename = col_character()
##   .. )

head(df)

## # A tibble: 6 x 11
##    year date       title           authors         n_authors col_code volume
##   <dbl> <date>     <chr>           <chr>               <dbl> <chr>     <dbl>
## 1  2005 2005-05-09 Bringing it al… J. C. I. Dooge          1 bw            9
## 2  2005 2005-06-09 Consumptive wa… M. Falkenmark …         2 0             9
## 3  2005 2005-06-09 Significance o… D. Tetzlaff an…         2 bw            9
## 4  2005 2005-06-14 Impact of phos… B. O. L. Demar…         4 bw            9
## 5  2005 2005-06-14 Biogeochemistr… V. R. Shevchen…        12 bw            9
## 6  2005 2005-06-14 Factors influe… J. Pempkowiak,…         7 0             9
## # … with 4 more variables: start_page <dbl>, end_page <dbl>, base_url <chr>,
## #   filename <chr>

tail(df)

## # A tibble: 6 x 11
##    year date       title           authors         n_authors col_code volume
##   <dbl> <date>     <chr>           <chr>               <dbl> <chr>     <dbl>
## 1  2020 2020-10-23 Hierarchical s… Haifan Liu, He…        11 2            24
## 2  2020 2020-10-26 3D multiple-po… Valentin Dall'…         6 2            24
## 3  2020 2020-10-28 Averaging over… Elham Rouholah…         3 0            24
## 4  2020 2020-10-28 Anthropogenic … Alex Zavarsky …         2 0            24
## 5  2020 2020-10-29 Dynamic mechan… Jianrong Zhu, …         6 2            24
## 6  2020 2020-10-30 Hydrodynamic a… Xintong Li, Bi…         9 2            24
## # … with 4 more variables: start_page <dbl>, end_page <dbl>, base_url <chr>,
## #   filename <chr>

skimr::skim(df)


Name	df
Number of rows	797
Number of columns	11
_______________________
Column type frequency:
character	5
Date	1
numeric	5
________________________
Group variables	None

Data summary

Variable type: character

skim_variable	complete_rate	min	max	n_unique
title	1	24	255	797
authors	1	6	534	790
col_code	1	1	2	4
base_url	1	46	50	797
filename	1	17	21	797

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	2005-05-09	2020-10-30	2015-06-08	455

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	2014.77	4.58	2005	2010	2015	2020	2020	▂▅▁▇▇
n_authors	1	4.65	2.73	1	3	4	6	27	▇▂▁▁▁
volume	1	18.77	4.58	9	14	19	24	24	▂▅▁▇▇
start_page	1	2077.35	1422.75	1	815	1895	3189	5057	▇▆▆▃▃
end_page	1	2092.64	1423.50	14	829	1908	3209	5076	▇▆▆▃▃

2. Explanatory data analysis

Distribution of surveyed papers in 2005, 2010, 2015 and 2020.

df %>% count(year)

## # A tibble: 4 x 2
##    year     n
## * <dbl> <int>
## 1  2005    54
## 2  2010   191
## 3  2015   289
## 4  2020   263

Color classification is stored in the col_code variable with:

0 = chromatic and issue-free,
1 = red-green issues,
2= rainbow issues and
bw= black and white paper.

df %>% 
    count(col_code) %>% 
    mutate(pct = n / sum(n))

## # A tibble: 4 x 3
##   col_code     n    pct
## * <chr>    <int>  <dbl>
## 1 0          377 0.473 
## 2 1          159 0.199 
## 3 2          190 0.238 
## 4 bw          71 0.0891

Focus on color classification in 2020:

df %>% 
    group_by(year) %>% 
    count(col_code) %>% 
    mutate(pct = n / sum(n)) %>% 
    filter(year == 2020) %>% 
    ungroup()

## # A tibble: 4 x 4
##    year col_code     n     pct
##   <dbl> <chr>    <int>   <dbl>
## 1  2020 0          139 0.529  
## 2  2020 1           60 0.228  
## 3  2020 2           62 0.236  
## 4  2020 bw           2 0.00760

df %>% 
    group_by(year) %>% 
    count(col_code) %>% 
    mutate(pct = n / sum(n)) %>% 
    ungroup() %>% 
    emphatic::hl('purple', rows = col_code == 2)

     year col_code   n         pct
1    2005        0  19 0.351851852
2    2005        1   5 0.092592593
3    2005        2   6 0.111111111
4    2005       bw  24 0.444444444
5    2010        0  79 0.413612565
6    2010        1  31 0.162303665
7    2010        2  47 0.246073298
8    2010       bw  34 0.178010471
9    2015        0 140 0.484429066
10   2015        1  63 0.217993080
11   2015        2  75 0.259515571
12   2015       bw  11 0.038062284
13   2020        0 139 0.528517110
14   2020        1  60 0.228136882
15   2020        2  62 0.235741445
16   2020       bw   2 0.007604563

Figure showing number of authors across color classification.

3. Access PDF papers with download links

Data frame can be accessed to extract a vector of links to download specific papers.

Example: Access a specific paper from 2005:

df %>% filter(year == 2005, start_page == 111) %>% 
    select(base_url, filename) %>% 
    mutate(download_link = paste0(base_url, filename)) %>% 
    pull(download_link)

## [1] "https://hess.copernicus.org/articles/9/111/2005/hess-9-111-2005.pdf"

Example: Single-author papers from 2005 that are pure black and white papers:

df %>% filter(year == 2005, col_code == "bw", n_authors == 1) %>% 
    select(base_url, filename) %>% 
    mutate(download_link = paste0(base_url, filename)) %>% 
    pull(download_link)

## [1] "https://hess.copernicus.org/articles/9/3/2005/hess-9-3-2005.pdf"    
## [2] "https://hess.copernicus.org/articles/9/481/2005/hess-9-481-2005.pdf"
## [3] "https://hess.copernicus.org/articles/9/645/2005/hess-9-645-2005.pdf"
## [4] "https://hess.copernicus.org/articles/9/675/2005/hess-9-675-2005.pdf"

Example: Rainbow papers from 2020 with more than 10 authors:

df %>% filter(year == 2020, col_code == 2, n_authors > 10) %>% 
    select(base_url, filename) %>% 
    mutate(download_link = paste0(base_url, filename)) %>% 
    pull(download_link)

## [1] "https://hess.copernicus.org/articles/24/633/2020/hess-24-633-2020.pdf"  
## [2] "https://hess.copernicus.org/articles/24/697/2020/hess-24-697-2020.pdf"  
## [3] "https://hess.copernicus.org/articles/24/1485/2020/hess-24-1485-2020.pdf"
## [4] "https://hess.copernicus.org/articles/24/3361/2020/hess-24-3361-2020.pdf"
## [5] "https://hess.copernicus.org/articles/24/4291/2020/hess-24-4291-2020.pdf"
## [6] "https://hess.copernicus.org/articles/24/4971/2020/hess-24-4971-2020.pdf"

4. Potential analyses with paper survey data:

Some example code snippets:

df %>% filter(str_detect(string = authors, pattern = "Weiler"))

## # A tibble: 8 x 11
##    year date       title           authors         n_authors col_code volume
##   <dbl> <date>     <chr>           <chr>               <dbl> <chr>     <dbl>
## 1  2010 2010-07-02 Effect of the … C. Gascuel-Odo…         3 1            14
## 2  2010 2010-08-04 Explicit simul… S. Stoll and M…         2 0            14
## 3  2010 2010-08-13 Integrated res… M. C. Roa-Garc…         2 bw           14
## 4  2015 2015-03-12 Quantifying se… M. Staudinger,…         3 0            19
## 5  2015 2015-06-03 Estimating flo… M. Sprenger, T…         4 0            19
## 6  2020 2020-02-25 Beyond binary … Michael Stoelz…         5 0            24
## 7  2020 2020-05-25 Soil moisture:… Mirko Mälicke,…         5 0            24
## 8  2020 2020-06-25 Field observat… Anne Hartmann,…         4 0            24
## # … with 4 more variables: start_page <dbl>, end_page <dbl>, base_url <chr>,
## #   filename <chr>

df %>% filter(str_detect(string = title, pattern = "radar"))

## # A tibble: 12 x 11
##     year date       title           authors         n_authors col_code volume
##    <dbl> <date>     <chr>           <chr>               <dbl> <chr>     <dbl>
##  1  2005 2005-06-09 Significance o… D. Tetzlaff an…         2 bw            9
##  2  2010 2010-01-21 Characteristic… M. Barnolas, T…         3 1            14
##  3  2010 2010-02-05 Relating surfa… H. Stephen, S.…         4 2            14
##  4  2010 2010-02-05 Performance of… C. Z. van de B…         5 2            14
##  5  2015 2015-01-19 Satellite rada… Y. B. Sulistio…         8 1            19
##  6  2015 2015-03-02 Quantitative h… P. Klenk, S. J…         3 0            19
##  7  2015 2015-03-02 Polarimetric r… M. Frech and J…         2 2            19
##  8  2015 2015-03-25 Scoping a fiel… Y. Duan, A. M.…         3 2            19
##  9  2015 2015-04-29 Evaluation of … O. P. Prat and…         2 2            19
## 10  2015 2015-09-29 Singularity-se… L.-P. Wang, S.…         4 2            19
## 11  2020 2020-03-24 Reconstructing… Nicolás Velásq…         4 0            24
## 12  2020 2020-06-19 The accuracy o… Marc Schleiss,…        10 2            24
## # … with 4 more variables: start_page <dbl>, end_page <dbl>, base_url <chr>,
## #   filename <chr>

df %>% filter(n_authors >= 7, col_code == 2)

## # A tibble: 43 x 11
##     year date       title           authors         n_authors col_code volume
##    <dbl> <date>     <chr>           <chr>               <dbl> <chr>     <dbl>
##  1  2010 2010-01-22 Soil moisture … C. Gruhier, P.…        11 2            14
##  2  2010 2010-02-22 A contribution… E. Alcântara, …         7 2            14
##  3  2010 2010-05-28 Modelling soil… S. Juglea, Y. …        10 2            14
##  4  2010 2010-06-24 A quality asse… T. Graeff, E. …         8 2            14
##  5  2010 2010-08-24 A past dischar… G. Thirel, E. …         7 2            14
##  6  2010 2010-09-09 Combined use o… D. Courault, R…         7 2            14
##  7  2010 2010-10-11 A multi basin … Z. M. Easton, …         9 2            14
##  8  2010 2010-10-21 Bayesian appro… H. Murakami, X…         8 2            14
##  9  2010 2010-12-06 Interannual va… F. Frappart, F…         8 2            14
## 10  2010 2010-12-16 Error characte… W. A. Dorigo, …         7 2            14
## # … with 33 more rows, and 4 more variables: start_page <dbl>, end_page <dbl>,
## #   base_url <chr>, filename <chr>

df %>% filter(end_page > start_page + 30)

## # A tibble: 4 x 11
##    year date       title           authors         n_authors col_code volume
##   <dbl> <date>     <chr>           <chr>               <dbl> <chr>     <dbl>
## 1  2015 2015-01-15 Hydrometeorolo… R. G. Knox, M.…         7 1            19
## 2  2020 2020-06-19 The accuracy o… Marc Schleiss,…        10 2            24
## 3  2020 2020-08-07 Revisiting the… Demetris Kouts…         1 1            24
## 4  2020 2020-08-25 Predicting dis… Adam Kiczko, K…         6 0            24
## # … with 4 more variables: start_page <dbl>, end_page <dbl>, base_url <chr>,
## #   filename <chr>

5. Text mining

Code example to start with text mining, e.g. extracting common words in paper titles.

library(tidytext)

df %>% 
    unnest_tokens(word, title) %>% 
    select(col_code, word, n_authors) %>% 
    mutate(word_len = str_length(word)) %>% 
    filter(word_len >= 5) %>% 
    group_by(word) %>% 
    add_count() %>% 
    ungroup()

## # A tibble: 7,621 x 5
##    col_code word         n_authors word_len     n
##    <chr>    <chr>            <dbl>    <int> <int>
##  1 bw       bringing             1        8     1
##  2 bw       together             1        8     1
##  3 0        consumptive          2       11     1
##  4 0        water                2        5   161
##  5 0        humanity             2        8     1
##  6 0        curing               2        6     1
##  7 0        blind                2        5     1
##  8 bw       significance         2       12     2
##  9 bw       spatial              2        7    31
## 10 bw       variability          2       11    37
## # … with 7,611 more rows

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
README_files/figure-gfm		README_files/figure-gfm
.gitignore		.gitignore
README.Rmd		README.Rmd
README.md		README.md
hess_papers_rainbow.txt		hess_papers_rainbow.txt
hess_papers_rainbow.zip		hess_papers_rainbow.zip
rainbow_hydrology.Rproj		rainbow_hydrology.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Color-issues in hydrological publications

Load data frame

1. Overview data variables of paper survey

2. Explanatory data analysis

3. Access PDF papers with download links

4. Potential analyses with paper survey data:

5. Text mining

About

Releases

Packages

modche/rainbow_hydrology

Folders and files

Latest commit

History

Repository files navigation

Color-issues in hydrological publications

Load data frame

1. Overview data variables of paper survey

2. Explanatory data analysis

3. Access PDF papers with download links

4. Potential analyses with paper survey data:

5. Text mining

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages