Store pipeline output in compressed CSVs and read/write them using `data.table` #308

nmdefries · 2023-09-22T21:22:17Z

Closes #262

Switch to csv.gz format instead of using RDS. CSV is more flexible in what R packages we can use to read/write the files, and in what languages we want to use (e.g. if we want to rewrite the pipeline in Python but keep the dashboard in R).

fread only speeds up the dashboard slightly, but offers the opportunity to use faster data.table processing in the future.

Reading and writing uncompressed CSVs is faster (~2x faster) than using data.table with compressed CSVs. So save files we need to read for the dashboard to uncompressed CSV, and predictions cards cache files to compressed CSV.

`fread` is able to read from URLs directly, by first downloading the object to a temp file, but in testing, using `s3read_using(fread...)` was faster than using `fread(...)`.

dsweber2 · 2023-09-25T18:15:03Z

so if I'm reading this right, to test it, I should run make score_forecast and make build_dashboard_dev?

nmdefries · 2023-09-25T18:33:06Z

make score_forecast to test the pipeline and make start_dashboard to test the dashboard (build_dashboard_dev is run as a dependency).

make score_forecast depends on an image repo download, a workaround is given as

# `docker_build/Dockerfile` is based on `ghcr.io/cmu-delphi/covidcast:latest`.
# Docker will try to fetch it from the image repository, which requires
# authentication. As a workaround, locally build a docker image
# from https://github.com/cmu-delphi/covidcast-docker/ using the `make build`
# target, and set `--pull=false` below.

That said, given the difficulties of testing this, I'll run the pipeline in GitHub Actions so that the score files are available in the S3 bucket (won't impact the public dashboard since the new extension makes the score file names different), and you can test the dashboard only. I'll let you know when that's done.

nmdefries added 8 commits September 22, 2023 13:14

write output from pipeline with fwrite

663ac5c

Reading and writing uncompressed CSVs is faster (~2x faster) than using data.table with compressed CSVs. So save files we need to read for the dashboard to uncompressed CSV, and predictions cards cache files to compressed CSV.

use fread to read scores from aws in app

2155900

`fread` is able to read from URLs directly, by first downloading the object to a temp file, but in testing, using `s3read_using(fread...)` was faster than using `fread(...)`.

update list of files available and example download to use csv.gz

4f4caf3

update forecasts+actuals generation script to use csv

fe6dee1

change uploaded/downloaded files from .rds to csv.gz

2f7f163

install R.utils as dependency of data.table

4a724ec

cast auto IDate to base Date

fc2a781

styler

0f8886f

nmdefries requested a review from dsweber2 September 22, 2023 21:22

for checking for missing forecasters, read old scores with fread

a4e9e7e

dsweber2 added this to the website update milestone Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store pipeline output in compressed CSVs and read/write them using `data.table` #308

Store pipeline output in compressed CSVs and read/write them using `data.table` #308

nmdefries commented Sep 22, 2023

dsweber2 commented Sep 25, 2023

nmdefries commented Sep 25, 2023

Store pipeline output in compressed CSVs and read/write them using data.table #308

Are you sure you want to change the base?

Store pipeline output in compressed CSVs and read/write them using data.table #308

Conversation

nmdefries commented Sep 22, 2023

dsweber2 commented Sep 25, 2023

nmdefries commented Sep 25, 2023

Store pipeline output in compressed CSVs and read/write them using `data.table` #308

Store pipeline output in compressed CSVs and read/write them using `data.table` #308