Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store pipeline output in compressed CSVs and read/write them using data.table #308

Open
wants to merge 9 commits into
base: dev
Choose a base branch
from

Conversation

nmdefries
Copy link
Collaborator

Closes #262

Switch to csv.gz format instead of using RDS. CSV is more flexible in what R packages we can use to read/write the files, and in what languages we want to use (e.g. if we want to rewrite the pipeline in Python but keep the dashboard in R).

fread only speeds up the dashboard slightly, but offers the opportunity to use faster data.table processing in the future.

Reading and writing uncompressed CSVs is faster (~2x faster) than using
data.table with compressed CSVs. So save files we need to read for the
dashboard to uncompressed CSV, and predictions cards cache files to
compressed CSV.
`fread` is able to read from URLs directly, by first downloading the
object to a temp file, but in testing, using `s3read_using(fread...)`
was faster than using `fread(...)`.
@nmdefries nmdefries requested a review from dsweber2 September 22, 2023 21:22
@dsweber2
Copy link
Collaborator

so if I'm reading this right, to test it, I should run make score_forecast and make build_dashboard_dev?

@nmdefries
Copy link
Collaborator Author

make score_forecast to test the pipeline and make start_dashboard to test the dashboard (build_dashboard_dev is run as a dependency).

make score_forecast depends on an image repo download, a workaround is given as

# `docker_build/Dockerfile` is based on `ghcr.io/cmu-delphi/covidcast:latest`.
# Docker will try to fetch it from the image repository, which requires
# authentication. As a workaround, locally build a docker image
# from https://github.com/cmu-delphi/covidcast-docker/ using the `make build`
# target, and set `--pull=false` below.

That said, given the difficulties of testing this, I'll run the pipeline in GitHub Actions so that the score files are available in the S3 bucket (won't impact the public dashboard since the new extension makes the score file names different), and you can test the dashboard only. I'll let you know when that's done.

@dsweber2 dsweber2 added this to the website update milestone Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Save output as CSV, and read and write using data.table
2 participants