-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store pipeline output in compressed CSVs and read/write them using data.table
#308
base: dev
Are you sure you want to change the base?
Conversation
Reading and writing uncompressed CSVs is faster (~2x faster) than using data.table with compressed CSVs. So save files we need to read for the dashboard to uncompressed CSV, and predictions cards cache files to compressed CSV.
`fread` is able to read from URLs directly, by first downloading the object to a temp file, but in testing, using `s3read_using(fread...)` was faster than using `fread(...)`.
so if I'm reading this right, to test it, I should run |
That said, given the difficulties of testing this, I'll run the pipeline in GitHub Actions so that the score files are available in the S3 bucket (won't impact the public dashboard since the new extension makes the score file names different), and you can test the dashboard only. I'll let you know when that's done. |
Closes #262
Switch to
csv.gz
format instead of using RDS. CSV is more flexible in what R packages we can use to read/write the files, and in what languages we want to use (e.g. if we want to rewrite the pipeline in Python but keep the dashboard in R).fread
only speeds up the dashboard slightly, but offers the opportunity to use fasterdata.table
processing in the future.