-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download only recently added or changed forecaster files #233
base: dev
Are you sure you want to change the base?
Conversation
@krivard Please take a look at this when you have a chance for initial/broad feedback. API authentication works in a test workflow run. I believe output from this branch matches output from the current pipeline but that comparison was a while ago so I'll probably repeat. We'd previously discussed setting up a manual option to be able to have the pipeline download the full repo history (in case we need to regenerate I'm not sure of the best way to do that. Currently the pipeline doesn't have a params file; the only option that it takes is a (static) command-line arg. I can set up a couple extra arguments -- I was thinking |
# is added that backfills forecast dates, we will end up requesting all those | ||
# dates for forecasters we've already seen before. To prevent that, make a new | ||
# call to `get_covidhub_predictions` for each forecaster with its own dates. | ||
predictions_cards <- lapply( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could easily parallelize this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is what you were asking but lmk if I'm confused: what you probably want to do is make two functions:
fetch_predictions_cards_updates()
- which uses the lapply per-forecasterget_covidhub_predicitons
callfetch_predictions_cards_all()
- which uses the old all-forecasters all-datesget_covidhub_predictions
call
then use an environment variable to figure out which one to use when setting predictions_cards
.
I'd also recommend pulling all the github parsing stuff out into a different file if you can; there's a lot there.
& for invoking the different behaviors, you have options:
eg for (2)
&
|
Description
Instead of recreating
predictions_cards
from scratch every time the pipeline is run, download and process only files that have been added or modified since the last pipeline run. Added and modified files are selected based on commit history pulled from the Reich Lab repository via the GitHub REST API.The new files are joined onto the predictions card object from the previous run, which is downloaded from the S3 bucket, and deduplicated in case any past predictions were modified.
The pipeline maintains the ability to regenerate the entire submission history via a manual override. The script can take two command line arguments,
exhaustive-download
andexhaustive-scoring
(does not currently change scoring behavior). Defaults are set in theMakefile
.Addresses this task.
Bonus: refactor of
Report/create_reports.R
.Changes
Makefile
s3_upload_ec2.yml
(self-hosted workflow)Report/create_reports.R
Report/fetch_data.R
(new)Report/process_data.R
(new)