You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Production outputs write to a dedicated container. Non-production ("experiment") model runs write to their own blob storage container.
Each model run metadata has a run_at timestamp and a release_date datestamp. Later run_at timestamps supersede earlier ones for the same release_tag_date for downstream processing
All versions of the outputs continue to exist in the container
Later jobs can run a subset of the jurisdictions and only supersede estimates for that subset
I don't love release_tag_date so please help
This repository has a function that scans over metadata and extracts the final task run for each state and saves the aggregated file as parquet. This file is the reference file for the production date
this file is used to generate production files (map_data...., timeseries_data...)
We generate this file after we are done with runs for the production date (loosely analogous to the silver file in the ETL?)
Requirements
A loose start...
Production outputs
All in python
Reads from Blob, pulling last task for given production date by scanning over metadata
No being clever until we have a problem with run time
Don't require a complete set of jurisdictions at the task level -- implicit missingness is ok when scanning from model outputs. We don't want to be in a situation where we can't aggregate an incomplete set of outputs.
However, we probably want asserts on the final outputs to check that implicit missingness has been made explicit.
Writes condensed output file to blob
Reads condensed output file to generate production files
Model eval
write the metadata scan function that takes a date and/or a release tag
We could have a staging environment where we prepare for production without actual production and we could point to all versions of a model run
Either way we want this to be a distinct service that we can run on an as-needed basis pointed to an arbitrary.....something -- is the unique identifier for the run the run_at or the production_date or something else?
Nate has some tree magic he will diagram for us involving "persistent something something"
Open questions
Where do model reports and plots live? Are they in this repo or a different repo?
If in this repo, are they a dashboard? Or pdf reports?
What kind of environment do they run in?
What's an exclusion? Can we break exclusions down into composite pieces?
For example, when we are doing backtesting is an exclusion from reporting also an exclusion from evaluation in backtesting?
What types of exclusions are generated where in the process? What can we save in a centralized place?
Patrick: Add them to the Excel file from the meeting, move to CSV, store.....somewhere
KG: This should be a result of the postprocessing pipeline for each unique run ID -- add a fixed set of cats for exclusion reason and this is something that we set up in the config but this will be difficult get right
Does this postprocessing piece run every time or only once? -- we need to answer this a little further along, making the decision output by output. e.g. dashboard should probably show the version to-be-released, for flat files or plots we may want it to run every time but it depends...
Website update workflow: Inform wants to move away from the current workflow where we hand them CSVs. Instead, they want our db -> DCIPHER -> data.cdc.gov -> website
Rather than using a combination of a time stamp and a date you could use a single date time. This could then be searched across by postprocessing to create a release (filtering for the latest available run per location within the release window). The logic would look something like:
Is greater than a date time
Is smaller than a final release date time
Is the latest date time found for a given location
This only works if you are able to identify production runs (which I think you are).
If you keep production_date I'd suggest renaming it as I think it is actually a release tag rather than a "production date" (it isnn't actually linked to the original production date but somee target release window or version.
I'm not sure why you are treating production as special here. Why not just have a tag system and production runs assign some kind of unique tag like "prod--". Then if you want you can metadata scan by any tag and time/date stamp
Where do model reports and plots live? Are they in this repo or a different repo?
Not in this repo - maybe tools for postprocessing
are they a dashboard?
Yes - connected to blob and allowing for looking through ids ideally
Does this postprocessing piece run every time
Every production run?
For example, when we are doing backtesting is an exclusion from reporting also an exclusion from evaluation in backtesting?
No. Ideally an exception hierarchy? Use a tag system again and enforce a dictionary of accepted tags (outlier etc etc etc). Then can filter these out as we wish
Goal
Produce a pipeline to transform model outputs into datasets formatted for public consumption (website, data.cdc.gov, DCIPHER, others?).
Pipeline structure
run_at
timestamp and arelease_date
datestamp. Laterrun_at
timestamps supersede earlier ones for the samerelease_tag_date
for downstream processingrelease_tag_date
so please helpmap_data....
,timeseries_data...
)Requirements
A loose start...
Production outputs
Model eval
run_at
or theproduction_date
or something else?Open questions
Out of scope for v1
The text was updated successfully, but these errors were encountered: