Output refactor plan #2

zsusswein · 2024-10-23T01:44:52Z

Goal

Produce a pipeline to transform model outputs into datasets formatted for public consumption (website, data.cdc.gov, DCIPHER, others?).

Pipeline structure

Modeling outputs write to ABS with the schema:

az://rt-output/
├── job_1/.
│   ├── tasks/
│   │   ├── task_1/
│   │   │   ├── model.rds + others
│   │   ├── task_3/
│   │   │   ├── model.rds + others
│   │   ├── task_5/
│   │   │   ├── model.rds + others
│   │   ├── task_6/
│   │   │   ├── model.rds + others
│   ├── job_metadata.json

Production outputs write to a dedicated container. Non-production ("experiment") model runs write to their own blob storage container.
Each model run metadata has a run_at timestamp and a release_date datestamp. Later run_at timestamps supersede earlier ones for the same release_tag_date for downstream processing
- All versions of the outputs continue to exist in the container
- Later jobs can run a subset of the jurisdictions and only supersede estimates for that subset
- I don't love release_tag_date so please help
This repository has a function that scans over metadata and extracts the final task run for each state and saves the aggregated file as parquet. This file is the reference file for the production date
- this file is used to generate production files (map_data...., timeseries_data...)
- We generate this file after we are done with runs for the production date (loosely analogous to the silver file in the ETL?)

Requirements

A loose start...

Production outputs

All in python
Reads from Blob, pulling last task for given production date by scanning over metadata
- No being clever until we have a problem with run time
- Don't require a complete set of jurisdictions at the task level -- implicit missingness is ok when scanning from model outputs. We don't want to be in a situation where we can't aggregate an incomplete set of outputs.
- However, we probably want asserts on the final outputs to check that implicit missingness has been made explicit.
Writes condensed output file to blob
Reads condensed output file to generate production files

Model eval

write the metadata scan function that takes a date and/or a release tag
We could have a staging environment where we prepare for production without actual production and we could point to all versions of a model run
Either way we want this to be a distinct service that we can run on an as-needed basis pointed to an arbitrary.....something -- is the unique identifier for the run the run_at or the production_date or something else?
- Nate has some tree magic he will diagram for us involving "persistent something something"

Open questions

Where do model reports and plots live? Are they in this repo or a different repo?
- If in this repo, are they a dashboard? Or pdf reports?
- What kind of environment do they run in?
What's an exclusion? Can we break exclusions down into composite pieces?
- For example, when we are doing backtesting is an exclusion from reporting also an exclusion from evaluation in backtesting?
- What types of exclusions are generated where in the process? What can we save in a centralized place?
- Patrick: Add them to the Excel file from the meeting, move to CSV, store.....somewhere
- KG: This should be a result of the postprocessing pipeline for each unique run ID -- add a fixed set of cats for exclusion reason and this is something that we set up in the config but this will be difficult get right
Does this postprocessing piece run every time or only once? -- we need to answer this a little further along, making the decision output by output. e.g. dashboard should probably show the version to-be-released, for flat files or plots we may want it to run every time but it depends...
Website update workflow: Inform wants to move away from the current workflow where we hand them CSVs. Instead, they want our db -> DCIPHER -> data.cdc.gov -> website
- But this is a to-be-resolved-later
@zsusswein put patrick in this repo
Patrick I can't tag you but can you leave a comment here about the outlier handling and Ben's code and scope?

Out of scope for v1

Additional modeling
New evaluation metrics or plots
Dashboard

The text was updated successfully, but these errors were encountered:

seabbs · 2024-10-23T17:05:32Z

My two cents

Rather than using a combination of a time stamp and a date you could use a single date time. This could then be searched across by postprocessing to create a release (filtering for the latest available run per location within the release window). The logic would look something like:

Is greater than a date time
Is smaller than a final release date time
Is the latest date time found for a given location

This only works if you are able to identify production runs (which I think you are).

If you keep production_date I'd suggest renaming it as I think it is actually a release tag rather than a "production date" (it isnn't actually linked to the original production date but somee target release window or version.

I'm not sure why you are treating production as special here. Why not just have a tag system and production runs assign some kind of unique tag like "prod--". Then if you want you can metadata scan by any tag and time/date stamp

Where do model reports and plots live? Are they in this repo or a different repo?

Not in this repo - maybe tools for postprocessing

are they a dashboard?

Yes - connected to blob and allowing for looking through ids ideally

Does this postprocessing piece run every time

Every production run?

For example, when we are doing backtesting is an exclusion from reporting also an exclusion from evaluation in backtesting?

No. Ideally an exception hierarchy? Use a tag system again and enforce a dictionary of accepted tags (outlier etc etc etc). Then can filter these out as we wish

kgostic assigned kgostic and zsusswein Nov 7, 2024

kgostic mentioned this issue Nov 12, 2024

First draft of output schema for the Rt postprocessing #1

Draft

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output refactor plan #2

Output refactor plan #2

zsusswein commented Oct 23, 2024 •

edited

Loading

seabbs commented Oct 23, 2024 •

edited

Loading

Output refactor plan #2

Output refactor plan #2

Comments

zsusswein commented Oct 23, 2024 • edited Loading

Goal

Pipeline structure

Requirements

Production outputs

Model eval

Open questions

Out of scope for v1

seabbs commented Oct 23, 2024 • edited Loading

zsusswein commented Oct 23, 2024 •

edited

Loading

seabbs commented Oct 23, 2024 •

edited

Loading