Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output refactor plan #2

Open
zsusswein opened this issue Oct 23, 2024 · 1 comment
Open

Output refactor plan #2

zsusswein opened this issue Oct 23, 2024 · 1 comment
Assignees

Comments

@zsusswein
Copy link
Collaborator

zsusswein commented Oct 23, 2024

Goal

Produce a pipeline to transform model outputs into datasets formatted for public consumption (website, data.cdc.gov, DCIPHER, others?).

Pipeline structure

  • Modeling outputs write to ABS with the schema:
az://rt-output/
├── job_1/.
│   ├── tasks/
│   │   ├── task_1/
│   │   │   ├── model.rds + others
│   │   ├── task_3/
│   │   │   ├── model.rds + others
│   │   ├── task_5/
│   │   │   ├── model.rds + others
│   │   ├── task_6/
│   │   │   ├── model.rds + others
│   ├── job_metadata.json
  • Production outputs write to a dedicated container. Non-production ("experiment") model runs write to their own blob storage container.
  • Each model run metadata has a run_at timestamp and a release_date datestamp. Later run_at timestamps supersede earlier ones for the same release_tag_date for downstream processing
    • All versions of the outputs continue to exist in the container
    • Later jobs can run a subset of the jurisdictions and only supersede estimates for that subset
    • I don't love release_tag_date so please help
  • This repository has a function that scans over metadata and extracts the final task run for each state and saves the aggregated file as parquet. This file is the reference file for the production date
    • this file is used to generate production files (map_data...., timeseries_data...)
    • We generate this file after we are done with runs for the production date (loosely analogous to the silver file in the ETL?)

image

Requirements

A loose start...

Production outputs

  • All in python
  • Reads from Blob, pulling last task for given production date by scanning over metadata
    • No being clever until we have a problem with run time
    • Don't require a complete set of jurisdictions at the task level -- implicit missingness is ok when scanning from model outputs. We don't want to be in a situation where we can't aggregate an incomplete set of outputs.
    • However, we probably want asserts on the final outputs to check that implicit missingness has been made explicit.
  • Writes condensed output file to blob
  • Reads condensed output file to generate production files

Model eval

  • write the metadata scan function that takes a date and/or a release tag
  • We could have a staging environment where we prepare for production without actual production and we could point to all versions of a model run
  • Either way we want this to be a distinct service that we can run on an as-needed basis pointed to an arbitrary.....something -- is the unique identifier for the run the run_at or the production_date or something else?
    • Nate has some tree magic he will diagram for us involving "persistent something something"

Open questions

  • Where do model reports and plots live? Are they in this repo or a different repo?
    • If in this repo, are they a dashboard? Or pdf reports?
    • What kind of environment do they run in?
  • What's an exclusion? Can we break exclusions down into composite pieces?
    • For example, when we are doing backtesting is an exclusion from reporting also an exclusion from evaluation in backtesting?
    • What types of exclusions are generated where in the process? What can we save in a centralized place?
    • Patrick: Add them to the Excel file from the meeting, move to CSV, store.....somewhere
    • KG: This should be a result of the postprocessing pipeline for each unique run ID -- add a fixed set of cats for exclusion reason and this is something that we set up in the config but this will be difficult get right
  • Does this postprocessing piece run every time or only once? -- we need to answer this a little further along, making the decision output by output. e.g. dashboard should probably show the version to-be-released, for flat files or plots we may want it to run every time but it depends...
  • Website update workflow: Inform wants to move away from the current workflow where we hand them CSVs. Instead, they want our db -> DCIPHER -> data.cdc.gov -> website
    • But this is a to-be-resolved-later
  • @zsusswein put patrick in this repo
  • Patrick I can't tag you but can you leave a comment here about the outlier handling and Ben's code and scope?

Out of scope for v1

  • Additional modeling
  • New evaluation metrics or plots
  • Dashboard
@seabbs
Copy link

seabbs commented Oct 23, 2024

My two cents

Rather than using a combination of a time stamp and a date you could use a single date time. This could then be searched across by postprocessing to create a release (filtering for the latest available run per location within the release window). The logic would look something like:

  1. Is greater than a date time
  2. Is smaller than a final release date time
  3. Is the latest date time found for a given location

This only works if you are able to identify production runs (which I think you are).

If you keep production_date I'd suggest renaming it as I think it is actually a release tag rather than a "production date" (it isnn't actually linked to the original production date but somee target release window or version.

I'm not sure why you are treating production as special here. Why not just have a tag system and production runs assign some kind of unique tag like "prod--". Then if you want you can metadata scan by any tag and time/date stamp

Where do model reports and plots live? Are they in this repo or a different repo?

Not in this repo - maybe tools for postprocessing

are they a dashboard?

Yes - connected to blob and allowing for looking through ids ideally

Does this postprocessing piece run every time

Every production run?

For example, when we are doing backtesting is an exclusion from reporting also an exclusion from evaluation in backtesting?

No. Ideally an exception hierarchy? Use a tag system again and enforce a dictionary of accepted tags (outlier etc etc etc). Then can filter these out as we wish

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants