First draft of output schema for the Rt postprocessing #1

kgostic · 2024-10-10T04:08:15Z

Here is the rendered readme with a proposed output structure. Please review and comment! https://github.com/CDCgov/cfa-rt-postprocessing/tree/output-structure?tab=readme-ov-file#cfa-r_t-postprocesing

Things we need to decide to move forward:

What language will the code be in? (R or python) - leaning toward Python because it has a good azure sdk and delta tables.
Where will the code run? (proposal: run in the vap but read and write from and to blob)
Can we use a pipeline manager like make or airflow?
Is this a pacakge?
Currently, the job_id and disease are conflated in the proposed output structure. How to solve?
What is the definition of run_date, release_date, rt_date, asof_date, etc. and what names/concepts do we want in the merged_release.csv file?
Proposal: keep the release module separate from the for-internal-review module. Put the release module with some stuff to automatically insert into the production database.
How do we reprocess if we have to re-run just one or two states? Keep track using an approach similar to the parameters table (SCD2) where we log metadata for each state in the run and mark old versions out of date. Or just use a delta table.
How do we manage the review and exclusion process?

In scope

v1: Proposed scope of this project: refactor to generate basic files as outlined here

Out of scope (for now)

v2: Add a dashboard
v3: Re-write the JS so that the website can also read from merged_release.csv

README.md

natemcintosh · 2024-10-10T13:13:25Z

README.md

+[data.cdc.gov](), and to [DCIPHER](). By convention, the pipeline always
+generates outputs for publication, even if the input data is from an experiment,
+test run, or backfill exercise that is not intended for release. The publication
+outputs are not costly to generate and our person-driven releae process ensures


Suggested change

outputs are not costly to generate and our person-driven releae process ensures

outputs are not costly to generate and our person-driven release process ensures

README.md

mermaid_diagram.mjs

PatrickTCorbett · 2024-10-10T18:47:23Z

What language will the code be in? (R or python)

I feel like R would be easier since a lot of the post processing is currently in R but I have not preferences

Where will the code run? (proposal: run in the vap but read and write from and to blob)

I like the idea of running in vap and writing to/from blob. When Kingsley or I recreate the output figures (to process state exclusions for example), ~75% of the time it takes is for the nodes to spin up in Azure Batch

Is this a pacakge?

My initial thought is that this should start off as a repo or two and then perhaps a package later in its life cycle if a clear path for its use in other pipelines/teams has been established

What is the definition of run_date, release_date, rt_date, asof_date, etc. and what names/concepts do we want in the merged_release.csv file?

Here's my attempt, lets see how close I am to the consensus

run_date: Day the pipeline was run (currently Wednesdays unless there's a surprise delay)
release_date: Day the Rt outputs are published by CFA (currently Friday after the run_date)
rt_date: Day that data collection halted (for NSSP its the day before the run_date and for NHSN it was the previous Saturday)
asof_date: Used when the NSSP data shown is from that date and not retroactively filled in by a later date

Regarding a potential dashboard, I'm wondering if we might want to create some of the anomaly figures or create RDS files encoding those figures within the post processing pipeline? The current anomaly report takes a while to generate because of the computational demand of reading and processing the input files (model RDS files, gold parquet files, latent.csv files). Any dynamic dashboard would benefit from reading in already processed figures/data so we don't have to wait

kgostic · 2024-11-12T14:24:23Z

@natemcintosh, I think that the more up to date version is in @zsusswein's issue: #2.

@zsusswein could you update this readme to reflect the final plan and submit for review?

kgostic requested a review from natemcintosh October 10, 2024 04:08

natemcintosh approved these changes Oct 10, 2024

View reviewed changes

amondal2 mentioned this pull request Oct 10, 2024

Config generation workflow CDCgov/cfa-epinow2-pipeline#68

Closed

kgostic mentioned this pull request Nov 7, 2024

Meta issue for project #3

Open

13 tasks

kgostic requested a review from zsusswein November 12, 2024 14:24

kgostic assigned zsusswein Nov 12, 2024

zsusswein removed their request for review November 12, 2024 14:25

kgostic added 4 commits November 12, 2024 15:39

first draft of output schema

06bff33

This is sloppy but I need to add the mermaid diagram

4b63ce9

Edit to render the diagram from .png

a32cd0d

Render mermaid diagram inline

535ffe8

zsusswein force-pushed the output-structure branch from ea011c3 to 535ffe8 Compare November 12, 2024 20:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First draft of output schema for the Rt postprocessing #1

First draft of output schema for the Rt postprocessing #1

kgostic commented Oct 10, 2024 •

edited

Loading

natemcintosh Oct 10, 2024

PatrickTCorbett commented Oct 10, 2024 •

edited

Loading

kgostic commented Nov 12, 2024

	outputs are not costly to generate and our person-driven releae process ensures
	outputs are not costly to generate and our person-driven release process ensures

First draft of output schema for the Rt postprocessing #1

Are you sure you want to change the base?

First draft of output schema for the Rt postprocessing #1

Conversation

kgostic commented Oct 10, 2024 • edited Loading

In scope

Out of scope (for now)

natemcintosh Oct 10, 2024

Choose a reason for hiding this comment

PatrickTCorbett commented Oct 10, 2024 • edited Loading

kgostic commented Nov 12, 2024

kgostic commented Oct 10, 2024 •

edited

Loading

PatrickTCorbett commented Oct 10, 2024 •

edited

Loading