Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First draft of output schema for the Rt postprocessing #1

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

kgostic
Copy link

@kgostic kgostic commented Oct 10, 2024

Here is the rendered readme with a proposed output structure. Please review and comment! https://github.com/CDCgov/cfa-rt-postprocessing/tree/output-structure?tab=readme-ov-file#cfa-r_t-postprocesing

Things we need to decide to move forward:

  • What language will the code be in? (R or python) - leaning toward Python because it has a good azure sdk and delta tables.
  • Where will the code run? (proposal: run in the vap but read and write from and to blob)
  • Can we use a pipeline manager like make or airflow?
  • Is this a pacakge?
  • Currently, the job_id and disease are conflated in the proposed output structure. How to solve?
  • What is the definition of run_date, release_date, rt_date, asof_date, etc. and what names/concepts do we want in the merged_release.csv file?
  • Proposal: keep the release module separate from the for-internal-review module. Put the release module with some stuff to automatically insert into the production database.
  • How do we reprocess if we have to re-run just one or two states? Keep track using an approach similar to the parameters table (SCD2) where we log metadata for each state in the run and mark old versions out of date. Or just use a delta table.
  • How do we manage the review and exclusion process?

In scope

v1: Proposed scope of this project: refactor to generate basic files as outlined here

Out of scope (for now)

v2: Add a dashboard
v3: Re-write the JS so that the website can also read from merged_release.csv

@kgostic kgostic requested a review from natemcintosh October 10, 2024 04:08
README.md Show resolved Hide resolved
[data.cdc.gov](), and to [DCIPHER](). By convention, the pipeline always
generates outputs for publication, even if the input data is from an experiment,
test run, or backfill exercise that is not intended for release. The publication
outputs are not costly to generate and our person-driven releae process ensures
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
outputs are not costly to generate and our person-driven releae process ensures
outputs are not costly to generate and our person-driven release process ensures

README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
mermaid_diagram.mjs Outdated Show resolved Hide resolved
@PatrickTCorbett
Copy link

PatrickTCorbett commented Oct 10, 2024

What language will the code be in? (R or python)

I feel like R would be easier since a lot of the post processing is currently in R but I have not preferences

Where will the code run? (proposal: run in the vap but read and write from and to blob)

I like the idea of running in vap and writing to/from blob. When Kingsley or I recreate the output figures (to process state exclusions for example), ~75% of the time it takes is for the nodes to spin up in Azure Batch

Is this a pacakge?

My initial thought is that this should start off as a repo or two and then perhaps a package later in its life cycle if a clear path for its use in other pipelines/teams has been established

What is the definition of run_date, release_date, rt_date, asof_date, etc. and what names/concepts do we want in the merged_release.csv file?

Here's my attempt, lets see how close I am to the consensus

  • run_date: Day the pipeline was run (currently Wednesdays unless there's a surprise delay)
  • release_date: Day the Rt outputs are published by CFA (currently Friday after the run_date)
  • rt_date: Day that data collection halted (for NSSP its the day before the run_date and for NHSN it was the previous Saturday)
  • asof_date: Used when the NSSP data shown is from that date and not retroactively filled in by a later date

Regarding a potential dashboard, I'm wondering if we might want to create some of the anomaly figures or create RDS files encoding those figures within the post processing pipeline? The current anomaly report takes a while to generate because of the computational demand of reading and processing the input files (model RDS files, gold parquet files, latent.csv files). Any dynamic dashboard would benefit from reading in already processed figures/data so we don't have to wait

@kgostic
Copy link
Author

kgostic commented Nov 12, 2024

@natemcintosh, I think that the more up to date version is in @zsusswein's issue: #2.

@zsusswein could you update this readme to reflect the final plan and submit for review?

@kgostic kgostic requested a review from zsusswein November 12, 2024 14:24
@zsusswein zsusswein removed their request for review November 12, 2024 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants