-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Lake][ETL] Add ETL checkpoint to enforce SLAs, and process data incrementally. #694
Comments
I have a plan about it: Checkpoint Period: We will add one more configuration to the Create a Checkpoint System: We will create a new table in our database to store the checkpoint timestamps for each ETL run. This table will have columns for Update ETL Process: We will modify the ETL process to use the checkpoint timestamp. Before starting an ETL run, we will fetch the latest successful Handle User Modification of Checkpoint: If the st_ts in ppss.yaml is older than Logging and Monitoring: We will log the success or failure of each ETL run. This can be done by writing a log message at the end of each ETL run and updating the Status in the checkpoint table. We will load these logs into DuckDB so they can be reported in a dashboard. |
We're going to close this ticket in favor of implementing a SQL build strategy. By using temp tables and some simple logic, we can avoid writing a checkpoint while improving our overall ETL and SLAs.
Please see epic #685 for more details. |
Motivation
Right now, ETL expects you to modify
st_ts
andend_ts
in ppss.yaml to manage ETL incrementally.This is a lot of effort and right now is not handled very cleanly.
Using the example above, in the first run we will have something like:
In the second run we want to have something like this, because we've already processed all historical data
However, the ux for the lake tries to be a bit more end-to-end by simply stating...
This is a much better/cleaner way of doing things.
[What's the current limit]
Raw Tables: pdr_slots, pdr_predictions can be in different timestamps.
Build Tables: bronze_predictions and bronze_slots can be in different timestamps.
Because of this, we're currently addressing this by recalculating more records than needed.
Rather, we can (1) maintain a checkpoint across all tables, (2) only calculate new values, (3) append to our prod table, (4) update checkpoint.
Trying to look across tables is problematic because they can be at different timestamps. Like I mentioned in another ticket, we may want to implement a checkpoint that settles where we are currently in time. This helps ppss.yaml, lake, ETL, and DuckDB play well with each other because this can be enforced.
Candidate Solutions
[Checkpoint]
It's objectively different and a cleaner SLA, to manage all tables if there is a common checkpoint to apply across all tables.
Other pipelines like aws kinesis are also known to maintain a checkpoint, they help us understand where we are pointing at. We know where the last successful build happened, what's considered new data, and if there are errors, we can cleanup all data from this point onward and rebuild.
[Recap of Joins]
DBs & ETL
[Improving ETL builds with DuckDB]
Besides maintaining a checkpoint, DuckDB and SQL also provides a great way for us to (1) build/etl and then (2) update the table. This means that if an error happens, we don't change any of the data.
[Improving Logging and Monitoring]
[Evolving past checkpoints]
KISS for now. Use a single checkpoint across all ETL tables.
There are other tools these use these concepts into far more detail (i.e. airflow). For now, I believe this offers a simple, clean SLA between raw-data and ETL. I believe this could be used again such that all checkpoints for gql, ohclv, and others are in-sync.
DoD
The text was updated successfully, but these errors were encountered: