Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lake][ETL] Add ETL checkpoint to enforce SLAs, and process data incrementally. #694

Closed
8 tasks
idiom-bytes opened this issue Feb 28, 2024 · 2 comments
Closed
8 tasks
Labels
Type: Enhancement New feature or request

Comments

@idiom-bytes
Copy link
Member

idiom-bytes commented Feb 28, 2024

Motivation

Right now, ETL expects you to modify st_ts and end_ts in ppss.yaml to manage ETL incrementally.
This is a lot of effort and right now is not handled very cleanly.
Using the example above, in the first run we will have something like:

lake:
  st_ts: '02-14-2024'
  end_ts: '02-15-2024'

In the second run we want to have something like this, because we've already processed all historical data

lake:
  st_ts: '02-15-2024'
  end_ts: '02-16-2024'

However, the ux for the lake tries to be a bit more end-to-end by simply stating...
This is a much better/cleaner way of doing things.

lake:
  st_ts: '01-06-2023'
  end_ts: 'now'

[What's the current limit]
Raw Tables: pdr_slots, pdr_predictions can be in different timestamps.
Build Tables: bronze_predictions and bronze_slots can be in different timestamps.

Because of this, we're currently addressing this by recalculating more records than needed.
Rather, we can (1) maintain a checkpoint across all tables, (2) only calculate new values, (3) append to our prod table, (4) update checkpoint.

Trying to look across tables is problematic because they can be at different timestamps. Like I mentioned in another ticket, we may want to implement a checkpoint that settles where we are currently in time. This helps ppss.yaml, lake, ETL, and DuckDB play well with each other because this can be enforced.

Candidate Solutions

[Checkpoint]
It's objectively different and a cleaner SLA, to manage all tables if there is a common checkpoint to apply across all tables.

Other pipelines like aws kinesis are also known to maintain a checkpoint, they help us understand where we are pointing at. We know where the last successful build happened, what's considered new data, and if there are errors, we can cleanup all data from this point onward and rebuild.

[Recap of Joins]

  • Right now, raw data (that we fetch) into different pdr_ tables can be at different timestamps. We should enforce that they have all been updated successfully, and have a common timestamp to process [from_ts, to_ts].
  • Right now, etl tables (that we build) into different tables, can be at different timestamsps. We should enforce that they have all been updated successfully, and are all updated and have the same [to_ts] so they can be appended to.
  • Even when our new ETL runs on DuckDB, we need to make sure ETL completes and builds all data successfully, before appending results to the live_lake.

DBs & ETL

[Improving ETL builds with DuckDB]
Besides maintaining a checkpoint, DuckDB and SQL also provides a great way for us to (1) build/etl and then (2) update the table. This means that if an error happens, we don't change any of the data.

  1. Ensure that all raw_tables are up-to-date.
  2. Process all ETL calculations and yield results to temp tables.
  3. When ETL completes, prod_db tables are updated with results from temp_tables in a single transaction.
  4. Update checkpoint.

[Improving Logging and Monitoring]

  1. Log ETL success/failure in code => load into duck_db so they can be reported in a dashboard.

[Evolving past checkpoints]
KISS for now. Use a single checkpoint across all ETL tables.

There are other tools these use these concepts into far more detail (i.e. airflow). For now, I believe this offers a simple, clean SLA between raw-data and ETL. I believe this could be used again such that all checkpoints for gql, ohclv, and others are in-sync.

DoD

  • Checkpoint is used to maintain the last_timestamp processed for all etl/built tables
  • Checkpoint is updated at the end of a successful etl_update loop
  • Raw tables can continue to update on their own and be at different timestamps.
  • ETL/Bronze tables verify that raw_tables have been updated successfully, before running an ETL update.
  • At the end of a successful ETL update, the checkpoint is updated.
  • ppss.yaml updates dynamically to track ETL checkpoint.
  • If user modifies the checkpoint into the past (1), their ETL tables will be wiped (2) from that timestamp onwards, and rebuilt (3) without having to re-fetch any raw-data from GQL or OHLCV. please note: in this case, it doesn't make sense to move the checkpoint into the future
  • ETL pipeline is successfully using yaml [st_ts, end_ts, and checkpoint] to update incrementally.
@idiom-bytes idiom-bytes added the Type: Enhancement New feature or request label Feb 28, 2024
@idiom-bytes idiom-bytes changed the title [Lake][ETL] Add checkpoint for ETL such that ppss.yaml, lake, ETL, and DuckDB play well with each other. [Lake][ETL] Add ETL checkpoint to track and progress the pipeline more effectively. Feb 28, 2024
@idiom-bytes idiom-bytes changed the title [Lake][ETL] Add ETL checkpoint to track and progress the pipeline more effectively. [Lake][ETL] Add ETL checkpoint to more objectively track and enforce SLAs. Feb 28, 2024
@idiom-bytes idiom-bytes changed the title [Lake][ETL] Add ETL checkpoint to more objectively track and enforce SLAs. [Lake][ETL] Add ETL checkpoint to enforce SLAs, and process data incrementally. Feb 28, 2024
@kdetry
Copy link
Contributor

kdetry commented Mar 7, 2024

I have a plan about it:

Checkpoint Period: We will add one more configuration to the ppss.yaml file, called as checkpoint_period. I will be our fetching period for every ETL process.

Create a Checkpoint System: We will create a new table in our database to store the checkpoint timestamps for each ETL run. This table will have columns for etl_run_id, start_timestamp, end_timestamp, and status. This will allow us to track each ETL run separately and maintain a history of all runs.

Update ETL Process: We will modify the ETL process to use the checkpoint timestamp. Before starting an ETL run, we will fetch the latest successful end_timestamp from the checkpoint table and only process data that is newer than this timestamp. At the end of a successful ETL run, we will insert a new row in the checkpoint table with the start_timestamp, end_timestamp, and status of the run.

Handle User Modification of Checkpoint: If the st_ts in ppss.yaml is older than start_timestamp of the first record in the ETL_Runs table, we will wipe the data completely and start a new ETL process. (or if the end_ts is newer than end_timestamp) This ensures that we always have consistent and up-to-date data in our ETL tables.

Logging and Monitoring: We will log the success or failure of each ETL run. This can be done by writing a log message at the end of each ETL run and updating the Status in the checkpoint table. We will load these logs into DuckDB so they can be reported in a dashboard.

@idiom-bytes
Copy link
Member Author

idiom-bytes commented Mar 11, 2024

We're going to close this ticket in favor of implementing a SQL build strategy. By using temp tables and some simple logic, we can avoid writing a checkpoint while improving our overall ETL and SLAs.

DuckDB and SQL also provides a great way for us to (1) build/etl and then (2) update the table. This means that if an error happens, we don't change any of the prod data.

Please see epic #685 for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants