Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lake][ETL] Update raw save-to-disk #681

Closed
1 task done
idiom-bytes opened this issue Feb 26, 2024 · 1 comment
Closed
1 task done

[Lake][ETL] Update raw save-to-disk #681

idiom-bytes opened this issue Feb 26, 2024 · 1 comment
Assignees
Labels
Type: Enhancement New feature or request

Comments

@idiom-bytes
Copy link
Member

idiom-bytes commented Feb 26, 2024

Motivation

Parquet can append efficiently w/ a batch of records , but isnt' good for single-row/records. We want to make the process as efficient as possible for now, without adding any additional overhead (i.e. live processing + message queue => dump).

image

Rather than re-creating a parquet file, simply append rows to the csv.
image

Outline

This was already partially implemented in the past (check git history).
I do not believe that it was saving out different csvs.
1 - Read last record from last csv
2 - Fetch all new records from subgraph
3 - Every value is validated
4 - *1000 lines per csv (please review feature)

DoD

  • Update table.py (and consumers) to save/load to append/load from csv

Final Comments

We reviewed updating ohlcv_factory to integrate CSVDataStore, however the CSV writing + OHLCV part is tightly integrated so it's going to require a bit more time. For now, we left ohlcv_factory as-is, and will create a separate ticket for updating it, such that we can continue separating concerns.

@idiom-bytes idiom-bytes added the Type: Enhancement New feature or request label Feb 26, 2024
@idiom-bytes idiom-bytes changed the title [Lake][ETL] - Step 1 - Save To Local [Lake][ETL] - Adjust save-to-local Feb 26, 2024
@idiom-bytes idiom-bytes changed the title [Lake][ETL] - Adjust save-to-local [Lake][ETL] - Update raw save-to-disk Feb 26, 2024
@kdetry kdetry self-assigned this Feb 27, 2024
@idiom-bytes idiom-bytes changed the title [Lake][ETL] - Update raw save-to-disk [Lake][ETL] Update raw save-to-disk Feb 27, 2024
@idiom-bytes
Copy link
Member Author

Code merged into #734 - DuckDB E2E PR.
Closing Ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants