** This is the original README for LoadKit. Its purpose has changed significantly since then. **
LoadKit is a simple Python-based ETL framework inspired by a discussion about the OpenSpending data warehouse platform.
It is intended to accept tabular input files, such as CSV files, Excel spreadsheets and other formats. The data is kept in a managed file structure locally or uploaded to an S3 bucket together with a JSON metadata file.
Once data has been ingested, it can be processed and turned into a series of Artifacts
, which are transformed versions of the initial resource.
Finally, an Artifact
can be loaded into an automatically generated SQL database table in order to be queried for analytical purposes.
See demo.py
in the project root.
- Decide which bits of the
datapackage
specification this needs to adhere to. - Allow passing in some metadata to aid interpretation of the table.
- Include much more data quality assessment tooling and data validation options.
- Does metadata (e.g. on fields) need to be per-resource instead of package-wide?
- Set up custom exceptions and error handling (invalid URLs and file names, too large, parsing failures, loading failures).
- Think about whether the resulting DB must be denormalized.
- Create a Postgres FTS index when loading the data with sqlalchemy-searchable.