This project aims to build a comprehensive data pipeline for an e-commerce platform, providing hands-on experience with various data engineering tools and technologies. The pipeline will ingest data from multiple sources, process it, and store it in different formats for analysis and reporting.
- The e-commerce data set provided in this project is from Brazilian E-Commerce Public Dataset by Olist on Kaggle.
- Tables as CSV files are first downloaded using Kaggle API, stored in Minio (storage.io), orchestated by Dagster.
- Tables that have timestamp columns (such as orders, orders_reviews, ...) are then partitioned by month with Dagster Partitions.
- After that monthly data is then inserted and updated to different locations: PostgreSQL (database.io) and MongoDB (document.io), based on the timestamp of corresponding events. This process is to simulate the real-time data ingestion, emitting changes for CDC in later steps with Kafka.
- For tables without timestamp, ingest directly to the PostgreSQL database.
- This process is orchestarted by Dagster with asyncio.
- Airflow ETL dags with 5-minute intervals cronjob (mimic daily jobs) are setup to:
- Extract daily data from the PostgreSQL and MongoDB mentioned in the previous step using Polars, and then upload to another Minio instance (lakehouse.io) under
/ecommerce/landing/
prefix. - Spinup Spark jobs using Airflow SparkSubmitOperator to read extracted CSV files and then merge to the corresponding Iceberg table in idempotent manner.
- Lastly, Spark jobs are submmited to retrieve records from Iceberg table, then merge to another PostgreSQL instance (warehouse.io). This is to allow BI applications such as Apache Superset or Power BI can easily integrate and build analytic dashboards.
- Extract daily data from the PostgreSQL and MongoDB mentioned in the previous step using Polars, and then upload to another Minio instance (lakehouse.io) under
- The process is orchestarted by Airflow.
- Changes being made in Data Generation sources (database.io and document.io) are captured by Kafka Connect with Debezium capability, CDC messages are then delivery to a Strimzi Kafka cluster.
- From here, messages in Kafka topics are consumed using Flink (Table API) and then ingest to OpenSearch instance with the corresponding indices.
- AWS DMS are setup to migrate tables in PostgreSQL to an RDS Postgres instance, while documents in MongoDB are migrated to an S3 bucket.
- Glue Data Catalog and Glue ETL are used to extract and load data from the above step to an Redshift data warehouse instance.
- With data in Redshift, larger datasets can be queried and proccessed without limitations as compare to local PostgreSQL as data warehouse.
- Superset is use to create analytic dashboards from PostgreSQL (warehouse.io) as data warehouse.
- Services are deployed on a local Kubernetes cluster of 2 physical machines, with the help of various toolings such as: Docker, Kubeadm, K9s, Helm, Helmfile, Taskfile and Bash scripts.
-
- Terraform with Terragrunt wrapper is used to deploy resources on AWS as well as Cloudflare.
- Jupyter Notebooks are used to test implementations and develop data pipelines, as interactive environment provide faster feedback loops hence allow for faster development.
- DataHub is deploy as a proof of concept to manage, discovery and governance all datasets in the datastack as well as their data lineages.
- Screenshots of the projects can be found under screenshots directory.