The final project of the course involves building a data lake with educational data stored in a MongoDB database and from the IBGE (Brazilian Institute of Geography and Statistics) API. Then, the processed and filtered data should be used to build a data warehouse for querying business problems.
To better divide the stages, two DAGs were created: get_ibge_data.py
and get_pnadc_data.py
.
-
Table Construction For the ETL stages, two databases are being used: RDS Postgres and Redshift. Therefore, before starting the DAGs, it is necessary to create the tables to receive the data in these two databases.
Tables ibge_data and people_info: These are in this script, and they are the tables used to save the treated data in the RDS instance. Tables filtered_people_info_mulheres_adultas, ibge_data_cities, ibge_data_states: These are in this script, and they are the tables to save the treated and filtered data in Redshift.
-
get_ibge_data.py
request_data
andget_ibge_data
: First, the IBGE API is requested at the endpoint/localidades/distritos
, thenget_data
transforms the response into JSON to CSV.upload_to_s3
: With the raw data in CSV, it is transferred to theibgebucket-data-pnadc
bucket in theibge/raw
folder. Here is the bronze layer of the data where we have the data untreated yet.clean_data
: In this part, we start treating the data. Since it is strings, we leave all the data in lowercase, remove accents, and remove spaces at the beginning and end of words. Data withnone
values are changed to ''. Finally, we create a new CSV with the treated data.upload_clean_data_to_s3
andtransfer_s3_to_rds
: The treated data then goes to two locations:- S3: in the
ibgebucket-data-pnadc
bucket in theibge/processed
folder. Here we have the silver layer of the data. They are treated and can be used for future queries. - RDS Postgres instance: To facilitate future queries and filtering in the data, from S3, they are transferred to a database in the
ibge_data
table.
- S3: in the
transfer_rds_cities_data_to_s3
andtransfer_rds_states_data_to_s3
: This is where the construction of the gold layer begins, where we will have the filtered data. Location data is divided into cities and states. City data contains:city name
,state abbreviation
,main region
, andmicro-region
. State data contains:state
,state abbreviation
,region
, andregion abbreviation
. In S3, they are in theibge/filter
folder saved as CSV.transfer_s3_to_redshift_cities
andtransfer_s3_to_redshift_states
: With the treated and filtered data, we can transfer this information to the data warehouse in Redshift. City data is transferred to theibge_data_cities
table, and state data is transferred to theibge_data_states
table.
- get_pnadc_data.py
get_mongodb_data
: Here we connect to MongoDB and gather the stored data.upload_raw_data_to_s3
: Then, the data collected by mongo is transferred to S3 in theibgebucket-data-pnadc
bucket in thepnadc_data_2023/raw
folder. (bronze layer)get_s3_raw_data
andcleans
group: The data from the raw layer is downloaded and treated. In the cleans group, there are two tasks:clean_data_rename_cols
: The column namesoccupation
,work
,hours_worked
, andschooling_years
are changed tooccupation
,work
,hours_worked
, andschooling_years
. The data type of each column is also defined:- strings: state, sex, color, graduation, work, and occupation
- int: year, quarter, and age
- float: schooling_years, hours_worked, and income:
clean_data_mod_rows
: Here the string data is treated: all letters are put in lowercase, accents are removed, and spaces at the beginning and end of words are removed. Data withnone
values are changed to ''. Finally, a new CSV is created with the treated data.
upload_clean_data_to_s3
: Then, the treated data is transferred to S3 in thepnadc_data_2023/processed
folder. (silver layer)transfer_s3_to_rds
: Then, the treated data is transferred to a postgres database in thepeople_info
table.transfer_rds_data_to_s3
: Then, we have the construction of the gold layer filtering the data present in the rds. For the data warehouse, it is necessary to collect data from women aged between 20 and 40. With the collected data, they are already saved in CSV in thepnadc_data_2023/filter
folder.transfer_s3_to_redshift
: Then, the treated and filtered data is transferred to thefiltered_people_info_mulheres_adultas
table in redshift for querying.
- To build the airflow instances, just run this command:
docker compose up -d
- In airflow, you need to configure the connections:
- ibge_api: Define the HTTP connection to access the IBGE API. The URL used is
http://servicodados.ibge.gov.br/api/v1/
- aws_conn: Here it is necessary to define your AWS access credentials (
aws_access_key_id
andaws_secret_access_key
). Remember that your user in AWS needs to have access to the instances used here: S3, Redshift, and RDS. - postgres_conn: Here it is necessary to define
host
,login
,password
, andschema
of the postgres database. - mongodb_conn: Here it is necessary to define
host
,login
, andpassword
. As we are using the mongodb server, it is also necessary to add{"srv":true}
inExtra
. - redshift_conn: Here it is necessary to define
host
,login
,password
, andschema
.
- ibge_api: Define the HTTP connection to access the IBGE API. The URL used is
- Remember to run the scripts to define the tables in redshift and postgres. They are available in the queries folder.
- Now just run the DAGs! The order of them here does not matter since both are acquired from different data sources.