The objective of this project is to transform and model World Development Indicators (WDI) datasets using PySpark, ensuring data quality and readiness for analysis. The project involves:
- Data Exploration and Transformation: Analyzing and transforming datasets (
WDICountry.csv
,WDISeries.csv
,WDIData.csv
). - Data Quality Checks: Performing data quality checks and resolving any issues.
- Data Output: Writing the cleaned data to
CSV
files. - Time Series Analysis: Conducting time series data pivoting and analyzing metrics for insights.
The World Development Indicators can be accessed through the World Bank's https://datacatalog.worldbank.org/search/dataset/0037712/World-Development-Indicators
The World Development Indicators (WDI) from the World Bank provide comprehensive data on economic, social, and environmental metrics across over 200 countries. They include various indicators such as GDP, health, education, and infrastructure, sourced from international and national agencies. This data supports global development analysis and policymaking.
Cellular and Broadband Penetration Analysis:
We aim to measure cellular and broadband penetration in comparison to population demographics for each country. Additionally, we seek insights on annual global aggregates.
Regional Metrics Exploration:
Becky finds the regional metrics interesting and wants to explore these metrics at a country level for each year. Can you adapt the regional pivot computed earlier to get the metrics for each country by year?
Business Environment Analysis:
Kat wants to identify the countries that are conducive to starting a business. She is interested in the most recent metrics for the following indicators:
-
Gross National Income (GNI)
-
Cost of business start-up procedures
-
Number of days required to start a business
-
Number of start-up procedures to register a business
-
GDP
-
GDP per capita
-
Business Regulatory Environment
-
Ease of doing business index (available only for 2017)
The data should be written to a CSV file.