This project is a complete data pipeline for scraping coffee reviews from CoffeeReview.com, followed by data cleaning, transformation, and analysis. The data is augmented with additional external datasets (e.g., consumer price index, exchange rates) and analyzed to explore trends in coffee quality.
The project involves:
- Web Scraping: Using Python to scrape coffee reviews and associated metadata and save to CSV.
- Data Cleaning: Processing the scraped data, handling missing values, extensive cleaning and normalization, and augmenting with additional information.
- Analysis: Generating visualizations and insights into coffee characteristics, quality scores, and tasting notes.
The goal is to provide insights into the boutique coffee market, with a focus on origin, price, flavor notes and quality metrics.
.
├── LICENSE
├── README.md
├── coffee
│ ├── __init__.py
│ ├── async_parser.py
│ ├── async_review_scraper.py
│ ├── async_url_scraper.py
│ ├── config.py
│ ├── data_cleaning.py
│ ├── test_html
│ └── utils.py
├── data
│ ├── external
│ ├── intermediate
│ ├── processed
│ └── raw
├── docs
├── imgs
├── main.py
├── notebooks
│ ├── 1-data-cleaning.ipynb
│ ├── 2-data-EDA.ipynb
│ ├── 3-text-features.ipynb
│ └── wordcloud.png
├── notes
├── pyproject.toml
├── requirement-dev.txt
├── requirements.txt
├── scripts
│ ├── archive
│ └── openex.py
├── tests
- Python 3.11+
- Virtual environment (e.g.
venv
orpyenv
) - Git (optional)
-
Clone this repository:
git clone https://github.com/tynardone/coffee-review-scraper.git cd coffee-review-scraper
Or just download files from repository.
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
-
Obtain free API Keys:
If you want to run data cleaning you will need two API keys, both available with free tiers.
Add API keys to environment or .env file
OPENEXCHANGERATES_API_KEY = GEOCODE_API_KEY =
-
CoffeeReview.com
Source of the dataset of coffee roast reviews and target of webscraper. Operating since 1997 and amassing 1000s of blind-taste reviews of coffee roasts from around the world. The raw scraped data requires significant cleanup.
-
OpenExchangeRates
Provider of historical and up-to-date currency exchange rates. Used to convert price data to a single currency. They offer free API access limited to 1000 requests per month.
-
Geocoding API
A free geocoding API from Map Maker. Geocoding is the process of converting addresses into latitude and longitude coordinates. This is done to provide coordinates of roasters and origin locations for potential future spatial analysis or visualization.
- async_scrape_roast_reviews.py
- async_scrape_roast_urls.py
- json_to_csv.py
- openex.py
- review_parse.py
https://docs.openexchangerates.org/reference/api-introduction
Consumer Price Index for All Urban Consumers (CPI-U) Not Seasonally Adjusted CPI for All Urban Consumers (CPI-U): U.S. city average All items in U.S. city average, all urban consumers, not seasonally adjusted
consumer_price_index.csv
https://www.bls.gov/cpi/data.htm
- Scrape and parse
- Data cleaning
- Cleaning and reconciliation in OpenRefine
- Feature engineering from text