Coffee Review Scraper and Analysis

This project is a complete data pipeline for scraping coffee reviews from CoffeeReview.com, followed by data cleaning, transformation, and analysis. The data is augmented with additional external datasets (e.g., consumer price index, exchange rates) and analyzed to explore trends in coffee quality.

Project Overview

The project involves:

Web Scraping: Using Python to scrape coffee reviews and associated metadata and save to CSV.
Data Cleaning: Processing the scraped data, handling missing values, extensive cleaning and normalization, and augmenting with additional information.
Analysis: Generating visualizations and insights into coffee characteristics, quality scores, and tasting notes.

The goal is to provide insights into the boutique coffee market, with a focus on origin, price, flavor notes and quality metrics.

Directory Structure

.
├── LICENSE
├── README.md
├── coffee
│   ├── __init__.py
│   ├── async_parser.py
│   ├── async_review_scraper.py
│   ├── async_url_scraper.py
│   ├── config.py
│   ├── data_cleaning.py
│   ├── test_html
│   └── utils.py
├── data
│   ├── external
│   ├── intermediate
│   ├── processed
│   └── raw
├── docs
├── imgs
├── main.py
├── notebooks
│   ├── 1-data-cleaning.ipynb
│   ├── 2-data-EDA.ipynb
│   ├── 3-text-features.ipynb
│   └── wordcloud.png
├── notes
├── pyproject.toml
├── requirement-dev.txt
├── requirements.txt
├── scripts
│   ├── archive
│   └── openex.py
├── tests

Installation

Prerequisites

Python 3.11+
Virtual environment (e.g. venv or pyenv)
Git (optional)

Setup

Clone this repository:

git clone https://github.com/tynardone/coffee-review-scraper.git
cd coffee-review-scraper

Or just download files from repository.

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```
Obtain free API Keys:

If you want to run data cleaning you will need two API keys, both available with free tiers.
- OpenExchangeRates
- GeoCodingAPI
Add API keys to environment or .env file
```
OPENEXCHANGERATES_API_KEY =
GEOCODE_API_KEY =
```

Data Sources

CoffeeReview.com

Source of the dataset of coffee roast reviews and target of webscraper. Operating since 1997 and amassing 1000s of blind-taste reviews of coffee roasts from around the world. The raw scraped data requires significant cleanup.
OpenExchangeRates

Provider of historical and up-to-date currency exchange rates. Used to convert price data to a single currency. They offer free API access limited to 1000 requests per month.
Geocoding API

A free geocoding API from Map Maker. Geocoding is the process of converting addresses into latitude and longitude coordinates. This is done to provide coordinates of roasters and origin locations for potential future spatial analysis or visualization.

Scripts

async_scrape_roast_reviews.py
async_scrape_roast_urls.py
json_to_csv.py
openex.py
review_parse.py

Historical Exchange Rates

https://docs.openexchangerates.org/reference/api-introduction

US Consumer Price Index

Consumer Price Index for All Urban Consumers (CPI-U) Not Seasonally Adjusted CPI for All Urban Consumers (CPI-U): U.S. city average All items in U.S. city average, all urban consumers, not seasonally adjusted

consumer_price_index.csv

https://www.bls.gov/cpi/data.htm

Scrape and parse
Data cleaning
Cleaning and reconciliation in OpenRefine
Feature engineering from text

t-SNE

https://distill.pub/2016/misread-tsne/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Coffee Review Scraper and Analysis

Table of Contents

Project Overview

Directory Structure

Installation

Prerequisites

Setup

Data Sources

Scripts

Historical Exchange Rates

US Consumer Price Index

t-SNE

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.vscode		.vscode
coffee		coffee
data		data
imgs		imgs
notebooks		notebooks
notes		notes
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
Untitled.ipynb		Untitled.ipynb
main.py		main.py
pyproject.toml		pyproject.toml
requirement-dev.txt		requirement-dev.txt
requirements-dev.backup		requirements-dev.backup
requirements.txt		requirements.txt
tree.txt		tree.txt

License

tynardone/coffee-review-analysis

Folders and files

Latest commit

History

Repository files navigation

Coffee Review Scraper and Analysis

Table of Contents

Project Overview

Directory Structure

Installation

Prerequisites

Setup

Data Sources

Scripts

Historical Exchange Rates

US Consumer Price Index

t-SNE

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages