Platform: https://comment-detector.com
- Aim 1: Executing web crawling every 10 minute orchestrated by Apache Airflow on Docker
- Aim 2: Cleaning and extracting data
- On: GCP Compute Engine
- Aim: Deploying Redis as cache system storing data updated by Python scripts scheduled by APScheduler
- On: GCP Compute Engine
- Aim: Storing cleaned data and providing data for platform
- On: MongoDB Atlas
- Aim: Retrieving data from MongoDB and demonstrating organized data for users
- On: image managed by Cloud Run
- PTT - Gossiping which has the largest number of users.
- PTT - HatePolitics which is highly related to politics.
- Extract: web crawling (requests + Beautiful Soup)
- Transform: python (data cleaning and extraction)
- Load: MongoDB
- Trend (趨勢分析):
- 提供資料量數據
- 呈現熱門關鍵字與文章
- Keywords (關鍵字分析):
- 使用者輸入想要了解的關鍵字後,儀表板會呈現與關鍵字相關的熱門文章
- 統計出留言數量前20名的留言者,以及留言者之間的關係 (Concurrency Analysis)
- Commenter (留言者分析):
- 使用者輸入想要了解的留言者後,儀表板會呈現該留言者的活躍時段
- 彙整該留言者的所有留言,並彙整成文字雲
- 開源資料 API:
- 獲得更多資訊:IP 與作者等
Category | Tool/Technique |
---|---|
Database | MongoDB |
Data Pipeline | Airflow |
Dashboard | Plotly Dash |
Backend | FastAPI |
Cache system | Redis |
Autoscaling | Cloud Run |
Load Balancing | Cloud Load Balancing |
Monitoring | Cloud Monitoring, Cloud Logging |
Others | GCP Compute Engine |
Raymond Hung [email protected]