-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #47 from CuteChuanChuan/develop
Provide README
- Loading branch information
Showing
13 changed files
with
218 additions
and
93 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,113 @@ | ||
# This workflow build and push a Docker container to Google Artifact Registry and deploy it on Cloud Run when a commit is pushed to the "main" branch | ||
# | ||
# Overview: | ||
# | ||
# 1. Authenticate to Google Cloud | ||
# 2. Authenticate Docker to Artifact Registry | ||
# 3. Build a docker container | ||
# 4. Publish it to Google Artifact Registry | ||
# 5. Deploy it to Cloud Run | ||
# | ||
# To configure this workflow: | ||
# | ||
|
||
# 2. Create and configure Workload Identity Federation for GitHub (https://github.com/google-github-actions/auth#setting-up-workload-identity-federation) | ||
# | ||
# 3. Ensure the required IAM permissions are granted | ||
# | ||
# Cloud Run | ||
# roles/run.admin | ||
# roles/iam.serviceAccountUser (to act as the Cloud Run runtime service account) | ||
# | ||
# Artifact Registry | ||
# roles/artifactregistry.admin (project or repository level) | ||
# | ||
# NOTE: You should always follow the principle of least privilege when assigning IAM roles | ||
# | ||
# 4. Create GitHub secrets for WIF_PROVIDER and WIF_SERVICE_ACCOUNT | ||
# | ||
# 5. Change the values for the GAR_LOCATION, SERVICE and REGION environment variables (below). | ||
# | ||
|
||
# | ||
# For more support on how to run this workflow, please visit https://github.com/marketplace/actions/deploy-to-cloud-run | ||
# | ||
# Further reading: | ||
# Cloud Run IAM permissions - https://cloud.google.com/run/docs/deploying | ||
# Artifact Registry IAM permissions - https://cloud.google.com/artifact-registry/docs/access-control#roles | ||
# Container Registry vs Artifact Registry - https://cloud.google.com/blog/products/application-development/understanding-artifact-registry-vs-container-registry | ||
# Principle of least privilege - https://cloud.google.com/blog/products/identity-security/dont-get-pwned-practicing-the-principle-of-least-privilege | ||
|
||
name: Build and Deploy to Cloud Run | ||
|
||
on: | ||
push: | ||
branches: [ "develop" ] | ||
|
||
env: | ||
PROJECT_ID: comment-detector-400115 | ||
GAR_LOCATION: asia-east1 | ||
REPOSITORY: comment-detector | ||
SERVICE: server | ||
REGION: asia-east1 | ||
|
||
jobs: | ||
deploy: | ||
# Add 'id-token' with the intended permissions for workload identity federation | ||
permissions: | ||
contents: 'read' | ||
id-token: 'write' | ||
|
||
runs-on: ubuntu-latest | ||
steps: | ||
- name: Checkout | ||
uses: actions/checkout@v3 | ||
|
||
- name: Google Auth | ||
id: auth | ||
uses: 'google-github-actions/auth@v0' | ||
with: | ||
token_format: 'access_token' | ||
workload_identity_provider: '${{ secrets.WIF_PROVIDER }}' # e.g. - projects/123456789/locations/global/workloadIdentityPools/my-pool/providers/my-provider | ||
service_account: '${{ secrets.WIF_SERVICE_ACCOUNT }}' # e.g. - [email protected] | ||
|
||
# NOTE: Alternative option - authentication via credentials json | ||
# - name: Google Auth | ||
# id: auth | ||
# uses: 'google-github-actions/auth@v0' | ||
# with: | ||
# credentials_json: '${{ secrets.GCP_CREDENTIALS }}'' | ||
|
||
# BEGIN - Docker auth and build (NOTE: If you already have a container image, these Docker steps can be omitted) | ||
|
||
# Authenticate Docker to Google Cloud Artifact Registry | ||
- name: Docker Auth | ||
id: docker-auth | ||
uses: 'docker/login-action@v1' | ||
with: | ||
username: 'oauth2accesstoken' | ||
password: '${{ steps.auth.outputs.access_token }}' | ||
registry: '${{ env.GAR_LOCATION }}-docker.pkg.dev' | ||
|
||
- name: Build and Push Container | ||
run: |- | ||
cd src/server | ||
SHORT_SHA=$(echo "${{ github.sha }}" | cut -c 1-6) | ||
docker build -t server:$SHORT_SHA --platform linux/amd64 -f DockerfileDashboard . | ||
docker tag server:$SHORT_SHA "${{ env.GAR_LOCATION }}-docker.pkg.dev/${{ env.PROJECT_ID }}/${{ env.REPOSITORY }}/${{ env.SERVICE }}:$SHORT_SHA" | ||
# asia-east1-docker.pkg.dev/comment-detector-400115/comment-detector | ||
docker push "${{ env.GAR_LOCATION }}-docker.pkg.dev/${{ env.PROJECT_ID }}/${{ env.REPOSITORY }}/${{ env.SERVICE }}:$SHORT_SHA" | ||
# END - Docker auth and build | ||
|
||
- name: Deploy to Cloud Run | ||
id: deploy | ||
uses: google-github-actions/deploy-cloudrun@v0 | ||
with: | ||
service: ${{ env.SERVICE }} | ||
region: ${{ env.REGION }} | ||
image: ${{ env.GAR_LOCATION }}-docker.pkg.dev/${{ env.PROJECT_ID }}/${{ env.REPOSITORY }}/${{ env.SERVICE }}:$(echo "${{ github.sha }}" | cut -c 1-6) | ||
|
||
# If required, use the Cloud Run url output in later steps | ||
- name: Show Output | ||
run: echo ${{ steps.deploy.outputs.url }} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,6 +8,7 @@ | |
*.log | ||
.idea/ | ||
trying/ | ||
tests/testing_html/* | ||
|
||
*/.DS_Store | ||
.env | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,91 +1,102 @@ | ||
# Comment-Detector | ||
Personal Project (AppWorks School #21 Data Engineering) | ||
|
||
# Data | ||
## Source | ||
- PTT - Gossiping | ||
- PTT - HatePolitics | ||
|
||
## Extraction | ||
- requests + beautifulsoup | ||
|
||
|
||
# Tools and Skills | ||
## <u>Database: MongoDB (NoSQL)</u> | ||
### Objective: | ||
### Why use this? | ||
|
||
## <u>Schedule: Airflow</u> | ||
### Objective: | ||
### Why use this? | ||
|
||
## <u>CDC: Kafka</u> | ||
### Objective: | ||
### Why use this? | ||
### Steps: | ||
1. downloading from [Kafka edu](https://github.com/mongodb-university/kafka-edu.git) | ||
2. running docker: ```docker-compose -p mongo-kafka up -d --force-recreate``` | ||
3. adding connectors: ```docker exec -it mongo1 /bin/bash``` | ||
4. creating connector.json: ```nano simplesource.json``` | ||
```json | ||
{ | ||
"name": "mongo-simple-source", | ||
"config": { | ||
"connector.class": "com.mongodb.kafka.connect.MongoSourceConnector", | ||
"connection.uri": "mongodb://mongo1", | ||
"database": "Tutorial1", | ||
"collection": "orders" | ||
} | ||
} | ||
``` | ||
5. connecting: ```cx simplesource.json``` | ||
|
||
### Reference: | ||
1. [MongoDB Quickstart](https://www.mongodb.com/docs/kafka-connector/current/quick-start/) | ||
2. [Kafka Connector Tutorial Setup](https://www.mongodb.com/docs/kafka-connector/current/tutorials/tutorial-setup/#std-label-kafka-tutorials-docker-setup) | ||
3. [Getting Started with the MongoDB Kafka Source Connector](https://www.mongodb.com/docs/kafka-connector/master/tutorials/source-connector/) | ||
|
||
|
||
## <u>Dashboard: Plotly Dash</u> | ||
### Objective: | ||
1. Create interactive interface for users to explore this product | ||
2. Create dashboards to demonstrate the product | ||
### URL: http://3.106.78.149:8000/ | ||
### Why use this? | ||
|
||
## <u>Middleware - WSGI (Web Server Gateway Interface) server: gunicorn</u> | ||
### Objective: | ||
### Why use this? | ||
|
||
## <u>Middleware - ASGI (Asynchronous Server Gateway Interface) server: uvicorn</u> | ||
### Objective: | ||
### Why use this? | ||
|
||
## <u>Process manager: pm2</u> | ||
### Objective: managing the process of streamlit | ||
### Why use this? | ||
1. Dashboards needs to be available for users. | ||
2. Currently, streamlit does not support running with gunicorn. | ||
### Steps: | ||
1. installation: ```sudo apt install npm``` | ||
2. installation: ```sudo npm install pm2 -g``` | ||
3. creating .sh: ```vim start_streamlit.sh``` | ||
```shell | ||
#!/bin/bash | ||
gunicorn main:app -w 4 -k uvicorn.workers.UvicornWorker | ||
``` | ||
4. changing .sh permission: ```chmod +x start_streamlit.sh``` | ||
5. starting virtual env: ```source ./crawler/bin/activate``` | ||
6. running script:```pm2 start start_streamlit.sh``` | ||
|
||
## <u>API: FastAPI</u> | ||
### Objective: | ||
### Why use this? | ||
|
||
## <u>Cache: Redis</u> | ||
### Objective: | ||
### Why use this? | ||
|
||
## <u>Rate limiter: Redis</u> | ||
### Objective: | ||
### Why use this? | ||
|
||
## Table of Contents | ||
* [Introduction](#Introduction) | ||
* [Architecture](#Architecture) | ||
* [Data](#Data) | ||
* [Feature](#Feature) | ||
* [Tools](#Tools) | ||
* [Monitoring](#Monitoring) | ||
* [Clip](#Clip) | ||
* [Contact](#Contact) | ||
|
||
|
||
## Introduction | ||
#### A dashboard offering users comprehensive & insightful data about PTT (Taiwan's largest forum) | ||
#### Users can form judgment about cyber warriors (網軍) and people manipulating public opinions (帶風向) | ||
#### Platform: [https://comment-detector.com](https://comment-detector.com) | ||
![Homepage](readme-img/Homepage.png) | ||
|
||
|
||
## Architecture | ||
![Architecture](readme-img/Architecture.png) | ||
|
||
### Compute Engine #1: | ||
- Aim 1: Executing web crawling every 10 minute orchestrated by Apache Airflow on Docker | ||
- Aim 2: Cleaning and extracting data | ||
- On: GCP Compute Engine | ||
|
||
### Compute Engine #2: | ||
- Aim: Deploying Redis as cache system storing data updated by Python scripts scheduled by APScheduler | ||
- On: GCP Compute Engine | ||
|
||
### Database | ||
- Aim: Storing cleaned data and providing data for platform | ||
- On: MongoDB Atlas | ||
|
||
### Dashboard (Application) | ||
- Aim: Retrieving data from MongoDB and demonstrating organized data for users | ||
- On: image managed by Cloud Run | ||
|
||
## Data | ||
### Source | ||
- PTT - Gossiping which has the largest number of users. | ||
- PTT - HatePolitics which is highly related to politics. | ||
|
||
### ETL | ||
- Extract: web crawling (requests + Beautiful Soup) | ||
- Transform: python (data cleaning and extraction) | ||
- Load: MongoDB | ||
|
||
## Feature | ||
- Trend (趨勢分析): | ||
- 提供資料量數據 | ||
- 呈現熱門關鍵字與文章 | ||
- Keywords (關鍵字分析): | ||
- 使用者輸入想要了解的關鍵字後,儀表板會呈現與關鍵字相關的熱門文章 | ||
- 統計出留言數量前20名的留言者,以及留言者之間的關係 (Concurrency Analysis) | ||
- Commenter (留言者分析): | ||
- 使用者輸入想要了解的留言者後,儀表板會呈現該留言者的活躍時段 | ||
- 彙整該留言者的所有留言,並彙整成文字雲 | ||
- 開源資料 API: | ||
- 獲得更多資訊:IP 與作者等 | ||
|
||
|
||
|
||
## Tools | ||
| Category | Tool/Technique | | ||
|----------------|---------------------------------| | ||
| Database | MongoDB | | ||
| Data Pipeline | Airflow | | ||
| Dashboard | Plotly Dash | | ||
| Backend | FastAPI | | ||
| Cache system | Redis | | ||
| Autoscaling | Cloud Run | | ||
| Load Balancing | Cloud Load Balancing | | ||
| Monitoring | Cloud Monitoring, Cloud Logging | | ||
| Others | GCP Compute Engine | | ||
|
||
|
||
## Monitoring | ||
#### Overall | ||
- ![overall](readme-img/monitoring-overall.png) | ||
#### Airflow | ||
- ![airflow](readme-img/monitoring-airflow.png) | ||
#### Dashboard | ||
- ![dashboard](readme-img/monitoring-dashboard.png) | ||
|
||
## Clip | ||
#### 趨勢分析 | ||
- ![Trend](readme-img/demo-trend.gif) | ||
#### 關鍵字分析 | ||
- ![Keyword](readme-img/demo-keyword.gif) | ||
#### 留言者分析 | ||
- ![Commenter](readme-img/demo-commenter.gif) | ||
#### APIs | ||
- ![ipaddress](readme-img/demo-api.gif) | ||
|
||
|
||
|
||
|
||
## Contact | ||
Raymond Hung [email protected] |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters