Machine5

Project Structure

Machine5

📦Machine5
┣ 📂Closed
┃ ┣ 📜Closed_RF.ipynb
┃ ┣ 📜Closed_RF_old .ipynb
┃ ┣ 📜Closed_RF_selected10.ipynb
┃ ┣ 📜Closed_SVM.ipynb
┃ ┣ 📜Closed_SVM_old.ipynb
┃ ┗ 📜baseline.ipynb
┣ 📂Open_Binary
┃ ┣ 📂baseline
┃ ┃ ┗ 📜Open_Binary_KNN.ipynb
┃ ┣ 📜Open2_Binary_RF_selected12.ipynb
┃ ┣ 📜Open2_Binary_RF_selected6.ipynb
┃ ┣ 📜Open_Binary_RF.ipynb
┃ ┗ 📜Open_Binary_SVM.ipynb
┣ 📂Open_Multi
┃ ┣ 📜Open2_Multi_RF.ipynb
┃ ┣ 📜Open2_Multi_SVM.ipynb
┃ ┣ 📜Open_Multi_RF.ipynb
┃ ┗ 📜Open_Multi_SVM.ipynb
┣ 📂datasets
┃ ┣ 📜binary_labels.csv
┃ ┣ 📜final_labels.csv
┃ ┣ 📜mon_features.csv
┃ ┣ 📜mon_features_old.csv
┃ ┣ 📜mon_labels.csv
┃ ┣ 📜unmon3000_features.csv
┃ ┣ 📜unmon3000_features_old.csv
┃ ┣ 📜unmon_features.csv
┃ ┗ 📜unmon_features_old.csv
┣ 📂features
┃ ┣ 📂feature_information
┃ ┃ ┣ 📜combined_feature_information.ipynb
┃ ┃ ┣ 📜comimage.png
┃ ┃ ┣ 📜comimage2.png
┃ ┃ ┣ 📜mon_feature_information.ipynb
┃ ┃ ┣ 📜monimage.png
┃ ┃ ┣ 📜monimage2.png
┃ ┃ ┣ 📜unmon_feature_information.ipynb
┃ ┃ ┗ 📜unmonimage.png
┃ ┣ 📂original_datasets
┃ ┃ ┣ 📜mon_standard.pkl
┃ ┃ ┣ 📜unmon_standard10.pkl
┃ ┃ ┗ 📜unmon_standard10_3000.pkl
┃ ┣ 📜README.md
┃ ┗ 📜feature_generator.ipynb
┣ 📜README.md
┣ 📜Scenario1_SVM.png
┣ 📜Scenario2_RF.png
┣ 📜Scenario2_RF_2.png
┗ 📜how_to_run.ipynb

Overview

Project Description

Given data: mon_standard.pkl (data from monitored websites), unmon_standard10.pkl (data from unmonitored websites).
Project Purpose: Based on the given data, create a model that makes the following predictions; label the monitored website instances with {0, 1, 2, ..., 94} and the unmonitored website instances with the label '-1'.
Constraints
- The model must be selected from LR, NB, SVM, DT, GB, k-NN, Clustering, and NN.
- Metric must use the following: Accuracy (Closed-World), True positive rate, False positive rate, precision, PR curve, and ROC (Opern-World).

Hypothesis, Scenario

Candidate Features
- Details: ./features/README.md
Closed-World Multi, Open-World Binary Classification
- Use a baseline model to select the best model that performs well on the given data.
- Only features with high importance and correlation coefficients are selected for training to avoid overfitting and speed up training.
- We use data preprocessing and hyperparameter tuning to improve the accuracy of the selected model.
- Open-World Binary Scenario
  - Combine monitored and unmonitored data into a single dataset.
  - Train the entire dataset using binary classification (-1 vs 1), where the label -1 represents unmonitored data and 1 represents monitored data.
  - Additional Step: Extract labels based on the binary classification prediction results to implement a multi-class model, and save these labels in a CSV file (used in Open-World Multi Scenario 2).
The open-world multi classification followed two scenarios.
Open-World Multi Scenario 1
- The model is selected by considering the baseline of multi-classification in the closed world and the baseline of binary classification in the open world.
- Combine data for monitored and unmonitored instances.
- Using the selected model, predict the label{-1, 0, 1, ..., 94} for the combined data.
Open-World Multi Scenario 2
- Preprocess the data without classification, taking into account the different feature importance and 2. correlation coefficients between monitored and unmonitored data.
- Train a closed multi-classification model and an open binary classification model for each data separately.
- Perform prediction of open binary classification model > Extract the prediction result > Perform multi-classification based on this prediction result.

Results

Open-World Binary Scenario
- The SVM achieved a baseline accuracy of 80.00%, which improved to 83.29% after hyperparameter optimization.
- details(SVM)
  - The optimized model showed excellent balance with a PR curve of 90.99% and an ROC curve of 80.32%, minimizing false negatives with a recall of 89.81%.
- Random Forest demonstrated the highest performance and improved generalization through hyperparameter tuning.
- details(RF)
  - The baseline accuracy of Random Forest was 84.72%, which was adjusted to 82.12% after tuning.
  - The tuning focused on reducing overfitting and addressing data imbalance by limiting max_depth and max_leaf_nodes and applying class_weight='balanced'.
  - Although accuracy and ROC-AUC decreased to 82.12% and 81.74%, respectively, the model's generalization performance improved.
  - PR-AUC remained high at 91.91%, and precision was maintained at 89.53%, effectively minimizing false positives and achieving balanced performance.
- Future Improvement Direction: Optimize class weights, tune hyperparameters, and address data imbalance using resampling techniques.
Open-World Multi Scenario 1
- Multi-classification and binary classification do not consider the importance of the features used, resulting in relatively low accuracy.
- details(SVM)
  
  Accuracy (Tuned Model): 0.6993
  Precision: 0.6993
  Recall: 0.6254
  Confusion Matrix (Tuned Model):
  [[1686 3 3 ... 1 0 4]
  [ 7 16 0 ... 0 0 2]
  [ 10 0 31 ... 0 0 0]
  ...
  [ 14 0 1 ... 17 0 0]
  [ 2 0 0 ... 0 35 0]
  [ 6 0 0 ... 1 0 25]]
  
  ROC AUC: 0.4105
  PR AUC: 0.0071
Open-World Multi Scenario 2
- The criteria not considered in Scenario 1 were applied, resulting in a relatively high accuracy.
- details(RF)
  
  Accuracy: 0.8136
  Precision: 0.8657
  Recall: 0.7885
  Confusion Matrix:
  
  ROC AUC: 0.3905
  Model PR AUC: 0.0054
Both open world multi classifications resulted in very low ROC and PR scores because the dataset was highly imbalanced (-1:remaining 95 classes = 10000:19000).
Solution: Techniques such as oversampling/undersampling can be used to balance the classes. Additionally, it is possible to consider weighting samples with smaller clusters.

Strong Point

Calculate features based on findings from prior research papers for a given classification objective and dataset
Calculate feature importance, correlation coefficients, and selection considering the purpose of the model
After selecting a model based on our understanding of the dataset, we implement a baseline model to experimentally verify its practical performance.
Optimal model achieves accuracy of 80 or higher for open world(final) classifications
Identify the cause of ROC and PR score decline due to dataset imbalance issues and propose solutions

How to Run

Highly recommend to skip 2. Feature generation and 3. Scenario 1: if you want to reproduce only the final prediction experiment.
- In this case, you only need to download the ./datasets folder and the ./Open_Multi/{Open2_Multi_RF | Open2_Multi_SVM}.ipynb file (which requires a dataset load path conversion) and run the file.

1. Open and run `how_to_run.ipynb` to clone this repository

2. Feature generation

Run feature_generator.ipynb(/content/drive/MyDrive/Machine5/featuers) and get {mon_features | unmon_features | unmon3000_features}.ipynb(/content/drive/MyDrive/Machine5/datasets)

3. Scenario 1

Closed: Run {baseline | Closed_RF | Closed_SVM}.ipynb(/content/drive/MyDrive/Machine5/Closed)
Open_Binary: Run {Open_Binary_RF | Open_Binary_SVM}.ipynb(/content/drive/MyDrive/Machine5/Open_Binary)
Open_Multi: Run {Open_Multi_RF | Open_Multi_SVM}.ipynb(/content/drive/MyDrive/Machine5/Open_Multi)

4. Scenario 2

Open_Binary model should be executed before Open_Multi model
Closed: Run {baseline | Closed_RF | Closed_SVM}.ipynb(/content/drive/MyDrive/Machine5/Closed)
Open_Binary: Run Open2_Binary_RF_selected12.ipynb(/content/drive/MyDrive/Machine5/Open_Binary) and get binary_labels.csv(/content/drive/MyDrive/Machine5/datasets)
Open_Multi: Run {Open2_Multi_RF | Open2_Multi_SVM}.ipynb(/content/drive/MyDrive/Machine5/Open_Multi) and get final_labels.csv(/content/drive/MyDrive/Machine5/datasets)

Contributors

Minseo Kim	Chaewon Kim	Minkyung Song	Seungyeon Kim	Yeonsu Kim
Project Management	Open_Binary Model Training & Optimization	Closed Model Training & Optimization	Open_Multi Model Training & Optimization	Result Analysis
Feature Engineering	Model Evaluation	Model Evaluation	Model Evaluation	Project Presentation

Created the following files; If something goes wrong, contact us!

All	Open_Binary	Closed	Open_Multi
features	Open_Binary/baseline	Closed_RF_old.ipynb	Open_Multi_RF.ipynb
Closed/{baseline, Closed_RF, Closed_RF_selected10}.ipynb	Open_Binary_RF.ipynb	Closed_SVM_old.ipynb	Open_Multi_SVM.ipynb
Open_Binary/{Open2_Binary_RF_selected6, Open2_Binary_RF_selected12}.ipynb	Open_Binary_SVM.ipynb	Closed_RF.ipynb	Open2_Multi_SVM.ipynb
Open2_Multi_RF.ipynb	Open_Binary_KNN.ipynb	Closed_SVM.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine5

Contents

Project Structure

Overview

Project Description

Hypothesis, Scenario

Results

Strong Point

How to Run

1. Open and run `how_to_run.ipynb` to clone this repository

2. Feature generation

3. Scenario 1

4. Scenario 2

Contributors

About

Releases

Packages

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
Closed		Closed
Open_Binary		Open_Binary
Open_Multi		Open_Multi
datasets		datasets
features		features
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md
Scenario1_SVM.png		Scenario1_SVM.png
Scenario2_RF.png		Scenario2_RF.png
Scenario2_RF_2.png		Scenario2_RF_2.png
how_to_run.ipynb		how_to_run.ipynb

440g/Machine5

Folders and files

Latest commit

History

Repository files navigation

Machine5

Contents

Project Structure

Overview

Project Description

Hypothesis, Scenario

Results

Strong Point

How to Run

1. Open and run how_to_run.ipynb to clone this repository

2. Feature generation

3. Scenario 1

4. Scenario 2

Contributors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

1. Open and run `how_to_run.ipynb` to clone this repository

Packages