Skip to content
/ Machine5 Public

Team repository for the machine learning projects; Fingerprinting websites based on the network traffic pattern analysis

Notifications You must be signed in to change notification settings

440g/Machine5

Repository files navigation

Machine5

Contents

Project Structure

Machine5
📦Machine5
┣ 📂Closed
┃ ┣ 📜Closed_RF.ipynb
┃ ┣ 📜Closed_RF_old .ipynb
┃ ┣ 📜Closed_RF_selected10.ipynb
┃ ┣ 📜Closed_SVM.ipynb
┃ ┣ 📜Closed_SVM_old.ipynb
┃ ┗ 📜baseline.ipynb
┣ 📂Open_Binary
┃ ┣ 📂baseline
┃ ┃ ┗ 📜Open_Binary_KNN.ipynb
┃ ┣ 📜Open2_Binary_RF_selected12.ipynb
┃ ┣ 📜Open2_Binary_RF_selected6.ipynb
┃ ┣ 📜Open_Binary_RF.ipynb
┃ ┗ 📜Open_Binary_SVM.ipynb
┣ 📂Open_Multi
┃ ┣ 📜Open2_Multi_RF.ipynb
┃ ┣ 📜Open2_Multi_SVM.ipynb
┃ ┣ 📜Open_Multi_RF.ipynb
┃ ┗ 📜Open_Multi_SVM.ipynb
┣ 📂datasets
┃ ┣ 📜binary_labels.csv
┃ ┣ 📜final_labels.csv
┃ ┣ 📜mon_features.csv
┃ ┣ 📜mon_features_old.csv
┃ ┣ 📜mon_labels.csv
┃ ┣ 📜unmon3000_features.csv
┃ ┣ 📜unmon3000_features_old.csv
┃ ┣ 📜unmon_features.csv
┃ ┗ 📜unmon_features_old.csv
┣ 📂features
┃ ┣ 📂feature_information
┃ ┃ ┣ 📜combined_feature_information.ipynb
┃ ┃ ┣ 📜comimage.png
┃ ┃ ┣ 📜comimage2.png
┃ ┃ ┣ 📜mon_feature_information.ipynb
┃ ┃ ┣ 📜monimage.png
┃ ┃ ┣ 📜monimage2.png
┃ ┃ ┣ 📜unmon_feature_information.ipynb
┃ ┃ ┗ 📜unmonimage.png
┃ ┣ 📂original_datasets
┃ ┃ ┣ 📜mon_standard.pkl
┃ ┃ ┣ 📜unmon_standard10.pkl
┃ ┃ ┗ 📜unmon_standard10_3000.pkl
┃ ┣ 📜README.md
┃ ┗ 📜feature_generator.ipynb
┣ 📜README.md
┣ 📜Scenario1_SVM.png
┣ 📜Scenario2_RF.png
┣ 📜Scenario2_RF_2.png
┗ 📜how_to_run.ipynb

Overview

Project Description

  • Given data: mon_standard.pkl (data from monitored websites), unmon_standard10.pkl (data from unmonitored websites).
  • Project Purpose: Based on the given data, create a model that makes the following predictions; label the monitored website instances with {0, 1, 2, ..., 94} and the unmonitored website instances with the label '-1'.
  • Constraints
    • The model must be selected from LR, NB, SVM, DT, GB, k-NN, Clustering, and NN.
    • Metric must use the following: Accuracy (Closed-World), True positive rate, False positive rate, precision, PR curve, and ROC (Opern-World).

Hypothesis, Scenario

  • Candidate Features

  • Closed-World Multi, Open-World Binary Classification

    • Use a baseline model to select the best model that performs well on the given data.

    • Only features with high importance and correlation coefficients are selected for training to avoid overfitting and speed up training.

    • We use data preprocessing and hyperparameter tuning to improve the accuracy of the selected model.

    • Open-World Binary Scenario

      • Combine monitored and unmonitored data into a single dataset.
      • Train the entire dataset using binary classification (-1 vs 1), where the label -1 represents unmonitored data and 1 represents monitored data.
      • Additional Step: Extract labels based on the binary classification prediction results to implement a multi-class model, and save these labels in a CSV file (used in Open-World Multi Scenario 2).
  • The open-world multi classification followed two scenarios.

  • Open-World Multi Scenario 1

    • The model is selected by considering the baseline of multi-classification in the closed world and the baseline of binary classification in the open world.
    • Combine data for monitored and unmonitored instances.
    • Using the selected model, predict the label{-1, 0, 1, ..., 94} for the combined data.
  • Open-World Multi Scenario 2

    • Preprocess the data without classification, taking into account the different feature importance and 2. correlation coefficients between monitored and unmonitored data.
    • Train a closed multi-classification model and an open binary classification model for each data separately.
    • Perform prediction of open binary classification model > Extract the prediction result > Perform multi-classification based on this prediction result.

Results

  • Open-World Binary Scenario

    • The SVM achieved a baseline accuracy of 80.00%, which improved to 83.29% after hyperparameter optimization.

    • details(SVM)
      • The optimized model showed excellent balance with a PR curve of 90.99% and an ROC curve of 80.32%, minimizing false negatives with a recall of 89.81%.
      image
    • Random Forest demonstrated the highest performance and improved generalization through hyperparameter tuning.

    • details(RF)
      • The baseline accuracy of Random Forest was 84.72%, which was adjusted to 82.12% after tuning.
      • The tuning focused on reducing overfitting and addressing data imbalance by limiting max_depth and max_leaf_nodes and applying class_weight='balanced'.
      • Although accuracy and ROC-AUC decreased to 82.12% and 81.74%, respectively, the model's generalization performance improved.
      • PR-AUC remained high at 91.91%, and precision was maintained at 89.53%, effectively minimizing false positives and achieving balanced performance.
      rf
    • Future Improvement Direction: Optimize class weights, tune hyperparameters, and address data imbalance using resampling techniques.
  • Open-World Multi Scenario 1

    • Multi-classification and binary classification do not consider the importance of the features used, resulting in relatively low accuracy.

    • details(SVM)

      Accuracy (Tuned Model): 0.6993
      Precision: 0.6993
      Recall: 0.6254
      Confusion Matrix (Tuned Model):
      [[1686 3 3 ... 1 0 4]
      [ 7 16 0 ... 0 0 2]
      [ 10 0 31 ... 0 0 0]
      ...
      [ 14 0 1 ... 17 0 0]
      [ 2 0 0 ... 0 35 0]
      [ 6 0 0 ... 1 0 25]]

      ROC AUC: 0.4105
      PR AUC: 0.0071
      alt text

  • Open-World Multi Scenario 2

    • The criteria not considered in Scenario 1 were applied, resulting in a relatively high accuracy.

    • details(RF)

      Accuracy: 0.8136
      Precision: 0.8657
      Recall: 0.7885
      Confusion Matrix: alt text

      ROC AUC: 0.3905
      Model PR AUC: 0.0054
      alt text

  • Both open world multi classifications resulted in very low ROC and PR scores because the dataset was highly imbalanced (-1:remaining 95 classes = 10000:19000).

  • Solution: Techniques such as oversampling/undersampling can be used to balance the classes. Additionally, it is possible to consider weighting samples with smaller clusters.

Strong Point

  • Calculate features based on findings from prior research papers for a given classification objective and dataset
  • Calculate feature importance, correlation coefficients, and selection considering the purpose of the model
  • After selecting a model based on our understanding of the dataset, we implement a baseline model to experimentally verify its practical performance.
  • Optimal model achieves accuracy of 80 or higher for open world(final) classifications
  • Identify the cause of ROC and PR score decline due to dataset imbalance issues and propose solutions

How to Run

  • Highly recommend to skip 2. Feature generation and 3. Scenario 1: if you want to reproduce only the final prediction experiment.
    • In this case, you only need to download the ./datasets folder and the ./Open_Multi/{Open2_Multi_RF | Open2_Multi_SVM}.ipynb file (which requires a dataset load path conversion) and run the file.

1. Open and run how_to_run.ipynb to clone this repository

  • Open In Colab

2. Feature generation

  • Run feature_generator.ipynb(/content/drive/MyDrive/Machine5/featuers) and get {mon_features | unmon_features | unmon3000_features}.ipynb(/content/drive/MyDrive/Machine5/datasets)

3. Scenario 1

  • Closed: Run {baseline | Closed_RF | Closed_SVM}.ipynb(/content/drive/MyDrive/Machine5/Closed)
  • Open_Binary: Run {Open_Binary_RF | Open_Binary_SVM}.ipynb(/content/drive/MyDrive/Machine5/Open_Binary)
  • Open_Multi: Run {Open_Multi_RF | Open_Multi_SVM}.ipynb(/content/drive/MyDrive/Machine5/Open_Multi)

4. Scenario 2

  • Open_Binary model should be executed before Open_Multi model
  • Closed: Run {baseline | Closed_RF | Closed_SVM}.ipynb(/content/drive/MyDrive/Machine5/Closed)
  • Open_Binary: Run Open2_Binary_RF_selected12.ipynb(/content/drive/MyDrive/Machine5/Open_Binary) and get binary_labels.csv(/content/drive/MyDrive/Machine5/datasets)
  • Open_Multi: Run {Open2_Multi_RF | Open2_Multi_SVM}.ipynb(/content/drive/MyDrive/Machine5/Open_Multi) and get final_labels.csv(/content/drive/MyDrive/Machine5/datasets)

Contributors

Minseo Kim Chaewon Kim Minkyung Song Seungyeon Kim Yeonsu Kim
Project Management Open_Binary Model Training & Optimization Closed Model Training & Optimization Open_Multi Model Training & Optimization Result Analysis
Feature Engineering Model Evaluation Model Evaluation Model Evaluation Project Presentation

Created the following files; If something goes wrong, contact us!
All Open_Binary Closed Open_Multi -
features Open_Binary/baseline Closed_RF_old.ipynb Open_Multi_RF.ipynb
Closed/{baseline, Closed_RF, Closed_RF_selected10}.ipynb Open_Binary_RF.ipynb Closed_SVM_old.ipynb Open_Multi_SVM.ipynb
Open_Binary/{Open2_Binary_RF_selected6, Open2_Binary_RF_selected12}.ipynb Open_Binary_SVM.ipynb Closed_RF.ipynb Open2_Multi_SVM.ipynb
Open2_Multi_RF.ipynb Open_Binary_KNN.ipynb Closed_SVM.ipynb

About

Team repository for the machine learning projects; Fingerprinting websites based on the network traffic pattern analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •