Machine5
📦Machine5
┣ 📂Closed
┃ ┣ 📜Closed_RF.ipynb
┃ ┣ 📜Closed_RF_old .ipynb
┃ ┣ 📜Closed_RF_selected10.ipynb
┃ ┣ 📜Closed_SVM.ipynb
┃ ┣ 📜Closed_SVM_old.ipynb
┃ ┗ 📜baseline.ipynb
┣ 📂Open_Binary
┃ ┣ 📂baseline
┃ ┃ ┗ 📜Open_Binary_KNN.ipynb
┃ ┣ 📜Open2_Binary_RF_selected12.ipynb
┃ ┣ 📜Open2_Binary_RF_selected6.ipynb
┃ ┣ 📜Open_Binary_RF.ipynb
┃ ┗ 📜Open_Binary_SVM.ipynb
┣ 📂Open_Multi
┃ ┣ 📜Open2_Multi_RF.ipynb
┃ ┣ 📜Open2_Multi_SVM.ipynb
┃ ┣ 📜Open_Multi_RF.ipynb
┃ ┗ 📜Open_Multi_SVM.ipynb
┣ 📂datasets
┃ ┣ 📜binary_labels.csv
┃ ┣ 📜final_labels.csv
┃ ┣ 📜mon_features.csv
┃ ┣ 📜mon_features_old.csv
┃ ┣ 📜mon_labels.csv
┃ ┣ 📜unmon3000_features.csv
┃ ┣ 📜unmon3000_features_old.csv
┃ ┣ 📜unmon_features.csv
┃ ┗ 📜unmon_features_old.csv
┣ 📂features
┃ ┣ 📂feature_information
┃ ┃ ┣ 📜combined_feature_information.ipynb
┃ ┃ ┣ 📜comimage.png
┃ ┃ ┣ 📜comimage2.png
┃ ┃ ┣ 📜mon_feature_information.ipynb
┃ ┃ ┣ 📜monimage.png
┃ ┃ ┣ 📜monimage2.png
┃ ┃ ┣ 📜unmon_feature_information.ipynb
┃ ┃ ┗ 📜unmonimage.png
┃ ┣ 📂original_datasets
┃ ┃ ┣ 📜mon_standard.pkl
┃ ┃ ┣ 📜unmon_standard10.pkl
┃ ┃ ┗ 📜unmon_standard10_3000.pkl
┃ ┣ 📜README.md
┃ ┗ 📜feature_generator.ipynb
┣ 📜README.md
┣ 📜Scenario1_SVM.png
┣ 📜Scenario2_RF.png
┣ 📜Scenario2_RF_2.png
┗ 📜how_to_run.ipynb
- Given data: mon_standard.pkl (data from monitored websites), unmon_standard10.pkl (data from unmonitored websites).
- Project Purpose: Based on the given data, create a model that makes the following predictions; label the monitored website instances with {0, 1, 2, ..., 94} and the unmonitored website instances with the label '-1'.
- Constraints
- The model must be selected from LR, NB, SVM, DT, GB, k-NN, Clustering, and NN.
- Metric must use the following: Accuracy (Closed-World), True positive rate, False positive rate, precision, PR curve, and ROC (Opern-World).
-
Candidate Features
- Details: ./features/README.md
-
Closed-World Multi, Open-World Binary Classification
-
Use a baseline model to select the best model that performs well on the given data.
-
Only features with high importance and correlation coefficients are selected for training to avoid overfitting and speed up training.
-
We use data preprocessing and hyperparameter tuning to improve the accuracy of the selected model.
-
Open-World Binary Scenario
- Combine monitored and unmonitored data into a single dataset.
- Train the entire dataset using binary classification (-1 vs 1), where the label -1 represents unmonitored data and 1 represents monitored data.
- Additional Step: Extract labels based on the binary classification prediction results to implement a multi-class model, and save these labels in a CSV file (used in Open-World Multi Scenario 2).
-
-
The open-world multi classification followed two scenarios.
-
Open-World Multi Scenario 1
- The model is selected by considering the baseline of multi-classification in the closed world and the baseline of binary classification in the open world.
- Combine data for monitored and unmonitored instances.
- Using the selected model, predict the label{-1, 0, 1, ..., 94} for the combined data.
-
Open-World Multi Scenario 2
- Preprocess the data without classification, taking into account the different feature importance and 2. correlation coefficients between monitored and unmonitored data.
- Train a closed multi-classification model and an open binary classification model for each data separately.
- Perform prediction of open binary classification model > Extract the prediction result > Perform multi-classification based on this prediction result.
-
Open-World Binary Scenario
-
The SVM achieved a baseline accuracy of 80.00%, which improved to 83.29% after hyperparameter optimization.
-
Random Forest demonstrated the highest performance and improved generalization through hyperparameter tuning.
-
details(RF)
- The baseline accuracy of Random Forest was 84.72%, which was adjusted to 82.12% after tuning.
- The tuning focused on reducing overfitting and addressing data imbalance by limiting
max_depth
andmax_leaf_nodes
and applyingclass_weight='balanced'
. - Although accuracy and ROC-AUC decreased to 82.12% and 81.74%, respectively, the model's generalization performance improved.
- PR-AUC remained high at 91.91%, and precision was maintained at 89.53%, effectively minimizing false positives and achieving balanced performance.
- Future Improvement Direction: Optimize class weights, tune hyperparameters, and address data imbalance using resampling techniques.
-
-
Open-World Multi Scenario 1
-
Multi-classification and binary classification do not consider the importance of the features used, resulting in relatively low accuracy.
-
-
Open-World Multi Scenario 2
-
The criteria not considered in Scenario 1 were applied, resulting in a relatively high accuracy.
-
-
Both open world multi classifications resulted in very low ROC and PR scores because the dataset was highly imbalanced (-1:remaining 95 classes = 10000:19000).
-
Solution: Techniques such as oversampling/undersampling can be used to balance the classes. Additionally, it is possible to consider weighting samples with smaller clusters.
- Calculate features based on findings from prior research papers for a given classification objective and dataset
- Calculate feature importance, correlation coefficients, and selection considering the purpose of the model
- After selecting a model based on our understanding of the dataset, we implement a baseline model to experimentally verify its practical performance.
- Optimal model achieves accuracy of 80 or higher for open world(final) classifications
- Identify the cause of ROC and PR score decline due to dataset imbalance issues and propose solutions
- Highly recommend to skip 2. Feature generation and 3. Scenario 1: if you want to reproduce only the final prediction experiment.
- In this case, you only need to download the
./datasets
folder and the./Open_Multi/{Open2_Multi_RF | Open2_Multi_SVM}.ipynb
file (which requires a dataset load path conversion) and run the file.
- In this case, you only need to download the
- Run
feature_generator.ipynb
(/content/drive/MyDrive/Machine5/featuers) and get{mon_features | unmon_features | unmon3000_features}.ipynb
(/content/drive/MyDrive/Machine5/datasets)
- Closed: Run
{baseline | Closed_RF | Closed_SVM}.ipynb
(/content/drive/MyDrive/Machine5/Closed) - Open_Binary: Run
{Open_Binary_RF | Open_Binary_SVM}.ipynb
(/content/drive/MyDrive/Machine5/Open_Binary) - Open_Multi: Run
{Open_Multi_RF | Open_Multi_SVM}.ipynb
(/content/drive/MyDrive/Machine5/Open_Multi)
- Open_Binary model should be executed before Open_Multi model
- Closed: Run
{baseline | Closed_RF | Closed_SVM}.ipynb
(/content/drive/MyDrive/Machine5/Closed) - Open_Binary: Run
Open2_Binary_RF_selected12.ipynb
(/content/drive/MyDrive/Machine5/Open_Binary) and getbinary_labels.csv
(/content/drive/MyDrive/Machine5/datasets) - Open_Multi: Run
{Open2_Multi_RF | Open2_Multi_SVM}.ipynb
(/content/drive/MyDrive/Machine5/Open_Multi) and getfinal_labels.csv
(/content/drive/MyDrive/Machine5/datasets)
Minseo Kim | Chaewon Kim | Minkyung Song | Seungyeon Kim | Yeonsu Kim |
---|---|---|---|---|
Project Management | Open_Binary Model Training & Optimization | Closed Model Training & Optimization | Open_Multi Model Training & Optimization | Result Analysis |
Feature Engineering | Model Evaluation | Model Evaluation | Model Evaluation | Project Presentation |
Created the following files; If something goes wrong, contact us!
All | Open_Binary | Closed | Open_Multi | - |
---|---|---|---|---|
features | Open_Binary/baseline | Closed_RF_old.ipynb | Open_Multi_RF.ipynb | |
Closed/{baseline, Closed_RF, Closed_RF_selected10}.ipynb | Open_Binary_RF.ipynb | Closed_SVM_old.ipynb | Open_Multi_SVM.ipynb | |
Open_Binary/{Open2_Binary_RF_selected6, Open2_Binary_RF_selected12}.ipynb | Open_Binary_SVM.ipynb | Closed_RF.ipynb | Open2_Multi_SVM.ipynb | |
Open2_Multi_RF.ipynb | Open_Binary_KNN.ipynb | Closed_SVM.ipynb |