┣ 📂Closed
┃ ┣ 📜Closed_RF.ipynb
┃ ┣ 📜Closed_RF_old .ipynb
┃ ┣ 📜Closed_RF_selected10.ipynb
┃ ┣ 📜Closed_SVM.ipynb
┃ ┣ 📜Closed_SVM_old.ipynb
┃ ┗ 📜baseline.ipynb
┣ 📂Open_Binary
┃ ┣ 📂baseline
┃ ┃ ┗ 📜Open_Binary_KNN.ipynb
┃ ┣ 📜Open2_Binary_RF_selected12.ipynb
┃ ┣ 📜Open2_Binary_RF_selected6.ipynb
┃ ┣ 📜Open_Binary_RF.ipynb
┃ ┗ 📜Open_Binary_SVM.ipynb
┣ 📂Open_Multi
┃ ┣ 📜Open2_Multi_RF.ipynb
┃ ┣ 📜Open2_Multi_SVM.ipynb
┃ ┣ 📜Open_Multi_RF.ipynb
┃ ┗ 📜Open_Multi_SVM.ipynb
┣ 📂datasets
┃ ┣ 📜binary_labels.csv
┃ ┣ 📜final_labels.csv
┃ ┣ 📜mon_features.csv
┃ ┣ 📜mon_features_old.csv
┃ ┣ 📜mon_labels.csv
┃ ┣ 📜unmon3000_features.csv
┃ ┣ 📜unmon3000_features_old.csv
┃ ┣ 📜unmon_features.csv
┃ ┗ 📜unmon_features_old.csv
┣ 📂features
┃ ┣ 📂feature_information
┃ ┃ ┣ 📜combined_feature_information.ipynb
┃ ┃ ┣ 📜comimage.png
┃ ┃ ┣ 📜comimage2.png
┃ ┃ ┣ 📜mon_feature_information.ipynb
┃ ┃ ┣ 📜monimage.png
┃ ┃ ┣ 📜monimage2.png
┃ ┃ ┣ 📜unmon_feature_information.ipynb
┃ ┃ ┗ 📜unmonimage.png
┃ ┣ 📂original_datasets
┃ ┃ ┣ 📜mon_standard.pkl
┃ ┃ ┣ 📜unmon_standard10.pkl
┃ ┃ ┗ 📜unmon_standard10_3000.pkl
┃ ┣ 📜
┃ ┗ 📜feature_generator.ipynb
┣ 📜
┣ 📜Scenario1_SVM.png
┣ 📜Scenario2_RF.png
┣ 📜Scenario2_RF_2.png
┗ 📜how_to_run.ipynb
- Given data: mon_standard.pkl (data from monitored websites), unmon_standard10.pkl (data from unmonitored websites).
- Project Purpose: Based on the given data, create a model that makes the following predictions; label the monitored website instances with {0, 1, 2, ..., 94} and the unmonitored website instances with the label '-1'.
- Constraints
- The model must be selected from LR, NB, SVM, DT, GB, k-NN, Clustering, and NN.
- Metric must use the following: Accuracy (Closed-World), True positive rate, False positive rate, precision, PR curve, and ROC (Opern-World).
Candidate Features
- Details: ./features/
Closed-World Multi, Open-World Binary Classification
Use a baseline model to select the best model that performs well on the given data.
Only features with high importance and correlation coefficients are selected for training to avoid overfitting and speed up training.
We use data preprocessing and hyperparameter tuning to improve the accuracy of the selected model.
Open-World Binary Scenario
- Combine monitored and unmonitored data into a single dataset.
- Train the entire dataset using binary classification (-1 vs 1), where the label -1 represents unmonitored data and 1 represents monitored data.
- Additional Step: Extract labels based on the binary classification prediction results to implement a multi-class model, and save these labels in a CSV file (used in Open-World Multi Scenario 2).
The open-world multi classification followed two scenarios.
Open-World Multi Scenario 1
- The model is selected by considering the baseline of multi-classification in the closed world and the baseline of binary classification in the open world.
- Combine data for monitored and unmonitored instances.
- Using the selected model, predict the label{-1, 0, 1, ..., 94} for the combined data.
Open-World Multi Scenario 2
- Preprocess the data without classification, taking into account the different feature importance and 2. correlation coefficients between monitored and unmonitored data.
- Train a closed multi-classification model and an open binary classification model for each data separately.
- Perform prediction of open binary classification model > Extract the prediction result > Perform multi-classification based on this prediction result.
Open-World Binary Scenario
The SVM achieved a baseline accuracy of 80.00%, which improved to 83.29% after hyperparameter optimization.
Random Forest demonstrated the highest performance and improved generalization through hyperparameter tuning.
- The baseline accuracy of Random Forest was 84.72%, which was adjusted to 82.12% after tuning.
- The tuning focused on reducing overfitting and addressing data imbalance by limiting
and applyingclass_weight='balanced'
. - Although accuracy and ROC-AUC decreased to 82.12% and 81.74%, respectively, the model's generalization performance improved.
- PR-AUC remained high at 91.91%, and precision was maintained at 89.53%, effectively minimizing false positives and achieving balanced performance.
- Future Improvement Direction: Optimize class weights, tune hyperparameters, and address data imbalance using resampling techniques.
Open-World Multi Scenario 1
Multi-classification and binary classification do not consider the importance of the features used, resulting in relatively low accuracy.
Open-World Multi Scenario 2
The criteria not considered in Scenario 1 were applied, resulting in a relatively high accuracy.
Both open world multi classifications resulted in very low ROC and PR scores because the dataset was highly imbalanced (-1:remaining 95 classes = 10000:19000).
Solution: Techniques such as oversampling/undersampling can be used to balance the classes. Additionally, it is possible to consider weighting samples with smaller clusters.
- Calculate features based on findings from prior research papers for a given classification objective and dataset
- Calculate feature importance, correlation coefficients, and selection considering the purpose of the model
- After selecting a model based on our understanding of the dataset, we implement a baseline model to experimentally verify its practical performance.
- Optimal model achieves accuracy of 80 or higher for open world(final) classifications
- Identify the cause of ROC and PR score decline due to dataset imbalance issues and propose solutions
- Highly recommend to skip 2. Feature generation and 3. Scenario 1: if you want to reproduce only the final prediction experiment.
- In this case, you only need to download the
folder and the./Open_Multi/{Open2_Multi_RF | Open2_Multi_SVM}.ipynb
file (which requires a dataset load path conversion) and run the file.
- In this case, you only need to download the
- Run
(/content/drive/MyDrive/Machine5/featuers) and get{mon_features | unmon_features | unmon3000_features}.ipynb
- Closed: Run
{baseline | Closed_RF | Closed_SVM}.ipynb
(/content/drive/MyDrive/Machine5/Closed) - Open_Binary: Run
{Open_Binary_RF | Open_Binary_SVM}.ipynb
(/content/drive/MyDrive/Machine5/Open_Binary) - Open_Multi: Run
{Open_Multi_RF | Open_Multi_SVM}.ipynb
- Open_Binary model should be executed before Open_Multi model
- Closed: Run
{baseline | Closed_RF | Closed_SVM}.ipynb
(/content/drive/MyDrive/Machine5/Closed) - Open_Binary: Run
(/content/drive/MyDrive/Machine5/Open_Binary) and getbinary_labels.csv
(/content/drive/MyDrive/Machine5/datasets) - Open_Multi: Run
{Open2_Multi_RF | Open2_Multi_SVM}.ipynb
(/content/drive/MyDrive/Machine5/Open_Multi) and getfinal_labels.csv
Minseo Kim | Chaewon Kim | Minkyung Song | Seungyeon Kim | Yeonsu Kim |
Project Management | Open_Binary Model Training & Optimization | Closed Model Training & Optimization | Open_Multi Model Training & Optimization | Result Analysis |
Feature Engineering | Model Evaluation | Model Evaluation | Model Evaluation | Project Presentation |
Created the following files; If something goes wrong, contact us!
All | Open_Binary | Closed | Open_Multi | - |
features | Open_Binary/baseline | Closed_RF_old.ipynb | Open_Multi_RF.ipynb | |
Closed/{baseline, Closed_RF, Closed_RF_selected10}.ipynb | Open_Binary_RF.ipynb | Closed_SVM_old.ipynb | Open_Multi_SVM.ipynb | |
Open_Binary/{Open2_Binary_RF_selected6, Open2_Binary_RF_selected12}.ipynb | Open_Binary_SVM.ipynb | Closed_RF.ipynb | Open2_Multi_SVM.ipynb | |
Open2_Multi_RF.ipynb | Open_Binary_KNN.ipynb | Closed_SVM.ipynb |