Replication package for Automated Identification of Libraries from Vulnerability Data: Can We Do Better?
This folder contains the original dataset that we use in our experiments. The zip file contains four files:
- dataset.csv: the original csv file of the dataset which has not been cleaned and the label not yet merged
- dataset_merged_cleaned.csv: the processed dataset.csv file which has been cleaned and merged between the co-occuring labels
- cve_labels.csv: the csv file containing the pairing of the CVE id with the labels. This is for the dataset.csv file that is not cleaned and merged yet
- cve_labels_merged_cleaned.csv: csv file containing pairing of CVE id with the labels for the cleaned and merged dataset
Each folder within this repository contains the implementation of different XML approaches that can be applied to the CVE data to identify the library. The Utility folder contains utility functions that may be useful to ease the process of using this repository.
data_preparation.py: contains functions that can be used to prepare the dataset for different XML algorithm. The inputs for most of these functions are mainly the pre-splitted dataset available in the dataset/splitted folder Please make sure that the folder exist in the Utility/dataset/ directory before using the functions in data_preparation.py. An alternative is to modify the functions to include folder creation functionality.
dataset folder: contains the dataset of the CVE data, both in the splitted form and in the original csv form. All the results from the data_preparation.py functions will be available here
- Please refer to https://github.com/kongds/LightXML/
- For easier virtual environment, I recommend to use conda env
- Then, install the requirements listed in requirements.txt
- Keep in mind when installing the requirements, it use the NVidia Apex (https://github.com/NVIDIA/apex). The one listed in requirements.txt is often linked with the wrong library
- To run the training and evaluation script, use the run.sh script
- ./run.sh cve_data
- Refer to line 76--86 of the run.sh script
- To get the trained LightXML model, you can download the model through the following command to your server.
wget https://smu-my.sharepoint.com/:u:/g/personal/yunbolyu_smu_edu_sg/Ed2D4pnUfsRMl4OJuXcNqtkBHmqj_v5jNLx54OqUJ1NqUw?download=1 -O LightXML.zip
- I use Python 3.6 virtual environment for FastXML
- After creating the virtual environment, install the libraries listed in the requirements.txt
- For the FastXML, I use the json file structure indicating the train and test data as suggested in the FastXML repo readme
- Data preparation utility function is available in Utilities/data_preparation.py prepare_fastxml_dataset() function.
- This function make use of the splitted_train_x.npy, splitted_test_x.npy (the pre-splitted numpy dataset), and the cve_labels_merged_cleaned.csv (the csv file containing all the entries)
- To make the dataset consistent, it would be good to use the dataset_train.csv and dataset_test.csv available in the utilities/dataset/splitted/splitted_dataset_csv.zip and change the merged column to the text that we want as the feature.
- Then, to convert these two csv files into the numpy array, you can use the save_splitted_dataset_as_numpy() function in the data_preparation file.
- After you have created the train.json and test.json for FastXML, copy the two files into FastXML/dataset folder
- To start the training process, run the FastXML/baseline.py. We need to define the run parameters. For starter, you can use the following parameter which produce similar result to Veracode's implementation:
model/model_name.model dataset/path_to_train.json --verbose train --iters 200 --gamma 30 --trees 64 --min-label-count 1 --blend-factor 0.5 --re_split 0 --leaf-probs
- After the training process is completed, the model will be created in the FastXML/model folder
- Then, we run the FastXML/baseline.py again for the model testing with the following run parameter:
model/path_to_model_folder dataset/path_to_test.json inference --score
- Running the above test command will produce FastXML/inference_result.json which contains the inference result of the model.
- To calculate the precision, recall, and F1 metrics, run the FastXML/util.py, which will calculate the metric from the inference_result.json file.
Omikuji is the name of the library that provides the implementation of both Bonsai and Parabel. It is fairly straightforward to setup omikuji as it is readily available in the form of a library.
- Omikuji takes as input training and test data in the form of svmlight file of the Tf Idf features of the data
- Data preparation utility function is available in Utilities/data_preparation.py prepare_omikuji_dataset function
- This utility function make use of the pre-splitted numpy array dataset
- Install the Python binding of Omikuji as specified in its repository README (https://github.com/tomtung/omikuji/).
pip install omikuji
- For the omikuji library I use Python 3.8 environment and omikuji version 0.3.2.
- If there is an error with the omikuji installation, please consider manually installing Omikuji from the repository.
- The above Python binding of Omikuji installation is used for the model prediction purpose.
- Meanwhile, for the model training using Omikuji, I use the Rust implementation of Omikuji that is available in Cargo (Refer to Build & Install section of Omikuji repository)
- After Omikuji is successfully installed from Cargo, we can use the following command to train a model: Parabel Model
omikuji train --model_path model_output_path --min_branch_size 2 --n_trees 3 path_to_dataset
Bonsai Model
omikuji train --cluster.unbalanced --model_path model_output_path --n_trees 3 dataset/train.txt
- Then, we can use the created models to predict the test data by running the Omikuji/omikuji_predict.py with model_path and test_data_path run parameters