Skip to content

Commit

Permalink
Merge pull request #1 from m-zakeri/r0.3.0
Browse files Browse the repository at this point in the history
R0.3.0
  • Loading branch information
m-zakeri authored Oct 13, 2019
2 parents 0f9d112 + 911af4a commit 4393fe1
Show file tree
Hide file tree
Showing 90 changed files with 2,474 additions and 15 deletions.
39 changes: 32 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,38 @@ Fuzz testing (Fuzzing) is a dynamic software testing technique. In this techniqu
In this thesis, we proposed an automated method for hybrid test data generation. To this aim, we apply neural language models (NLMs) that are constructed by recurrent neural networks (RNNs). The proposed models by using deep learning techniques can learn the statistical structure of complex files and then generate new textual test data, based on the grammar, and binary data, based on mutations. Fuzzing the generated data is done by two newly introduced algorithms, called neural fuzz algorithms that use these models. We use our proposed method to generate test data, and then fuzz testing of MuPDF complicated software which takes portable document format (PDF) files as input. To train our generative models, we gathered a large corpus of PDF files. Our experiments demonstrate that the data generated by this method leads to an increase in the code coverage, more than 7%, compared to state-of-the-art file format fuzzers such as American fuzzy lop (AFL). Experiments also indicate a better learning accuracy of simpler NLMS in comparison with the more complicated encoder-decoder model and confirm that our proposed models can outperform the encoder-decoder model in code coverage when fuzzing the SUT.


## Getting Started
In the current release (0.3.0) you can use IUST-DeepFuzz for test data generation and then fuzzing every application.

### Install
You need to have Python 3.6.x and and up-to-date TensorFlow and Keras frameworks on your computer.
* Install [Python 3.6.x](https://www.python.org/)
* Install [TensorFlow](https://www.tensorflow.org/)
* Install [Keras](https://keras.io/)
* Clone the IUST-DeepFuzz repository: `git clone https://github.com/m-zakeri/iust_deep_fuzz.git` or download the latest version https://github.com/m-zakeri/iust_deep_fuzz.git
* IUST-DeepFuzz is almost ready for test data generation!

### Running
* Configure the `config.py` work with your dataset and to set other paths settings.
* Find the script of specific algorithm that you need.
* Run the script in command line: `python script_name.py`
* Wait until your file format learn and your test data is generate!

#### Available Pre-trained Models
A pre-trained model is a model that was trained on a large benchmark dataset to solve a problem similar to the one that we want to solve. For the time being, we provided some pre-trained model for PDF file format. Our best trained model is available at [model_checkpoint/best_models](model_checkpoint/best_models)

#### Availbale Fuzzing Scripts
ISUT-DeepFuzz has implemented four new deep models and two new fuzz algorithms: DataNeuralFuzz and MetadataNeuralFuzz as our contribution in mentioned thesis. The following algorithms to generate and fuzz test data are available in the current release (r0.3.0):
* `data_neural_fuzz.py`: To implement the DataNeuralFuzz algorithm for fuzzing data in the files.
* `metadata_neural_fuzz.py`: To implement MetadataNeuralFuzz for fuzzing metadata in the files.
* `learn_and_fuzz_3_sample_fuzz.py`: To implement SampleFuzz algorithm introduced in https://arxiv.org/abs/1701.07232.

#### Available Dataset
Various file format for learning with IUST-DeepFuzz and then fuzz testing is available at [dataset directory](dataset).


## How It Works?

### The PDF File Generation Process
![amazing_test_data_generation_process](docs/figs/amazing_test_data_generation_process.gif)

Expand All @@ -20,13 +52,6 @@ In this thesis, we proposed an automated method for hybrid test data generation.



## About
### Version 0.1
The main purpose of this version is to implement a free version of learn and fuzz paper and improve the **learn\&fuzz algorithm**.

### Version 0.2
This version implements four new deep models and two new fuzz algorithms: DataNeuralFuzz and MetadataNeuralFuzz as our contribution in mentioned thesis.

### FAQs
This repository is under *active development* and it dose not documented well. If you have downloaded source code or have forked it and have any questions, then feel free to email me (*[email protected]*) and get more information. You may see the main [references](REFERENCES.md) or look at our large [test corpus](dataset).

Expand Down
19 changes: 11 additions & 8 deletions dataset/README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,27 @@
# IUST Neural Software Testing (NST) Dataset
# IUST Neural Software Testing (IUST-NST) Dataset

Neural software testing (NST) is about applying machine learning techniques, specially deep-learning and neural network, in the field of software testing. We began with fuzz testing, but it can transform into other types of software testing. An unavoidable part of all machine learning task is data. The goal of this section is to provide suitable and public dataset which can be used by other researchers.


For now, we are gathering some large corpus for different file formats such as portable document format (PDF), extensible markup language (XML), and hypertext markup language (HTML) to do fuzz testing real-world application which takes these formats as their majoring inputs.
At this time, IUST PDF Corpus is ready to view and download.

## IUST PDF Corpus

![IUSTPDFCorpusDemo Image](pdfs/IUSTPDFCorpusDemo.PNG)
### News
**2019-10-13:** IUST-PDFCorpus version 1.0.0 is publicly available at [https://zenodo.org/record/3484013](https://zenodo.org/record/3484013) with DOI **10.5281/zenodo.3484013**.


## IUST-PDFCorpus
**Download:** [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3484013.svg)](https://doi.org/10.5281/zenodo.3484013)

![IUSTPDFCorpusDemo Image](pdfs/IUST-PDFCorpusDemo.PNG)

We are happy to introduce **IUST PDF Corpus**, a large set of various PDF files, aimed at manipulating,
testing and improving the qualification of real-world PDF readers such as [MuPDF](https://mupdf.com/).
IUST PDF Corpus (version 1.0) contains about **6,000** PDF file. we extract more than **500,000** PDF data object from this corpus to evaluate IUST DeepFuzz, our new file format fuzzer.
IUST PDF Corpus (version 1.0) contains **6,141** PDF file. we extract more than **500,000** PDF data object from this corpus to evaluate IUST DeepFuzz, our new file format fuzzer.

The extracted objects have put under a _pdfs_ directory. We divide the objects dataset into two sub-dataset: _large-size_ and _small-size_. The small-size dataset is created to develop and test the generative models and has about 120,000 PDF objects. The large dataset is used to train deep models and fuzz testing PDF viewers and has 500,000 PDF objects.
We are extending this corpus and want to add more PDF files, as soon as possible.
We also extract 1000 binary streams form data objects. These streams have put under the small-size subdirectory. All extracted objects are available to [view and download](./pdfs/) from the current GitHub repository. The complete set of PDF files will be available to view and download as soon as our relevant paper on IUST DeepFuzz is published.

* [View and download IUST PDF Corpus (version 1.0)](https://www.dropbox.com/sh/0gr8qscxdoawwtw/AAD_0Za_bFbrfCoSBTzoeE1Oa?dl=0) [Not available yet!]
We also extract 1000 binary streams form data objects. These streams have put under the small-size subdirectory. All extracted objects are available to [view and download](./pdfs/) from the current GitHub repository. The complete set of PDF files will be available to view and download as soon as our relevant paper on IUST DeepFuzz is published.


## IUST XML Corpus
Expand Down
Loading

0 comments on commit 4393fe1

Please sign in to comment.