DeepBugs is a framework for learning name-based bug detectors from an existing code corpus. See this technical report for a detailed description.
- All commands are called from the main directory.
- Python code (most of the implementation) and JavaScript code (for extracting data from .js files) are in the
/python
and/javascript
directories. - All data to learn from, e.g., .js files are expected to be in the
/data
directory. - All data that is generated, e.g., intermediate representations, are written into the
/data/tech
directory. - All generated data files have a timestamp as part of the file name. Below, all files are used with
*
. When running commands multiple times, make sure to use the most recent files.
- Node.js
- npm modules (install with
npm install module_name
): acorn, estraverse, walk-sync - Python 3
- Python packages: keras, scipy, numpy, sklearn, asttokens
- The full corpus can be downloaded here and is expected to be stored in
data/js/programs_all
. It consists of 100.000 training files, listed indata/js/programs_training.txt
, and 50.000 files for validation, listed indata/js/programs_eval.txt
. - This repository contains only a very small subset of the corpus. It is stored in
data/js/programs_50
. Training and validation files for the small corpus are listed indata/js/programs_50_training.txt
anddata/js/programs_50_eval.txt
.
- The full corpus can be downloaded here and should be used in the same way as JS corpus.
Creating a bug detector consists of two main steps:
- Extract positive (i.e., likely correct) and negative (i.e., likely buggy) training examples from code.
- Train a classifier to distinguish correct from incorrect code examples.
Each bug detector addresses a particular bug pattern, e.g.:
- The
SwappedArgs
bug detector looks for accidentally swapped arguments of a function call, e.g., callingsetPoint(y,x)
instead ofsetPoint(x,y)
. - The
BinOperator
bug detector looks for incorrect operators in binary operations, e.g.,i <= len
instead ofi < len
. - The
IncorrectBinaryOperand
bug detector looks for incorrect operands in binary operations, e.g.,height - x
instead ofheight - y
.
node javascript/extractFromJS.js calls prefix --parallel 4 data/js/programs_50_training.txt data/js/programs_50
or
python3 python/extractor/ExtractFromPython.py calls prefix data/python/programs_50_training.txt data/python/programs_50
- Second argument is prefix appended to names of all files.
- The
--parallel
argument sets the number of processes to run. programs_50_training.txt
contains files to include (one file per line). To extract data for validation, run the command withdata/js/programs_50_eval.txt
.- The last argument is a directory that gets recursively scanned for .js files, considering only files listed in the file provided as the second argument.
- The command produces
calls_prefix_*.json
files, which is data suitable for theSwappedArgs
bug detector. For the other bug two detectors, replacecalls
withbinOps
in the above command.
python3 python/AnomalyDetector2.py SwappedArgs --learn token_to_vector.json type_to_vector.json node_type_to_vector.json --trainingData calls_xx*.json --validationData calls_yy*.json
- The first argument selects the bug pattern.
- The next three arguments are vector representations for tokens (here: identifiers and literals), for types, and for AST node types. These files are provided in the repository.
- The remaining arguments are two lists of .json files. They contain the training and validation data extracted in Step 1.
- After learning the bug detector, the command measures accurracy and recall w.r.t. seeded bugs and writes a list of potential bugs in the unmodified validation code (see
poss_anomalies.txt
).
Note that learning a bug detector from the very small corpus of 50 programs will yield a classifier with low accuracy that is unlikely to be useful. To leverage the full power of DeepBugs, you'll need a larger code corpus, e.g., the JS150 corpus mentioned above.
The above bug detectors rely on a vector representation for identifier names and literals. To use our framework, the easiest is to use the shipped token_to_vector.json
file. Alternatively, you can learn the embeddings via Word2Vec as follows:
- Extract identifiers and tokens:
node javascript/extractFromJS.js tokens prefix --parallel 4 data/js/programs_50_training.txt data/js/programs_50
or
python3 python/extractor/ExtractFromPython.py tokens prefix data/python/programs_50_training.txt data/python/programs_50
- The command produces
tokens_prefix_*.json
files.
- Encode identifiers and literals with context into arrays of numbers (for faster reading during learning):
python3 python/TokensToTopTokens.py prefix tokens_*.json
- The arguments are the just created files.
- The command produces
encoded_tokens_prefix_*.json
files and a filetoken_to_number_prefix_*.json
that assigns a number to each identifier and literal.
- Learn embeddings for identifiers and literals:
python3 python/EmbeddingsLearnerWord2Vec.py prefix token_to_number_*.json encoded_tokens_*.json
- The arguments are the just created files.
- The command produces a file
token_to_vector_prefix_*.json
.