Because of limitations in GitHub storage, we provide a compressed SQL dump of the CVEfixes vulnerability dataset via Zenodo with DOI: 10.5281/zenodo.4476563. The following script can be used to convert this compressed SQL dump into an SQLite3 database:
$ sh Code/create_CVEfixes_from_dump.sh
It is also possible to do this by hand (Windows users can use
sqlite3.exe
instead of sqlite3
)
$ gzcat Data/CVEfixes.sql.gz | sqlite3 Data/CVEfixes.db
The overall structure of the database is as shown in ER diagram. You can use any SQLite tool to view and query the database. DB Browser for SQLite is an open source visual explorer for SQLite databases that is available for Mac, Windows and Linux.
We provide a Jupyter Notebook with example queries to extract the data at different abstraction levels, code to generate statistics, as well as the code to replicate all tables and plots presented in the paper based on the CVEfixes data.
Some example queries to extract the part of CVEfixes database are as follows:
- a query to extract all the method_level vulnerability data of C programming language.
SQL_QUERY = "SELECT m.method_change_id, m.name, m.code, m.before_change, f.programming_language
from file_change f, method_change m
WHERE m.file_change_id=f.file_change_id
AND f.programming_language='C';"
- another example as a query to extract all the code patches of Java programming language that have added/removed only a single statement to fix vulnerabilities.
SQL_QUERY = "SELECT cv.cve_id, f.filename, f.num_lines_added, f.num_lines_deleted, f.code_before, f.code_after, cc.cwe_id
FROM file_change f, commits c, fixes fx, cve cv, cwe_classification cc
WHERE f.hash = c.hash
AND c.hash = fx.hash
AND fx.cve_id = cv.cve_id
AND cv.cve_id = cc.cve_id
AND f.num_lines_added<=1
AND f.num_lines_deleted<=1
AND f.programming_language='Java';;"
Create a .CVEfixes.ini
configuration file in the project root
(where this INSTALL.md file lives) defines the following variables
to customize paths in the [CVEfixes]
section:
-
database_path
: directory that should contain the CVEfixes database file (and some temporary files during extraction). -
sample_limit
: The number of samples to be extracted, sample_limit = 0 is interpreted as unlimited samples, this is discussed in more detail below.
The repository contains a file example.CVEfixes.ini
.
We observe that more than 98% repositories of CVEfixes are hosted on
GitHub. The GitHub API
allows us to retrieve meta information of public repositories. Without
configuration of a token, this API is rate limited and only allows
gathering the information for approximately 25 repositories. Thus, we
recommend the user to obtain a GitHub token and configure a [GitHub]
section of the .CVEfixes.ini
configuration file which has two
variables user
and token
for your GitHub username and access token
respectively. Change the values of these variables to set up the token.
It is recommended to not disclose the token information to prevent its
misuse.
The following script recreates the CVEfixes database file from scratch
up for a time-frame from 2002 up to the latest available CVE records. Be
warned that depending on available internet speed and processing power,
the process can take quite some time (with a modern laptop, around 8-10
hours on T1, and up to 3 days on a simple home connection). Note that
running this script will overwrite an existing CVEfixes.db
file in
database_path
, but that it is possible to restore the database from
the SQL dump on Zenodo, as mentioned above. Before running the
extraction, see also the above remark on setting up a GitHub
token.
$ sh Code/create_CVEfixes_from_scratch.sh
As mentioned, the complete extraction of the CVEfixes dataset could
take up to 3 days depending on the internet connection. To enable simple
demonstrations, the .CVEfixes.ini
configuration file has a
sample_limit
variable that is used to define the number of commits for
CVEs to collect. Setting sample_limit = 0
is interpreted as unlimited,
i.e. collect all available data. When the sample_limit
is set to a
value different than 0
, only that many commits are collected from CVEs
for the current year. We tested sample_limit = 25
, which took
approximately 12 minutes to gather. However, this can take much longer
if any of the CVEs corresponds to major projects. Therefore, whenever
sample_limit
is set to a value different than 0
, we filter out five
major projects to minimize the time to collect a sample.