Skip to content

Latest commit

 

History

History
148 lines (107 loc) · 7.78 KB

README.md

File metadata and controls

148 lines (107 loc) · 7.78 KB
Logo

Dataset Download

GitSchemas (Permissive Licenses)

Introduction

This repository contains scripts to crawl SQL-files from GitHub, parse them and extract structured database schema information from them. We do this, because we want to learn about the semantics of database tables in the wild (table names, column names, foreign key relations etc.).

Disclaimer: The dataset was created with research problems in the area of large scale data integration in mind, such as foreign key detection, and CSV-header detection. For other use cases, we strongly encourage users to revisit the data collection, parsing and extraction methods and align them with their concrete research objective.

Other Datasets

Samples, containing only data from repos with a permissive license like MIT, Apache-2.0, are available for direct download above. We can provide other datasets for reproduction purposes upon request.

Crawler

The GitHub SQL Crawler was built using the GitHub Search API. We search GitHub for SQL code that contains a CREATE TABLE and a FOREIGN KEY statement to make sure we retrieve SQL scripts which actually define a schema.

The GitHub Search index contains ~7.8M SQL files, out of which ~0.7M contain the "CREATE TABLE FOREIGN KEY" keywords (s. Table).

Query # Search Reults (Nov. 2021)
language: SQL ~7.8M
keywords: "CREAT TABLE FOREIGN KEY" language: SQL ~ 0.8M
keywords: "CREAT TABLE FOREIGN KEY" language: SQL (successfully downloaded) ~ 0.7M
keywords: "CREAT TABLE FOREIGN KEY" language: SQL (deduplicated) ~ 0.37M

The file-sizes of all SQL scripts containing the desired keywords, are distributed as follows:

f4b12a00f8561cc4d3315af2e3f906f6.png

It is Interesting to note is that there are no files >400 KB. Maybe GitHub does not perform language detection on files >400 KB.

The crawler performs the following three steps:

  • (1) crawl a list of URLs
  • (2) download files based on the list of URLs
  • (3) deduplicate the downloaded files based on their sha256 hash

SQL Parser

To facilitate downstream use cases, we want to extract structured schema information from the crawled SQL files. This includes things such as table names, column names, primary keys, and foreign keys including their reference table and their reference column names. However, the extraction is no simple feat, because the crawled SQL files may use different SQL dialects, contain comments, be incomplete, and/or contain syntactical errors.

We tried different parsing options available in Python, including the libraries sqlparse (Non-validating), mysqlparse (MySQL), pglast (Postgres), and queryparser (MySQL/Postrgres). Untimately, we foundpglastto provide the best tradeoff between conveniece and parsing success (~16%), leaving us with a total of 61,038 schemas (s. Table).

Description # of files
All SQL files 373,156
Parsable with pglast 61,038 (16.36%)

pglast extracts an abstract syntax tree (AST) from the SQL script, which we ultimately translate to JSON to facilitate further analysis.

Note: There should be quite some room for improvement in terms of the parsing success rate. E.g., backtick-quotes ` (MySQL-style) are incompatible with the postgres parser and currently lead to an immediate error. While something like this might be easy to solve with a search-and-replace, other issues are more intricate. There are many different SQL dialects, and for not every dialect there is an open-source parser available. Therefore it does not seem feasible to simply trial-and-error all possible parsers/dialects. Creating a robust SQL parser which can extract schema information from SQL files, without knowing which particular database system the query was written for, could be an interesting project in and of itself. Any useful pointers regarding this are highly apprecited.

Schema Data

The parsing step results in a JSON file which looks as follows. A sample dataset can be downloaded (s. Download Links section):

{'schema_000001':
	{'INFO': {
		'user': 'user_xxx',
		'url': 'https://github.com/2212khushboo/Project/blob/5a6746d../db/sql/tables.sql?raw=true\n',
		'filename': 'tables.sql',
		'project': 'Project',
		'filesize': 2885},
	 'TABLES': {
	  'state': {
		'COLUMNS': [
			['id', 'serial'],
			['uuid', 'varchar'],
			['state_name', 'varchar']],
	    'PRIMARY_KEYS': ['id'],
	    'FOREIGN_KEYS': []},
	  'address': {
		'COLUMNS': [
			['id', 'serial'],
			['uuid', 'varchar'],
			['flat_buil_number', 'varchar'],
			['locality', 'varchar'],
			['city', 'varchar'],
			['pincode', 'varchar'],
			['state_id', 'int4']],
	    'PRIMARY_KEYS': ['id'],
	    'FOREIGN_KEYS': [{
		 'FOREIGN_KEY': ['state_id'],
		 'REFERENCE_TABLE': 'state',
		 'REFERENCE_COLUMN': ['id']}]}}},
   'schema_000002': {....}
}

The following section gives an overview of some of the properties of the dataset.

Entity #
schemas 61,038
tables 393,653
columns 2,544,164
unique column names 303,443
primary keys 322,627
unique primary key names 31,599
foreign keys 175,589
unique foreign key names 31,041

4f85b31e3fddfd3215355ab1486c42c9.png

25581e9d96abcfd4a11c19aecbe806bb.png

da3cbc115c114ce3aed85e9a16ef31e3.png

454df2ff5a645df073fe9ef87ffc0189.png

e425621af2b5c9008219ca10e6525b07.png

1b11d64f92e09062ba46306b84535370.png

Apart from the raw JSON data, we derived a tabular data set particularly for the foreign key detection problem. The dataset contains only those tables, that have a non-composite-key foreign key relation to another table. For tables that have multiple foreign key relations, we added one line for every relation. It has the following schema and a sample can be downloaded as CSV or parquet file (s. Download Links section):

schema table_name_a table_name_b columns_a columns_b primary_keys_a primary_keys_b key_a key_b
000361_xxxx.sql event resourcetbl planno, lineno, locno, resno, timestart, time... id id resno resno
000376_yyyy.derby.sql category db_category id, name, parent_id id, name, parent_id id id parent_id id

total rows: 124605

Publication

📄 GitSchemas: A Dataset for Automating Relational Data Preparation Tasks

@inproceedings{gitschemas2022,
  author    = {D{\"o}hmen, Till and Hulsebos, Madelon and Beecks, Christian and Schelter, Sebastian},
  title     = {GitSchemas: A Dataset for Automating Relational Data Preparation Tasks},
  year      = {2022},
  maintitle = {ICDE 2022 – 38th IEEE International Conference on Data Engineering},
  booktitle = {Workshop on Databases and Machine Learning (DBML)},
  organization={IEEE}
}