Faster position-based slice or lookup on multiple bams.
You have dozens or more indexed bams sitting on fast disk and you wish to query them all by position. Possible solutions include:
- Query all the bams in parallel and aggregate.
- Create an aggregate data structure out of all the bams (e.g. in HDF5) and query that instead.
They both have some appeal — the first needs no pre-processing time or extra storage; the second requires less compute at runtime and should be faster. Here we try to improve upon the most naïve implementation of the first approach by creating an aggregate index from the .bai files, so that instead of reading all indices to locate all the offsets we make one query to the aggregate index.
-
A database of bam index ('bai') files that can be queried by genomic coordinates and returns virtual file offsets to the indexed bams. This allows lookup on hundreds or thousands of bams without first reading all their indexes.
-
A client that queries the database by genomic coordinates and returns records that overlap.
git clone https://github.com/delocalizer/mumbai
cd mumbai
pip install -r requirements.txt --user .
(optional, requires tox)
tox
- get help
mumbai_db --help # for help on the db create and load tool
mumbai --help # for help on the db client
- create a bam index database for bams aligned to GRCh38
mumbai_db create newdb GRCh38
- load the database with bam indexes
mumbai_db load newdb /path/to/first.bam /path/to/second.bam ...
- query the bams by position and return overlapping records in SAM format
mumbai sam newdb chr1 1000000 1000100
- query the bams by position and count overlapping records
mumbai count newdb chr1 1000000 1000100
- query the bams by position and visualize the overlapping region
mumbai tview newdb chr1 1000000 1000100
- query the bams by position and visualize the overlapping region as a pileup
mumbai pileup newdb chr1 1000000 1000100