EMBL-my-Genbank: Convert a complicated Genbank file to a barebones EMBL file

Usage

This repo contains a minimal-dependency, Ruff-formatted, pure Python module that can be accessed directly through a script, e.g.,

python3 src/embl_my_genbank/embl_my_genbank.py -g file.gb -s "Homo sapiens"

You can also reproduce the project environment with uv and the repo's pyproject.toml. To do so, make sure you have uv installed, and then set up the environment with the commands uv venv, source .venv/bin/activate, and uv sync. From there, the tool will be available as the command emb_my_gbk, like so:

emb_my_gbk -g file.gb -s "Homo sapiens"

The simplest set up option of all, again using uv, is uv run:

uv run src/embl_my_genbank/embl_my_genbank.py -g file.gb -s "Homo sapiens"

All said, feel free to use whichever Python environment manager you're used to——the only non-standard-library dependencies are BioPython, Polars, and Loguru.

For documentation of the API, see the HTML in docs/. Recommended usage looks like this:

usage: emb_my_gbk [-h] --gb_path GB_PATH --species SPECIES [--out_fmt OUT_FMT] [--view_intermediate VIEW_INTERMEDIATE]

options:
  -h, --help            show this help message and exit
  --gb_path GB_PATH, -g GB_PATH
                        Genbank file to be converted.
  --metadata METADATA, -m METADATA
                        Metadata file with, at minimum, a column of allele names and a column of representative animals.
  --species SPECIES, -s SPECIES
                        Scientific name for the species under examination.
  --out_fmt OUT_FMT, -o OUT_FMT
                        Format to convert to. Can convert to EMBL, IPD_EMBL, and FASTA.
  --view_intermediate, -v
                        Whether to write out the intermediate cleaned Genbank file for inspection.

Purpose

With some regularity, the Genomic Services Unit at the Wisconsin National Primate Research Center run an allele discovery pipeline on primate MHC amplicons, the goal of which is to submit previously undocumented MHC alleles to the Immuno-Polymorphism Database (IPD). One of the primary pain points in this process is converting between Genbank format, which we use internally to annotate and review allele candidates alongside exon annotations, and the equally complicated EMBL format, which IPD requires. Additionally, IPD imposes non-standard EMBL format requirements, which we have to adhere to in as many as hundreds of new allele candidates. This python module implements many cleaning and conversion steps to take our internal review Genbank files and convert them into IPD-compatible EMBL files--a surprisingly tricky process!

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
docs		docs
src/embl_my_genbank		src/embl_my_genbank
test/data		test/data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EMBL-my-Genbank: Convert a complicated Genbank file to a barebones EMBL file

Usage

Purpose

About

Contributors 2

Languages

License

nrminor/embl-my-gbk

Folders and files

Latest commit

History

Repository files navigation

EMBL-my-Genbank: Convert a complicated Genbank file to a barebones EMBL file

Usage

Purpose

About

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages