Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #123: Add anib #338

Open
wants to merge 81 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 78 commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
b64231c
Initial split of v2 ANIb code into `anib.py` and `aniblastall.py` files
baileythegreen Jul 15, 2021
39bdde1
Updated comments/logging
baileythegreen Jul 21, 2021
7e9ebe2
Added `sysexit` when adding genomes to the database fails
baileythegreen Jul 21, 2021
70abeb9
Added initial versions of functions and comments about needed code
baileythegreen Jul 21, 2021
9dcade6
Rename `fraglens` to `fragsizes` for consistency with `anib.py`
baileythegreen Jul 21, 2021
c660e63
Add `add_blastdb()` to `pyani_orm.py`
baileythegreen Sep 13, 2021
bf4fa20
Change default for `maxmatch` from `None` to `False`
baileythegreen Sep 13, 2021
c1b4be1
Add/expand code to process input genomes
baileythegreen Sep 13, 2021
34f9ec3
Split genome files into contiguous fragments
baileythegreen Sep 13, 2021
5cdd452
Implement `generate_joblist()`
baileythegreen Sep 13, 2021
d78b479
Implement `run_anib_jobs()`
baileythegreen Sep 13, 2021
0d42e79
Implement `update_comparisons_results()` and commit to database
baileythegreen Sep 13, 2021
9e25141
Update call to `generate_joblist()`
baileythegreen Sep 13, 2021
1e8a47d
Update name of output file in `fragment_fasta_file()`
baileythegreen Sep 13, 2021
b07a1dd
Update value passed for `maxmatch` to a boolean
baileythegreen Sep 13, 2021
8f4b131
Implement remainder of `subcmd_anib()`
baileythegreen Sep 13, 2021
c76f54c
Add a commented question
baileythegreen Sep 13, 2021
c71ce5d
Alter `generate_blastn_commands()` to only take one query/subject pair
baileythegreen Sep 13, 2021
585b98f
Alter `construct_blastn_cmdline()` for a single query/subject pair
baileythegreen Sep 13, 2021
059ffdf
Remove `method` parameter from `process_blast()` call
baileythegreen Sep 13, 2021
aed81e1
Change `outfilename` in `construct_makeblastdb_cmd()
baileythegreen Sep 13, 2021
0f0653f
Move `aniblastall`-specific tests to `test_aniblastall.py`
baileythegreen Sep 13, 2021
fc273df
Change call to `parse_blast_tab()` to not use `method` parameter
baileythegreen Sep 13, 2021
cc4d346
Skip tests as a result of changes to how command lines are generated
baileythegreen Sep 13, 2021
0e3d471
Add `get_version()` tests and boilerplate
baileythegreen Sep 13, 2021
f6f52ec
Merge branch 'master' of https://github.com/widdowquinn/pyani into an…
baileythegreen Sep 13, 2021
8f94ba0
Make changes to reflect splitting of `anib` and `aniblastall`
baileythegreen Sep 14, 2021
6de8b0f
Add tests for `get_version()`
baileythegreen Sep 21, 2021
7da06d9
Fixed typos and put one variable name in camelcase
baileythegreen Sep 22, 2021
fba3662
Fix which file is passed as input and which as blastdb
baileythegreen Sep 22, 2021
1fb3c08
Move serialisation of fragment dictionary to primary function
baileythegreen Sep 22, 2021
6c12a72
Added useful debugging lines to `anib.py`
baileythegreen Sep 22, 2021
419db64
Add long versions of CLI flags to `anib_parser.py`
baileythegreen Sep 22, 2021
fd41e4c
Comment out (likely) unnecessary function
baileythegreen Sep 22, 2021
104178a
Update docstrings to reflect new comparison handling
baileythegreen Sep 22, 2021
37477db
Add useful debugging lines
baileythegreen Sep 22, 2021
8cf205b
Populate `aniblastall_parser.py`
baileythegreen Sep 22, 2021
afd1be1
Comment out (likely) unnecessary function
baileythegreen Sep 22, 2021
7dc1113
Simplify generation of command lines and update supporting docstrings
baileythegreen Sep 22, 2021
ee90a68
Update command name referenced in docstrings
baileythegreen Sep 22, 2021
e8f6e41
Fix which files are passed as input and blastdb in command generation
baileythegreen Sep 22, 2021
c0c518a
Add useful debugging lines
baileythegreen Sep 22, 2021
dd1ee44
Add imports and convenience struct for `ComparisonJob`
baileythegreen Sep 22, 2021
8cfbfe8
Add initial code for starting a run and adding it to the database
baileythegreen Sep 22, 2021
2aa686b
Add genomes for the ru to the database
baileythegreen Sep 22, 2021
10bd2a2
Get genomes for the run and create output directories
baileythegreen Sep 22, 2021
8df94c7
Add code to fragment fasta files
baileythegreen Sep 22, 2021
e42c089
Add code to create `blastdb`
baileythegreen Sep 22, 2021
05665ee
Add code to generate list of comparisons and filter existing ones
baileythegreen Sep 22, 2021
502becf
Add code for case where all comparisons have been run already
baileythegreen Sep 22, 2021
3589efc
Add code for recovery mode
baileythegreen Sep 22, 2021
c74bf3a
Rename `blastcmd` to `blastallcmd`
baileythegreen Sep 22, 2021
953be89
Add `generate_joblist()`
baileythegreen Sep 22, 2021
1971d25
Add code to generate `joblist`
baileythegreen Sep 22, 2021
641a843
Add code to run comparisons
baileythegreen Sep 22, 2021
d53c4e8
Add code to update database
baileythegreen Sep 22, 2021
aba213c
Remove or rename things specific to `anib` to `aniblastall`
baileythegreen Sep 22, 2021
3233fd5
Skip unnecessary tests
baileythegreen Sep 22, 2021
92932ba
Add `dir_aniblastall_in()` fixture
baileythegreen Sep 22, 2021
9631001
Add lines debugging
baileythegreen Sep 30, 2021
c7f60cc
Fix naming of elements in `test_subcmd_anib" namespace
baileythegreen Oct 5, 2021
c174133
Update naming of `TestANIbSubcommand()`
baileythegreen Oct 5, 2021
dc89bcd
Switch to use `NamedTuple`
baileythegreen Oct 5, 2021
13add41
Add `test_subcmd_10_aniblastall.py`
baileythegreen Oct 5, 2021
7566329
Make `test_subcmd_04_anim.py` match other method subcommand tests
baileythegreen Oct 5, 2021
3c688d4
Fix names of executable variables
baileythegreen Oct 5, 2021
bc25ccc
Pass both stdout and stderr to the regex version search
baileythegreen Oct 7, 2021
a313d2f
Set `indir` and `outdir` to required in `anib_parser.py`
baileythegreen Oct 7, 2021
3ccd1d9
Add `docs/subcmd_anib.rst`
baileythegreen Oct 7, 2021
f423dbe
Add documentation for `aniblastall`
baileythegreen Oct 7, 2021
c252188
Merge branch 'master' of https://github.com/widdowquinn/pyani into an…
baileythegreen Dec 12, 2021
8c40113
Replace f-strings in logging statements
baileythegreen Dec 12, 2021
0596e77
Update .gitignore
baileythegreen Apr 13, 2022
03a28fa
Merge branch 'master' of https://github.com/widdowquinn/pyani into an…
baileythegreen Apr 13, 2022
6b7b52f
Update call to `add_run()` to fit the function's new return value
baileythegreen Apr 13, 2022
23d951f
Remove f-strings from logging calls
baileythegreen Apr 13, 2022
8b43a50
Merge branch 'master' of https://github.com/widdowquinn/pyani into an…
baileythegreen May 11, 2022
5b0cd43
Merge branch 'master' of https://github.com/widdowquinn/pyani into an…
baileythegreen May 16, 2022
ffba902
update deprecated behaviours in pandas
widdowquinn Jun 7, 2022
bdefef0
remove unused import
widdowquinn Jun 7, 2022
dfee675
Merge branch 'master' of https://github.com/widdowquinn/pyani into an…
baileythegreen Jun 16, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# Scratch directory for local testing
scratch/
makefile_template

# Mac-related dreck
.DS_Store
Expand Down Expand Up @@ -72,4 +73,4 @@ venv-*

# Extra documentation output
classes_pyani.pdf
packages_pyani.pdf
packages_pyani.pdf
121 changes: 121 additions & 0 deletions docs/run_aniblastall.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
.. _pyani-run_aniblastall:

=====================
Running ANIblastall analysis
=====================

``pyani`` implements average nucleotide identity analysis using `NCBI-BLAST`_ (*ANIblastall*) as defined in Goris `et al.` (2007) (`doi:10.1099/ijs.0.64483-0`_). To run ANIblastall on a set of input genomes, use the ``pyani aniblastall`` subcommand.

In brief, the analysis proceeds as follows for a set of input prokaryotic genomes:

1. Each input genome is fragmented into consecutive sequences of a given size (default: 1020bp)
2. A new ``BLAST`` database is built from each input genome sequence
3. `NCBI-BLAST`_ is used to perform pairwise comparisons of each input genome fragment set against the databases for each other input genome, to identify homologous (alignable) regions.
4. For each comparison, the alignment output is parsed, and the following values are calculated:

- total number of aligned bases on each genome
- fraction of each genome that is aligned (the *coverage*)
- the proportion of all aligned regions that is identical in each genome (the *ANI*)
- the number of unaligned or non-identical bases (the *similarity errors*)
- the product of *coverage* and *ANI*

The output values are recorded in the ``pyani`` database.

.. NOTE::
The `NCBI-BLAST`_ comparisons are asymmetric, and performed in both directions for a pair of genomes (i.e. "fragmented A vs complete B" and "fragmented B vs complete A").

.. TIP::
The `NCBI-BLAST`_ comparisons are embarrasingly parallel, and can be distributed across cores on an `Open Grid Scheduler`_-compatible cluster, using the ``--scheduler SGE`` option.

.. ATTENTION::
``pyani aniblastall`` requires that a working copy of `NCBI-BLAST`_ is available. Please see :ref:`pyani-installation` for information about installing this package.

For more information about the ``pyani aniblastall`` subcommand, please see the :ref:`pyani-subcmd-aniblastall` page, or issue the command ``pyani aniblastall -h`` to see the inline help.

---------------------
Perform ANIblastall analysis
---------------------

The basic form of the command is:

.. code-block:: bash

pyani aniblastall -i <INPUT_DIRECTORY> -o <OUTPUT_DIRECTORY>

This instructs ``pyani`` to perform ANIblastall on the genome FASTA files in ``<INPUT_DIRECTORY>``, and write any output files to ``<OUTPUT_DIRECTORY>``. For example, the following command performs ANIblastall on genomes in the directory ``genomes`` and writes output to a new directory ``genomes_ANIblastall``:

.. code-block:: bash

pyani aniblastall -i genomes -o genomes_ANIblastall

.. NOTE::
While running, ``pyani aniblastall`` will show progress bars unless these are disabled with the option ``--disable_tqdm``

This command will write the intermediate `NCBI-BLAST`_ output to the directory ``genomes_ANIblastall``, where the results can be inspected if required.

..
I am unsure if this is relevant for aniblastall
.. code-block:: bash

$ ls genomes_ANIblastall/
nucmer_output

.. ATTENTION::
To view the output ANIblastall results, you will need to use the ``pyani report`` or ``pyani plot`` subcommands. Please see :ref:`pyani-subcmd-report` and :ref:`pyani-subcmd-plot` for more details.

----------------------------------------------
Perform ANIblastall analysis with Open Grid Scheduler
----------------------------------------------

The `NCBI-BLAST`_ comparisons are embarrasingly parallel, and these jobs can be distributed across cores in a cluster using the `Open Grid Scheduler`_. To enable this during the analysis, use the ``--scheduler SGE`` option:

.. code-block:: bash

pyani aniblastall --scheduler SGE -i genomes -o genomes_ANIblastall

.. NOTE::
Jobs are submitted as *array jobs* to keep the scheduler queue short.

.. NOTE::
If ``--scheduler SGE`` is not specified, all `NCBI-BLAST`_ jobs are run locally with ``Python``'s ``multiprocessing`` module.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Controlling parameters of Open Grid Scheduler
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It is possible to control the following features of `Open Grid Scheduler`_ `via` the ``pyani aniblastall`` subcommand:

- The array job size (by default, comparison jobs are batched in arrays of 10,000)
- The prefix string for the job, as reported in the scheduler queue
- Arguments to the ``qsub`` job submission command

These allow for useful control of job execution. For example, the command:

.. code-block:: bash

pyani aniblastall --scheduler SGE --SGEgroupsize 5000 -i genomes -o genomes_ANIblastall

will batch ``ANIblastall`` jobs in groups of 500 for the scheduler. The command:

.. code-block:: bash

pyani aniblastall --scheduler SGE --jobprefix My_Ace_Job -i genomes -o genomes_ANIblastall

will prepend the string ``My_Ace_Job`` to your job in the scheduler queue. And the command:

.. code-block:: bash

pyani aniblastall --scheduler SGE --SGEargs "-m e -M [email protected]" --SGEgroupsize 5000 -i genomes -o genomes_ANIblastall

will email ``[email protected]`` when the jobs finish.


----------
References
----------

- Goris`et al.` (2007) `Int J Syst Evol Micr` _57_: 81-91. `doi:10.1099/ijs.0.64483-0`.

.. _doi:10.1099/ijs.0.64483-0: https://dx.doi.org/10.1099/ijs.0.64483-0
.. _NCBI-BLAST: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download
.. _Open Grid Scheduler: http://gridscheduler.sourceforge.net/
80 changes: 80 additions & 0 deletions docs/subcmd_aniblastall.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
.. _pyani-subcmd-aniblastall:

==============
``pyani aniblastall``
==============

The ``aniblastall`` subcommand will carry out ANIb analysis using genome files contained in the ``indir`` directory, writing result files to the ``outdir`` directory, and recording data about each comparison and run in a local `SQLite3`_ database.

.. code-block:: text

usage: pyani.py aniblastall [-h] [-l LOGFILE] [-v] [--debug] [--disable_tqdm] [--version]
[--citation] [--scheduler {multiprocessing,SGE}]
[--workers WORKERS] [--SGEgroupsize SGEGROUPSIZE]
[--SGEargs SGEARGS] [--jobprefix JOBPREFIX] [--name NAME]
[--classes CLASSES] [--labels LABELS] [--recovery] -i INDIR -o
OUTDIR [--dbpath DBPATH] [--blastall_exe BLASTALL_EXE]
[--format_exe FORMAT_EXE] [--fragsize FRAGSIZE]

.. _SQLite3: https://www.sqlite.org/index.html

-----------------
Flagged arguments
-----------------

``--blastall_exe BLASTALL_EXE``
Path to the ``blastall`` executable. Default: ``blastall``

``--classes CLASSFNAME``
Use the set of classes (one per genome sequence file) found in the file ``CLASSFNAME`` in ``INDIR``. Default: ``classes.txt``

``--dbpath DBPATH``
Path to the location of the local ``pyani`` database to be used. Default: ``.pyani/pyanidb``

``--disable_tqdm``
Disable the ``tqdm`` progress bar while the ANIblastall process runs. This is useful when testing to avoid aesthetic problems with test output.

``--format_exe FORMAT_EXE``
Path to the ``blastall`` executable. Default: ``formatdb``

``--fragsize FRAGSIZE``
Fragment size to use in analysis. (default: 1020)

``-h, --help``
Display usage information for ``pyani aniblastall``.

``-i INDIR, --input INDIR``
Path to the directory containing indexed genome files to be used for the analysis.

``--jobprefix JOBPREFIX``
Use the string ``JOBPREFIX`` as a prefix for SGE job submission names. Default: ``PYANI``

``--labels LABELFNAME``
Use the set of labels (one per genome sequence file) found in the file ``LABELFNAME`` in ``INDIR``. Default: ``labels.txt``

``-l LOGFILE, --logfile LOGFILE``
Provide the location ``LOGFILE`` to which a logfile of the ANIb process will be written.

``--name NAME``
Use the string ``NAME`` to identify this ``ANIblastall`` run in the ``pyani`` database.

``-o OUTDIR, --outdir OUTDIR``
Path to a directory where comparison output files will be written.

``--recovery``
Use existing ``ANIblastall`` comparison output if available, e.g. if recovering from a failed job submission. Using this option will not generate a new comparison if the old output files exist.

``--scheduler {multiprocessing, SGE}``
Specify the job scheduler to be used when parallelising genome comparisons: one of ``multiprocessing`` (use many cores on the current machine) or ``SGE`` (use an SGE or OGE job scheduler). Default: ``multiprocessing``.

``--SGEargs SGEARGS``
Pass additional arguments ``SGEARGS`` to ``qsub`` when running the SGE-distributed jobs.

``--SGEgroupsize SGEGROUPSIZE``
Create SGE arrays containing SGEGROUPSIZE comparison jobs. Default: 10000

``-v, --verbose``
Provide verbose output to ``STDOUT``.

``--workers WORKERS``
Spawn WORKERS worker processes with the ``--scheduler multiprocessing`` option. Default: 0 (use all cores)
1 change: 1 addition & 0 deletions docs/subcommands.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ This document links out to detailed instructions for each of the ``pyani`` subco
subcmd_createdb
subcmd_anim
subcmd_anib
subcmd_aniblastall
subcmd_report
subcmd_plot
subcmd_classify
Expand Down
Loading