Skip to content

Commit

Permalink
More adjustments to the docs
Browse files Browse the repository at this point in the history
  • Loading branch information
HenrikBengtsson committed Apr 23, 2024
1 parent 1382333 commit 4a9292c
Show file tree
Hide file tree
Showing 2 changed files with 29 additions and 26 deletions.
10 changes: 5 additions & 5 deletions src/seguid/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,7 @@
is also provided.
This package works without external dependencies, but ``csseguid()`` and ``cdseguid()``
can be made faster by installing `pydivsufsort <https://pypi.org/project/pydivsufsort/>`_
since pydivsufsort provides a faster algorithm for finding the
can be made faster by installing `pydivsufsort <https://pypi.org/project/pydivsufsort/>`_, provides a faster implementation of the algorithm for finding the
`smallest string rotation <https://en.wikipedia.org/wiki/Lexicographically_minimal_string_rotation>`_.
Expand All @@ -52,6 +51,9 @@
The sequence for which the checksum should be calculated. The sequence may only comprise of symbols in the
alphabet specified by the alphabet argument.
**watson, crick** (strings)
Two reverse-complementary DNA sequences. Both sequences should be specified in the 5'-to-3' direction.
**alphabet** (string)
The type of sequence used. If ``"{DNA}"`` (default), then the input is a DNA sequence. If ``"{RNA}"``, then the
input is an RNA sequence. If ``"{protein}"``, then the input is an amino-acid sequence. If ``"{DNA-extended}"``
Expand All @@ -66,8 +68,6 @@
How the checksum is presented. If ``"long"`` (default), the full-length checksum is returned. If ``"short"``,
the short, six-digit checksum is returned. If ``"both"``, both the short and the long checksums are returned.
**watson, crick** (strings)
Two reverse-complementary DNA sequences. Both sequences should be specified in the 5'-to-3' direction.
Value
-----
Expand All @@ -76,7 +76,7 @@
is ``"both"``, then a tuple of two strings is returned, where the first component holds the ``"short"``
checksum and the second the ``"long"`` checksum. The long checksum, without the prefix, is string with 27
characters. The short checksum, without the prefix, is the first six characters of the long checksum.
All checksums are prefixed with a label indicating which SEGUID method was used. Except for ``seguid()``,
All long checksums are prefixed with a label indicating which SEGUID method was used. Except for ``seguid()``,
which uses Base64 encoding, all functions produce checksums using the Base64url encoding
("Base 64 Encoding with URL and Filename Safe Alphabet").
Expand Down
45 changes: 24 additions & 21 deletions src/seguid/_chksum.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ def seguid(seq: str, alphabet: str = "{DNA}", form: str = "long") -> str:
The original definition of the SEGUID v1 checksum algorithm (Babnigg & Giometti, 2006)
included transformation to uppercase before calculating the checksum.
Here, ``seguid()`` does *not* coerce the input sequence to upper case. If your input sequence
has lower-case symbols, you can use :meth:`str.upper` to emulate what the original method does.
has lower-case symbols, you can use :meth:`str.upper` to achieve what the original method does.
``seguid()`` only accepts symbols as specified by the `alphabet` argument.
Thus, our implementation is more conservative, which has the benefit of
lowering the risk of passing the incorrect sequence by mistake.
Expand Down Expand Up @@ -103,20 +103,22 @@ def seguid(seq: str, alphabet: str = "{DNA}", form: str = "long") -> str:
def lsseguid(seq: str, alphabet: str = "{DNA}", form: str = "long") -> str:
"""SEGUID checksum for linear single-stranded DNA.
Identical to the :func:`seguid()` function except for that forward slashes (``/``) and plus signs (``+``)
The same as the :func:`seguid()` function except that forward slashes (``/``) and plus signs (``+``)
in the resulting checksum are replaced by underscores (``_``) and minus signs (``-``), respectively
following the Base64url standard in RFC 4648 section 5.
following the Base64url standard in RFC 4648.
This checksum is applicable to linear single-stranded DNA sequences or
protein sequences. If protein sequences are analyzed, the alphabet
argument should be ``"{protein}"`` or ``"{protein-extended}"``.
This checksum is applicable to linear single-stranded DNA
sequences and protein sequences, among other sequences. If
protein sequences are analyzed, the alphabet argument should be
``"{protein}"`` or ``"{protein-extended}"``.
The checksum is prefixed with ``lsseguid=``.
Examples
--------
>>> lsseguid("AT")
'lsseguid=Ax_RG6hzSrMEEWoCO1IWMGska-4'
"""
return _form(
lsseguid_prefix,
Expand All @@ -128,10 +130,10 @@ def lsseguid(seq: str, alphabet: str = "{DNA}", form: str = "long") -> str:
def csseguid(seq: str, alphabet: str = "{DNA}", form: str = "long") -> str:
r"""SEGUID checksum for circular single-stranded DNA.
The ``csseguid()`` is the :func:`lsseguid()` checksum calculated for the lexicographically
smallest string rotation of ``seq``.
Only defined for circular single-stranded sequences.
The ``csseguid()`` is the :func:`lsseguid()` checksum calculated
for the lexicographically smallest string rotation of ``seq``.
This checksum is Only defined for circular single-stranded
sequences.
The checksum is prefixed with ``csseguid=``.
Expand All @@ -145,6 +147,7 @@ def csseguid(seq: str, alphabet: str = "{DNA}", form: str = "long") -> str:
'csseguid=ot6JPLeAeMmfztW1736Kc6DAqlo'
>>> lsseguid("TTTA")
'lsseguid=8zCvKwyQAEsbPtC4yTV-pY0H93Q'
"""
return _form(
csseguid_prefix,
Expand All @@ -160,16 +163,17 @@ def ldseguid(
) -> str:
r"""SEGUID checksum for linear double-stranded DNA.
Calculates the ``ldseguid()`` checksum for a double-stranded DNA (dsDNA) sequence defined by two
This function calculates the SEGUID checksum for a double-stranded DNA (dsDNA) sequence defined by two
strings representing the upper (Watson) and the complementary (Crick) DNA strands. Watson and Crick
strands are always equal in length.
strands should be of equal length.
Optional single-stranded DNA regions in the ends are indicated by a dash (``-``) in either strand.
The algorithm first selects the lexicographically smallest of the Watson and Crick strands.
The algorithm first selects the lexicographically smallest of the
Watson and Crick strands. The two string are then joined 5'-3',
separated by a semicolon (``;``), and the :func:`lsseguid()`
function is used on the resulting string.
The two string are joined 5'-3', separated by a semicolon (``;``) and the :func:`lsseguid()` function
is used on the resulting string.
For example, consider the linear dsDNA sequence defined by ``watson="-TATGCC"`` and ``crick="-GCATAC"`` as in:
::
Expand All @@ -183,15 +187,13 @@ def ldseguid(
|||||
CCGTAT-
For the linear dsDNA sequence defined by ``watson="-TATGCC"``, ``crick="-GCATAC"``
(see figures above), The ``"-GCATAC"`` strand is selected since lexicographically,
``"-GCATAC"`` < ``"-TATGCC"``.
A string is constructed like so:
The SEGUID algorithm identifies the ``"-GCATAC"`` strand as the lexicographic smallest of the two. Then it concattenates the two as:
::
"-GCATAC" + ";" + "-TATGCC"
and calculates the final checksum based on that sequence.
The checksum is prefixed with ``ldseguid=``.
Examples
Expand All @@ -200,6 +202,7 @@ def ldseguid(
'ldseguid=rr65d6AYuP-CdMaVmdw3L9FPt6I'
>>> ldseguid("-GCATAC", "-TATGCC")
'ldseguid=rr65d6AYuP-CdMaVmdw3L9FPt6I'
"""
assert watson, "Watson sequence must not be empty"
assert crick, "Crick sequence must not be empty"
Expand Down

0 comments on commit 4a9292c

Please sign in to comment.