Skip to content

Commit

Permalink
Simpler cut representation (#187)
Browse files Browse the repository at this point in the history
* implemented and tests pass

* added cutsite_is_valid to check that the cutting site and recognition site are double stranded

* finished notebook new cut implementation

* extra function documentation and tests

* fix small error
  • Loading branch information
manulera authored and BjornFJohansson committed Feb 5, 2024
1 parent d8695d7 commit fa0d2d4
Show file tree
Hide file tree
Showing 6 changed files with 610 additions and 83 deletions.
271 changes: 248 additions & 23 deletions scripts/cutsite_pairs.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,46 +4,271 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# New cut implementation\n"
"# New cut implementation\n",
"\n",
"The most important thing is that cuts are now represented as `((cut_watson, ovhg), enz)`:\n",
"\n",
"- `cut_watson` is a positive integer contained in `[0,len(seq))`, where `seq` is the sequence that will be cut. It represents the position of the cut on the watson strand, using the full sequence as a reference. By \"full sequence\" I mean the one you would get from `str(Dseq)`. See example below.\n",
"- `ovhg` is the overhang left after the cut. It has the same meaning as `ovhg` in the `Bio.Restriction` enzyme objects, or pydna's `Dseq` property.\n",
"- `enz` is the enzyme object. It's not necessary to perform the cut, but can be used to keep track of which enzyme was used.\n",
"\n",
"The new implementation of `Dseq.cut` now looks like this:\n",
"\n",
"```python\n",
"cutsites = self.get_cutsites(*enzymes)\n",
"cutsite_pairs = self.get_cutsite_pairs(cutsites)\n",
"return tuple(self.apply_cut(*cs) for cs in cutsite_pairs)\n",
"```\n",
"\n",
"Let's go through it step by step"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"get_cutsites output: [((3, -4), EcoRI), ((11, -4), EcoRI)]\n",
"EcoRI.search output: [4, 12] < (positions are 1-based)\n",
"EcoRI.ovhg: -4\n",
"\n",
"get_cutsites output: [((6, -4), EcoRI)]\n",
"EcoRI.search output: [7] < (positions are 1-based)\n",
"EcoRI.ovhg: -4\n",
"\n",
"get_cutsites output: [((1, 2), PacI)]\n",
"PacI.search output: [2] < (positions are 1-based)\n",
"PacI.ovhg: 2\n",
"\n"
]
}
],
"source": [
"# Coming soon\n",
"# seq = Dseq('aaGAATTCaa', circular=True)\n",
"from pydna.dseq import Dseq\n",
"from Bio.Restriction import EcoRI, PacI\n",
"\n",
"# print('EcORI', EcoRI.ovhg, len(seq))\n",
"# for shift in range(len(seq)):\n",
"# seq_shifted = seq.shifted(shift)\n",
"# cut_site = seq_shifted.get_cutsites(EcoRI)[0][0]\n",
"# print(shift, seq_shifted, cut_site, cut_site[0] - cut_site[1])\n",
"dseq = Dseq('aaGAATTCaaGAATTCaa')\n",
"\n",
"# seq = Dseq('ccTTAATTAAcc', circular=True)\n",
"# print('PacI', PacI.ovhg, len(seq))\n",
"# for shift in range(len(seq)):\n",
"# seq_shifted = seq.shifted(shift)\n",
"# cut_site = seq_shifted.get_cutsites(PacI)[0][0]\n",
"# print(shift, seq_shifted, cut_site, cut_site[0] - cut_site[1])\n",
"# what this function does is basically handle the format of the enzymes, and return the cut positions\n",
"# that are returned by enz.search, along with the enzyme name and overhang. Positions are made zero-base\n",
"# instead of one-based\n",
"\n",
"print('get_cutsites output:', dseq.get_cutsites([EcoRI]))\n",
"print('EcoRI.search output:', EcoRI.search(dseq), '< (positions are 1-based)')\n",
"print('EcoRI.ovhg:', EcoRI.ovhg)\n",
"print()\n",
"\n",
"# seq = Dseq('TTAAccccTTAA', circular=True)\n",
"# custom_cut = ((1, 11), type('DynamicClass', (), {'ovhg': 2})())\n",
"# print(seq.apply_cut(custom_cut, custom_cut).__repr__())\n",
"# Below are two examples of circular sequences with a cutsite that spans the origin.\n",
"dseq = Dseq('TTCaaGAA', circular=True)\n",
"print('get_cutsites output:', dseq.get_cutsites([EcoRI]))\n",
"print('EcoRI.search output:', EcoRI.search(dseq, linear=False), '< (positions are 1-based)')\n",
"print('EcoRI.ovhg:', EcoRI.ovhg)\n",
"print()\n",
"\n",
"# print()\n",
"dseq = Dseq('TTAAaaTTAA', circular=True)\n",
"print('get_cutsites output:', dseq.get_cutsites([PacI]))\n",
"print('PacI.search output:', PacI.search(dseq, linear=False), '< (positions are 1-based)')\n",
"print('PacI.ovhg:', PacI.ovhg)\n",
"print()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note in the above printed output how if the ovhg is negative, for an origin spanning cutsite, the position lies on the left side of the origin, and viceversa.\n",
"\n",
"# custom_cut = ((1, 11), type('DynamicClass', (), {'ovhg': -10})())\n",
"# print(seq.apply_cut(custom_cut, custom_cut).__repr__())"
"Below, you can see that the `cut_watson` is defined with respect to the \"full sequence\""
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dseq(-10)\n",
"aaGAATTCaa\n",
" tCTTAAGtt\n",
"ovhg: -1 >> [((3, -4), EcoRI)]\n",
"\n",
"Dseq(-10)\n",
"aaGAATTCaa\n",
"ttCTTAAGtt\n",
"ovhg: 0 >> [((3, -4), EcoRI)]\n",
"\n",
"Dseq(-10)\n",
" aGAATTCaa\n",
"ttCTTAAGtt\n",
"ovhg: 1 >> [((3, -4), EcoRI)]\n",
"\n"
]
}
],
"source": [
"# `cut_watson` is defined with respect to the \"full sequence\"\n",
"for ovhg in [-1, 0, 1]:\n",
" dseq = Dseq.from_full_sequence_and_overhangs('aaGAATTCaa', ovhg, 0)\n",
" print(dseq.__repr__())\n",
" print('ovhg:', ovhg, '>>', dseq.get_cutsites([EcoRI]))\n",
" print()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cuts are only returned if the recognition site and overhang are on the double-strand part of the sequence."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[((1, -4), EcoRI)]\n",
"[]\n"
]
}
],
"source": [
"\n",
"seq = Dseq('GAATTC')\n",
"print(seq.get_cutsites([EcoRI]))\n",
"\n",
"seq = Dseq.from_full_sequence_and_overhangs('GAATTC', -1, 0)\n",
"print(seq.get_cutsites([EcoRI]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"## Pairing cutsites\n",
"\n",
"A fragment produced by restriction is represented by a tuple of length 2 that may contain cutsites or `None`:\n",
"\n",
"- Two cutsites: represents the extraction of a fragment between those two cutsites, in that orientation. To represent the opening of a circular molecule with a single cutsite, we put the same cutsite twice. See below.\n",
"- `None`, cutsite: represents the extraction of a fragment between the left edge of linear sequence and the cutsite.\n",
"- cutsite, `None`: represents the extraction of a fragment between the cutsite and the right edge of a linear sequence.\n",
"\n",
"## Generating the sequence\n",
"\n",
"To get the fragment, we use the function `dseq.apply_cut`, passing the two elements of the tuple as arguments."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"> None, cutsite : (None, ((3, -4), EcoRI))\n",
"Dseq(-7)\n",
"aaG\n",
"ttCTTAA\n",
"\n",
"> cutsite, cutsite : (((3, -4), EcoRI), ((11, -4), EcoRI))\n",
"Dseq(-12)\n",
"AATTCaaG\n",
" GttCTTAA\n",
"\n",
"> cutsite, None : (((11, -4), EcoRI), None)\n",
"Dseq(-7)\n",
"AATTCaa\n",
" Gtt\n",
"\n",
"Circular molecule\n",
"> cutsite, cutsite : (((6, -4), EcoRI), ((6, -4), EcoRI))\n",
"Dseq(-12)\n",
"AATTCaaG\n",
" GttCTTAA\n"
]
}
],
"source": [
"dseq = Dseq('aaGAATTCaaGAATTCaa')\n",
"cutsites = dseq.get_cutsites([EcoRI])\n",
"\n",
"cutsite_pairs = dseq.get_cutsite_pairs(cutsites)\n",
"pair_types = ['None, cutsite', 'cutsite, cutsite', 'cutsite, None']\n",
"\n",
"for pair, pair_type in zip(cutsite_pairs, pair_types):\n",
" print('>', pair_type, ':',pair)\n",
" print(dseq.apply_cut(*pair).__repr__())\n",
" print()\n",
"\n",
"# Opening a circular sequence\n",
"print('Circular molecule')\n",
"dseq = Dseq('TTCaaGAA', circular=True)\n",
"cutsites = dseq.get_cutsites([EcoRI])\n",
"cutsite_pairs = dseq.get_cutsite_pairs(cutsites)\n",
"print('> cutsite, cutsite :', cutsite_pairs[0])\n",
"print(dseq.apply_cut(*cutsite_pairs[0]).__repr__())"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dseq(-7)\n",
" aG\n",
"ttCTTAA\n",
"\n",
"Dseq(-7)\n",
"AATTCaa\n",
" Gt\n"
]
}
],
"source": [
"# Note that the cutsite respects the ovhg of the parent sequence:\n",
"dseq = Dseq.from_full_sequence_and_overhangs('aaGAATTCaaGAATTCaa', 1, 1)\n",
"f1, f2, f3 = dseq.cut([EcoRI])\n",
"print(f1.__repr__())\n",
"print()\n",
"print(f3.__repr__())\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python"
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.6"
}
},
"nbformat": 4,
Expand Down
Loading

0 comments on commit fa0d2d4

Please sign in to comment.