Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simpler cut representation #187

Merged
merged 5 commits into from
Jan 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
271 changes: 248 additions & 23 deletions scripts/cutsite_pairs.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,46 +4,271 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# New cut implementation\n"
"# New cut implementation\n",
"\n",
"The most important thing is that cuts are now represented as `((cut_watson, ovhg), enz)`:\n",
"\n",
"- `cut_watson` is a positive integer contained in `[0,len(seq))`, where `seq` is the sequence that will be cut. It represents the position of the cut on the watson strand, using the full sequence as a reference. By \"full sequence\" I mean the one you would get from `str(Dseq)`. See example below.\n",
"- `ovhg` is the overhang left after the cut. It has the same meaning as `ovhg` in the `Bio.Restriction` enzyme objects, or pydna's `Dseq` property.\n",
"- `enz` is the enzyme object. It's not necessary to perform the cut, but can be used to keep track of which enzyme was used.\n",
"\n",
"The new implementation of `Dseq.cut` now looks like this:\n",
"\n",
"```python\n",
"cutsites = self.get_cutsites(*enzymes)\n",
"cutsite_pairs = self.get_cutsite_pairs(cutsites)\n",
"return tuple(self.apply_cut(*cs) for cs in cutsite_pairs)\n",
"```\n",
"\n",
"Let's go through it step by step"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"get_cutsites output: [((3, -4), EcoRI), ((11, -4), EcoRI)]\n",
"EcoRI.search output: [4, 12] < (positions are 1-based)\n",
"EcoRI.ovhg: -4\n",
"\n",
"get_cutsites output: [((6, -4), EcoRI)]\n",
"EcoRI.search output: [7] < (positions are 1-based)\n",
"EcoRI.ovhg: -4\n",
"\n",
"get_cutsites output: [((1, 2), PacI)]\n",
"PacI.search output: [2] < (positions are 1-based)\n",
"PacI.ovhg: 2\n",
"\n"
]
}
],
"source": [
"# Coming soon\n",
"# seq = Dseq('aaGAATTCaa', circular=True)\n",
"from pydna.dseq import Dseq\n",
"from Bio.Restriction import EcoRI, PacI\n",
"\n",
"# print('EcORI', EcoRI.ovhg, len(seq))\n",
"# for shift in range(len(seq)):\n",
"# seq_shifted = seq.shifted(shift)\n",
"# cut_site = seq_shifted.get_cutsites(EcoRI)[0][0]\n",
"# print(shift, seq_shifted, cut_site, cut_site[0] - cut_site[1])\n",
"dseq = Dseq('aaGAATTCaaGAATTCaa')\n",
"\n",
"# seq = Dseq('ccTTAATTAAcc', circular=True)\n",
"# print('PacI', PacI.ovhg, len(seq))\n",
"# for shift in range(len(seq)):\n",
"# seq_shifted = seq.shifted(shift)\n",
"# cut_site = seq_shifted.get_cutsites(PacI)[0][0]\n",
"# print(shift, seq_shifted, cut_site, cut_site[0] - cut_site[1])\n",
"# what this function does is basically handle the format of the enzymes, and return the cut positions\n",
"# that are returned by enz.search, along with the enzyme name and overhang. Positions are made zero-base\n",
"# instead of one-based\n",
"\n",
"print('get_cutsites output:', dseq.get_cutsites([EcoRI]))\n",
"print('EcoRI.search output:', EcoRI.search(dseq), '< (positions are 1-based)')\n",
"print('EcoRI.ovhg:', EcoRI.ovhg)\n",
"print()\n",
"\n",
"# seq = Dseq('TTAAccccTTAA', circular=True)\n",
"# custom_cut = ((1, 11), type('DynamicClass', (), {'ovhg': 2})())\n",
"# print(seq.apply_cut(custom_cut, custom_cut).__repr__())\n",
"# Below are two examples of circular sequences with a cutsite that spans the origin.\n",
"dseq = Dseq('TTCaaGAA', circular=True)\n",
"print('get_cutsites output:', dseq.get_cutsites([EcoRI]))\n",
"print('EcoRI.search output:', EcoRI.search(dseq, linear=False), '< (positions are 1-based)')\n",
"print('EcoRI.ovhg:', EcoRI.ovhg)\n",
"print()\n",
"\n",
"# print()\n",
"dseq = Dseq('TTAAaaTTAA', circular=True)\n",
"print('get_cutsites output:', dseq.get_cutsites([PacI]))\n",
"print('PacI.search output:', PacI.search(dseq, linear=False), '< (positions are 1-based)')\n",
"print('PacI.ovhg:', PacI.ovhg)\n",
"print()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note in the above printed output how if the ovhg is negative, for an origin spanning cutsite, the position lies on the left side of the origin, and viceversa.\n",
"\n",
"# custom_cut = ((1, 11), type('DynamicClass', (), {'ovhg': -10})())\n",
"# print(seq.apply_cut(custom_cut, custom_cut).__repr__())"
"Below, you can see that the `cut_watson` is defined with respect to the \"full sequence\""
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dseq(-10)\n",
"aaGAATTCaa\n",
" tCTTAAGtt\n",
"ovhg: -1 >> [((3, -4), EcoRI)]\n",
"\n",
"Dseq(-10)\n",
"aaGAATTCaa\n",
"ttCTTAAGtt\n",
"ovhg: 0 >> [((3, -4), EcoRI)]\n",
"\n",
"Dseq(-10)\n",
" aGAATTCaa\n",
"ttCTTAAGtt\n",
"ovhg: 1 >> [((3, -4), EcoRI)]\n",
"\n"
]
}
],
"source": [
"# `cut_watson` is defined with respect to the \"full sequence\"\n",
"for ovhg in [-1, 0, 1]:\n",
" dseq = Dseq.from_full_sequence_and_overhangs('aaGAATTCaa', ovhg, 0)\n",
" print(dseq.__repr__())\n",
" print('ovhg:', ovhg, '>>', dseq.get_cutsites([EcoRI]))\n",
" print()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cuts are only returned if the recognition site and overhang are on the double-strand part of the sequence."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[((1, -4), EcoRI)]\n",
"[]\n"
]
}
],
"source": [
"\n",
"seq = Dseq('GAATTC')\n",
"print(seq.get_cutsites([EcoRI]))\n",
"\n",
"seq = Dseq.from_full_sequence_and_overhangs('GAATTC', -1, 0)\n",
"print(seq.get_cutsites([EcoRI]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"## Pairing cutsites\n",
"\n",
"A fragment produced by restriction is represented by a tuple of length 2 that may contain cutsites or `None`:\n",
"\n",
"- Two cutsites: represents the extraction of a fragment between those two cutsites, in that orientation. To represent the opening of a circular molecule with a single cutsite, we put the same cutsite twice. See below.\n",
"- `None`, cutsite: represents the extraction of a fragment between the left edge of linear sequence and the cutsite.\n",
"- cutsite, `None`: represents the extraction of a fragment between the cutsite and the right edge of a linear sequence.\n",
"\n",
"## Generating the sequence\n",
"\n",
"To get the fragment, we use the function `dseq.apply_cut`, passing the two elements of the tuple as arguments."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"> None, cutsite : (None, ((3, -4), EcoRI))\n",
"Dseq(-7)\n",
"aaG\n",
"ttCTTAA\n",
"\n",
"> cutsite, cutsite : (((3, -4), EcoRI), ((11, -4), EcoRI))\n",
"Dseq(-12)\n",
"AATTCaaG\n",
" GttCTTAA\n",
"\n",
"> cutsite, None : (((11, -4), EcoRI), None)\n",
"Dseq(-7)\n",
"AATTCaa\n",
" Gtt\n",
"\n",
"Circular molecule\n",
"> cutsite, cutsite : (((6, -4), EcoRI), ((6, -4), EcoRI))\n",
"Dseq(-12)\n",
"AATTCaaG\n",
" GttCTTAA\n"
]
}
],
"source": [
"dseq = Dseq('aaGAATTCaaGAATTCaa')\n",
"cutsites = dseq.get_cutsites([EcoRI])\n",
"\n",
"cutsite_pairs = dseq.get_cutsite_pairs(cutsites)\n",
"pair_types = ['None, cutsite', 'cutsite, cutsite', 'cutsite, None']\n",
"\n",
"for pair, pair_type in zip(cutsite_pairs, pair_types):\n",
" print('>', pair_type, ':',pair)\n",
" print(dseq.apply_cut(*pair).__repr__())\n",
" print()\n",
"\n",
"# Opening a circular sequence\n",
"print('Circular molecule')\n",
"dseq = Dseq('TTCaaGAA', circular=True)\n",
"cutsites = dseq.get_cutsites([EcoRI])\n",
"cutsite_pairs = dseq.get_cutsite_pairs(cutsites)\n",
"print('> cutsite, cutsite :', cutsite_pairs[0])\n",
"print(dseq.apply_cut(*cutsite_pairs[0]).__repr__())"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dseq(-7)\n",
" aG\n",
"ttCTTAA\n",
"\n",
"Dseq(-7)\n",
"AATTCaa\n",
" Gt\n"
]
}
],
"source": [
"# Note that the cutsite respects the ovhg of the parent sequence:\n",
"dseq = Dseq.from_full_sequence_and_overhangs('aaGAATTCaaGAATTCaa', 1, 1)\n",
"f1, f2, f3 = dseq.cut([EcoRI])\n",
"print(f1.__repr__())\n",
"print()\n",
"print(f3.__repr__())\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python"
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.6"
}
},
"nbformat": 4,
Expand Down
Loading
Loading