diff --git a/.github/workflows/buid_and_release.yml b/.github/workflows/buid_and_release.yml
index d6ee9e3..264ea84 100644
--- a/.github/workflows/buid_and_release.yml
+++ b/.github/workflows/buid_and_release.yml
@@ -11,7 +11,7 @@ on:
jobs:
build_and_release:
runs-on: ubuntu-latest
- container: obolibrary/odkfull:v1.4.3
+ container: obolibrary/odkfull:v1.5.3
steps:
- uses: actions/checkout@v2
- name: Install
diff --git a/README.md b/README.md
index 5c398f0..b4c36dd 100644
--- a/README.md
+++ b/README.md
@@ -105,15 +105,148 @@ always in sync, and that one or the other may be slightly more up-to or out-of d
### `review.tsv`
Columns:
-- `classCode`: integer: ID of review case class
-- `classShortName`: string (camelCase): describing the review case class
+- `classCode`: integer
+- `classLabel`: string
- `value`: any: Some form of data to review
- `comment`: string (optional)
-#### 1. `causalD2gButMarkedDigenic`
-This review case involves what would be otherwise considered a valid disease-gene relationship, but for the fact that
-it quite unusually includes 'digenic' in the label, even though it only had 1 association. OMIM doesn't have a
-guaranatee on the data quality of its disease-gene associations marked 'digenic', so for any of these entries, it could
-be the case that either (a) it is not 'digenic'; OMIM should remove that from the label, and Mondo can make an explicit
-exception to add the relationship, or could otherwise wait until OMIM fixes the issue and it will automatically be
-added, or (b) it is in fact 'digenic', and OMIM should add the missing 2nd gene association.
+#### 1. D2G: digenic
+This review case involves what would be otherwise considered a valid, disease-defining disease-gene (D2G) relationship,
+but for the fact that it quite unusually includes 'digenic' in the label, even though it only had 1 association. OMIM
+doesn't have a guaranatee on the data quality of its disease-gene associations marked 'digenic', so for any of these
+entries, it could be the case that either (a) it is not 'digenic'; OMIM should remove that from the label, and Mondo
+can make an explicit exception to add the relationship, or could otherwise wait until OMIM fixes the issue and it will
+automatically be added, or (b) it is in fact 'digenic', and OMIM should add the missing 2nd gene association.
+
+#### 2. D2G: self-referential
+The unique characteristics of cases of this class are as follows:
+- Each case has 2 rows in `morbidmap.txt` and are part of a pattern.
+- Row 1: One row is a typical, valid, disease-defining entry. For the given phenotype MIM in that row, there are no
+- other rows in `morbidmap.txt` where it appears as a phenotype having an association with another gene.
+ - In all such cases seen thus far as of 2024/11/18, all of these are cancer cases, and the label ends with "somatic".
+ - This entry appears in the Phenotype-Gene Relationships table on the MIM's omim.org/entry page.
+- Row 2: There is a second row where the phenotype in the first row appears as a gene.
+ - For this row, there is no MIM in the phenotype field.
+ - This row does not appear in the Gene-Phenotype Relationships table on the MIM's omim.org/entry page.
+ - This row is self-referential. The label in the Phenotype field is one of the titles of the MIM in the Gene field.
+
+**Example case**:
+|Phenotype|Gene/Locus And Other Related Symbols|MIM Number|Cyto Location|
+|-|-|-|-|
+|Small cell cancer of the lung, somatic, 182280 (3)|RB1|614041|13q14.2|
+|Small-cell cancer of lung (2)|SCLC1|182280|3p23-p21|
+
+**All known cases**:
+There is a spreadsheet which collates all known cases as of 2024/11/18: [google sheet](
+https://docs.google.com/spreadsheets/d/1hKSp2dyKye6y_20NK2HwLsaKNzWfGCMJMP52lKrkHtU/). The MIMs of the known cases are: `159595`, `182280`, `607107`, and `615830`.
+
+**Additional notes**:
+Note that unlike the other cases, a single case of "D2G: self-referential" spans multiple rows in `review.tsv`.
+The cases are enumerated in the TSV, with individual cases identifiable via a leading integer in the `value` column,
+e.g. "1: " for the first case, "2: " for the second, and so on.
+
+Also, see note in section "3. D2G: somatic" about intersection between these two cases.
+
+#### 3. D2G: somatic
+Happens when all conditions were met for this association to be considered disease-defining, but the mutation is a somatic cell mutation, rather than a germline mutation. This is indicated by the appearance of the word 'somatic' in the label of the phenotype MIM in the association. These cases should be reviewed because currently any association meeting the criteria to be considered disease-defining is also considered a germline mutation and the association is represented in `omim.owl` using the property 'is causal germline mutation in' (RO:0004013).
+
+Note that there is an intersection between this case and case 2, "D2G: self-referential". Sometimes the somatic cases
+will also be self-referential, but not always. However, all cases of "D2G: self-referential" have historically included
+a row where the phenotype includes the word 'somatic'.
+
+#### 4. D2G: Phenotype is gene
+Happens when all conditions were met for this association to be considered disease-defining. However, the phenotype in
+the association unexpectedly has the type of "gene" rather than "phenotype". This is unexpected and considered a data
+quality issue on the OMIM side. As of 2024/10, we flagged this to the OMIM team and they corrected all such cases.
+
+#### 5. D2G: Phenotype type error
+Happens when all conditions were met for this association to be considered disease-defining. However, the phenotype in
+the association has an unexpected type of either 'OBSOLETE', 'SUSPECTED', or 'HAS_AFFECTED_FEATURE'. As of 2024/12, we
+have not seen such cases appear, but we have set this review case up to watch for them should they occur.
+
+## Under the hood: Design decisions, etc.
+### Gene-Disease pipeline
+This pipeline involves the processing of `morbidmap.txt` to create ontological representations of Gene --> Disease and
+Disease --> Gene associations.
+
+#### Example input/output
+##### Input: `morbidmap.txt`
+| Phenotype | Gene/Locus And Other Related Symbols | MIM Number | Cyto Location |
+|----------------------------------|--------------------------------------|------------|---------------|
+| Prune belly syndrome, 100100 (3) | CHRM3, PBS, EGBRS | 118494 | 1q43 |
+
+`OMIM:100100` (Prune belly syndrome) is the Phenotype ("Disease"), and `OMIM:118494` (CHRM3) is the associated Gene.
+They are related via mapping key `(3)` (explained below).
+
+##### Output: `omim.ttl`
+```ttl
+OMIM:100100 a owl:Class ;
+ rdfs:label "prune belly syndrome" ;
+ rdfs:subClassOf _:N2fd22c9bb2f04630b81414cff9514660 ;
+ biolink:category biolink:Disease .
+
+_:N2fd22c9bb2f04630b81414cff9514660 a owl:Restriction ;
+ owl:onProperty RO:0004003 ;
+ owl:someValuesFrom OMIM:118494 .
+```
+
+The association is represented as an `rdfs:subClassOf` `owl:Restriction`, where mapping key `(3)` is represented as
+`RO:0004003`.
+
+#### OMIM MorbidMap mapping keys & Relationship Ontology predicates
+In order to add these associations to an OWL ontology, we must use an appropriate predicate. Below are the 4 OMIM
+`morbidmap.txt` mapping keys and [their definitions](https://omim.org/help/faq#1_6), alongside the RO predicates we've
+chosen to represent them.
+
+Note that the directionality of these associations / predicates is in the Gene->Disease direction:
+(Gene MIM) --(Mapping key / RO predicate)--> (Disease MIM)
+
+1: The disorder is placed on the map based on its association with a gene, but the underlying defect is not known.
+Not ontologized. These types are ignored due to the uncertainty of the nature of the association.
+
+2: The disorder has been placed on the map by linkage or other statistical method; no mutation has been found.
+[RO:0003303 (causes condition)](https://www.ebi.ac.uk/ols/ontologies/ro/properties?iri=http://purl.obolibrary.org/obo/RO_0003303):
+A relationship between an entity (e.g. a genotype, genetic variation, chemical, or environmental exposure) and a
+condition (a phenotype or disease), where the entity has some causal role for the condition.
+
+3: The molecular basis for the disorder is known; a mutation has been found in the gene.
+[RO:0004013 (is causal germline mutation in)](https://www.ebi.ac.uk/ols/ontologies/ro/properties?iri=http://purl.obolibrary.org/obo/RO_0004013):
+Relates a gene to condition, such that a mutation in this gene is sufficient to produce the condition and that can be
+passed on to offspring[modified from orphanet].
+
+Note: For these "mapping key (3)" cases, there also exists an inverse predicate which we ontologize in the
+inverse direction: (Disease MIM) --(Mapping key 3 / RO:0004003)--> (Gene MIM):
+[RO:0004003 (has material basis in germline mutation in)](https://www.ebi.ac.uk/ols4/ontologies/ro/properties?iri=http://purl.obolibrary.org/obo/RO_0004003)
+
+4: A contiguous gene deletion or duplication syndrome, multiple genes are deleted or duplicated causing the phenotype.
+[RO:0003304 (contributes to condition)](https://www.ebi.ac.uk/ols/ontologies/ro/properties?iri=http://purl.obolibrary.org/obo/RO_0003304):
+A relationship between an entity (e.g. a genotype, genetic variation, chemical, or environmental exposure) and a
+condition (a phenotype or disease), where the entity has some contributing role that influences the condition.
+
+**Important caveat: Singular vs multiple associations**
+These above RO predicates are only used if there is only 1 gene associated with a given disease, i.e.
+in `morbidmap.txt`, there is only 1 row where the MIM appears in the `Phenotype` field.
+
+In cases where there is >1 association, the following RO predicate is used instead, regardless of if the mapping key is
+(2), (3), or (4):
+[RO:0003302 (causes or contributes to condition)](https://www.ebi.ac.uk/ols/ontologies/ro/properties?iri=http://purl.obolibrary.org/obo/RO_0003302):
+A relationship between an entity (e.g. a genotype, genetic variation, chemical, or environmental exposure) and a
+condition (a phenotype or disease), where the entity has some causal or contributing role that influences the condition.
+
+#### Necessary conditions for disease-defining associations
+Of the above 3 Gene->Disease association predicates (those with mapping keys (2), (3), and (4)), the one which we
+consider "disease defining" is (3) (RO:0004013). For these cases, as mentioned above, we also declare an association in
+the Disease->Gene direction, RO:0004003. However, we only declare these associations if several other conditions are
+also met. These other conditions are: (i) the Phenotype not be marked as a non-disease (represented by the label
+being wrapped in `[]`), (ii) that is not a mutation that contribute to susceptibility to multifactorial disorders
+(e.g., diabetes, asthma) or to susceptibility to infection (e.g., malaria) (represented by the label being wrapped in
+`{}`), and (iii) not be marked provisional (represented by the label beginning with `?`). These 3 special markers are
+further explained in the [OMIM FAQ](https://omim.org/help/faq#1_6). Additionally, as mentioned above, we only declare
+the association in `omim.ttl` if there is 1 and only 1 association shown in `morbidmap.txt
+
+So, all of the conditions together are:
+1. Mapping key is (3)
+2. Only 1 association
+3. Phenotype not marked as non-disease (`[]`)
+4. Phenotype not marked as susceptibility to multifactorial disorders or infection (`{}`)
+5. Phenotype not marked provisional (`?`)
diff --git a/analyses/morbidmap-data-analysis/Analyze_morbidmap - v3.ipynb b/analyses/morbidmap-data-analysis/Analyze_morbidmap - v3.ipynb
new file mode 100644
index 0000000..8e01dcd
--- /dev/null
+++ b/analyses/morbidmap-data-analysis/Analyze_morbidmap - v3.ipynb
@@ -0,0 +1,1207 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "5601677f-11b5-488b-b6f2-c7ccc4377d58",
+ "metadata": {},
+ "source": [
+ "## Analyze Morbidmap content - v3\n",
+ "The goal of this notebook is to analyze the content of the files from OMIM called morbidmap and mimTitles in order to create a gold standard list of diseases that should be represented in Mondo with 'has material basis in germline mutation in' some GENE. The diseases in this list can be used for comparison of results that occur through the various transformations of the omim content to confirm the final representation is correct in downstream files, e.g. omim.owl.\n",
+ "\n",
+ "To download these input files (morbidmap and mimTitles), request an API key from OMIM (https://omim.org/contact#) and then create the files using python -m omim2obo based on the instructions in the README in the omim repo.\n",
+ "\n",
+ "For this analysis, the working assumption from Sabrina's latest email ('Gene association in Mondo' on Fri, Nov 1, 6:46 PM) is that the gene associations to add into Mondo are:\n",
+ "1) The disease has exactly 1 associated gene\n",
+ "2) The association is causal (mapping key = 3)\n",
+ "3) Classified as a disease, non-provisional, and not a susceptibility relationshsip (phenotype label does NOT include [], {}, or ?)\n",
+ " \n",
+ "See https://omim.org/help/faq#1_6 for more details on what the Phenotype mapping key values mean and additional formatting, [], {}, ?, found in phenotype labels. See https://omim.org/help/faq#1_3 for information on what the Prefix values in the file mimTitles means.\n",
+ "\n",
+ "NOTE: Without filtering out from this set diseases where the phenotype label contains 'digenic' there will be a handful of these since\n",
+ "as we saw in Analyze Morbidmap content - v1 these exist. Also in the summary doc of [OMIM Disease-Gene Issues](https://docs.google.com/document/d/1cLfBgPIZWiN5LX-E-xwSyBeFdT-vw0JuSfSM7HL3_hc/edit?tab=t.0#heading=h.h4y343h64cck).\n",
+ "\n",
+ "Also, without filtering out from this set diseases that start with [,{, and ? non-diseases, susceptibility, and provisional diseases \n",
+ "will be included. See https://omim.org/help/faq#1_6 for a description of these special characters."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "26da9796-e6bb-4e5c-aabf-ad2648610195",
+ "metadata": {},
+ "source": [
+ "### Imports"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "e2a21e21-bf58-433a-9237-b083c9288a1b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Imports\n",
+ "import pandas as pd\n",
+ "import re\n",
+ "import time\n",
+ "\n",
+ "# Set the display option to show full column width\n",
+ "pd.set_option('display.max_colwidth', None)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f6bc4df4-d5de-4df1-bf15-94a5ace2f346",
+ "metadata": {},
+ "source": [
+ "### Read in Data file"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "580ee461-c4fa-4902-a1f8-bbe9d56b0c5b",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Phenotype | \n",
+ " Gene/Locus And Other Related Symbols | \n",
+ " MIM Number | \n",
+ " Cyto Location | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 17,20-lyase deficiency, isolated, 202110 (3) | \n",
+ " CYP17A1, CYP17, P450C17 | \n",
+ " 609300 | \n",
+ " 10q24.32 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 17-alpha-hydroxylase/17,20-lyase deficiency, 202110 (3) | \n",
+ " CYP17A1, CYP17, P450C17 | \n",
+ " 609300 | \n",
+ " 10q24.32 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 2,4-dienoyl-CoA reductase deficiency, 616034 (3) | \n",
+ " NADK2, C5orf33, DECRD | \n",
+ " 615787 | \n",
+ " 5p13.2 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 2-methylbutyrylglycinuria, 610006 (3) | \n",
+ " ACADSB, SBCAD | \n",
+ " 600301 | \n",
+ " 10q26.13 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 3-M syndrome 1, 273750 (3) | \n",
+ " CUL7, 3M1 | \n",
+ " 609577 | \n",
+ " 6p21.1 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Phenotype \\\n",
+ "0 17,20-lyase deficiency, isolated, 202110 (3) \n",
+ "1 17-alpha-hydroxylase/17,20-lyase deficiency, 202110 (3) \n",
+ "2 2,4-dienoyl-CoA reductase deficiency, 616034 (3) \n",
+ "3 2-methylbutyrylglycinuria, 610006 (3) \n",
+ "4 3-M syndrome 1, 273750 (3) \n",
+ "\n",
+ " Gene/Locus And Other Related Symbols MIM Number Cyto Location \n",
+ "0 CYP17A1, CYP17, P450C17 609300 10q24.32 \n",
+ "1 CYP17A1, CYP17, P450C17 609300 10q24.32 \n",
+ "2 NADK2, C5orf33, DECRD 615787 5p13.2 \n",
+ "3 ACADSB, SBCAD 600301 10q26.13 \n",
+ "4 CUL7, 3M1 609577 6p21.1 "
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Read in file. This version of morbidmap.tsv was downloaded on 18-Nov-2024\n",
+ "# NOTE: You will need to follow the instructions in the README to get the morbidmap file. \n",
+ "# IMPORTANT !!The morbidmap file is not a file that should be posted publicly in this repo!!\n",
+ "\n",
+ "df = pd.read_csv('../../data/morbidmap.tsv', sep='\\t')\n",
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c7d9885d-c998-45ab-943d-a621b2b6e644",
+ "metadata": {},
+ "source": [
+ "### Process file to parse out phenotype mim number from Phenotype column"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "1259b124-160b-4eb2-b60a-83f73e61ed3c",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Phenotype | \n",
+ " Gene/Locus And Other Related Symbols | \n",
+ " MIM Number | \n",
+ " Cyto Location | \n",
+ " p_label | \n",
+ " p_mim | \n",
+ " p_mapping_key | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 17,20-lyase deficiency, isolated, 202110 (3) | \n",
+ " CYP17A1, CYP17, P450C17 | \n",
+ " 609300 | \n",
+ " 10q24.32 | \n",
+ " 17,20-lyase deficiency, isolated, | \n",
+ " 202110 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 17-alpha-hydroxylase/17,20-lyase deficiency, 202110 (3) | \n",
+ " CYP17A1, CYP17, P450C17 | \n",
+ " 609300 | \n",
+ " 10q24.32 | \n",
+ " 17-alpha-hydroxylase/17,20-lyase deficiency, | \n",
+ " 202110 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 2,4-dienoyl-CoA reductase deficiency, 616034 (3) | \n",
+ " NADK2, C5orf33, DECRD | \n",
+ " 615787 | \n",
+ " 5p13.2 | \n",
+ " 2,4-dienoyl-CoA reductase deficiency, | \n",
+ " 616034 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 2-methylbutyrylglycinuria, 610006 (3) | \n",
+ " ACADSB, SBCAD | \n",
+ " 600301 | \n",
+ " 10q26.13 | \n",
+ " 2-methylbutyrylglycinuria, | \n",
+ " 610006 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 3-M syndrome 1, 273750 (3) | \n",
+ " CUL7, 3M1 | \n",
+ " 609577 | \n",
+ " 6p21.1 | \n",
+ " 3-M syndrome 1, | \n",
+ " 273750 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Phenotype \\\n",
+ "0 17,20-lyase deficiency, isolated, 202110 (3) \n",
+ "1 17-alpha-hydroxylase/17,20-lyase deficiency, 202110 (3) \n",
+ "2 2,4-dienoyl-CoA reductase deficiency, 616034 (3) \n",
+ "3 2-methylbutyrylglycinuria, 610006 (3) \n",
+ "4 3-M syndrome 1, 273750 (3) \n",
+ "\n",
+ " Gene/Locus And Other Related Symbols MIM Number Cyto Location \\\n",
+ "0 CYP17A1, CYP17, P450C17 609300 10q24.32 \n",
+ "1 CYP17A1, CYP17, P450C17 609300 10q24.32 \n",
+ "2 NADK2, C5orf33, DECRD 615787 5p13.2 \n",
+ "3 ACADSB, SBCAD 600301 10q26.13 \n",
+ "4 CUL7, 3M1 609577 6p21.1 \n",
+ "\n",
+ " p_label p_mim p_mapping_key \n",
+ "0 17,20-lyase deficiency, isolated, 202110 3 \n",
+ "1 17-alpha-hydroxylase/17,20-lyase deficiency, 202110 3 \n",
+ "2 2,4-dienoyl-CoA reductase deficiency, 616034 3 \n",
+ "3 2-methylbutyrylglycinuria, 610006 3 \n",
+ "4 3-M syndrome 1, 273750 3 "
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Parse out phenotype mim number from Phenotype column\n",
+ "\n",
+ "# Updated pattern - from https://github.com/monarch-initiative/omim/pull/158/files#diff-712ded77b46725c43257568450e7c94df2a64f683c77dfd88e9726fbcbc7c5fbR351\n",
+ "pattern = r'(.*)(\\d{6})\\s*(?:\\((\\d+)\\))?'\n",
+ "\n",
+ "# Use .str.extract() to apply the pattern and store matches in new columns\n",
+ "df[['p_label', 'p_mim', 'p_mapping_key']] = df['Phenotype'].str.extract(pattern)\n",
+ "\n",
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "eccc15f5-cfa2-4738-92aa-7bf11573e7a7",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Convert type of p_mapping_key to a string\n",
+ "\n",
+ "df['p_mapping_key'] = df['p_mapping_key'].astype(str)\n",
+ "\n",
+ "# Check that each value is now a string\n",
+ "print(df['p_mapping_key'].apply(type).unique())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4a6813b9-3556-4a0a-87c0-34726a49025b",
+ "metadata": {},
+ "source": [
+ "### Get all rows where the p_mim value occurs only 1 time in the dataframe and has p_mapping_key='3'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "a425363d-a7fd-495a-be8d-508a46f26b6b",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Length of unique_p_mim: 6350\n",
+ "\n",
+ "---\n",
+ "unique_and_pkey3_df.nunique() values:\n",
+ "Phenotype 6331\n",
+ "Gene/Locus And Other Related Symbols 4622\n",
+ "MIM Number 4622\n",
+ "Cyto Location 834\n",
+ "p_label 6330\n",
+ "p_mim 6331\n",
+ "p_mapping_key 1\n",
+ "dtype: int64\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Phenotype | \n",
+ " Gene/Locus And Other Related Symbols | \n",
+ " MIM Number | \n",
+ " Cyto Location | \n",
+ " p_label | \n",
+ " p_mim | \n",
+ " p_mapping_key | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 2 | \n",
+ " 2,4-dienoyl-CoA reductase deficiency, 616034 (3) | \n",
+ " NADK2, C5orf33, DECRD | \n",
+ " 615787 | \n",
+ " 5p13.2 | \n",
+ " 2,4-dienoyl-CoA reductase deficiency, | \n",
+ " 616034 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 2-methylbutyrylglycinuria, 610006 (3) | \n",
+ " ACADSB, SBCAD | \n",
+ " 600301 | \n",
+ " 10q26.13 | \n",
+ " 2-methylbutyrylglycinuria, | \n",
+ " 610006 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 3-M syndrome 1, 273750 (3) | \n",
+ " CUL7, 3M1 | \n",
+ " 609577 | \n",
+ " 6p21.1 | \n",
+ " 3-M syndrome 1, | \n",
+ " 273750 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ " 5 | \n",
+ " 3-M syndrome 2, 612921 (3) | \n",
+ " OBSL1, KIAA0657, 3M2 | \n",
+ " 610991 | \n",
+ " 2q35 | \n",
+ " 3-M syndrome 2, | \n",
+ " 612921 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ " 6 | \n",
+ " 3-M syndrome 3, 614205 (3) | \n",
+ " CCDC8, 3M3 | \n",
+ " 614145 | \n",
+ " 19q13.32 | \n",
+ " 3-M syndrome 3, | \n",
+ " 614205 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Phenotype \\\n",
+ "2 2,4-dienoyl-CoA reductase deficiency, 616034 (3) \n",
+ "3 2-methylbutyrylglycinuria, 610006 (3) \n",
+ "4 3-M syndrome 1, 273750 (3) \n",
+ "5 3-M syndrome 2, 612921 (3) \n",
+ "6 3-M syndrome 3, 614205 (3) \n",
+ "\n",
+ " Gene/Locus And Other Related Symbols MIM Number Cyto Location \\\n",
+ "2 NADK2, C5orf33, DECRD 615787 5p13.2 \n",
+ "3 ACADSB, SBCAD 600301 10q26.13 \n",
+ "4 CUL7, 3M1 609577 6p21.1 \n",
+ "5 OBSL1, KIAA0657, 3M2 610991 2q35 \n",
+ "6 CCDC8, 3M3 614145 19q13.32 \n",
+ "\n",
+ " p_label p_mim p_mapping_key \n",
+ "2 2,4-dienoyl-CoA reductase deficiency, 616034 3 \n",
+ "3 2-methylbutyrylglycinuria, 610006 3 \n",
+ "4 3-M syndrome 1, 273750 3 \n",
+ "5 3-M syndrome 2, 612921 3 \n",
+ "6 3-M syndrome 3, 614205 3 "
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Step 1: Filter for rows where p_mim occurs only once and p_mapping_key is 3\n",
+ "unique_p_mim = df['p_mim'].value_counts()[df['p_mim'].value_counts() == 1].index\n",
+ "print(\"Length of unique_p_mim: \", len(unique_p_mim))\n",
+ "\n",
+ "unique_and_pkey3_df = df[(df['p_mim'].isin(unique_p_mim)) & (df['p_mapping_key'] == '3')]\n",
+ "print (\"\\n---\\nunique_and_pkey3_df.nunique() values:\")\n",
+ "print(unique_and_pkey3_df.nunique())\n",
+ "\n",
+ "unique_and_pkey3_df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "febe84d1-40fb-45ff-b520-48481cf36dc8",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Phenotype 5340\n",
+ "Gene/Locus And Other Related Symbols 4014\n",
+ "MIM Number 4014\n",
+ "Cyto Location 811\n",
+ "p_label 5339\n",
+ "p_mim 5340\n",
+ "p_mapping_key 1\n",
+ "dtype: int64\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Phenotype | \n",
+ " Gene/Locus And Other Related Symbols | \n",
+ " MIM Number | \n",
+ " Cyto Location | \n",
+ " p_label | \n",
+ " p_mim | \n",
+ " p_mapping_key | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 2 | \n",
+ " 2,4-dienoyl-CoA reductase deficiency, 616034 (3) | \n",
+ " NADK2, C5orf33, DECRD | \n",
+ " 615787 | \n",
+ " 5p13.2 | \n",
+ " 2,4-dienoyl-CoA reductase deficiency, | \n",
+ " 616034 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 2-methylbutyrylglycinuria, 610006 (3) | \n",
+ " ACADSB, SBCAD | \n",
+ " 600301 | \n",
+ " 10q26.13 | \n",
+ " 2-methylbutyrylglycinuria, | \n",
+ " 610006 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 3-M syndrome 1, 273750 (3) | \n",
+ " CUL7, 3M1 | \n",
+ " 609577 | \n",
+ " 6p21.1 | \n",
+ " 3-M syndrome 1, | \n",
+ " 273750 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ " 5 | \n",
+ " 3-M syndrome 2, 612921 (3) | \n",
+ " OBSL1, KIAA0657, 3M2 | \n",
+ " 610991 | \n",
+ " 2q35 | \n",
+ " 3-M syndrome 2, | \n",
+ " 612921 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ " 6 | \n",
+ " 3-M syndrome 3, 614205 (3) | \n",
+ " CCDC8, 3M3 | \n",
+ " 614145 | \n",
+ " 19q13.32 | \n",
+ " 3-M syndrome 3, | \n",
+ " 614205 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Phenotype \\\n",
+ "2 2,4-dienoyl-CoA reductase deficiency, 616034 (3) \n",
+ "3 2-methylbutyrylglycinuria, 610006 (3) \n",
+ "4 3-M syndrome 1, 273750 (3) \n",
+ "5 3-M syndrome 2, 612921 (3) \n",
+ "6 3-M syndrome 3, 614205 (3) \n",
+ "\n",
+ " Gene/Locus And Other Related Symbols MIM Number Cyto Location \\\n",
+ "2 NADK2, C5orf33, DECRD 615787 5p13.2 \n",
+ "3 ACADSB, SBCAD 600301 10q26.13 \n",
+ "4 CUL7, 3M1 609577 6p21.1 \n",
+ "5 OBSL1, KIAA0657, 3M2 610991 2q35 \n",
+ "6 CCDC8, 3M3 614145 19q13.32 \n",
+ "\n",
+ " p_label p_mim p_mapping_key \n",
+ "2 2,4-dienoyl-CoA reductase deficiency, 616034 3 \n",
+ "3 2-methylbutyrylglycinuria, 610006 3 \n",
+ "4 3-M syndrome 1, 273750 3 \n",
+ "5 3-M syndrome 2, 612921 3 \n",
+ "6 3-M syndrome 3, 614205 3 "
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Remove rows where the p_label starts with an \"OMIM special character\". See https://omim.org/help/faq#1_6\n",
+ "\n",
+ "# Filter out rows where p_label starts with [, {, or ?\n",
+ "unique_and_key3_filtered_df = unique_and_pkey3_df[~unique_and_pkey3_df['p_label'].str.match(r'^[\\[{?]', na=False)]\n",
+ "print(unique_and_key3_filtered_df.nunique())\n",
+ "\n",
+ "unique_and_key3_filtered_df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "b02e7433-f960-4d25-b0d4-5046b141404a",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Prefix 5\n",
+ "MIM Number 28952\n",
+ "Preferred Title; symbol 28684\n",
+ "Alternative Title(s); symbol(s) 18991\n",
+ "Included Title(s); symbols 1314\n",
+ "dtype: int64\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Prefix | \n",
+ " MIM Number | \n",
+ " Preferred Title; symbol | \n",
+ " Alternative Title(s); symbol(s) | \n",
+ " Included Title(s); symbols | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " NaN | \n",
+ " 100050 | \n",
+ " AARSKOG SYNDROME, AUTOSOMAL DOMINANT | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " Percent | \n",
+ " 100070 | \n",
+ " AORTIC ANEURYSM, FAMILIAL ABDOMINAL, 1; AAA1 | \n",
+ " ANEURYSM, ABDOMINAL AORTIC; AAA;; ABDOMINAL AORTIC ANEURYSM | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " Number Sign | \n",
+ " 100100 | \n",
+ " PRUNE BELLY SYNDROME; PBS | \n",
+ " ABDOMINAL MUSCLES, ABSENCE OF, WITH URINARY TRACT ABNORMALITY AND CRYPTORCHIDISM;; EAGLE-BARRETT SYNDROME; EGBRS | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " NaN | \n",
+ " 100200 | \n",
+ " ABDUCENS PALSY | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " Number Sign | \n",
+ " 100300 | \n",
+ " ADAMS-OLIVER SYNDROME 1; AOS1 | \n",
+ " AOS;; ABSENCE DEFECT OF LIMBS, SCALP, AND SKULL;; CONGENITAL SCALP DEFECTS WITH DISTAL LIMB REDUCTION ANOMALIES;; APLASIA CUTIS CONGENITA WITH TERMINAL TRANSVERSE LIMB DEFECTS | \n",
+ " APLASIA CUTIS CONGENITA, CONGENITAL HEART DEFECT, AND FRONTONASAL CYSTS, INCLUDED | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Prefix MIM Number Preferred Title; symbol \\\n",
+ "0 NaN 100050 AARSKOG SYNDROME, AUTOSOMAL DOMINANT \n",
+ "1 Percent 100070 AORTIC ANEURYSM, FAMILIAL ABDOMINAL, 1; AAA1 \n",
+ "2 Number Sign 100100 PRUNE BELLY SYNDROME; PBS \n",
+ "3 NaN 100200 ABDUCENS PALSY \n",
+ "4 Number Sign 100300 ADAMS-OLIVER SYNDROME 1; AOS1 \n",
+ "\n",
+ " Alternative Title(s); symbol(s) \\\n",
+ "0 NaN \n",
+ "1 ANEURYSM, ABDOMINAL AORTIC; AAA;; ABDOMINAL AORTIC ANEURYSM \n",
+ "2 ABDOMINAL MUSCLES, ABSENCE OF, WITH URINARY TRACT ABNORMALITY AND CRYPTORCHIDISM;; EAGLE-BARRETT SYNDROME; EGBRS \n",
+ "3 NaN \n",
+ "4 AOS;; ABSENCE DEFECT OF LIMBS, SCALP, AND SKULL;; CONGENITAL SCALP DEFECTS WITH DISTAL LIMB REDUCTION ANOMALIES;; APLASIA CUTIS CONGENITA WITH TERMINAL TRANSVERSE LIMB DEFECTS \n",
+ "\n",
+ " Included Title(s); symbols \n",
+ "0 NaN \n",
+ "1 NaN \n",
+ "2 NaN \n",
+ "3 NaN \n",
+ "4 APLASIA CUTIS CONGENITA, CONGENITAL HEART DEFECT, AND FRONTONASAL CYSTS, INCLUDED "
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Read in mimTitles to join with unique_and_key3_no_digenic_filtered_df to get Prefix values\n",
+ "mimTitles_df = pd.read_csv('../../data/mimTitles.tsv', sep='\\t')\n",
+ "print(mimTitles_df.nunique())\n",
+ "\n",
+ "mimTitles_df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "b31fc8c8-8ed6-4084-a500-d680f5099de9",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Phenotype 5340\n",
+ "Gene/Locus And Other Related Symbols 4014\n",
+ "MIM Number_x 4014\n",
+ "Cyto Location 811\n",
+ "p_label 5339\n",
+ "p_mim 5340\n",
+ "p_mapping_key 1\n",
+ "Prefix 2\n",
+ "MIM Number_y 5340\n",
+ "Preferred Title; symbol 5340\n",
+ "Alternative Title(s); symbol(s) 2900\n",
+ "Included Title(s); symbols 219\n",
+ "dtype: int64\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/var/folders/cp/m4__ys497773m0zyz5l__yqw0000gq/T/ipykernel_76076/112016101.py:4: SettingWithCopyWarning: \n",
+ "A value is trying to be set on a copy of a slice from a DataFrame.\n",
+ "Try using .loc[row_indexer,col_indexer] = value instead\n",
+ "\n",
+ "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
+ " unique_and_key3_filtered_df['p_mim'] = unique_and_key3_filtered_df['p_mim'].astype(str)\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Phenotype | \n",
+ " Gene/Locus And Other Related Symbols | \n",
+ " MIM Number_x | \n",
+ " Cyto Location | \n",
+ " p_label | \n",
+ " p_mim | \n",
+ " p_mapping_key | \n",
+ " Prefix | \n",
+ " MIM Number_y | \n",
+ " Preferred Title; symbol | \n",
+ " Alternative Title(s); symbol(s) | \n",
+ " Included Title(s); symbols | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 2,4-dienoyl-CoA reductase deficiency, 616034 (3) | \n",
+ " NADK2, C5orf33, DECRD | \n",
+ " 615787 | \n",
+ " 5p13.2 | \n",
+ " 2,4-dienoyl-CoA reductase deficiency, | \n",
+ " 616034 | \n",
+ " 3 | \n",
+ " Number Sign | \n",
+ " 616034 | \n",
+ " 2,4-DIENOYL-CoA REDUCTASE DEFICIENCY; DECRD | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 2-methylbutyrylglycinuria, 610006 (3) | \n",
+ " ACADSB, SBCAD | \n",
+ " 600301 | \n",
+ " 10q26.13 | \n",
+ " 2-methylbutyrylglycinuria, | \n",
+ " 610006 | \n",
+ " 3 | \n",
+ " Number Sign | \n",
+ " 610006 | \n",
+ " 2-METHYLBUTYRYL-CoA DEHYDROGENASE DEFICIENCY | \n",
+ " 2-METHYLBUTYRYL GLYCINURIA;; SHORT/BRANCHED-CHAIN ACYL-CoA DEHYDROGENASE DEFICIENCY; SBCADD | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 3-M syndrome 1, 273750 (3) | \n",
+ " CUL7, 3M1 | \n",
+ " 609577 | \n",
+ " 6p21.1 | \n",
+ " 3-M syndrome 1, | \n",
+ " 273750 | \n",
+ " 3 | \n",
+ " Number Sign | \n",
+ " 273750 | \n",
+ " THREE M SYNDROME 1; 3M1 | \n",
+ " 3M SYNDROME;; LE MERRER SYNDROME;; DOLICHOSPONDYLIC DYSPLASIA;; GLOOMY FACE SYNDROME | \n",
+ " YAKUT SHORT STATURE SYNDROME, INCLUDED | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 3-M syndrome 2, 612921 (3) | \n",
+ " OBSL1, KIAA0657, 3M2 | \n",
+ " 610991 | \n",
+ " 2q35 | \n",
+ " 3-M syndrome 2, | \n",
+ " 612921 | \n",
+ " 3 | \n",
+ " Number Sign | \n",
+ " 612921 | \n",
+ " THREE M SYNDROME 2; 3M2 | \n",
+ " 3M SYNDROME 2 | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 3-M syndrome 3, 614205 (3) | \n",
+ " CCDC8, 3M3 | \n",
+ " 614145 | \n",
+ " 19q13.32 | \n",
+ " 3-M syndrome 3, | \n",
+ " 614205 | \n",
+ " 3 | \n",
+ " Number Sign | \n",
+ " 614205 | \n",
+ " THREE M SYNDROME 3; 3M3 | \n",
+ " 3M SYNDROME 3 | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Phenotype \\\n",
+ "0 2,4-dienoyl-CoA reductase deficiency, 616034 (3) \n",
+ "1 2-methylbutyrylglycinuria, 610006 (3) \n",
+ "2 3-M syndrome 1, 273750 (3) \n",
+ "3 3-M syndrome 2, 612921 (3) \n",
+ "4 3-M syndrome 3, 614205 (3) \n",
+ "\n",
+ " Gene/Locus And Other Related Symbols MIM Number_x Cyto Location \\\n",
+ "0 NADK2, C5orf33, DECRD 615787 5p13.2 \n",
+ "1 ACADSB, SBCAD 600301 10q26.13 \n",
+ "2 CUL7, 3M1 609577 6p21.1 \n",
+ "3 OBSL1, KIAA0657, 3M2 610991 2q35 \n",
+ "4 CCDC8, 3M3 614145 19q13.32 \n",
+ "\n",
+ " p_label p_mim p_mapping_key Prefix \\\n",
+ "0 2,4-dienoyl-CoA reductase deficiency, 616034 3 Number Sign \n",
+ "1 2-methylbutyrylglycinuria, 610006 3 Number Sign \n",
+ "2 3-M syndrome 1, 273750 3 Number Sign \n",
+ "3 3-M syndrome 2, 612921 3 Number Sign \n",
+ "4 3-M syndrome 3, 614205 3 Number Sign \n",
+ "\n",
+ " MIM Number_y Preferred Title; symbol \\\n",
+ "0 616034 2,4-DIENOYL-CoA REDUCTASE DEFICIENCY; DECRD \n",
+ "1 610006 2-METHYLBUTYRYL-CoA DEHYDROGENASE DEFICIENCY \n",
+ "2 273750 THREE M SYNDROME 1; 3M1 \n",
+ "3 612921 THREE M SYNDROME 2; 3M2 \n",
+ "4 614205 THREE M SYNDROME 3; 3M3 \n",
+ "\n",
+ " Alternative Title(s); symbol(s) \\\n",
+ "0 NaN \n",
+ "1 2-METHYLBUTYRYL GLYCINURIA;; SHORT/BRANCHED-CHAIN ACYL-CoA DEHYDROGENASE DEFICIENCY; SBCADD \n",
+ "2 3M SYNDROME;; LE MERRER SYNDROME;; DOLICHOSPONDYLIC DYSPLASIA;; GLOOMY FACE SYNDROME \n",
+ "3 3M SYNDROME 2 \n",
+ "4 3M SYNDROME 3 \n",
+ "\n",
+ " Included Title(s); symbols \n",
+ "0 NaN \n",
+ "1 NaN \n",
+ "2 YAKUT SHORT STATURE SYNDROME, INCLUDED \n",
+ "3 NaN \n",
+ "4 NaN "
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Merge dataframes\n",
+ "\n",
+ "# Ensure both p_mim and MIM Number columns are of the same data type (string in this example)\n",
+ "unique_and_key3_filtered_df['p_mim'] = unique_and_key3_filtered_df['p_mim'].astype(str)\n",
+ "mimTitles_df['MIM Number'] = mimTitles_df['MIM Number'].astype(str)\n",
+ "\n",
+ "# Perform the join based on p_mim and MIM Number\n",
+ "merged_df = unique_and_key3_filtered_df.merge(\n",
+ " mimTitles_df, left_on='p_mim', right_on='MIM Number', how='left'\n",
+ ")\n",
+ "\n",
+ "print(merged_df.nunique())\n",
+ "\n",
+ "merged_df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "3fc9dba6-e5b6-425d-ae21-3d49a0fa85ca",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Phenotype 5340\n",
+ "Gene/Locus And Other Related Symbols 4014\n",
+ "Gene MIM 4014\n",
+ "Cyto Location 811\n",
+ "p_label 5339\n",
+ "p_mim 5340\n",
+ "p_mapping_key 1\n",
+ "Prefix 2\n",
+ "Phenotype MIM 5340\n",
+ "dtype: int64\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Phenotype | \n",
+ " Gene/Locus And Other Related Symbols | \n",
+ " Gene MIM | \n",
+ " Cyto Location | \n",
+ " p_label | \n",
+ " p_mim | \n",
+ " p_mapping_key | \n",
+ " Prefix | \n",
+ " Phenotype MIM | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 2,4-dienoyl-CoA reductase deficiency, 616034 (3) | \n",
+ " NADK2, C5orf33, DECRD | \n",
+ " 615787 | \n",
+ " 5p13.2 | \n",
+ " 2,4-dienoyl-CoA reductase deficiency, | \n",
+ " 616034 | \n",
+ " 3 | \n",
+ " Number Sign | \n",
+ " 616034 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 2-methylbutyrylglycinuria, 610006 (3) | \n",
+ " ACADSB, SBCAD | \n",
+ " 600301 | \n",
+ " 10q26.13 | \n",
+ " 2-methylbutyrylglycinuria, | \n",
+ " 610006 | \n",
+ " 3 | \n",
+ " Number Sign | \n",
+ " 610006 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 3-M syndrome 1, 273750 (3) | \n",
+ " CUL7, 3M1 | \n",
+ " 609577 | \n",
+ " 6p21.1 | \n",
+ " 3-M syndrome 1, | \n",
+ " 273750 | \n",
+ " 3 | \n",
+ " Number Sign | \n",
+ " 273750 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 3-M syndrome 2, 612921 (3) | \n",
+ " OBSL1, KIAA0657, 3M2 | \n",
+ " 610991 | \n",
+ " 2q35 | \n",
+ " 3-M syndrome 2, | \n",
+ " 612921 | \n",
+ " 3 | \n",
+ " Number Sign | \n",
+ " 612921 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 3-M syndrome 3, 614205 (3) | \n",
+ " CCDC8, 3M3 | \n",
+ " 614145 | \n",
+ " 19q13.32 | \n",
+ " 3-M syndrome 3, | \n",
+ " 614205 | \n",
+ " 3 | \n",
+ " Number Sign | \n",
+ " 614205 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Phenotype \\\n",
+ "0 2,4-dienoyl-CoA reductase deficiency, 616034 (3) \n",
+ "1 2-methylbutyrylglycinuria, 610006 (3) \n",
+ "2 3-M syndrome 1, 273750 (3) \n",
+ "3 3-M syndrome 2, 612921 (3) \n",
+ "4 3-M syndrome 3, 614205 (3) \n",
+ "\n",
+ " Gene/Locus And Other Related Symbols Gene MIM Cyto Location \\\n",
+ "0 NADK2, C5orf33, DECRD 615787 5p13.2 \n",
+ "1 ACADSB, SBCAD 600301 10q26.13 \n",
+ "2 CUL7, 3M1 609577 6p21.1 \n",
+ "3 OBSL1, KIAA0657, 3M2 610991 2q35 \n",
+ "4 CCDC8, 3M3 614145 19q13.32 \n",
+ "\n",
+ " p_label p_mim p_mapping_key Prefix \\\n",
+ "0 2,4-dienoyl-CoA reductase deficiency, 616034 3 Number Sign \n",
+ "1 2-methylbutyrylglycinuria, 610006 3 Number Sign \n",
+ "2 3-M syndrome 1, 273750 3 Number Sign \n",
+ "3 3-M syndrome 2, 612921 3 Number Sign \n",
+ "4 3-M syndrome 3, 614205 3 Number Sign \n",
+ "\n",
+ " Phenotype MIM \n",
+ "0 616034 \n",
+ "1 610006 \n",
+ "2 273750 \n",
+ "3 612921 \n",
+ "4 614205 "
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Modify to keep only certain columns\n",
+ "\n",
+ "# Specify the columns you want to keep\n",
+ "columns_to_keep = ['Phenotype', 'Gene/Locus And Other Related Symbols', 'MIM Number_x', 'Cyto Location', 'p_label',\n",
+ " 'p_mim', 'p_mapping_key', 'Prefix', 'MIM Number_y']\n",
+ "\n",
+ "# Create a new DataFrame with only these columns\n",
+ "new_df = merged_df[columns_to_keep]\n",
+ "\n",
+ "# Re-name columns\n",
+ "new_df = new_df.rename(columns={\n",
+ " 'MIM Number_x': 'Gene MIM',\n",
+ " 'MIM Number_y': 'Phenotype MIM'\n",
+ "})\n",
+ "\n",
+ "print(new_df.nunique())\n",
+ "\n",
+ "new_df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "dce503a0-e6a7-4c66-97b9-253db63f1796",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "['Number Sign' 'Percent']\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Prefix has two values, let's see what these are:\n",
+ "\n",
+ "print(new_df['Prefix'].unique())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "d29e5ff2-c5e2-4066-ac06-4daf72fc84b1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Save to file\n",
+ "timestamp = time.time()\n",
+ "\n",
+ "new_df.to_csv(f'unique_and_key3_filtered_df_{timestamp}.tsv', sep='\\t', index=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "199aba9a-ebc4-4037-bfd1-f14bdc666c54",
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.7"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/data/exclusions-disease-gene.tsv b/data/exclusions-disease-gene.tsv
new file mode 100644
index 0000000..3d283a0
--- /dev/null
+++ b/data/exclusions-disease-gene.tsv
@@ -0,0 +1,8 @@
+omim_id mondo_id mondo_label orcid exclusion_reason_comment
+OMIM:603956 MONDO:0002974 cervical cancer' https://orcid.org/0000-0002-4142-7153 evidence of various genes involved
+OMIM:619151 MONDO:0030894 "AMED syndrome, digenic'" https://orcid.org/0000-0002-4142-7153 digenic
+OMIM:158901 MONDO:0008031 https://orcid.org/0000-0002-4142-7153 digenic
+OMIM:108770 MONDO:0007171 atrial standstill 1' https://orcid.org/0000-0002-4142-7153 digenic
+OMIM:620040 MONDO:0031057 "dyskeratosis congenita, digenic'" https://orcid.org/0000-0002-4142-7153 digenic
+OMIM:619478 MONDO:0030355 "facioscapulohumeral muscular dystrophy 4, digenic'" https://orcid.org/0000-0002-4142-7153 digenic
+OMIM:300818 MONDO:0010438 paroxysmal nocturnal hemoglobinuria 1 https://orcid.org/0000-0002-4142-7153 "disease caused by a somatic mutation, therefore a gene association stating this is due to a germline mutation should not be added"
\ No newline at end of file
diff --git a/makefile b/makefile
index aec5583..f300b5d 100644
--- a/makefile
+++ b/makefile
@@ -37,7 +37,7 @@ omim.owl: omim.ttl mondo_exactmatch_omim.sssom.owl mondo_exactmatch_omimps.sssom
# Create a TSV of relational information for gene and disease classes
mondo-omim-genes.tsv: omim.owl
- robot query -i omim.owl --query sparql/mondo-omim-genes.sparql $@
+ robot query -i $< --query sparql/mondo-omim-genes.sparql $@
# Create a TSV of relational information for gene and disease classes, as a ROBOT template
mondo-omim-genes.robot.tsv: mondo-omim-genes.tsv
@@ -69,7 +69,7 @@ get-pmids:
# SETUP / INSTALLATION -------------------------------------------------------------------------------------------------
install:
- pip install -r requirements-unlocked.txt
+ pip install -r requirements-unlocked.txt --user --break-system-packages
# QA / TESTING ---------------------------------------------------------------------------------------------------------
test:
diff --git a/omim2obo/config.py b/omim2obo/config.py
index 6d4802e..dd93118 100644
--- a/omim2obo/config.py
+++ b/omim2obo/config.py
@@ -9,6 +9,7 @@
DATA_DIR = ROOT_DIR / 'data'
ENV_PATH = ROOT_DIR / '.env'
REVIEW_CASES_PATH = ROOT_DIR / 'review.tsv'
+DISEASE_GENE_EXCLUSIONS_PATH = DATA_DIR / 'exclusions-disease-gene.tsv'
with open(DATA_DIR / 'dipper/GLOBAL_TERMS.yaml') as file:
GLOBAL_TERMS = yaml.safe_load(file)
@@ -20,6 +21,7 @@
# ReviewCase: See README.md for review class documentation
class ReviewCase(TypedDict):
+ """See README.md docs for: review.tsv"""
classCode: int
classShortName: str
value: str
diff --git a/omim2obo/main.py b/omim2obo/main.py
index 234562f..541db46 100644
--- a/omim2obo/main.py
+++ b/omim2obo/main.py
@@ -3,6 +3,10 @@
Resources
- https://monarch-initiative.github.io/monarch-ingest/Sources/OMIM/
+FYIs
+"Included Title(s)" in mimTitles.txt is the same as the "Other entities represented in this entry" section in omim.org
+entry pages.
+
Steps
- Loads prefixes
- Parses mimTitles.txt
@@ -27,6 +31,7 @@
- links omim entry (~gene) to phenotype
- I thought phenotypic series did something like the sme?
- links omim entry to chromosome location
+ - Add disease-gene associations
- Parses BioPortal's omim.ttl
At least, I think that's where that .ttl file comes from. Adds following info to graph:
- pmid info
@@ -41,21 +46,25 @@
A tab-delimited file with purpose unknown to me (Joe), but has mappings between HGNC symbols and IDs.
- Get HGNC symbol::id mappings.
todo: The downloads should all happen at beginning of script
+todo: This is last updated 4/2022 and now does not fully describe everything that happens.
Assumptions
1. Mappings obtained from official OMIM files as described above are interpreted correctly (e.g. skos:exactMatch).
"""
+from typing import Optional, Set
+
import yaml
from hashlib import md5
from rdflib import Graph, RDF, OWL, RDFS, Literal, BNode, URIRef, SKOS
from rdflib.term import Identifier
-from omim2obo.config import REVIEW_CASES_PATH, ROOT_DIR, GLOBAL_TERMS, ReviewCase
+from omim2obo.config import REVIEW_CASES_PATH, ROOT_DIR, GLOBAL_TERMS
from omim2obo.namespaces import *
-from omim2obo.parsers.omim_entry_parser import get_alt_labels, get_pubs, get_mapped_ids, LabelCleaner
+from omim2obo.parsers.omim_entry_parser import REVIEW_CASES, cleanup_title, get_alt_and_included_titles_and_symbols, \
+ get_pubs, get_mapped_ids, log_review_cases, recapitalize_acronyms_in_titles
from omim2obo.parsers.omim_txt_parser import * # todo: change to specific imports
-
+from omim2obo.utils.utils import get_d2g_exclusions_by_curator
# Vars
OUTPATH = os.path.join(ROOT_DIR / 'omim.ttl')
@@ -64,7 +73,6 @@
LOG = logging.getLogger(__name__)
LOG.setLevel(logging.DEBUG)
LOG.addHandler(logging.StreamHandler(sys.stdout))
-REVIEW_CASES: List[ReviewCase] = []
# Funcs
@@ -76,6 +84,35 @@ def get_curie_maps():
return maps
+def add_axiom_annotations(
+ graph: Graph, source: URIRef, prop: URIRef, target: Union[Literal, str, URIRef],
+ anno_pred_vals: List[Tuple[URIRef, Union[Literal, str, URIRef]]]
+):
+ """Add an axiom annotation to the graph."""
+ target = Literal(target) if type(target) is str else target
+
+ axiom = BNode()
+ graph.add((axiom, RDF.type, OWL.Axiom))
+ graph.add((axiom, OWL.annotatedSource, source))
+ graph.add((axiom, OWL.annotatedProperty, prop))
+ graph.add((axiom, OWL.annotatedTarget, target))
+ for pred, val in anno_pred_vals:
+ val = Literal(val) if type(target) is str else val
+ graph.add((axiom, pred, val))
+
+
+def add_triple_and_optional_annotations(
+ graph: Graph, source: URIRef, prop: URIRef, target: Union[Literal, str, URIRef],
+ anno_pred_vals: List[Tuple[URIRef, Union[Literal, str, URIRef]]] = None
+):
+ """Add a triple and optional annotations to the graph."""
+ target = Literal(target) if type(target) is str else target
+
+ graph.add((source, prop, target))
+ if anno_pred_vals:
+ add_axiom_annotations(graph, source, prop, target, anno_pred_vals)
+
+
def add_subclassof_restriction(graph: Graph, predicate: URIRef, some_values_from: URIRef, on: URIRef) -> BNode:
"""Creates a subClassOf someValuesFrom restriction"""
b = BNode()
@@ -86,21 +123,22 @@ def add_subclassof_restriction(graph: Graph, predicate: URIRef, some_values_from
return b
-def add_subclassof_restriction_with_evidence(
- graph: Graph, predicate: URIRef, some_values_from: URIRef, on: URIRef, evidence: Union[str, Literal]
+def add_subclassof_restriction_with_evidence_and_source(
+ graph: Graph, predicate: URIRef, some_values_from: URIRef, on: URIRef, evidence: Union[str, Literal],
+ source: Optional[URIRef] = None,
):
"""Creates a subClassOf someValuesFrom restriction, and adds an evidence axiom to it."""
evidence = Literal(evidence) if type(evidence) is str else evidence
# Add restriction on MIM class
b: BNode = add_subclassof_restriction(graph, predicate, some_values_from, on)
# Add axiom to restriction
- b2 = BNode()
- graph.add((b2, RDF['type'], OWL['Axiom']))
- graph.add((b2, OWL['annotatedSource'], on))
- graph.add((b2, OWL['annotatedProperty'], RDFS['subClassOf']))
- graph.add((b2, OWL['annotatedTarget'], b))
- graph.add((b2, BIOLINK['has_evidence'], evidence))
- graph.add((b2, RDFS['comment'], evidence))
+ annotation_pred_vals = [
+ (BIOLINK['has_evidence'], evidence),
+ (RDFS['comment'], evidence)
+ ]
+ annotation_pred_vals += [(oboInOwl.source, source)] if source else []
+
+ add_axiom_annotations(graph, on, RDFS['subClassOf'], b, annotation_pred_vals)
# Classes
@@ -134,7 +172,6 @@ def get_graph():
TAX_ID = GLOBAL_TERMS[TAX_LABEL]
TAX_URI = URIRef(NCBITAXON + TAX_ID.split(':')[1])
CURIE_MAP = get_curie_maps()
-label_cleaner = LabelCleaner()
CONFIG = {
'verbose': False
}
@@ -164,6 +201,7 @@ def omim2obo(use_cache: bool = False):
# - Non-OMIM triples
graph.add((URIRef('http://purl.obolibrary.org/obo/mondo/omim.owl'), RDF.type, OWL.Ontology))
graph.add((URIRef(oboInOwl.hasSynonymType), RDF.type, OWL.AnnotationProperty))
+ graph.add((URIRef(oboInOwl.source), RDF.type, OWL.AnnotationProperty))
graph.add((URIRef(MONDONS.omim_included), RDF.type, OWL.AnnotationProperty))
graph.add((URIRef(OMO['0003000']), RDF.type, OWL.AnnotationProperty))
graph.add((BIOLINK['has_evidence'], RDF.type, OWL.AnnotationProperty))
@@ -189,27 +227,31 @@ def omim2obo(use_cache: bool = False):
continue
# - Non-deprecated
- # Parse titles
- omim_type, pref_labels_str, alt_labels, inc_labels = omim_type_and_titles[omim_id]
- other_labels = []
- cleaned_inc_labels = []
- label_endswith_included_alt = False
- label_endswith_included_inc = False
- pref_labels: List[str] = [x.strip() for x in pref_labels_str.split(';')]
- pref_title: str = pref_labels[0]
- pref_symbols: List[str] = pref_labels[1:]
- if alt_labels:
- cleaned_alt_labels, label_endswith_included_alt = get_alt_labels(alt_labels)
- other_labels += cleaned_alt_labels
- if inc_labels:
- cleaned_inc_labels, label_endswith_included_inc = get_alt_labels(inc_labels)
- # other_labels += cleaned_inc_labels # deactivated 7/2024 in favor of alternative for tagging 'included'
+ # Parse titles & symbols
+ omim_type, pref_titles_str, alt_titles_str, inc_titles_str = omim_type_and_titles[omim_id]
+ pref_titles_and_symbols: List[str] = [x.strip() for x in pref_titles_str.split(';')]
+ pref_title, pref_symbols = cleanup_title(pref_titles_and_symbols[0]), pref_titles_and_symbols[1:]
+ alt_titles, alt_symbols, former_alt_titles, former_alt_symbols = \
+ get_alt_and_included_titles_and_symbols(alt_titles_str)
+ included_titles, included_symbols, former_included_titles, former_included_symbols = \
+ get_alt_and_included_titles_and_symbols(inc_titles_str)
+ included_is_included = included_titles or included_symbols # redundant. can't be included symbol w/out title
+
+ # Recapitalize acronyms in titles
+ all_abbrevs: Set[str] = \
+ set(pref_symbols + alt_symbols + former_alt_symbols + included_symbols + former_included_symbols)
+ # todo: consider DRYing to 1 call by passing all 5 title types to a wrapper function
+ pref_title = recapitalize_acronyms_in_titles(pref_title, all_abbrevs)
+ alt_titles = recapitalize_acronyms_in_titles(alt_titles, all_abbrevs)
+ former_alt_titles = recapitalize_acronyms_in_titles(former_alt_titles, all_abbrevs)
+ included_titles = recapitalize_acronyms_in_titles(included_titles, all_abbrevs)
+ former_included_titles = recapitalize_acronyms_in_titles(former_included_titles, all_abbrevs)
# Special cases depending on OMIM term type
is_gene = omim_type == OmimType.GENE or omim_type == OmimType.HAS_AFFECTED_FEATURE
if omim_type == OmimType.HERITABLE_PHENOTYPIC_MARKER: # '%' char
graph.add((omim_uri, BIOLINK['category'], BIOLINK['Disease']))
- elif is_gene: # * or + chars
+ elif is_gene: # Represented by: * or + chars
graph.add((omim_uri, RDFS.subClassOf, SO['0000704'])) # gene
graph.add((omim_uri, MONDO.exclusionReason, MONDO.nonDisease))
graph.add((omim_uri, BIOLINK['category'], BIOLINK['Gene']))
@@ -223,36 +265,54 @@ def omim2obo(use_cache: bool = False):
gene_label_err = 'Warning: Only 1 symbol picked for label for gene term, but there were 2 to choose ' \
f'from. Unsure which is best. Picking the first.\nhttps://omim.org/entry/{omim_id} - {pref_symbols}'
if len(pref_symbols) > 1:
- LOG.warning(gene_label_err) # todo: decide the best way to handle these situations
+ LOG.warning(gene_label_err) # todo: rare (n=1?), but decide the best way to handle these situations
graph.add((omim_uri, RDFS.label, Literal(pref_symbols[0])))
else:
- graph.add((omim_uri, RDFS.label, Literal(label_cleaner.clean(pref_title))))
-
- # todo: .clean()/.cleanup_label() 2nd param `explicit_abbrev` should be List[str] instead of str. And below,
- # should pass all symbols/abbrevs from each of preferred, alt, included each time it is called. If no symbols
- # for given term, should pass empty list. See: https://github.com/monarch-initiative/omim/issues/129
- abbrev: Union[str, None] = None if not pref_symbols else pref_symbols[0]
+ graph.add((omim_uri, RDFS.label, Literal(pref_title)))
# Add synonyms
- graph.add((omim_uri, oboInOwl.hasExactSynonym, Literal(label_cleaner.clean(pref_title, abbrev))))
- for alt_label in other_labels:
- graph.add((omim_uri, oboInOwl.hasExactSynonym, Literal(label_cleaner.clean(alt_label, abbrev))))
- for abbreviation in pref_symbols:
- graph.add((omim_uri, oboInOwl.hasExactSynonym, Literal(abbreviation)))
- # Reify on abbreviations. See: https://github.com/monarch-initiative/omim/issues/2
- axiom = BNode()
- graph.add((axiom, RDF.type, OWL.Axiom))
- graph.add((axiom, OWL.annotatedSource, omim_uri))
- graph.add((axiom, OWL.annotatedProperty, oboInOwl.hasExactSynonym))
- graph.add((axiom, OWL.annotatedTarget, Literal(abbreviation)))
- graph.add((axiom, oboInOwl.hasSynonymType, OMO['0003000']))
-
- # Add 'included' entry properties
- included_detected_comment = "This term has one or more labels that end with ', INCLUDED'."
- if label_endswith_included_alt or label_endswith_included_inc:
- graph.add((omim_uri, RDFS['comment'], Literal(included_detected_comment)))
- for included_label in cleaned_inc_labels:
- graph.add((omim_uri, URIRef(MONDONS.omim_included), Literal(label_cleaner.clean(included_label, abbrev))))
+ # - exact titles
+ graph.add((omim_uri, oboInOwl.hasExactSynonym, Literal(pref_title)))
+ for title in alt_titles:
+ graph.add((omim_uri, oboInOwl.hasExactSynonym, Literal(title)))
+ # - exact abbreviations
+ for abbrevs in [pref_symbols, alt_symbols]:
+ for abbreviation in abbrevs:
+ add_triple_and_optional_annotations(graph, omim_uri, oboInOwl.hasExactSynonym, abbreviation,
+ [(oboInOwl.hasSynonymType, OMO['0003000'])])
+ # - related, deprecated 'former' titles
+ for title in former_alt_titles:
+ add_triple_and_optional_annotations(graph, omim_uri, oboInOwl.hasRelatedSynonym, title,
+ [(OWL.deprecated, Literal(True))])
+ # - related, deprecated 'former' abbreviations
+ for abbreviation in former_alt_symbols:
+ add_triple_and_optional_annotations(graph, omim_uri, oboInOwl.hasRelatedSynonym, abbreviation,
+ [(OWL.deprecated, Literal(True)), (oboInOwl.hasSynonymType, OMO['0003000'])])
+
+ # Add 'included' entries
+ # - comment
+ if included_is_included:
+ included_comment = "This term has one or more labels that end with ', INCLUDED'."
+ graph.add((omim_uri, RDFS['comment'], Literal(included_comment)))
+ # - titles
+ for title in included_titles:
+ graph.add((omim_uri, URIRef(MONDONS.omim_included), Literal(title)))
+ # - symbols
+ for symbol in included_symbols:
+ add_triple_and_optional_annotations(graph, omim_uri, URIRef(MONDONS.omim_included), symbol, [
+ # Though these are abbreviations, MONDONS.omim_included is not a synonym type, so can't add axiom:
+ # (oboInOwl.hasSynonymType, OMO['0003000'])
+ ])
+ # - deprecated, 'former'
+ for title in former_included_titles:
+ add_triple_and_optional_annotations(graph, omim_uri, URIRef(MONDONS.omim_included), title,
+ [(OWL.deprecated, Literal(True))])
+ for symbol in former_included_symbols:
+ add_triple_and_optional_annotations(graph, omim_uri, URIRef(MONDONS.omim_included), symbol, [
+ (OWL.deprecated, Literal(True)),
+ # Though these are abbreviations, MONDONS.omim_included is not a synonym type, so can't add axiom:
+ # (oboInOwl.hasSynonymType, OMO['0003000'])
+ ])
# Gene ID
# Why is 'skos:exactMatch' appropriate for disease::gene relationships? - joeflack4 2022/06/06
@@ -304,13 +364,15 @@ def omim2obo(use_cache: bool = False):
'gene_id': gene_mim, 'phenotype_label': p_lab, 'mapping_key': p_map_key, 'mapping_label': p_map_lab})
# - Add relations (subclass restrictions)
+ exclusions_p_mim_orcid_map = get_d2g_exclusions_by_curator()
for p_mim, assocs in phenotype_genes.items():
for assoc in assocs:
gene_mim, p_lab, p_map_key, p_map_lab = assoc['gene_id'], assoc['phenotype_label'], \
assoc['mapping_key'], assoc['mapping_label']
evidence = f'Evidence: ({p_map_key}) {p_map_lab}'
+ p_mim_excluded = p_mim in exclusions_p_mim_orcid_map
- # General skippable cases
+ # Skip: No phenotype or unknown defect
# - not p_mim: Skip because not an association to another MIM (Provenance:
# https://github.com/monarch-initiative/omim/issues/78)
# - p_map_key == '1': Skip because association w/ unknown defect (Provenance:
@@ -318,40 +380,33 @@ def omim2obo(use_cache: bool = False):
if not p_mim or p_map_key == '1':
continue
- # Gene->Disease non-causal relationships
+ # Add restrictions: Gene->Disease non-causal / non-disease-defining relationships
# - RO:0003302 docs: see MORBIDMAP_PHENOTYPE_MAPPING_KEY_PREDICATES
- if p_map_key != '3': # 3 = 'causal'. Handled separately below.
- g2d_pred = MORBIDMAP_PHENOTYPE_MAPPING_KEY_PREDICATES[p_map_key] if len(assocs) == 1 else RO['0003302']
- add_subclassof_restriction_with_evidence(graph, g2d_pred, OMIM[p_mim], OMIM[gene_mim], evidence)
-
- # Disease->Gene & Gene->Disease: Causal relationships
- # - Skip non-causal cases
- # - 3: The molecular basis for the disorder is known; a mutation has been found in the gene.
- if len(assocs) > 1 or p_map_key != '3' or not p2g_is_definitive(p_lab):
+ # - Mapping key 3 = 'causal' (disease-defining). Handled separately below.
+ if p_map_key != '3' or p_mim_excluded:
+ g2d_pred = MORBIDMAP_PHENOTYPE_MAPPING_KEY_PREDICATES[p_map_key] \
+ if len(assocs) == 1 and not p_mim_excluded \
+ else RO['0003302']
+ orcid: Optional[URIRef] = exclusions_p_mim_orcid_map[p_mim] if p_mim_excluded else None
+ add_subclassof_restriction_with_evidence_and_source(
+ graph, g2d_pred, OMIM[p_mim], OMIM[gene_mim], evidence, orcid)
continue
- # - Digenic: Should technically be none marked 'digenic' if only 1 association, but there are.
- if 'digenic' in p_lab.lower():
- # noinspection PyTypeChecker typecheck_fail_old_Python
- REVIEW_CASES.append({
- "classCode": 1,
- "classShortName": "causalD2gButMarkedDigenic",
- "value": f"OMIM:{p_mim}: {p_lab} (Gene: OMIM:{gene_mim})",
- })
- p_mim_type: str = omim_types[p_mim] # Allowable: PHENOTYPE, HERITABLE_PHENOTYPIC_MARKER (#, %)
- mim_type_err = f"Warning: Unexpected MIM type {p_mim_type} for Phenotype {p_mim} when parsing phenotype-" \
- f"disease relationships. Skipping."
- if p_mim_type in ('OBSOLETE', 'SUSPECTED', 'HAS_AFFECTED_FEATURE'): # ^, NULL, +
- print(mim_type_err, file=sys.stderr) # Hasn't happened. Failsafe.
- if p_mim_type == 'GENE': # *
- print(mim_type_err, file=sys.stderr) # OMIM recognized as data quality issue. Fixed 2024/11. Failsafe.
-
- # Disease --(RO:0004003 'has material basis in germline mutation in')--> Gene
- # https://www.ebi.ac.uk/ols4/ontologies/ro/properties?iri=http://purl.obolibrary.org/obo/RO_0004003
- add_subclassof_restriction_with_evidence(
+
+ # Skip non-causal (disease-defining) cases
+ if len(assocs) > 1 or not p2g_is_definitive(p_lab): # or cases above: (p_map_key != '3') & p_mim_excluded
+ continue
+
+ # Log review.tsv cases
+ log_review_cases(p_mim, p_lab, p_map_key, gene_mim, gene_phenotypes, omim_types)
+
+ # Add restrictions: Disease-defining ('causal germline mutation')
+ # - Disease --(RO:0004003 'has material basis in germline mutation in')--> Gene
+ # https://www.ebi.ac.uk/ols4/ontologies/ro/properties?iri=http://purl.obolibrary.org/obo/RO_0004003
+ add_subclassof_restriction_with_evidence_and_source(
graph, RO['0004003'], OMIM[gene_mim], OMIM[p_mim], evidence)
- # Gene --(RO:0004013 'is causal germline mutation in')--> Disease
- # https://www.ebi.ac.uk/ols4/ontologies/ro/properties?iri=http://purl.obolibrary.org/obo/RO_0004013
- add_subclassof_restriction_with_evidence(
+ # - Gene --(RO:0004013 'is causal germline mutation in')--> Disease
+ # https://www.ebi.ac.uk/ols4/ontologies/ro/properties?iri=http://purl.obolibrary.org/obo/RO_0004013
+ add_subclassof_restriction_with_evidence_and_source(
graph, RO['0004013'], OMIM[p_mim], OMIM[gene_mim], evidence)
# PUBMED, UMLS
@@ -379,7 +434,8 @@ def omim2obo(use_cache: bool = False):
for orphanet_id in orphanet_ids:
graph.add((OMIM[mim_number], SKOS.exactMatch, ORPHANET[orphanet_id]))
- review_df = pd.DataFrame(REVIEW_CASES) # todo: ensure comment field exists even when no row uses
+ # todo: ensure comment field exists even when no row uses
+ review_df = pd.DataFrame(REVIEW_CASES).sort_values(by=['classCode', 'value'])
review_df.to_csv(REVIEW_CASES_PATH, index=False, sep='\t')
with open(OUTPATH, 'w') as f:
f.write(graph.serialize(format='turtle'))
diff --git a/omim2obo/namespaces.py b/omim2obo/namespaces.py
index 268f2dc..7212e41 100644
--- a/omim2obo/namespaces.py
+++ b/omim2obo/namespaces.py
@@ -102,6 +102,7 @@
# publication/citation/reference sources
DOI = Namespace('http://dx.doi.org/') # Digital Object identifier
GENEREVIEWS = Namespace('http://www.ncbi.nlm.nih.gov/books/') # NCBI gene and diseases
+ORCID = Namespace('https://orcid.org/') # Open Researcher and Contributor ID
# more bogus IRIs
ISBN = Namespace('https://monarchinitiative.org/ISBN_') # International Standard Book Number
ISBN_10 = Namespace('https://monarchinitiative.org/ISBN10_') # Same as ISBN has 10 digits pre 2007
diff --git a/omim2obo/parsers/omim_entry_parser.py b/omim2obo/parsers/omim_entry_parser.py
index b97a618..5d1fa95 100644
--- a/omim2obo/parsers/omim_entry_parser.py
+++ b/omim2obo/parsers/omim_entry_parser.py
@@ -1,25 +1,63 @@
"""OMIM Entry parsers"""
import csv
import logging
-# import re
from collections import defaultdict
-from copy import copy
-from typing import List, Dict, Tuple
+from typing import List, Dict, Set, Tuple, Union
import pandas as pd
from rdflib import Graph, RDF, RDFS, DC, Literal, OWL, SKOS, URIRef
-# from rdflib import Namespace
-from omim2obo.config import DATA_DIR
+from omim2obo.config import DATA_DIR, ReviewCase
from omim2obo.omim_type import OmimType, get_omim_type
from omim2obo.namespaces import *
from omim2obo.utils.romanplus import *
LOG = logging.getLogger('omim2obo.parsers.api_entry_parser')
+REVIEW_SELF_REF_CASE_I = 0
+REVIEW_CASES: List[ReviewCase] = []
+REVIEW_CASE_NAME_MAP: Dict[int, str] = {
+ 1: "D2G: digenic",
+ 2: "D2G: self-referential",
+ 3: "D2G: somatic",
+ 4: "D2G: Phenotype is gene",
+ 5: "D2G: Phenotype type error",
+}
+
+def get_known_capitalizations() -> Dict[str, str]:
+ """Get list of known capitalizations for proper names, acronyms, and the like.
+ todo: Contains space-delimited words, e.g. "vitamin d". The way that
+ cleanup_label is currently implemented, each word in the label gets
+ replaced; i.e. it would try to replace "vitamin" and "d" separately. Hence,
+ this would fail.
+ Therefore, we should probably do this in 2 different operations: (1) use
+ the current 'word replacement' logic, but also, (2), at the end, do a
+ generic string replacement (e.g. my_str.replace(a, b). When implementing
+ (2), we should also split this dictionary into two separate dictionaries,
+ each for 1 of these 2 different purposes.
+
+ todo: known_capitalizations.tsv can be refactored possibly. It really only needs 1 column, the case to replaace. The
+ pattern column is not used, and the first column (lowercase) can be computed by using .lower() on the case to
+ replace. We could also leave as-is since this file is shared elsewhere in the project infrastructure, though I do
+ not know its source-of-truth location.
+ """
+ path = DATA_DIR / 'known_capitalizations.tsv'
+ with open(path, "r") as file:
+ data_io = csv.reader(file, delimiter="\t")
+ data: List[List[str]] = [x for x in data_io]
+ df = pd.DataFrame(data[1:], columns=data[0])
+ d = {}
+ for index, row in df.iterrows():
+ d[row['lower_name']] = row['cap_name']
+ return d
+
+
+CAPITALIZATION_REPLACEMENTS: Dict[str, str] = get_known_capitalizations()
# todo: This isn't used in the ingest to create omim.ttl. Did this have some other use case?
+# - If working on this again, remove these noinspect warning suppressions and address them
+# noinspection PyUnusedLocal,PyUnboundLocalVariable,PyTypeChecker
def transform_entry(entry) -> Graph:
"""
Transforms an OMIM API entry to a graph.
@@ -38,10 +76,10 @@ def transform_entry(entry) -> Graph:
omim_uri = URIRef(OMIM[omim_num])
other_labels = []
if 'alternativeTitles' in titles:
- cleaned, label_endswith_included = get_alt_labels(titles['alternativeTitles'])
+ cleaned, label_endswith_included = parse_title_symbol_pairs(titles['alternativeTitles'])
other_labels += cleaned
if 'includedTitles' in titles:
- cleaned, label_endswith_included = get_alt_labels(titles['includedTitles'])
+ cleaned, label_endswith_included = parse_title_symbol_pairs(titles['includedTitles'])
other_labels += cleaned
graph.add((omim_uri, RDF.type, OWL.Class))
@@ -49,7 +87,7 @@ def transform_entry(entry) -> Graph:
abbrev = label.split(';')[1].strip() if ';' in label else None
if omim_type == OmimType.HERITABLE_PHENOTYPIC_MARKER.value: # %
- graph.add((omim_uri, RDFS.label, Literal(cleanup_label(label))))
+ graph.add((omim_uri, RDFS.label, Literal(cleanup_title(label))))
graph.add((omim_uri, BIOLINK['category'], BIOLINK['Disease']))
elif omim_type == OmimType.GENE.value or omim_type == OmimType.HAS_AFFECTED_FEATURE.value: # * or +
omim_type = OmimType.GENE.value
@@ -57,10 +95,10 @@ def transform_entry(entry) -> Graph:
graph.add((omim_uri, RDFS.subClassOf, SO['0000704']))
graph.add((omim_uri, BIOLINK['category'], BIOLINK['Gene']))
elif omim_type == OmimType.PHENOTYPE.value: # #
- graph.add((omim_uri, RDFS.label, Literal(cleanup_label(label))))
+ graph.add((omim_uri, RDFS.label, Literal(cleanup_title(label))))
graph.add((omim_uri, BIOLINK['category'], BIOLINK['Disease']))
else: # ^ or NULL (no prefix character)
- graph.add((omim_uri, RDFS.label, Literal(cleanup_label(label))))
+ graph.add((omim_uri, RDFS.label, Literal(cleanup_title(label))))
graph.add((omim_uri, oboInOwl.hasExactSynonym, Literal(label)))
for label in other_labels:
@@ -108,11 +146,11 @@ def transform_entry(entry) -> Graph:
for phenotypic_serie in get_phenotypic_series(entry):
if omim_type == OmimType.HERITABLE_PHENOTYPIC_MARKER.value or omim_type == OmimType.PHENOTYPE.value:
graph.add((omim_uri, RDFS.subClassOf, OMIMPS[phenotypic_serie]))
- elif omim_type == OmimType.GENE.vaule or omim_type == OmimType.HAS_AFFECTED_FEATURE.value:
+ elif omim_type == OmimType.GENE.value or omim_type == OmimType.HAS_AFFECTED_FEATURE.value:
graph.add((omim_uri, RO['0003304'], OMIMPS[phenotypic_serie]))
# NCBI ENTREZ Gene IDs
- if omim_type == OmimType.GENE.value or omim_type == OmimType.HAS_AFFECTED_FEATURE.value:
+ if omim_type == OmimType.GENE.value or omim_type == OmimType.HAS_AFFECTED_FEATURE.value:
for gene_id in get_mapped_gene_ids(entry):
graph.add((omim_uri, OWL.equivalentClass, NCBIGENE[gene_id]))
@@ -122,15 +160,12 @@ def transform_entry(entry) -> Graph:
return graph
-def _detect_abbreviations(
- label: str,
- explicit_abbrev: str = None,
- trailing_abbrev: str = None,
- CAPITALIZATION_THRESHOLD = 0.75
-):
+def detect_abbreviations(label: str, capitalization_threshold=0.75) -> List[str]:
"""Detect possible abbreviations / acronyms"""
# Compile regexp
+ # todo: handle several warnings: {1} redundant, {1,} simplified to +
acronyms_without_periods_compiler = re.compile('[A-Z]{1}[A-Z0-9]{1,}')
+ # todo: PyCharm flagged next 2 lines as invalid escape sequence, but this code seems to work? Should double check
acronyms_with_periods_compiler = re.compile('[A-Z]{1}\.([A-Z0-9]\.){1,}')
title_cased_abbrev_compiler = re.compile('[A-Z]{1}[a-zA-Z]{1,}\.')
@@ -142,99 +177,72 @@ def _detect_abbreviations(
if word.upper() == word:
fully_capitalized_count += 1
is_largely_uppercase = \
- fully_capitalized_count / len(words) >= CAPITALIZATION_THRESHOLD
+ fully_capitalized_count / len(words) >= capitalization_threshold
- # Detect acronyms without periods
+ # Detect cases
if is_largely_uppercase:
acronyms_without_periods = [] # can't infer because everything was uppercase
else:
- acronyms_without_periods = acronyms_without_periods_compiler.findall(label)
- # Detect more
- title_cased_abbrevs = title_cased_abbrev_compiler.findall(label)
- acronyms_with_periods = acronyms_with_periods_compiler.findall(label)
- # Combine list of things to re-format
- replacements = []
- candidates: List[List[str]] = [
- acronyms_with_periods, acronyms_without_periods, title_cased_abbrevs,
- [trailing_abbrev], [explicit_abbrev]]
- for item_list in candidates:
- for item in item_list:
- if item:
- replacements.append(item)
-
- return replacements
-
-
-# todo: explicit_abbrev: Change to List[str]. See: https://github.com/monarch-initiative/omim/issues/129
-def cleanup_label(
- label: str,
- explicit_abbrev: str = None,
- replacement_case_method: str = 'lower', # lower | title | upper
- replacement_case_method_acronyms = 'upper', # lower | title | upper
- conjunctions: List[str] = ['and', 'but', 'yet', 'for', 'nor', 'so'],
- little_preps: List[str] = [
- 'at', 'by', 'in', 'of', 'on', 'to', 'up', 'as', 'it', 'or'],
- articles: List[str] = ['a', 'an', 'the'],
- CAPITALIZATION_THRESHOLD = 0.75,
- word_replacements: Dict[str, str] = None # w/ known cols
+ acronyms_without_periods: List[str] = acronyms_without_periods_compiler.findall(label)
+ title_cased_abbrevs: List[str] = title_cased_abbrev_compiler.findall(label)
+ acronyms_with_periods: List[str] = acronyms_with_periods_compiler.findall(label)
+
+ return acronyms_with_periods + acronyms_without_periods + title_cased_abbrevs
+
+
+# todo: rename? It's doing more than cleaning; it's mutating
+def cleanup_title(
+ title: str,
+ replacement_case_method: str = 'lower', # 'upper', 'title', 'lower', 'capitalize' (=sentence case)
+ conjunctions: List[str] = ['and', 'but', 'yet', 'for', 'nor', 'so'],
+ little_preps: List[str] = ['at', 'by', 'in', 'of', 'on', 'to', 'up', 'as', 'it', 'or'],
+ articles: List[str] = ['a', 'an', 'the'],
+ word_replacements: Dict[str, str] = CAPITALIZATION_REPLACEMENTS,
) -> str:
- """
- Reformat the ALL CAPS OMIM labels to something more pleasant to read.
- This will:
- 1. remove the abbreviation suffixes
- 2. convert the roman numerals to integer numbers
- 3. make the text title case,
- except for suplied conjunctions/prepositions/articles
+ """Reformat the ALL CAPS OMIM labels to something more pleasant to read.
+
+ :param title: A preferred, alternative, or included title.
- Resources
- - https://pythex.org/
+ 1. Converts roman numerals to arabic
+ 2. Makes the text adhere to the case of `replacement_case_method`, except for supplied
+ conjunctions, prepositions, and articles, which will always be lowercased. NOTE: The default for this is 'lower',
+ meaning that this operation by default does nothing.
Assumptions:
- 1. All acronyms are capitalized
-
- # TODO Laters:
- # 1: Find a pattern for hyphenated types, and maintain acronym capitalization
- # ...e.g. MITF-related melanoma and renal cell carcinoma predisposition syndrome
- # ...e.g. ATP1A3-associated neurological disorder
- # 2. Make pattern for chromosomes
- # ...agonadism, 46,XY, with intellectual disability, short stature, retarded bone age, and multiple extragenital malformations
- # ...Chromosome special formatting capitalization?
- # ...There seems to be special formatting for chromosome refs; they have a comma in the middle, but with no space
- # ...after the comma, though some places I saw on the internet contained a space.
- # ...e.g. "46,XY" in: agonadism, 46,XY, with intellectual disability, short stature, retarded bone age, and multiple extragenital malformations
- # 3. How to find acronym if it is capitalized but only includes char [A-Z], and
- # ... every other char in the string is also capitalized? I don't see a way unless
- # ... checking every word against an explicit dictionary of terms, though there are sure
- # ... to also be (i) acronyms in that dictionary, and (ii) non-acronyms missing from
- # ... that dictionary. And also concern (iii), where to get such an extensive dictionary?
- # 4. Add "special character" inclusion into acronym regexp. But which special
- # ... chars to include, and which not to include?
- # 5. Acronym capture extension: case where at least 1 word is not capitalized:
- # ... any word that is fully capitalized might as well be acronym, so long
- # ...as at least 1 other word in the label is not all caps. Maybe not a good rule,
- # ...because there could be some violations, and this probably would not happen
- # ...that often anwyay
- # ... - Not sure what I meant about (5) - joeflack4 2021/09/10
- # 6. Eponyms: re-capitalize first char?
- # ...e.g.: Balint syndrome, Barre-Lieou syndrome, Wallerian degeneration, etc.
- # ...How to do this? Simply get/create a list of known eponyms? Is this feasible?
-
- :param synonym: str
- :return: str
+ 1. All acronyms are capitalized
+
+ todo later's:
+ 1: Find a pattern for hyphenated types, and maintain acronym capitalization
+ e.g. MITF-related melanoma and renal cell carcinoma predisposition syndrome
+ e.g. ATP1A3-associated neurological disorder
+ 2. Make pattern for chromosomes
+ agonadism, 46,XY, with intellectual disability, short stature, retarded bone age, and multiple extragenital
+ malformations
+ Chromosome special formatting capitalization?
+ There seems to be special formatting for chromosome refs; they have a comma in the middle, but with no space
+ after the comma, though some places I saw on the internet contained a space.
+ e.g. "46,XY" in: agonadism, 46,XY, with intellectual disability, short stature, retarded bone age, and multiple
+ extragenital malformations
+ 3. How to find acronym if it is capitalized but only includes char [A-Z], and
+ every other char in the string is also capitalized? I don't see a way unless
+ checking every word against an explicit dictionary of terms, though there are sure
+ to also be (i) acronyms in that dictionary, and (ii) non-acronyms missing from
+ that dictionary. And also concern (iii), where to get such an extensive dictionary?
+ 4. Add "special character" inclusion into acronym regexp. But which special
+ chars to include, and which not to include?
+ 5. Acronym capture extension: case where at least 1 word is not capitalized:
+ any word that is fully capitalized might as well be acronym, so long
+ as at least 1 other word in the label is not all caps. Maybe not a good rule,
+ because there could be some violations, and this probably would not happen
+ that often anwyay
+ - Not sure what I meant about (5) - joeflack4 2021/09/10
+ 6. Eponyms: re-capitalize first char?
+ e.g.: Balint syndrome, Barre-Lieou syndrome, Wallerian degeneration, etc.
+ How to do this? Simply get/create a list of known eponyms? Is this feasible?
"""
- # 1/3: Detect abbreviations / acronyms
- label2 = label.split(r';')[0] if r';' in label else label
- trailing_abbrev = label.split(r';')[1] if r';' in label else ''
- possible_abbreviations = _detect_abbreviations(
- label2, explicit_abbrev, trailing_abbrev, CAPITALIZATION_THRESHOLD)
-
- # 2/3: Format label
- # Simple method: Lower/title case everything but acronyms
- # label_newcase = getattr(label2, replacement_case_method)()
- # Advanced method: iteritavely format words
fixedwords = []
i = 0
- for wrd in label2.split():
+ for wrd in title.split():
i += 1
# convert the roman numerals to numbers,
# but assume that the first word is not
@@ -252,69 +260,137 @@ def cleanup_label(
wrd = fixed
wrd = getattr(wrd, replacement_case_method)()
# replace interior conjunctions, prepositions, and articles with lowercase, always
- if wrd.lower() in (conjunctions + little_preps + articles) and i != 1:
+ if wrd in (conjunctions + little_preps + articles) and i != 1:
wrd = wrd.lower()
if word_replacements:
wrd = word_replacements.get(wrd, wrd)
fixedwords.append(wrd)
label_newcase = ' '.join(fixedwords)
- # 3/3 Re-capitalize acronyms / words based on information contained w/in original label
- formatted_label = copy(label_newcase)
- for item in possible_abbreviations:
- to_replace = getattr(item, replacement_case_method_acronyms)()
- formatted_label = formatted_label.replace(to_replace, item)
+ return label_newcase
- return formatted_label
+def recapitalize_acronyms_in_title(title: str, known_abbrevs: Set[str] = None, capitalization_threshold=0.75) -> str:
+ """Re-capitalize acronyms / words based on information contained w/in original label
-def get_alt_labels(titles: str) -> Tuple[List[str], bool]:
+ todo: If title has been used on cleanup_title() using a replacement_case_method other than the non-default 'lower',
+ then the .replace() operation will not work. To solve, this (a) capture the replacement_case_method used and
+ pass that here, or (b) duplicate the .replace() line and call it on alternative casing variations (.title() and
+ capitalize() (=sentence case)), (c) possibly just compare to word.lower() instead of 'word.
+ todo: (more important): It's probable that .split(' ') is not enough to cover all cases. Should also run the check
+ by splitting on other characters. E.g. consider the following potential cases: "TITLE (ACRONYM)",
+ "TITLE: ACRONYM1&ACRONYM2", "TITLE/ACRONYM" or "TITLE ACRONYM/ACRONYM", "TITLE {ACRONYM1,ACRONYM2}",
+ "TITLE[ACRONYM]", "TITLE-ACRONYM", or less likely cases such as "TITLE_ACRONYM", "TITLE.ACRONYM". There are quite
+ a few different combos of special char usage that could theoretically arise. It might be possible for thisthat to
+ utilize the regular expressions in detect_abbreviations(), and substitute in the acronym in the place of the [A-Z]
+ part. It is also possible to improve detect_abbreviations() by considering some of thes eother possible example
+ cases above.
"""
- From a string of delimited titles, make an array.
- This assumes that the titles are double-semicolon (';;') delimited.
- This will additionally pass each through the _cleanup_label method to
- convert the screaming ALL CAPS to something more pleasant to read.
- :param titles:
- :return: an array of cleaned-up labels
+ inferred_abbrevs: Set[str] = set(detect_abbreviations(title, capitalization_threshold))
+ abbrevs: Set[str] = known_abbrevs.union(inferred_abbrevs)
+ if not abbrevs:
+ return title
+ title2_words: List[str] = []
+ for word in title.split():
+ abbrev_match = False
+ for abbrev in abbrevs:
+ if abbrev.lower() == word:
+ title2_words.append(abbrev)
+ abbrev_match = True
+ break
+ if not abbrev_match:
+ title2_words.append(word)
+ title2 = ' '.join(title2_words)
+ return title2
+
+
+def recapitalize_acronyms_in_titles(
+ titles: Union[str, List[str]], known_abbrevs: Set[str] = None, capitalization_threshold=0.75
+) -> Union[str, List[str]]:
+ """Re-capitalize acronyms in a list of titles"""
+ if isinstance(titles, str):
+ return recapitalize_acronyms_in_title(titles, known_abbrevs, capitalization_threshold)
+ return [recapitalize_acronyms_in_title(title, known_abbrevs, capitalization_threshold) for title in titles]
+
+
+def remove_included_and_formerly_suffixes(title: str) -> str:
+ """Remove ', INCLUDED' and ', FORMERLY' suffixes from a title"""
+ for suffix in ['FORMERLY', 'INCLUDED']:
+ title = re.sub(r',\s*' + suffix, '', title, re.IGNORECASE)
+ return title
+
+
+def separate_former_titles_and_symbols(
+ titles: List[str], symbols: List[str]
+) -> Tuple[List[str], List[str], List[str], List[str]]:
+ """Separate current title/symbols from deprecated (marked 'former') ones"""
+ former_titles = [x for x in titles if ', FORMERLY' in x.upper()]
+ former_symbols = [x for x in symbols if ', FORMERLY' in x.upper()]
+ current_titles = [x for x in titles if ', FORMERLY' not in x.upper()]
+ current_symbols = [x for x in symbols if ', FORMERLY' not in x.upper()]
+ return current_titles, current_symbols, former_titles, former_symbols
+
+
+def clean_alt_and_included_titles(titles: List[str], symbols: List[str]) -> Tuple[List[str], List[str]]:
+ """Remove ', INCLUDED' and ', FORMERLY' suffixes from titles/symbols & misc title reformatting"""
+ # remove ', included' and ', formerly', if present
+ titles2 = [remove_included_and_formerly_suffixes(x) for x in titles]
+ symbols2 = [remove_included_and_formerly_suffixes(x) for x in symbols]
+ # additional reformatting for titles
+ titles3 = [cleanup_title(x) for x in titles2]
+ return titles3, symbols2
+
+
+def parse_title_symbol_pairs(title_symbol_pairs_str: str) -> Tuple[List[str], List[str]]:
+ """Parses a string containing title-symbol pairs.
+
+ :param title_symbol_pairs_str: A string representing title-symbol pairs.
+ Format:
+ - Pairs are separated by ';;'
+ - Within each pair:
+ - The first element is always a title
+ - Optionally followed by zero or more symbols, separated by ';'
+
+ Examples:
+ Positional semantics:
+ Title1;Symbol1;Symbol2;;Title2;;Title3;Symbol3
+ Alternative Title(s); symbol(s):
+ ACROCEPHALOSYNDACTYLY, TYPE V; ACS5;; ACS V;; NOACK SYNDROME
+ Included Title(s); symbols:
+ CRANIOFACIAL-SKELETAL-DERMATOLOGIC DYSPLASIA, INCLUDED
"""
-
- labels = []
- label_endswith_included = False
- # "alternativeTitles": "
- # ACROCEPHALOSYNDACTYLY, TYPE V; ACS5;;\nACS V;;\nNOACK SYNDROME",
- # "includedTitles":
- # "CRANIOFACIAL-SKELETAL-DERMATOLOGIC DYSPLASIA, INCLUDED"
- for title in titles.split(';;'):
- # remove ', included', if present
- title = title.strip()
- label = re.sub(r',\s*INCLUDED', '', title, re.IGNORECASE)
- label_endswith_included = label != title
- label = cleanup_label(label)
- labels.append(label)
-
- return labels, label_endswith_included
+ titles: List[str] = []
+ symbols: List[str] = []
+ title_symbol_pairs: List[str] = title_symbol_pairs_str.split(';;')
+ for pair_str in title_symbol_pairs:
+ pair: List[str] = [x.strip() for x in pair_str.split(';')]
+ titles.append(pair[0])
+ symbols.extend(pair[1:])
+ return titles, symbols
+
+
+def get_alt_and_included_titles_and_symbols(title_symbol_pair_str) -> Tuple[List[str], List[str], List[str], List[str]]:
+ """Separates different types of titles/symbols, and cleans them."""
+ titles: List[str] = []
+ symbols: List[str] = []
+ former_titles: List[str] = []
+ former_symbols: List[str] = []
+ if title_symbol_pair_str:
+ titles, symbols = parse_title_symbol_pairs(title_symbol_pair_str)
+ titles, symbols, former_titles, former_symbols = separate_former_titles_and_symbols(titles, symbols)
+ titles, symbols = clean_alt_and_included_titles(titles, symbols)
+ former_titles, former_symbols = clean_alt_and_included_titles(former_titles, former_symbols)
+ return titles, symbols, former_titles, former_symbols
def get_mapped_gene_ids(entry) -> List[str]:
+ """Get mapped gene IDs from an OMIM entry"""
gene_ids = entry.get('externalLinks', {}).get('geneIDs', '')
return [s.strip() for s in gene_ids.split(',')]
- # omim_num = str(entry['mimNumber'])
- # omim_curie = 'OMIM:' + omim_num
- # if 'externalLinks' in entry:
- # links = entry['externalLinks']
- # omimtype = omim_type[omim_num]
- # if 'geneIDs' in links:
- # entrez_mappings = links['geneIDs']
- # gene_ids = entrez_mappings.split(',')
- # omim_ncbigene_idmap[omim_curie] = gene_ids
- # if omimtype in [
- # globaltt['gene'], self.globaltt['has_affected_feature']]:
- # for ncbi in gene_ids:
- # model.addEquivalentClass(omim_curie, 'NCBIGene:' + str(ncbi))
- # return gene_ids
def get_pubs(entry) -> List[str]:
+ """Get pubmed information from an OMIM entry"""
result = []
for rlst in entry.get('referenceList', []):
if 'pubmedID' in rlst['reference']:
@@ -323,6 +399,7 @@ def get_pubs(entry) -> List[str]:
def get_mapped_ids(entry) -> Dict[Namespace, List[str]]:
+ """Get mapped IDs from an OMIM entry"""
external_links = entry.get('externalLinks', {})
result = defaultdict(list)
if 'orphanetDiseases' in external_links:
@@ -334,6 +411,7 @@ def get_mapped_ids(entry) -> Dict[Namespace, List[str]]:
def get_phenotypic_series(entry) -> List[str]:
+ """Get phenotypic series info from an OMIM entry"""
result = []
for pheno in entry.get('phenotypeMapList', []):
if 'phenotypicSeriesNumber' in pheno['phenotypeMap']:
@@ -346,40 +424,59 @@ def get_phenotypic_series(entry) -> List[str]:
# noinspection PyUnusedLocal
def get_process_allelic_variants(entry) -> List:
+ """Process allelic variants from an OMIM entry"""
# Not sure when/if Dazhi intended to use this - joeflack4 2021/12/20
return []
-def get_known_capitalizations() -> Dict[str, str]:
- """Get list of known capitalizations for proper names, acronyms, and the like.
- TODO: Contains space-delimited words, e.g. "vitamin d". The way that
- cleanup_label is currently implemented, each word in the label gets
- replaced; i.e. it would try to replace "vitamin" and "d" separately. Hence,
- this would fail.
- Therefore, we should probably do this in 2 different operations: (1) use
- the current 'word replacement' logic, but also, (2), at the end, do a
- generic string replacement (e.g. my_str.replace(a, b). When implementing
- (2), we should also split this dictionary into two separate dictionaries,
- each for 1 of these 2 different purposes."""
- path = DATA_DIR / 'known_capitalizations.tsv'
- with open(path, "r") as file:
- data_io = csv.reader(file, delimiter="\t")
- data: List[List[str]] = [x for x in data_io]
- df = pd.DataFrame(data[1:], columns=data[0])
- d = {}
- for index, row in df.iterrows():
- d[row['lower_name']] = row['cap_name']
- return d
+def get_self_ref_assocs(phenotype_mim: str, gene_phenotypes: Dict[str, Dict]) -> List[Dict]:
+ """Find any cases where it appears that there is a self-referential gene-disease association"""
+ if phenotype_mim not in gene_phenotypes:
+ return []
+ _assocs = gene_phenotypes[phenotype_mim]['phenotype_associations']
+ _self_ref_assocs = []
+ for _assoc in _assocs:
+ if not _assoc['phenotype_mim_number']:
+ _self_ref_assocs.append(_assoc)
+ return _self_ref_assocs
-class LabelCleaner():
- """Cleans labels"""
+def _add_to_review_tsv(class_code: int, value: str):
+ """Update REVIEW_CASES with review cases, which will later be written to review.tsv"""
+ REVIEW_CASES.append({
+ "classCode": class_code,
+ "classShortName": REVIEW_CASE_NAME_MAP[class_code],
+ "value": value,
+ })
- def __init__(self):
- """New obj"""
- self.word_replacements: Dict[str, str] = get_known_capitalizations()
- def clean(self, label, *args, **kwargs):
- """Overrides cleanup_label by adding word_replacements"""
- return cleanup_label(
- label, *args, **kwargs, word_replacements=self.word_replacements)
+def log_review_cases(
+ p_mim: str, p_lab: str, p_map_key: str, gene_mim: str, gene_phenotypes: Dict[str, Dict],
+ omim_types: Dict[str, str]
+):
+ """Log cases that need to be reviewed"""
+ global REVIEW_SELF_REF_CASE_I
+ p_lab_lower: str = p_lab.lower()
+ basic_review_info = f"(Phenotype: {p_mim} {p_lab}), (Map key: {p_map_key}), (Gene: {gene_mim})"
+
+ # - Digenic: Should technically be none marked 'digenic' if only 1 association, but there are.
+ if 'digenic' in p_lab_lower:
+ _add_to_review_tsv(1, basic_review_info)
+ # = Somatic mutations
+ if 'somatic' in p_lab_lower:
+ _add_to_review_tsv(3, basic_review_info)
+ # - Self-referential cases
+ self_ref_assocs: List[Dict] = get_self_ref_assocs(p_mim, gene_phenotypes)
+ if self_ref_assocs:
+ REVIEW_SELF_REF_CASE_I += 1
+ _add_to_review_tsv(2, f"{REVIEW_SELF_REF_CASE_I}: {basic_review_info}")
+ for self_ref_assoc in self_ref_assocs:
+ _add_to_review_tsv(2, f"{REVIEW_SELF_REF_CASE_I}: (Phenotype: {self_ref_assoc['phenotype_label']}), (Map key: "
+ f"{self_ref_assoc['phenotype_mapping_info_key']}), (Gene: {p_mim})", )
+ # - Unexpected non-phenotype MIM types
+ p_mim_type: str = omim_types[p_mim] # Allowable: PHENOTYPE, HERITABLE_PHENOTYPIC_MARKER (#, %)
+ mim_type_err = f"(Phenotype MIM type {p_mim_type}), {basic_review_info}"
+ if p_mim_type == 'GENE': # Represented by: *
+ _add_to_review_tsv(4, mim_type_err)
+ elif p_mim_type in ('OBSOLETE', 'SUSPECTED', 'HAS_AFFECTED_FEATURE'): # Represented by: ^, NULL, +
+ _add_to_review_tsv(5, mim_type_err)
diff --git a/omim2obo/parsers/omim_txt_parser.py b/omim2obo/parsers/omim_txt_parser.py
index 1d6872d..c8e889e 100644
--- a/omim2obo/parsers/omim_txt_parser.py
+++ b/omim2obo/parsers/omim_txt_parser.py
@@ -10,7 +10,6 @@
import requests
import re
import pandas as pd
-# from rdflib import URIRef
from omim2obo.config import CONFIG, DATA_DIR
from omim2obo.namespaces import RO
@@ -69,7 +68,7 @@
# - Multiple rows, same mapping key: https://github.com/monarch-initiative/omim/issues/75
# - Multiple rows, diff mapping keys: https://github.com/monarch-initiative/omim/issues/81
-## todo: these are unused variables. remove?:
+# todo: these are unused variables. remove?:
# - Disease-to-Gene predicates
# RO:0004013 (is causal germline mutation in)
# https://www.ebi.ac.uk/ols/ontologies/ro/properties?iri=http://purl.obolibrary.org/obo/RO_0004013
@@ -250,6 +249,7 @@ def parse_mim_titles(lines) -> Tuple[Dict[str, Tuple[OmimType, str, str, str]],
def parse_phenotypic_series_titles(lines) -> Dict[str, List]:
+ """Parse phenotypic series titles"""
ret = defaultdict(list)
for line in lines:
if line.startswith('#'):
@@ -274,8 +274,8 @@ def parse_gene_map(lines):
def get_hgnc_map(filename, symbol_col, mim_col='MIM Number') -> Dict:
"""Get HGNC Map"""
- map = {}
- input_path = os.path.join(DATA_DIR, filename)
+ d = {}
+ input_path = DATA_DIR / filename
try:
df = pd.read_csv(input_path, delimiter='\t', comment='#').fillna('')
df[mim_col] = df[mim_col].astype(int) # these were being read as floats
@@ -298,9 +298,9 @@ def get_hgnc_map(filename, symbol_col, mim_col='MIM Number') -> Dict:
if symbol:
# Useful to read as `int` to catch any erroneous entries, but convert to str for compatibility with rest of
# codebase, which is currently reading as `str` for now.
- map[str(row[mim_col])] = symbol
+ d[str(row[mim_col])] = symbol
- return map
+ return d
def parse_mim2gene(lines: List[str], filename='mim2gene.tsv', filename2='genemap2.tsv') -> Tuple[Dict, Dict, Dict]:
@@ -334,8 +334,8 @@ def parse_mim2gene(lines: List[str], filename='mim2gene.tsv', filename2='genemap
if mim_num not in hgnc_map:
hgnc_map[mim_num] = symbol
elif hgnc_map[mim_num] != symbol:
- LOG.warning(warning.format(mim_num, hgnc_map[mim_num], symbol))
- del hgnc_map[mim_num]
+ LOG.warning(warning.format(mim_num, hgnc_map[mim_num], symbol))
+ del hgnc_map[mim_num]
return gene_map, pheno_map, hgnc_map
@@ -428,11 +428,10 @@ def get_maps_from_turtle() -> Tuple[Dict, Dict, Dict]:
return pmid_maps, umls_maps, orphanet_maps
+# todo: Update this function to dynamically retrieve the updated records
+# noinspection PyUnusedLocal address_if_this_gets_reimplemented
def get_updated_entries(start_year=2020, start_month=1, end_year=2021, end_month=8):
- """
- TODO: Update this function to dynamically retrieve the updated records
- :return:
- """
+ """Get updated entries from OMIM API."""
# updated_mims = set()
# updated_entries = []
# for year in range(start_year, end_year):
@@ -452,16 +451,17 @@ def get_hgnc_symbol_id_map(input_path=os.path.join(DATA_DIR, 'hgnc', 'hgnc_compl
"""Get mapping between HGNC symbols and IDs
todo: Ideally download the latest file: http://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/tsv/hgnc_complete_set.txt
todo: Address or suppress warning. I dont even need these columns anyway:
- /Users/joeflack4/projects/omim/omim2obo/main.py:208: DtypeWarning: Columns (32,34,38,40,50) have mixed types.Specify dtype option on import or set low_memory=False.
+ /Users/joeflack4/projects/omim/omim2obo/main.py:208: DtypeWarning: Columns (32,34,38,40,50) have mixed types.
+ Specify dtype option on import or set low_memory=False.
hgnc_symbol_id_map: Dict = get_hgnc_symbol_id_map()
"""
- map = {}
+ d = {}
df = pd.read_csv(input_path, sep='\t')
for index, row in df.iterrows():
# hgnc_id is formatted as "hgnc:"
- map[row['symbol']] = row['hgnc_id'].split(':')[1]
+ d[row['symbol']] = row['hgnc_id'].split(':')[1]
- return map
+ return d
def p2g_is_definitive(label: str) -> bool:
"""Is phenotype to gene association definitive?
diff --git a/omim2obo/utils/utils.py b/omim2obo/utils/utils.py
index abd119f..d6a67d8 100644
--- a/omim2obo/utils/utils.py
+++ b/omim2obo/utils/utils.py
@@ -1,5 +1,10 @@
"""Misc utilities"""
-from typing import List, Union
+from typing import Dict, List, Optional, Union
+
+import pandas as pd
+
+from omim2obo.config import DISEASE_GENE_EXCLUSIONS_PATH
+from omim2obo.namespaces import ORCID
# todo: also in mondo-ingest. Refactor into mondolib: https://github.com/monarch-initiative/mondolib/issues/13
@@ -14,3 +19,14 @@ def remove_angle_brackets(uris: Union[str, List[str]]) -> Union[str, List[str]]:
x = x[:-1] if x.endswith('>') else x
uris2.append(x)
return uris2[0] if str_input else uris2
+
+
+def get_d2g_exclusions_by_curator(path=DISEASE_GENE_EXCLUSIONS_PATH) -> Dict[str, Optional[str]]:
+ """Get disease-gene exclusions
+
+ :return: Dict[str, str]: Phenotype MIM as keys, ORCID of curator as values
+ """
+ df = pd.read_csv(path, sep='\t').fillna('')
+ df['phenotype_mim'] = df['omim_id'].apply(lambda x: x.split(':')[1])
+ phenotype_mim_orcid_map = {x['phenotype_mim']: x['orcid'] for x in df.to_dict(orient='records')}
+ return {k: ORCID[v] if v else None for k, v in phenotype_mim_orcid_map.items()}
diff --git a/requirements-unlocked.txt b/requirements-unlocked.txt
index 566cccb..00203da 100644
--- a/requirements-unlocked.txt
+++ b/requirements-unlocked.txt
@@ -1 +1,7 @@
+beautifulsoup4
+pandas
python-dotenv
+pyyaml
+rdflib
+requests
+sssom
diff --git a/run.sh b/run.sh
index 1f6db72..3a43b59 100644
--- a/run.sh
+++ b/run.sh
@@ -36,7 +36,7 @@ if [ -n "$TAG_IN_IMAGE" ]; then
ODK_TAG=$TAG_IN_IMAGE
ODK_IMAGE=$(echo $ODK_IMAGE | awk -F':' '{ print $1 }')
fi
-ODK_TAG=${ODK_TAG:-v1.4.3}
+ODK_TAG=${ODK_TAG:-v1.5.3}
ODK_JAVA_OPTS=${ODK_JAVA_OPTS:--Xmx20G}
ODK_DEBUG=${ODK_DEBUG:-no}
diff --git a/sparql/disease-gene-relationships.sparql b/sparql/disease-gene-relationships.sparql
index 8895cdb..d1359bb 100644
--- a/sparql/disease-gene-relationships.sparql
+++ b/sparql/disease-gene-relationships.sparql
@@ -24,6 +24,7 @@ WHERE {
FILTER(
?PredUri IN (
+ ,
,
,
,