This repository is organized into six folders corresponding to each stage of constructing and evaluating the FamPlex resource.
- Download Reactome signaling proteins list
(see step1_genes_pmids/signaling_proteins_readme.txt)
- output: signaling_proteins.tsv
- Run process_proteins.py to get a gene list for literature search
- For each protein, get gene name.
- output: signaling_genes.txt
- Get genes identified in famplex/relations.csv (which included families from BEL and from the Ras family).
- Combine the two gene lists
- output: combined_genes.txt.
- Run get_pmids.py to get a list of PMIDs from gene list
- For each gene in combined_genes.txt, get papers curated from Entrez gene.
- Save dict mapping gene to list of papers
- output: combined_genes_to_pmids.pkl
- Make set of unique PMIDs, save
- output: combined_pids.txt
- Build REACH without FamPlex
- output: combined_genes_no_famplex_stmts.pkl (the pickle file contains a dictionary mapping each PMID to a list of Statements)
- 80% of papers to training set, 20% of papers to test set
- get_training_test_stmts.py
- output: training_pmid_stmts.pkl
- output: test_pmid_stmts.pkl
- output: training_pmids.txt
- output: test_pmids.txt
- Run get_agents_for_evaluation.py. Creates random sample of agents from
training (or test) set for evaluation.
- output: training_agents_sample.csv
- output: training_agents.csv
- output: training_ungrounded.csv
- output: training_agent_distribution.pdf
- output: training_ungrounded_distribution.pdf
- Open training_agents_sample_curated.csv in Excel to curate
- Curate agents. Code:
- P: protein
- F: family
- C: complex
- X: complex of families
- S: small molecule
- B: biological process
- U: unknown/other
- M: microRNA
- Curate agents. Code:
- For ease of curation, run generate_agent_links.py to create an HTML table to check groundings in different databases.
- Same for test set.
- texts_for_gene.py helps in finding lexical synonyms for unmapped families.
- Fraction of most frequent agents that in
clude family or complex
- Fraction of mentions
- Fraction of most frequent ungrounded that include family or complex
- Fraction of ungrounded mentions
- Table with 3 columns Entity, Frequency, Category (F, C, X, or empty), Curator
- Creates a plot of frequency of groundings to each entity