Scripts to reproduce results reported in FamPlex manuscript

This repository is organized into six folders corresponding to each stage of constructing and evaluating the FamPlex resource.

Step 1: Get list of genes and corresponding PMIDs

Download Reactome signaling proteins list (see step1_genes_pmids/signaling_proteins_readme.txt)
- output: signaling_proteins.tsv
Run process_proteins.py to get a gene list for literature search
- For each protein, get gene name.
- output: signaling_genes.txt
- Get genes identified in famplex/relations.csv (which included families from BEL and from the Ras family).
- Combine the two gene lists
- output: combined_genes.txt.
Run get_pmids.py to get a list of PMIDs from gene list
- For each gene in combined_genes.txt, get papers curated from Entrez gene.
- Save dict mapping gene to list of papers
- output: combined_genes_to_pmids.pkl
- Make set of unique PMIDs, save
- output: combined_pids.txt

Build REACH without FamPlex
output: combined_genes_no_famplex_stmts.pkl (the pickle file contains a dictionary mapping each PMID to a list of Statements)

80% of papers to training set, 20% of papers to test set
get_training_test_stmts.py
- output: training_pmid_stmts.pkl
- output: test_pmid_stmts.pkl
- output: training_pmids.txt
- output: test_pmids.txt

Run get_agents_for_evaluation.py. Creates random sample of agents from training (or test) set for evaluation.
- output: training_agents_sample.csv
- output: training_agents.csv
- output: training_ungrounded.csv
- output: training_agent_distribution.pdf
- output: training_ungrounded_distribution.pdf
Open training_agents_sample_curated.csv in Excel to curate
- Curate agents. Code:
  - P: protein
  - F: family
  - C: complex
  - X: complex of families
  - S: small molecule
  - B: biological process
  - U: unknown/other
  - M: microRNA
For ease of curation, run generate_agent_links.py to create an HTML table to check groundings in different databases.
Same for test set.

texts_for_gene.py helps in finding lexical synonyms for unmapped families.
Fraction of most frequent agents that in clude family or complex
- Fraction of mentions
Fraction of most frequent ungrounded that include family or complex
- Fraction of ungrounded mentions
Table with 3 columns Entity, Frequency, Category (F, C, X, or empty), Curator