This Python Jupyter notebook call sites of srong escape from each antibody / sera
Import Python modules:
import os
from dms_variants.constants import CBPALETTE
from IPython.display import display, HTML
import pandas as pd
from plotnine import *
import yaml
Versions of key software:
Read the configuration file:
with open('config.yaml') as f:
config = yaml.safe_load(f)
Create output directory:
os.makedirs(config['escape_profiles_dir'], exist_ok=True)
Extract from configuration what we will use as the site- and mutation-level metrics:
site_metric = config['site_metric']
mut_metric = config['mut_metric']
print(f"At site level, quantifying selection by {site_metric}")
print(f"At mutation level, quantify selection by {mut_metric}")
At site level, quantifying selection by site_total_escape_frac_epistasis_model
At mutation level, quantify selection by mut_escape_frac_epistasis_model
Read the escape fractions.
We only retain the average of the libraries for plotting here, not the individual libraries.
Also, we work in the full-Spike rather than RBD numbering, which means we use label_site
as site
(and so rename as such below):
print(f"Reading escape fractions from {config['escape_fracs']}")
escape_fracs = (pd.read_csv(config['escape_fracs'])
.query('library == "average"')
.drop(columns=['site', 'selection', 'library'])
.rename(columns={'label_site': 'site'})
)
print('First few lines of escape-fraction data frame with sample-information added:')
display(HTML(escape_fracs.head().to_html(index=False)))
Reading escape fractions from results/escape_scores/escape_fracs.csv
First few lines of escape-fraction data frame with sample-information added:
condition | site | wildtype | mutation | protein_chain | protein_site | mut_escape_frac_epistasis_model | mut_escape_frac_single_mut | site_total_escape_frac_epistasis_model | site_total_escape_frac_single_mut | site_avg_escape_frac_epistasis_model | site_avg_escape_frac_single_mut | nlibs |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CAB-A17_139 | 331 | N | A | E | 331 | 0.001006 | 0.000179 | 0.01628 | 0.003478 | 0.001017 | 0.000217 | 2 |
CAB-A17_139 | 331 | N | D | E | 331 | 0.001013 | 0.000167 | 0.01628 | 0.003478 | 0.001017 | 0.000217 | 1 |
CAB-A17_139 | 331 | N | E | E | 331 | 0.000979 | 0.000237 | 0.01628 | 0.003478 | 0.001017 | 0.000217 | 2 |
CAB-A17_139 | 331 | N | F | E | 331 | 0.001090 | 0.000095 | 0.01628 | 0.003478 | 0.001017 | 0.000217 | 2 |
CAB-A17_139 | 331 | N | G | E | 331 | 0.001025 | 0.000109 | 0.01628 | 0.003478 | 0.001017 | 0.000217 | 2 |
Now identify sites of "strong" selection using three criteria for each threshold.
-
The selection at a site has to exceed the median selection across all sites for that condition (e.g., antibody or sera) by some threshold.
-
The selection at a site has to be greater than some fraction of the maximum selection observed at any site for that condition (e.g., antibody or sera).
-
The selection at a site has to exceed some absolute minimum.
site_threshold_df = pd.DataFrame()
for threshold, params in config['strong_escape_sites_calling_params'].items():
print(f"\nCalling sites of strong escape for the {threshold} calling parameters:")
exceed_median = params['exceed_median']
frac_max = params['frac_max']
min_value = params['min_value']
print(f"Here are criteria used to call sites of strong escape for the {threshold} params:\n"
f" 1. Selection at site exceeds median by >{exceed_median} fold\n"
f" 2. Selection at site is >{frac_max} of the max for any site\n"
f" 3. Selection at site is >{min_value}")
if 'min_value_mut' in params:
min_value_mut = params['min_value_mut']
print(f"Sites are also called as significant if they have **any** mutation with escape fraction > {min_value_mut}")
else:
min_value_mut = None
site_threshold_df = site_threshold_df.append(
escape_fracs
.assign(max_mut=lambda x: x.groupby(['condition', 'site'])[mut_metric].transform('max'))
[['condition', 'site', site_metric, 'max_mut']]
.drop_duplicates()
.assign(
threshold=threshold,
median=lambda x: x.groupby('condition')[site_metric].transform('median'),
max=lambda x: x.groupby('condition')[site_metric].transform('max'),
exceed_median=lambda x: x['median'] * exceed_median,
meet_exceed_median=lambda x: x[site_metric] > x['exceed_median'],
frac_max=lambda x: x['max'] * frac_max,
meet_frac_max=lambda x: x[site_metric] > x['frac_max'],
min_value=lambda x: min_value,
meet_min_value=lambda x: x[site_metric] > x['min_value'],
meet_mut_min_value=lambda x: False if min_value_mut is None else x['max_mut'] > min_value_mut,
strong_site=lambda x: (x['meet_exceed_median'] & x['meet_frac_max'] & x['meet_min_value']) | x['meet_mut_min_value'],
n_strong_sites=lambda x: x.groupby('condition')['strong_site'].transform('sum'),
)
)
# orders for plotting
site_threshold_df = (
site_threshold_df
.assign(condition=lambda x: pd.Categorical(x['condition'],
reversed(x['condition'].unique()),
ordered=True),
)
)
Calling sites of strong escape for the default calling parameters:
Here are criteria used to call sites of strong escape for the default params:
1. Selection at site exceeds median by >10 fold
2. Selection at site is >0.1 of the max for any site
3. Selection at site is >0
Calling sites of strong escape for the sensitive calling parameters:
Here are criteria used to call sites of strong escape for the sensitive params:
1. Selection at site exceeds median by >5 fold
2. Selection at site is >0.05 of the max for any site
3. Selection at site is >0
Calling sites of strong escape for the sensitive_max_mut calling parameters:
Here are criteria used to call sites of strong escape for the sensitive_max_mut params:
1. Selection at site exceeds median by >5 fold
2. Selection at site is >0.05 of the max for any site
3. Selection at site is >0
Sites are also called as significant if they have **any** mutation with escape fraction > 0.5
Now plot the selection for all sites for all conditions, indicating which sites are strong escape and using lines to draw the three thresholds:
max_sel = site_threshold_df[site_metric].max() # max y-value plotted
p = (ggplot(site_threshold_df.assign(alpha=lambda x: x['strong_site'].astype(int))) +
aes(site_metric, 'condition', alpha='alpha', fill='strong_site') +
geom_jitter(size=2.5, height=0.1, width=0, stroke=0, random_state=1) +
scale_fill_manual(values=CBPALETTE) +
scale_alpha(range=(0.15, 0.5), guide=None) +
theme(figure_size=(4 * site_threshold_df['threshold'].nunique(),
0.25 * site_threshold_df['condition'].nunique()),
legend_position='top',
) +
xlab('site-level selection') +
expand_limits(x=1.2 * max_sel) +
geom_text(data=site_threshold_df[['condition', 'threshold', 'n_strong_sites']].drop_duplicates(),
mapping=aes(y='condition', label='n_strong_sites'),
inherit_aes=False,
x=1.1 * max_sel,
color=CBPALETTE[1],
alpha=1,
size=8,
ha='left',
va='center',
) +
geom_linerange(data=site_threshold_df
.melt(id_vars=['condition', 'threshold'],
value_vars=['exceed_median', 'frac_max', 'min_value'],
var_name='threshold_type',
value_name='threshold_line')
.reset_index()
.drop_duplicates()
.assign(ymin=lambda x: x['condition'].cat.codes + 1 - 0.3,
ymax=lambda x: x['condition'].cat.codes + 1 + 0.3),
mapping=aes(x='threshold_line', ymin='ymin', ymax='ymax', color='threshold_type'),
inherit_aes=False,
size=0.1) +
scale_color_manual(values=CBPALETTE[2: ]) +
facet_wrap('~ threshold', nrow=1)
)
_ = p.draw()
Now get the strong sites for each condition, and write to a CSV file:
strong_sites = site_threshold_df.query('strong_site').reset_index(drop=True)
print('Number of sites of strong escape for each threshold:')
display(HTML(
strong_sites
.groupby('threshold')
.aggregate(n_sites=pd.NamedAgg('site', 'nunique'))
.to_html()
))
print(f"Writing to {config['strong_escape_sites']}, and printing the first few lines below:")
display(HTML(strong_sites.head().to_html()))
strong_sites.to_csv(config['strong_escape_sites'], index=False)
Number of sites of strong escape for each threshold:
n_sites | |
---|---|
threshold | |
default | 5 |
sensitive | 5 |
sensitive_max_mut | 5 |
Writing to results/escape_profiles/strong_escape_sites.csv, and printing the first few lines below:
condition | site | site_total_escape_frac_epistasis_model | max_mut | threshold | median | max | exceed_median | meet_exceed_median | frac_max | meet_frac_max | min_value | meet_min_value | meet_mut_min_value | strong_site | n_strong_sites | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | CAB-A17_139 | 456 | 0.5772 | 0.5368 | default | 0.01313 | 0.5772 | 0.1313 | True | 0.05772 | True | 0 | True | False | True | 2 |
1 | CAB-A17_139 | 475 | 0.4891 | 0.4616 | default | 0.01313 | 0.5772 | 0.1313 | True | 0.05772 | True | 0 | True | False | True | 2 |
2 | CAB-A49_222 | 417 | 2.3510 | 0.9329 | default | 0.01927 | 2.8220 | 0.1927 | True | 0.28220 | True | 0 | True | False | True | 4 |
3 | CAB-A49_222 | 420 | 1.8470 | 0.9687 | default | 0.01927 | 2.8220 | 0.1927 | True | 0.28220 | True | 0 | True | False | True | 4 |
4 | CAB-A49_222 | 456 | 1.3500 | 0.9806 | default | 0.01927 | 2.8220 | 0.1927 | True | 0.28220 | True | 0 | True | False | True | 4 |
Plot which sites are auto-identified for each condition (e.g., antibody or sera):
p = (ggplot(strong_sites
.assign(site=lambda x: pd.Categorical(x['site'], # categorical so only plot observed sites
sorted(x['site'].unique()),
ordered=True)
)
) +
aes('site', 'condition') +
geom_tile() +
theme(axis_text_x=element_text(angle=90),
figure_size=(0.18 * strong_sites['site'].nunique(),
0.2 * strong_sites['condition'].nunique() * strong_sites['threshold'].nunique()),
) +
facet_wrap('~ threshold', ncol=1)
)
_ = p.draw()