Skip to content

Latest commit

 

History

History
952 lines (821 loc) · 27 KB

natural_mutations.md

File metadata and controls

952 lines (821 loc) · 27 KB

Analyze naturally occurring mutations at sites of strong escape

This Python Jupyter notebook sees how many naturally occuring mutations are observed at each site of strong escape

Set up analysis

Import Python modules:

import collections
import copy
import math
import os

import dms_variants.utils

from IPython.display import display, HTML

import matplotlib.pyplot as plt

import pandas as pd

from plotnine import *

import yaml

Read the configuration file:

with open('config.yaml') as f:
    config = yaml.safe_load(f)

Read escape profiles config, which tells which sets to make plots for:

with open(config['escape_profiles_config']) as f:
    escape_profiles_config = yaml.safe_load(f)

Create output directory:

os.makedirs(config['gisaid_mutations_dir'], exist_ok=True)

Read counts of naturally ocurring mutations:

print(f"Reading mutation counts from {config['gisaid_mutation_counts']}")

mut_counts = pd.read_csv(config['gisaid_mutation_counts'])
Reading mutation counts from results/GISAID_mutations/mutation_counts.csv

Read sites of "strong escape" from all antibodies / sera:

print(f"Reading sites of strong escape from {config['strong_escape_sites']}")

strong_sites = pd.read_csv(config['strong_escape_sites'])
Reading sites of strong escape from results/escape_profiles/strong_escape_sites.csv

Read escape fractions for all antibodies / sera:

print(f"Reading escape fractions from {config['escape_fracs']}")

escape_fracs = (
    pd.read_csv(config['escape_fracs'])
    .query('library == "average"')
    .drop(columns='site')
    .rename(columns={'mutation': 'mutant',
                     'label_site': 'site'})
    [['condition', 'site', 'wildtype', 'mutant', config['mut_metric'], config['site_metric']]]
    )

escape_fracs
Reading escape fractions from results/escape_scores/escape_fracs.csv
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
condition site wildtype mutant mut_escape_frac_epistasis_model site_total_escape_frac_epistasis_model
0 CAB-A17_139 331 N A 0.001006 0.01628
1 CAB-A17_139 331 N D 0.001013 0.01628
2 CAB-A17_139 331 N E 0.000979 0.01628
3 CAB-A17_139 331 N F 0.001090 0.01628
4 CAB-A17_139 331 N G 0.001025 0.01628
... ... ... ... ... ... ...
5833 CAB-C19_184 531 T R 0.001181 0.02607
5834 CAB-C19_184 531 T S 0.001409 0.02607
5835 CAB-C19_184 531 T V 0.001510 0.02607
5836 CAB-C19_184 531 T W 0.000796 0.02607
5837 CAB-C19_184 531 T Y 0.001528 0.02607

5838 rows × 6 columns

Counts of mutations at sites of escape

Get counts of naturally occurring mutations at sites of escape, along with the actual escape values:

First get mutation-level counts:

mutcounts_strong_sites = (
    strong_sites[['condition', 'threshold', 'site']]
    .merge(mut_counts, how='inner', on='site')
    .merge(escape_fracs[['condition', 'site', 'wildtype', config['site_metric']]].drop_duplicates(),
           on=['condition', 'site', 'wildtype'],
           validate='many_to_one')
    .assign(mutation=lambda x: x['wildtype'] + x['site'].astype(str) + x['mutant'])
    .sort_values('count', ascending=False)
    )

Now get site-level counts (aggregating all mutations at a site):

sitecounts_strong_sites = (
    mutcounts_strong_sites
    .assign(mut_count=lambda x: x['mutation'] + ' (' + x['count'].astype(str) + ')')
    .groupby(['condition', 'threshold', 'site', 'wildtype', config['site_metric']])
    .aggregate({'count': 'sum', 'mut_count': ', '.join})
    .rename(columns={'mut_count': 'counts_by_mutation'})
    .reset_index()
    .sort_values('count', ascending=False)
    )

print(f"Here are first few lines showing the most frequently mutated sites of escape:")
display(HTML(sitecounts_strong_sites.head(n=20).to_html(index=False)))
Here are first few lines showing the most frequently mutated sites of escape:
condition threshold site wildtype site_total_escape_frac_epistasis_model count counts_by_mutation
CAB-A49_222 sensitive 417 K 2.35100 3647 K417N (3010), K417T (632), K417R (4), K417E (1)
CAB-A17_139 sensitive 417 K 0.09368 3647 K417N (3010), K417T (632), K417R (4), K417E (1)
CAB-C19_184 default 417 K 0.70390 3647 K417N (3010), K417T (632), K417R (4), K417E (1)
CAB-A17_139 sensitive_max_mut 417 K 0.09368 3647 K417N (3010), K417T (632), K417R (4), K417E (1)
CAB-C19_184 sensitive 417 K 0.70390 3647 K417N (3010), K417T (632), K417R (4), K417E (1)
CAB-A49_222 default 417 K 2.35100 3647 K417N (3010), K417T (632), K417R (4), K417E (1)
CAB-C19_184 sensitive_max_mut 417 K 0.70390 3647 K417N (3010), K417T (632), K417R (4), K417E (1)
CAB-A49_222 sensitive_max_mut 417 K 2.35100 3647 K417N (3010), K417T (632), K417R (4), K417E (1)
CAB-C19_184 sensitive_max_mut 475 A 1.72900 229 A475V (149), A475S (68), A475T (11), A475P (1)
CAB-A49_222 sensitive_max_mut 475 A 2.82200 229 A475V (149), A475S (68), A475T (11), A475P (1)
CAB-A17_139 default 475 A 0.48910 229 A475V (149), A475S (68), A475T (11), A475P (1)
CAB-A49_222 sensitive 475 A 2.82200 229 A475V (149), A475S (68), A475T (11), A475P (1)
CAB-A49_222 default 475 A 2.82200 229 A475V (149), A475S (68), A475T (11), A475P (1)
CAB-C19_184 sensitive 475 A 1.72900 229 A475V (149), A475S (68), A475T (11), A475P (1)
CAB-A17_139 sensitive_max_mut 475 A 0.48910 229 A475V (149), A475S (68), A475T (11), A475P (1)
CAB-A17_139 sensitive 475 A 0.48910 229 A475V (149), A475S (68), A475T (11), A475P (1)
CAB-C19_184 default 475 A 1.72900 229 A475V (149), A475S (68), A475T (11), A475P (1)
CAB-C19_184 sensitive 460 N 0.22870 29 N460I (15), N460K (6), N460T (5), N460S (3)
CAB-C19_184 default 460 N 0.22870 29 N460I (15), N460K (6), N460T (5), N460S (3)
CAB-C19_184 sensitive_max_mut 460 N 0.22870 29 N460I (15), N460K (6), N460T (5), N460S (3)

Plot sites of escape with natural variation

We perform analyses on all subsets in the escape profiles config for which this is specified:

for name, specs in escape_profiles_config.items():
    if 'analyze_natural_mutations' not in specs or not specs['analyze_natural_mutations']:
        continue
    print(f"\nAnalyzing natural mutations for {name}")
    
    conditions = specs['conditions']
    
    threshold = specs['plot_auto_identified_sites']
    if threshold not in sitecounts_strong_sites['threshold'].unique():
        raise ValueError(f"invalid threshold {threshold} for {name}")
    
    # get count for conditions of interest for this subset
    df = (sitecounts_strong_sites
          .query('condition in @conditions')
          .query('threshold == @threshold')
          .assign(condition=lambda x: x['condition'].map(conditions))
          .drop(columns=config['site_metric'])
          )
    countsfile = os.path.join(config['gisaid_mutations_dir'], f"{name}_mutation_counts.csv")
    print(f"Writing counts of mutations at sites of strong escape to {countsfile}. First few lines:")
    display(HTML(df.head(n=10).to_html(index=False)))
    df.to_csv(countsfile, index=False)
    
    # make plot showing escape sites with more than mincounts mutations
    if 'natural_mutations_mincounts' in specs:
        mincounts = specs['natural_mutations_mincounts']
    else:
        mincounts = 5
    plotsfile = os.path.join(config['gisaid_mutations_dir'], f"{name}_mutation_counts.pdf")
    print('Plotting which antibodies / sera are escaped by mutations at all sites of '
          f"escape with at least {mincounts} mutation counts and saving to {plotsfile}.")
    plot_df = (
        # data frame with all combinations of conditions and sites
        pd.DataFrame.from_records([(condition, site) for condition in conditions.values()
                                   for site in df['site'].unique()],
                                  columns=['condition', 'site'])
        # annotate sites of escape
        .merge(df.assign(escape=lambda x: x['count'] >= mincounts)[['condition', 'site', 'escape']],
               how='left',
               validate='one_to_one',
               on=['condition', 'site'])
        .assign(escape=lambda x: x['escape'].fillna(False))
        # add wildtype and counts of mutations at each site
        .merge(sitecounts_strong_sites[['site', 'wildtype', 'count']].drop_duplicates(),
               how='left',
               validate='many_to_one',
               on='site')
        # get only sites with sufficient mutation counts
        .query('count > @mincounts')
        # only get sites where at least one antibody escapes
        .assign(n_escape=lambda x: x.groupby('site')['escape'].transform('sum'))
        .query('n_escape > 0')
        # order conditions, and order sites by count after making nice label
        .assign(site_label=lambda x: x['wildtype'] + x['site'].astype(str) + ' (' + x['count'].astype(str) + ')')
        .sort_values('count')
        .assign(condition=lambda x: pd.Categorical(x['condition'], list(conditions.values()), ordered=True),
                site_label=lambda x: pd.Categorical(x['site_label'], x['site_label'].unique(), ordered=True)
                )
        )
    p = (ggplot(plot_df) +
         aes('condition', 'site_label', fill='escape') +
         geom_tile(color='black', size=0.3) +
         theme(axis_text_x=element_text(angle=90),
               figure_size=(0.3 * plot_df['condition'].nunique(), 0.3 * plot_df['site_label'].nunique()),
               panel_background=element_blank(),
               ) +
         xlab('') +
         ylab('') +
         scale_fill_manual(values=['white', 'dimgray'])
         )
    p.save(plotsfile, verbose=False)
    fig = p.draw()
    display(fig)
    plt.close(fig)
Analyzing natural mutations for karolinska
Writing counts of mutations at sites of strong escape to results/GISAID_mutations/karolinska_mutation_counts.csv. First few lines:
condition threshold site wildtype count counts_by_mutation
CAB-A17 sensitive_max_mut 417 K 3647 K417N (3010), K417T (632), K417R (4), K417E (1)
CAB-C19 sensitive_max_mut 417 K 3647 K417N (3010), K417T (632), K417R (4), K417E (1)
CAB-A49 sensitive_max_mut 417 K 3647 K417N (3010), K417T (632), K417R (4), K417E (1)
CAB-C19 sensitive_max_mut 475 A 229 A475V (149), A475S (68), A475T (11), A475P (1)
CAB-A49 sensitive_max_mut 475 A 229 A475V (149), A475S (68), A475T (11), A475P (1)
CAB-A17 sensitive_max_mut 475 A 229 A475V (149), A475S (68), A475T (11), A475P (1)
CAB-C19 sensitive_max_mut 460 N 29 N460I (15), N460K (6), N460T (5), N460S (3)
CAB-C19 sensitive_max_mut 456 F 11 F456L (9), F456Y (2)
CAB-A49 sensitive_max_mut 456 F 11 F456L (9), F456Y (2)
CAB-A17 sensitive_max_mut 456 F 11 F456L (9), F456Y (2)
Plotting which antibodies / sera are escaped by mutations at all sites of escape with at least 5 mutation counts and saving to results/GISAID_mutations/karolinska_mutation_counts.pdf.

png

Plot correlation between escape and natural frequency

First aggregate frequency of mutations and escape fractions:

escape_and_freq = (
    escape_fracs
    .rename(columns={config['mut_metric']: 'mut_escape',
                     config['site_metric']: 'tot_site_escape'})
    .assign(max_site_escape=lambda x: x.groupby(['condition', 'site'])['mut_escape'].transform('max'),
            mean_site_escape=lambda x: x.groupby(['condition', 'site'])['mut_escape'].transform('mean'))
    .merge(mut_counts[['site', 'wildtype', 'mutant', 'frequency']]
                     .rename(columns={'frequency': 'mut_freq'}),
           on=['site', 'wildtype', 'mutant'],
           how='left', validate='many_to_one')
    .assign(mut_freq=lambda x: x['mut_freq'].fillna(0),
            site_freq=lambda x: x.groupby(['condition', 'site'])['mut_freq'].transform('sum'),
            mutation=lambda x: x['wildtype'] + x['site'].astype(str) + x['mutant'],
            )
    )

display(HTML(escape_and_freq.head().to_html()))
condition site wildtype mutant mut_escape tot_site_escape max_site_escape mean_site_escape mut_freq site_freq mutation
0 CAB-A17_139 331 N A 0.001006 0.01628 0.00109 0.001017 0.0 0.000029 N331A
1 CAB-A17_139 331 N D 0.001013 0.01628 0.00109 0.001017 0.0 0.000029 N331D
2 CAB-A17_139 331 N E 0.000979 0.01628 0.00109 0.001017 0.0 0.000029 N331E
3 CAB-A17_139 331 N F 0.001090 0.01628 0.00109 0.001017 0.0 0.000029 N331F
4 CAB-A17_139 331 N G 0.001025 0.01628 0.00109 0.001017 0.0 0.000029 N331G

Now make plots. Note that you can configure below exactly what variables you want to plot (mutation frequency, mutation escape, site escape, etc):

# all the parameters below have the indicated defaults, but can be set in `escape_profiles_config`
# via analyze_natural_mutations_specs
default_analysis_specs = {
    'maxcol': 5,  # maximum columns in plot
    'minfreq': 1e-5,  # collapse any natural frequencies < this
    'freq': 'site_freq',  # type of frequency to plot: mut_freq or site_freq
    'escape': 'mean_site_escape',  # type of escape to plot: mean_site_escape, mut_escape, max_site_escape, tot_site_escape
    'xlabel': 'frequency of mutations at site',
    'ylabel': 'mean escape at site',
    'label_minfreq': 5e-5,  # label points with frequency >= this and...
    'label_minescape': 0.1,  # label points with escape >= this
    'also_label': [],  # also label any points (sites or mutations) listed here
    'label_font_size': 6,  # font size for labeling points
    'default_color': 'black',  # color for points not otherwise specified
    'default_alpha': 0.5,  # default alpha if not specified
    'set_point_color': {},  # set color; key by site / mutation, value is color
    'set_point_alpha': {},  # set alpha; key by site / mutations, value is alpha
    'plot_average_only': False,  # only plot average of conditions, not individual ones
    'scales': 'fixed', # can alternatively specify 'free_y'
    }
label_minfreq = 5e-5  # label points with frequency >= this
label_minescape = 0.05  # label points with escape >= this

for name, specs in escape_profiles_config.items():
    if 'analyze_natural_mutations' not in specs or not specs['analyze_natural_mutations']:
        continue
    print(f"\nAnalyzing natural mutations for {name}")
    
    analysis_specs = copy.deepcopy(default_analysis_specs)
    if 'analyze_natural_mutations_specs' in specs:
        for key, val in specs['analyze_natural_mutations_specs'].items():
            analysis_specs[key] = val
    
    conditions = specs['conditions']
    
    if 'site' in analysis_specs['freq'] and 'site' in analysis_specs['escape']:
        ptlabel = 'site'
    else:
        ptlabel = 'mutation'
    
    df = (escape_and_freq
          .query('condition in @conditions')
          .assign(condition=lambda x: x['condition'].map(conditions))
          .assign(**{analysis_specs['freq']: lambda x: x[analysis_specs['freq']].clip(lower=analysis_specs['minfreq'])})
          [['condition', analysis_specs['escape'], analysis_specs['freq'], ptlabel]]
          .drop_duplicates()
          )

    assert len(conditions) == df['condition'].nunique()
    
    for avg_conditions in (False, True):
        
        if analysis_specs['plot_average_only'] and not avg_conditions:
            continue
        
        if avg_conditions:
            plot_df = df.groupby(ptlabel, as_index=False).aggregate({analysis_specs['freq']: 'mean',
                                                                     analysis_specs['escape']: 'mean'})
            nrow = ncol = 1
            plotfile = os.path.join(config['gisaid_mutations_dir'],
                                    f"{name}_escape_vs_freq_average.pdf")
            print(f"Plotting average across conditions and saving to {plotfile}")
        else:
            nrow = math.ceil(len(conditions) / analysis_specs['maxcol'])
            ncol = min(len(conditions), analysis_specs['maxcol'])
            plot_df = df.copy()
            # make condition categorical to maintain order 
            plot_df=plot_df.assign(condition=lambda x: pd.Categorical(x['condition'], 
                                                                      list(conditions.values()), 
                                                                      ordered=True)
                                  )
            plotfile = os.path.join(config['gisaid_mutations_dir'],
                                    f"{name}_escape_vs_freq_by-condition.pdf")
            print(f"Plotting each condition and saving to {plotfile}")
         
        # color points and set alpha
        set_point_color = collections.defaultdict(lambda: analysis_specs['default_color'])
        set_point_alpha = collections.defaultdict(lambda: analysis_specs['default_alpha'])
        for point, color in analysis_specs['set_point_color'].items():
            set_point_color[point] = color
        for point, alpha in analysis_specs['set_point_alpha'].items():
            set_point_alpha[point] = alpha
        plot_df['color'] = plot_df[ptlabel].map(set_point_color)
        plot_df['alpha'] = plot_df[ptlabel].map(set_point_alpha)
        # need to make color categorical to assign as aesthetic
        colors = plot_df['color'].unique()
        plot_df['color'] = pd.Categorical(plot_df['color'], colors, ordered=True)
            
        label_df = (plot_df
                    .assign(label=lambda x: x[ptlabel].isin(analysis_specs['also_label']))
                    .query(f"label or ({analysis_specs['freq']} >= {analysis_specs['label_minfreq']})")
                    .query(f"label or ({analysis_specs['escape']} >= {analysis_specs['label_minescape']})")
                    )
        
        maxfreq = plot_df[analysis_specs['freq']].max()
        assert analysis_specs['minfreq'] == 10**(int(math.log10(analysis_specs['minfreq'])))
        logxbreaks = list(range(int(math.log10(analysis_specs['minfreq'])), round(math.log10(maxfreq)) + 1, 1))
        xbreaks = [10**logx for logx in logxbreaks]
        xlabels = [f"$10^{{{logx}}}$" for logx in logxbreaks]
        xlabels[0] = f"$<{xlabels[0][1:]}"
        
        p = (ggplot(plot_df) +
             aes(analysis_specs['freq'], analysis_specs['escape'], color='color', alpha='alpha') +
             geom_point() +
             geom_text(data=label_df,
                       mapping=aes(label=ptlabel),
                       size=analysis_specs['label_font_size'],
                       adjust_text={'expand_points': (1.05, 1.2),
                                    'expand_text': (1.05, 1.2)},
                       ) +
             theme_classic() +
             theme(figure_size=(2.5 * ncol, 2.5 * nrow),
                   panel_spacing=0.3,
                   legend_position='none',
                   ) +
             scale_x_log10(name=analysis_specs['xlabel'],
                           breaks=xbreaks,
                           labels=xlabels,
                           expand=(0.07, 0)) +
             ylab(analysis_specs['ylabel']) +
             scale_color_manual(values=colors) +
             scale_alpha_continuous(limits=(0, 1), range=(0, 1))
             )
        if not avg_conditions:
            p = p + facet_wrap('~ condition', ncol=ncol, scales=analysis_specs['scales'])
        p.save(plotfile, verbose=False)
        fig = p.draw()
        display(fig)
       # plt.close(fig)
Analyzing natural mutations for karolinska
Plotting each condition and saving to results/GISAID_mutations/karolinska_escape_vs_freq_by-condition.pdf


/fh/fast/bloom_j/software/miniconda3/envs/SARS-CoV-2-RBD_MAP/lib/python3.7/site-packages/plotnine/facets/facet.py:552: PlotnineWarning: If you need more space for the x-axis tick text use ... + theme(subplots_adjust={'wspace': 0.25}). Choose an appropriate value for 'wspace'.
/fh/fast/bloom_j/software/miniconda3/envs/SARS-CoV-2-RBD_MAP/lib/python3.7/site-packages/plotnine/facets/facet.py:552: PlotnineWarning: If you need more space for the x-axis tick text use ... + theme(subplots_adjust={'wspace': 0.25}). Choose an appropriate value for 'wspace'.

png

Plotting average across conditions and saving to results/GISAID_mutations/karolinska_escape_vs_freq_average.pdf

png

png

png