Skip to content

Latest commit

 

History

History
301 lines (236 loc) · 9.06 KB

counts_to_cells_ratio.md

File metadata and controls

301 lines (236 loc) · 9.06 KB

Ratio of sequencing counts to cells sorted

This Python Jupyter notebook looks at the ratio of sequencing counts to cells sorted for each sample, and flags any where this ratio seems too low.

First, import Python modules:

import Bio.SeqIO

from IPython.display import display, HTML

import matplotlib.pyplot as plt

import pandas as pd

from plotnine import *

import yaml

Read the configuration file:

with open('config.yaml') as f:
    config = yaml.safe_load(f)

Read information about the samples:

samples_df = pd.read_csv(config['barcode_runs'])

Read the variant counts:

variant_counts = pd.read_csv(config['variant_counts'])

Get the "primary target"

wt_seqrecord = Bio.SeqIO.read(config['wildtype_sequence'], 'fasta')
primary_target = wt_seqrecord.name
print(f"Primary target is {primary_target}")
Primary target is SARS-CoV-2

Merge the sample information and aggregated variant counts data frames into a new data frame that has the number of cells sorted and the total variant counts for each sample, as well as the ratio of the variant counts to cells. Importantly, counts and ratios are computed using counts for all variants (SARS-CoV-2) and counts for just variants of the primary target (SARS-CoV-2). The cell counts aren't meaningful for reference samples, so set to NA.

counts_cells = (
    samples_df
    .assign(sample_lib=lambda x: x['sample'] + ', ' + x['library'],
            cells_sorted=lambda x: x['cells_sorted'].where(x['selection'] != 'reference', pd.NA))
    [['sample_lib', 'date', 'experiment', 'antibody', 'concentration', 'selection', 'cells_sorted']]
    .merge(variant_counts
           .assign(sample_lib=lambda x: x['sample'] + ', ' + x['library'],
                   is_primary_target=lambda x: (x['target'] == primary_target).astype(int),
                   count_primary_target=lambda x: x['count'] * x['is_primary_target'])
           .groupby('sample_lib')
           .aggregate(counts_all=pd.NamedAgg('count', 'sum'),
                      counts_primary_target=pd.NamedAgg('count_primary_target', 'sum'))
           .astype(float)
           .reset_index(),
           on='sample_lib', how='left', validate='one_to_one',
           )
    .assign(sample_lib=lambda x: pd.Categorical(x['sample_lib'], reversed(x['sample_lib'].unique()), ordered=True),
            counts_all_to_cells=lambda x: x['counts_all'] / x['cells_sorted'],
            counts_primary_target_to_cells=lambda x: x['counts_primary_target'] / x['cells_sorted'],
            )
    .rename(columns={'counts_all': 'counts all variants',
                     'counts_primary_target': f"counts {primary_target} variants",
                     'counts_all_to_cells': 'counts:cells all variants',
                     'counts_primary_target_to_cells': f"counts:cells {primary_target} variants"
                     })
    )

print(f"First few lines of the data frame, writing entirety to {config['counts_to_cells_csv']}:")
display(HTML(counts_cells.head().to_html(index=False, float_format='{:.2g}'.format)))
counts_cells.to_csv(config['counts_to_cells_csv'], index=False, float_format='%.3g')
First few lines of the data frame, writing entirety to results/counts/counts_to_cells.csv:
sample_lib date experiment antibody concentration selection cells_sorted counts all variants counts SARS-CoV-2 variants counts:cells all variants counts:cells SARS-CoV-2 variants
expt_149-151-none-0-reference, lib1 220128 expt_149-151 none 0 reference <NA> 3e+07 3e+07 <NA> <NA>
expt_149-151-none-0-reference, lib2 220128 expt_149-151 none 0 reference <NA> 3.8e+07 3.8e+07 <NA> <NA>
expt_149-CAB-A17-139-escape, lib1 220128 expt_149 CAB-A17 139 escape 1.1e+06 2e+07 1.9e+07 18 17
expt_149-CAB-A17-139-escape, lib2 220128 expt_149 CAB-A17 139 escape 1.1e+06 2.2e+07 2.1e+07 20 19
expt_150-CAB-A49-222-escape, lib1 220128 expt_150 CAB-A49 222 escape 1.4e+06 2.8e+07 2.6e+07 20 19

Now we plot the number of counts or the counts:cells ratio for all samples. We stratify by reference samples and escape samples, and only plot the counts:cells ratio for escape samples as cells are not sorted for reference samples. We plot the counts / ratios for all variants, and primary target (SARS-CoV-2) variants.

In order to have the axis limits be reasonable, we clip very high / low values and draw dotted black lines to indicate the clipping. For reference sample counts, and escape sample counts:cells ratio, we have minimum desired values. We draw dashed green lines at these minimum values, and keep track of samples that don't achieve the minimum.

min_fail_sample_libs = set([])  # sample_lib that fail desired min

for var, selection, lower_clip, upper_clip, desired_min in [
            ('counts', 'reference', 1e5, None, config['reference_min_counts']),
            ('counts', 'escape', 1e5, None, None),
            ('counts:cells', 'escape', 0.1, 10, config['escape_min_counts_to_cells_ratio']),
            ]:

    # get tidy data frame with samples of interest
    tidy_df = (
        counts_cells
        .query('selection == @selection')
        .melt(id_vars='sample_lib',
              value_vars=[f"{var} all variants", f"{var} {primary_target} variants"],
              var_name='variant_type',
              value_name=var,
              )
        .assign(variant_type=lambda x: x['variant_type'].str.replace(f"{var} ", ''))
        )
    tidy_df[var] = tidy_df[var].clip(lower=lower_clip, upper=upper_clip).astype(float)

    # make plot
    p = (ggplot(tidy_df) +
         aes(var, 'sample_lib', shape='variant_type', color='variant_type') +
         geom_point(size=2, alpha=0.7) +
         theme(figure_size=(4, 0.25 * tidy_df['sample_lib'].nunique())) +
         ggtitle(f"{var} for {selection} samples")
         )
    if var == 'counts':  # plot counts on log scale
        p = p + scale_x_log10()
        
    # add dotted vertical lines if clipping if data
    if (lower_clip is not None) and (lower_clip >= tidy_df[var].min()):
        p = p + geom_vline(xintercept=lower_clip, color='black', linetype='dotted')
    if (upper_clip is not None) and (upper_clip <= tidy_df[var].max()):
        p = p + geom_vline(xintercept=upper_clip, color='black', linetype='dotted')
        
    # draw line at desired minimum value, and identify any samples that fail minimum
    if desired_min is not None:
        p = p + geom_vline(xintercept=desired_min, color='green', linetype='dashed')
        min_fail_sample_libs.update(set(
            tidy_df
            .assign(fail_min=lambda x: x[var] < desired_min)
            .query('fail_min')
            ['sample_lib']
            ))
    
    # draw figure
    fig = p.draw()
    display(fig)
    plt.close(fig)

png

png

png

Now list any samples that fail the minimum counts or counts:cell ratio:

print(f"Reference samples with < {config['reference_min_counts']} counts, "
      f"or escape samples with a counts:cell ratio < {config['escape_min_counts_to_cells_ratio']}.")

display(HTML(
    counts_cells
    .query('sample_lib in @min_fail_sample_libs')
    .to_html(index=False, float_format='{:.2g}'.format)
    ))
Reference samples with < 25000000.0 counts, or escape samples with a counts:cell ratio < 2.5.
sample_lib date experiment antibody concentration selection cells_sorted counts all variants counts SARS-CoV-2 variants counts:cells all variants counts:cells SARS-CoV-2 variants