# Regression Hazards

“Technical confounders” are likely correlated with real biology. Regressing them out can badly distort the data.

When analyzing single-cell sequencing data, it is natural to want to remove the effect of technical confounders. Statistics such as the cellular detection rate (number of genes detected per cell), library size (number of reads or UMI per cell), or percent mitochondrial reads may all represent some kind of bias. If those statistics vary between cell types, however, regressing them out will pull the expression vectors for those cell types closer together.

We can see a few examples of this phenomenon in the Tabula Muris dataset.

library(Seurat)
library(here)
library(tidyverse)

tmd <- CreateSeuratObject(raw.data = tm.droplet.matrix, meta.data = tm.droplet.metadata, project = "TabulaMuris")

# Only keep annotated cells
annotated_cells = tm.droplet.metadata %>% filter(!is.na(cell_ontology_class)) %>% pull(cell)
tmd <- SubsetData(tmd, cells.use = annotated_cells, do.clean = TRUE)

tmd <- NormalizeData(tmd)


Consider the Bladder. Because the epithelial cells have significantly more UMI on average than the mesenchymal or endothelial cells, we are set up for Simpson’s paradox: even if the expression of a gene is positively correlated with nUMI within each cell type, it may be negatively correlated if all cell types are considered together.

FetchData(tmd, c('nUMI', 'tissue', 'free_annotation'))%>% filter(tissue == 'Bladder') %>%
ggplot(aes(x = tissue, y = log(nUMI), fill = free_annotation)) + geom_boxplot() + coord_flip()


For example, the mesenchymal marker Car3 is positively correlated with nUMI within each cell type and negatively correlated at the level of the ensemble.

FetchData(tmd, c('nUMI', 'tissue', 'Car3', 'free_annotation')) %>% filter(tissue == 'Bladder') %>%
ggplot(aes(log(nUMI), Car3, color = free_annotation)) + geom_jitter(width = 0.2, alpha = 0.3) + geom_smooth(method = 'lm') +
geom_smooth(method = 'lm', color = 'black')


A similar phenomenon can be seen in the tongue, where the number of genes detected per cell differs between cell types.

FetchData(tmd, c('nGene', 'tissue', 'cell_ontology_class'))%>% filter(tissue == 'Tongue') %>%
ggplot(aes(x = tissue, y = nGene, fill = cell_ontology_class)) + geom_boxplot() + coord_flip()


Here, the proliferation marker Top2a has quite different correlations with nGene globally and within each cell type.

FetchData(tmd, c('nGene', 'tissue', 'Top2a', 'cell_ontology_class')) %>% filter(tissue == 'Tongue') %>%
ggplot(aes(nGene, Top2a, color = cell_ontology_class)) + geom_jitter(width = 0.2, alpha = 0.3) + geom_smooth(method = 'lm') +
geom_smooth(method = 'lm', color = 'black')


In fact, this sort of paradox is possible in every organ we collected. Each organ has a cell type which is an outlier for nGene, setting up a potential Simpson’s paradox for each gene differentially expressed in that cell type.

FetchData(tmd, c('nGene', 'tissue', 'cell_ontology_class'))%>%
ggplot(aes(x = tissue, y = nGene, fill = cell_ontology_class)) + geom_boxplot() + coord_flip() +
ggtitle('nGene variability between cell types within tissues') +
theme(axis.text = element_text(size=40), text = element_text(size=40), legend.position="none")


Any form of regularization (regressing confounders, batch effect correction) has the potential to introduce artifacts. This is why it’s essential that, whatever regularization you used to find clusters or trajectories, you use raw gene counts for computing differential expression. Terms like nUMI, nGene, or batch can be used as covariates in a regression model for gene expression. By giving that final model access to the covariates, instead of trying to ‘adjust’ for them beforehand, you increase the likelihood that the DE genes you find come from real biological variation.

Published on October 19, 2018 by Joshua Batson