In this vignette, we describe usage of a suite of tools, SEESAW, Statistical Estimation of allelic Expression using Salmon and Swish, which allow for testing allelic imbalance across samples.

The methods are described in Wu et al. (2022) doi: 10.1101/2022.08.12.503785.

SEESAW makes use of Swish (Zhu et al. 2019) for paired inference, which is an extension of the SAMseq (Li and Tibshirani 2011) methods for permutation-based FDR control.

Quick start

Minimal code for running SEESAW is shown below.

On the command line:

# generate diploid txome with g2gtools:
> salmon index -p #threads -t diploid_fasta -i diploid_txome --keepDuplicates
> salmon quant -i diploid_txome -l A -p #threads \
  --numBootstraps 30 -o output -1 read1 -2 read2

From within R/Bioconductor:

# first build tx2gene to gene- or TSS-level
# (for isoform level skip `tx2gene`)
# gene level:
tx2gene <- transcripts(edb) %>%
  select(tx_id, group_id=gene_id)
# TSS level:
tx2gene <- makeTx2Tss(edb, maxgap=50) %>%    
  select(tx_id, gene_id, group_id, tss)
# import counts:
y <- importAllelicCounts(
  coldata, a1="alt", a2="ref",
  format="wide", tx2gene=tx2gene
# testing with Swish:
y <- labelKeep(y)
y <- y[mcols(y)$keep,]
# see below for other tests and details
y <- swish(y, x="allele", pair="sample", fast=1) 
mcols(y)$qvalue # <-- gives FDR-bounded sets

Method overview

Type of tests

SEESAW allows for testing global allelic imbalance across all samples (pairwise testing within each individual), as well as differential, or dynamic allelic imbalance (pairwise allelic fold changes estimated within individual, followed by testing across two groups, or along an additional covariate). Each of these allelic imbalance (AI) analyses takes into account the potentially heterogeneous amount of inferential uncertainty per sample, per feature (transcript, transcript-group, or gene), and per allele.

Steps in SEESAW

Running SEESAW involves generation of a diploid transcriptome (e.g. using g2gtools, construction of a diploid Salmon index (specifying --keepDuplicates), followed by Salmon quantification with a number of bootstrap inferential replicates (we recommend 30 bootstrap replicates). These three steps (diploid reference preparation, indexing, quantification with bootstraps) provide the input data for the following statistical analyses in R/Bioconductor. The steps shown in this vignette leverage Bioconductor infrastructure including SummarizedExperiment for storage of input data and results, tximport for data import, and GRanges and Gviz for plotting.

In short the SEESAW steps are as listed, and diagrammed below:

  1. g2gtools (diploid reference preparation)
  2. Salmon indexing with --keepDuplicates
  3. Salmon quantification with bootstraps
  4. makeTx2Tss() aggregates data to TSS-level (optional)
  5. importAllelicCounts() creates a SummarizedExperiment
  6. Swish analysis: labelKeep() and swish() (skip scaling)
  7. Plotting

Below we demonstrate an analysis where transcripts are grouped by their transcription start site (TSS), although gene-level or transcript-level analysis is also possible. Additionally, any custom grouping could be used, by manually generating a t2g table as shown below. Special plotting functions in fishpond facilitate visualization of allelic and isoform changes at different resolutions, alongside gene models. In three examples, we perform global AI testing, differential AI testing, and dynamic AI testing, in all cases on simulated data associated with human genes.

Linking transcripts to TSS

We begin assuming steps 1-3 have been completed. We can use the makeTx2Tss function to generate a GRanges object t2g that connects transcripts to transcript groups.

edb <- EnsDb.Hsapiens.v86
t2g <- makeTx2Tss(edb) # GRanges object
## DataFrame with 216741 rows and 2 columns
##                           tx_id               group_id
##                     <character>            <character>
## ENST00000456328 ENST00000456328  ENSG00000223972-11869
## ENST00000450305 ENST00000450305  ENSG00000223972-12010
## ENST00000488147 ENST00000488147  ENSG00000227232-29570
## ENST00000619216 ENST00000619216  ENSG00000278267-17436
## ENST00000473358 ENST00000473358  ENSG00000243485-29554
## ...                         ...                    ...
## ENST00000420810 ENST00000420810 ENSG00000224240-2654..
## ENST00000456738 ENST00000456738 ENSG00000227629-2659..
## ENST00000435945 ENST00000435945 ENSG00000237917-2663..
## ENST00000435741 ENST00000435741 ENSG00000231514-2662..
## ENST00000431853 ENST00000431853 ENSG00000235857-5685..

Alternatively for gene-level analysis, one could either prepare a t2g data.frame with tx_id and gene_id columns, or a t2g GRanges object with a column group_id that is equal to gene_id.

Importing allelic counts

Here we will use simulated data, but we can import allelic counts with the importAllelicCounts() function. It is best to read over the manual page for this function. For TSS-level analysis, the t2g GRanges generated above should be passed to the tx2gene argument. This will summarize transcript-level counts to the TSS level, and will attach rowRanges that provide the genomic locations of the grouped transcripts. Note that importAllelicCounts does not yet have the ability to automatically generate ranges based on sequence hashing (as occurs in tximeta).

Filtering features

Because we use --keepDuplicates in the step when we build the Salmon index, there will be a number of features in which there is no information about the allelic expression in the reads. We can find these features in bootstrap data by examining when the inferential replicates are nearly identical for the two alleles, as this is how the EM will split the reads. Removing these features avoids downstream problems during differential testing. Code for this filtering follows:

n <- ncol(y)/2
rep1a1 <- assay(y, "infRep1")[,y$allele == "a1"]
rep1a2 <- assay(y, "infRep1")[,y$allele == "a2"]
mcols(y)$someInfo <- rowSums(abs(rep1a1 - rep1a2) < 1) < n
y <- y[ mcols(y)$someInfo, ]

Global allelic imbalance

We begin by generating a simulated data object that resembles what one would obtain with importAllelicCounts(). The import function arranges the a2 (non-effect) allelic counts first, followed by the a1 (effect) allelic counts. Allelic ratios are calculated as a1/a2, which follows the notational standard in PLINK and other tools.

y <- makeSimSwishData(allelic=TRUE)
## DataFrame with 20 rows and 2 columns
##          allele   sample
##        <factor> <factor>
## s1-a2        a2  sample1
## s2-a2        a2  sample2
## s3-a2        a2  sample3
## s4-a2        a2  sample4
## s5-a2        a2  sample5
## ...         ...      ...
## s6-a1        a1 sample6 
## s7-a1        a1 sample7 
## s8-a1        a1 sample8 
## s9-a1        a1 sample9 
## s10-a1       a1 sample10
levels(y$allele) # a1/a2 allelic fold changes
## [1] "a2" "a1"

A hidden code chunk is used to add ranges from the EnsDb to the simulated dataset. For a real dataset, the ranges would be added either by importAllelicCounts (if using tx2gene) or could be added manually for transcript- or gene-level analysis, using the rowRanges<- setter function. The ranges are only needed for the plotAllelicGene plotting function below.

<hidden code chunk>

We can already plot a heatmap of allelic ratios, before performing statistical testing. We can see in the first gene, ADSS, there appear to be two groups of transcripts with opposing allelic fold change. SEESAW makes use of pheatmap for plotting a heatmap of allelic ratios.

y <- computeInfRV(y) # for posterior mean, variance
gene <- rowRanges(y)$gene_id[1]
idx <- mcols(y)$gene_id == gene
plotAllelicHeatmap(y, idx=idx)