Contents

1 Introduction

If many genes are perturbed in a population of cells, this can lead to diseases like cancer. The perturbations can happen in different ways, e.g. via mutations, copy number abberations or methylation. However, not all perturbations are observed in all samples.

Nested Effects Model-based perturbation inference (NEM\(\pi\)) uses observed perturbation profiles and gene expression data to infer unobserved perturbations and augment observed ones. The causal network of the perturbed genes (P-genes) is modelled as an adjacency matrix \(\phi\) and the genes with observed gene expression (E-genes) are modelled with the attachment \(\theta\) with \(\theta_{ij}=1\), if E-gene \(j\) is attached to S-gene \(i\). If E-gene \(j\) is attached to P-gene \(i\), \(j\) shows an effect for a perturbation of P-gene \(i\). Hence, \(\phi\theta\) predicts gene expression profiles, which can be compared to the real data. NEM\(\pi\) iteratively infers a network \(\phi\) based on gene expression profiles and a perturbation profile, and the perturbation profile based on a network \(\phi\).

2 Installation and loading

Use devtools to install the latest version from github or use the BiocManahger to install the package from bioconductor.

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("nempi")

Load the package with the library function.

library(nempi)

3 Small example

We look at a small example for which we first simulate data and then infer the unobserved perturbations. We draw a random network for the perturbed genes (P-genes). Then we simulate gene expression for each sample given the subset of P-genes that has been perturbed in the sample. E.g., if P-gene A is upstream of P-gene B and P-gene A has been perturbed, all E-genes attached to B also show an effect.

3.1 Data simulation

library(mnem)
seed <- 8675309
Pgenes <- 10
Egenes <- 5
samples <- 100
edgeprob <- 0.5
uninform <- floor((Pgenes * Egenes) * 0.1)
Nems <- mw <- 1
noise <- 1
multi <- c(0.2, 0.1)
set.seed(seed)
simmini <- simData(Sgenes = Pgenes, Egenes = Egenes, Nems = Nems, mw = mw, nCells = samples,
    uninform = uninform, multi = multi, badCells = floor(samples * 0.1), edgeprob = edgeprob)
data <- simmini$data
ones <- which(data == 1)
zeros <- which(data == 0)
data[ones] <- rnorm(length(ones), 1, noise)
data[zeros] <- rnorm(length(zeros), -1, noise)
epiNEM::HeatmapOP(data, col = "RdBu", cexRow = 0.75, cexCol = 0.75, bordercol = "transparent",
    xrot = 0, dendrogram = "both")
Heatmap of the simulated log odds. Effects are blue and no effects
are red. Rows denote the observed E-genes and columns the samples annoted by
P-genes. Each P-gene
has been perturbed in many cells. The E-genes are annotated as how they are
attached in the ground truth. E.g. E-genes named '1' are attached to S-gene
'1' in the ground truth.

Figure 1: Heatmap of the simulated log odds
Effects are blue and no effects are red. Rows denote the observed E-genes and columns the samples annoted by P-genes. Each P-gene has been perturbed in many cells. The E-genes are annotated as how they are attached in the ground truth. E.g. E-genes named ‘1’ are attached to S-gene ‘1’ in the ground truth.

The typical data input for NEM\(\pi\) consists of a data matrix with samples as columns and E-genes as rows. The columns are either labeled by their perturbed gene(s) or unlabeled (default: ““). After the data simulation all samples are labeled. We unlabel \(50\%\) of the sample and pretend we do not know, which P-gene has been perturbed.

lost <- sample(1:ncol(data), floor(ncol(data) * 0.5))
colnames(data)[lost] <- ""

3.2 Perturbation inference

We use NEM\(\pi\) and other methods to infer the perturbations. We use the area under the precision-recall curve as the two measures of accuracy. We also plot the NEM\(\pi\) result.

res <- nempi(data)
fit <- pifit(res, simmini, data)
print(fit$auc)
## [1] 0.7788576
ressvm <- classpi(data)
fit <- pifit(ressvm, simmini, data, propagate = FALSE)
print(fit$auc)
## [1] 0.7090377
resnn <- classpi(data, method = "nnet")
fit <- pifit(resnn, simmini, data, propagate = FALSE)
print(fit$auc)
## [1] 0.6273228
resrf <- classpi(data, method = "randomForest")
fit <- pifit(resrf, simmini, data, propagate = FALSE)
print(fit$auc)
## [1] 0.6987766
col <- rgb(seq(0, 1, length.out = 10), seq(1, 0, length.out = 10), seq(1, 0, length.out = 10))
plot(res, heatlist = list(col = "RdBu"), barlist = list(col = col))

Compared to support vector machines (svm), neural nets (nnet) and random forest (rf) class prediction, NEM\(\pi\) achieves a higher accuracy.

Note that NEM\(\pi\) is in general more powerful, if the P-genes are connected in a denser network. The other methods perform equally well for sparse or even disconnected network \(\phi\). However, they usually profit from combinatorial perturbations.

3.3 Prior matrix

Alternatively, we can also provide NEM\(\pi\) with a probabilistic perturbation matrix \(\Gamma\) as a prior. In the rows are the potentially perturbed genes and in the columns are the samples. The entries are between \(0\) and \(1\), with the sum of all entries of a sample summing to \(1\).

Gamma <- matrix(0, Pgenes, ncol(data))
rownames(Gamma) <- seq_len(Pgenes)
colnames(Gamma) <- colnames(data)
for (i in seq_len(Pgenes)) {
    Gamma[i, grep(paste0("^", i, "_|_", i, "$|_", i, "_|^", i, "$"), colnames(data))] <- 1
}
Gamma <- apply(Gamma, 2, function(x) return(x/sum(x)))
Gamma[is.na(Gamma)] <- 0

epiNEM::HeatmapOP(Gamma, col = "RdBu", cexRow = 0.75, cexCol = 0.75, bordercol = "transparent",
    xrot = 0, dendrogram = "both")