PLSDA-batch is a batch-effect correction method based on Projection to Latent Structures Discriminant Analysis. It estimates latent components associated with treatment and batch effects and removes the batch-related variation prior to any downstream analysis. It also enables ordination and data visualisation. PLSDA-batch is particularly suitable for microbiome data because it is non-parametric and multivariate. When combined with the centered log-ratio transformation to address uneven library sizes and compositionality, PLSDA-batch directly tackles key characteristics of microbiome data that existing correction methods often overlook.
In addition to the main method, the R package includes two extensions:
1/ weighted PLSDA-batch designed for unbalanced batch-by-treatment designs commonly encountered in small-sample microbiome studies;
2/ sparse PLSDA-batch which performs discriminative variable selection to reduce overfitting in classification settings.
These two variants broaden the applicability of PLSDA-batch across diverse data scenarios (???).
This vignette covers microbiome data pre-processing, batch-effect detection and visualisation, the use of the PLSDA-batch family of methods, and post-correction evaluation and variable selection. For guidance on method choice according to experimental objectives and study designs, see “Batch Effects Management in Case Studies”.
Note: A new argument mode has been added to PLSDA_batch(). We recommend using “regression” mode, but if you need to reproduce results obtained using PLSDAbatch ≤ 1.6.0, please explicitly set mode = "canonical".
First, we load the packages required for the analysis and check their versions.
# CRAN
library(pheatmap)
library(vegan)
library(gridExtra)
# Bioconductor
library(mixOmics)
library(Biobase)
library(SummarizedExperiment)
library(PLSDAbatch)
# print package versions
package.version("pheatmap")
## [1] "1.0.13"
package.version("vegan")
## [1] "2.7-2"
package.version("gridExtra")
## [1] "2.3"
package.version("mixOmics")
## [1] "6.35.0"
package.version("Biobase")
## [1] "2.71.0"
package.version("PLSDAbatch")
## [1] "1.99.1"
We considered a case study to illustrate the application of PLSDA-batch. The study is described as follows:
\(\color{blue}{\bf{\text{Anaerobic digestion.}}}\) This study investigated microbial indicators that could enhance the efficacy of anaerobic digestion (AD) bioprocesses and prevent system failure (???). The dataset includes 75 samples and 567 microbial variables. Samples were treated with two different ranges of phenol concentration (effect of interest) and processed on five different dates (batch effect). This study exhibits a clear and strong batch effect with an approx. balanced batch x treatment design.
We load the \(\color{blue}{\text{AD data}}\) stored internally using the data() function, and then extract the batch and treatment information.
# AD data
data("AD_data")
ad.count <- assays(AD_data$FullData)$Count
dim(ad.count)
## [1] 75 567
ad.metadata <- rowData(AD_data$FullData)
ad.batch <- factor(ad.metadata$sequencing_run_date,
levels = unique(ad.metadata$sequencing_run_date)
)
ad.trt <- as.factor(ad.metadata$initial_phenol_concentration.regroup)
names(ad.batch) <- names(ad.trt) <- rownames(ad.metadata)
The raw \(\color{blue}{\text{AD data}}\) include 567 OTUs and 75 samples.
We then use the PreFL() function from our \(\color{orange}{\text{PLSDAbatch}}\)
R package to filter the data.
ad.filter.res <- PreFL(data = ad.count)
ad.filter <- ad.filter.res$data.filter
dim(ad.filter)
## [1] 75 231
# zero proportion before filtering
ad.filter.res$zero.prob.before
## [1] 0.6328042
# zero proportion after filtering
ad.filter.res$zero.prob.after
## [1] 0.3806638
After filtering, 231 OTUs remained, and the proportion of zeroes decreased from 63% to 38%.
Note: The PreFL() function is intended for raw count data rather than
relative abundance data. We also recommend performing pre-filtering on raw
counts to better mitigate compositionality issues.
Prior to CLR transformation, we recommend adding 1 as an offset for raw count data
(e.g., \(\color{blue}{\text{AD data}}\)), and half of the minimum value as
an offset for relative abundance data. We use the
logratio.transfo() function from the \(\color{orange}{\text{mixOmics}}\) package to
apply the CLR transformation.
ad.clr <- logratio.transfo(X = ad.filter, logratio = "CLR", offset = 1)
class(ad.clr) <- "matrix"
We apply the pca() function from the \(\color{orange}{\text{mixOmics}}\) package to
the \(\color{blue}{\text{AD data}}\) and use the Scatter_Density() function from
\(\color{orange}{\text{PLSDAbatch}}\) to display the PCA sample plot with
density overlays.
# AD data
ad.pca.before <- mixOmics::pca(ad.clr, ncomp = 3, scale = TRUE)
Scatter_Density(
components = ad.pca.before$variates$X, comp = c(1, 2),
expl.var = ad.pca.before$prop_expl_var$X,
batch = ad.batch, trt = ad.trt,
title = "AD data", trt.legend.title = "Phenol conc."
)
Figure 1: PCA sample plot with density overlays for the AD data
In the figure above, we observed (1) clear separation between samples treated with different phenol concentrations and (2) noticeable differences between samples sequenced on “14/04/2016” and “21/09/2017” compared with the other dates. Therefore, the date-related batch effect needs to be removed.
We first identify the top OTUs driving the major variance in the PCA using the
selectVar() function from the \(\color{orange}{\text{mixOmics}}\) package. Each selected OTU can then be visualised using boxplots and density plots with the box_plot() and density_plot() functions in \(\color{orange}{\text{PLSDAbatch}}\).
ad.OTU.name <- selectVar(ad.pca.before, comp = 1)$name[1]
ad.OTU_batch <- data.frame(value = ad.clr[, ad.OTU.name], batch = ad.batch)
box_plot(
df = ad.OTU_batch, title = paste(ad.OTU.name, "(AD data)"),
x.angle = 30
)
Figure 2: Boxplots of sample values for OTU28 before batch-effect correction in the AD data
density_plot(df = ad.OTU_batch, title = paste(ad.OTU.name, "(AD data)"))
Figure 3: Density plots of sample values for OTU28 before batch-effect correction in the AD data
The boxplot and density plot indicate a strong date-related batch effect, driven by the differences between “14/04/2016”, “21/09/2017”, and the other dates for “OTU28”.
We also apply a linear regression model to “OTU28” using the linear_regres() function from \(\color{orange}{\text{PLSDAbatch}}\), including batch and treatment effects as covariates. We set “14/04/2016” and “21/09/2017” as the reference batches,
respectively, using the relevel() function from \(\color{orange}{\text{stats}}\).
# reference batch: 14/04/2016
ad.batch <- relevel(x = ad.batch, ref = "14/04/2016")
ad.OTU.lm <- linear_regres(
data = ad.clr[, ad.OTU.name],
trt = ad.trt, batch.fix = ad.batch,
type = "linear model"
)
summary(ad.OTU.lm$model$data)
##
## Call:
## stats::lm(formula = y ~ trt + batch.fix)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9384 -0.3279 0.1635 0.3849 0.9887
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8501 0.2196 3.871 0.000243 ***
## trt1-2 -1.6871 0.1754 -9.617 2.27e-14 ***
## batch.fix09/04/2015 1.5963 0.2950 5.410 8.55e-07 ***
## batch.fix01/07/2016 2.0839 0.2345 8.886 4.82e-13 ***
## batch.fix14/11/2016 1.7405 0.2480 7.018 1.24e-09 ***
## batch.fix21/09/2017 1.2646 0.2690 4.701 1.28e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7033 on 69 degrees of freedom
## Multiple R-squared: 0.7546, Adjusted R-squared: 0.7368
## F-statistic: 42.44 on 5 and 69 DF, p-value: < 2.2e-16
# reference batch: 21/09/2017
ad.batch <- relevel(x = ad.batch, ref = "21/09/2017")
ad.OTU.lm <- linear_regres(
data = ad.clr[, ad.OTU.name],
trt = ad.trt, batch.fix = ad.batch,
type = "linear model"
)
summary(ad.OTU.lm$model$data)
##
## Call:
## stats::lm(formula = y ~ trt + batch.fix)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9384 -0.3279 0.1635 0.3849 0.9887
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1147 0.2502 8.453 2.97e-12 ***
## trt1-2 -1.6871 0.1754 -9.617 2.27e-14 ***
## batch.fix14/04/2016 -1.2646 0.2690 -4.701 1.28e-05 ***
## batch.fix09/04/2015 0.3317 0.3139 1.056 0.29446
## batch.fix01/07/2016 0.8193 0.2573 3.185 0.00218 **
## batch.fix14/11/2016 0.4759 0.2705 1.760 0.08292 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7033 on 69 degrees of freedom
## Multiple R-squared: 0.7546, Adjusted R-squared: 0.7368
## F-statistic: 42.44 on 5 and 69 DF, p-value: < 2.2e-16
From the linear regression results, we observed P < 0.001 for the regression coefficients associated with all other batches when the reference batch was “14/04/2016”, confirming the differences between samples from “14/04/2016” and those from the other dates, as seen in the previous plots. When the reference batch was “21/09/2017”, we also observed significant differences between “21/09/2017” and “14/04/2016”, as well as between “21/09/2017” and “01/07/2016”. Therefore, a batch effect associated with “21/09/2017” is also present.
We produce a heatmap using the \(\color{orange}{\text{pheatmap}}\) package. The data first need to be scaled across both OTUs and samples.
# scale the clr data on both OTUs and samples
ad.clr.s <- scale(ad.clr, center = TRUE, scale = TRUE)
ad.clr.ss <- scale(t(ad.clr.s), center = TRUE, scale = TRUE)
ad.anno_col <- data.frame(Batch = ad.batch, Treatment = ad.trt)
ad.anno_colors <- list(
Batch = color.mixo(seq_len(5)),
Treatment = pb_color(seq_len(2))
)
names(ad.anno_colors$Batch) <- levels(ad.batch)
names(ad.anno_colors$Treatment) <- levels(ad.trt)
pheatmap(ad.clr.ss,
cluster_rows = FALSE,
fontsize_row = 4,
fontsize_col = 6,
fontsize = 8,
clustering_distance_rows = "euclidean",
clustering_method = "ward.D",
treeheight_row = 30,
annotation_col = ad.anno_col,
annotation_colors = ad.anno_colors,
border_color = "NA",
main = "AD data - Scaled"
)
Figure 4: Hierarchical clustering of samples in the AD data
In the heatmap, samples in the \(\color{blue}{\text{AD data}}\) from the batch dated “14/04/2016” were clustered together and were distinct from the other samples, indicating the presence of a batch effect.
We apply pRDA using the varpart() function from the \(\color{orange}{\text{vegan}}\)
R package.
# AD data
ad.factors.df <- data.frame(trt = ad.trt, batch = ad.batch)
class(ad.clr) <- "matrix"
ad.rda.before <- varpart(ad.clr, ~trt, ~batch,
data = ad.factors.df, scale = TRUE
)
ad.rda.before$part$indfract
## Df R.squared Adj.R.squared Testable
## [a] = X1|X2 1 NA 0.08943682 TRUE
## [b] = X2|X1 4 NA 0.26604420 TRUE
## [c] 0 NA 0.01296248 FALSE
## [d] = Residuals NA NA 0.63155651 FALSE
In the results, X1 and X2 represent the first and second covariates fitted
in the model. [a] and [b] represent the independent proportions of variance
explained by X1 and X2 respectively, and [c] represents the shared
variance between X1 and X2. In the \(\color{blue}{\text{AD data}}\),
batch variance (X2) was larger than treatment variance (X1) with some
interaction variance (shown in line [c], Adj.R.squared = 0.013). A
larger shared variance indicates a more unbalanced batch x treatment design
is. In this study, we considered the design to be approx. balanced.
The PLSDA_batch() function is implemented in the
\(\color{orange}{\text{PLSDAbatch}}\) package. To use this function, we need to
specify the optimal number of components related to the treatment (ncomp.trt) and
batch effects (ncomp.bat).
For versions > 1.6.0, we introduced the mode argument. The default is “regression”, which is also the recommended mode. In our applications, both Y.trt and Y.batch are categorical, and our goal is to understand how X explains and predicts Y: specifically, to remove the information in X that explains Y.batch while preserving the information in X that explains Y.trt.
In the \(\color{blue}{\text{AD data}}\), ad.trt has 2 classes and ad.batch has 5 batches, so we choose 1 component for the treatment effect and 4 components for the batch effect.
In PLS-based methods, a categorical Y variable is converted into a 0/1 dummy matrix, A Y variable with \(k\) classes becomes a matrix with \(k\) columns, but only \(k-1\) of these are informative because the final column can always be derived from the others. Mathmetically, the rank of Y is \(k-1\).
# the optimal number of treatment components
nlevels(ad.trt) - 1
## [1] 1
# the optimal number of batch components
nlevels(ad.batch) - 1
## [1] 4
ad.PLSDA_batch.res.reg <- PLSDA_batch(
X = ad.clr,
Y.trt = ad.trt, Y.bat = ad.batch,
ncomp.trt = 1, ncomp.bat = 4,
mode = "regression"
)
ad.PLSDA_batch.reg <- ad.PLSDA_batch.res.reg$X.nobatch
The canonical mode is not recommended for batch-effect removal, as it measures how X and Y co-vary rather then how X explains Y. For PLSDAbatch <= 1.6.0, the canonical mode was used by default. Therefore, if you need to reproduce results obtained with versions ≤ 1.6.0, please explicitly set mode = "canonical". Do not worry, the results from the two modes are generally similar, but biologically the “regression” mode is more appropriate.
The procedure for choosing the optimal number of components is the same in both modes. In canonical mode, however, we can directly visualise that a \(k\)-class Y variable requires only \(k − 1\) informative components, reflecting the effective rank of its dummy-coded representation.
We first convert ad.trt into a dummy matrix using the unmap() function from
\(\color{orange}{\text{mixOmics}}\), and then use pls() in “canonical” mode from the same package to calculate the explained variance of Y.
# estimate the number of treatment components
ad.trt.mat <- unmap(ad.trt)
ad.trt.tune.cnc <- pls(
X = ad.clr, Y = ad.trt.mat,
ncomp = 2, mode = "canonical"
)
ad.trt.tune.cnc$prop_expl_var$Y # 1
## comp1 comp2
## 1.0000000 0.9979148
In the results, 1 component was sufficient to explain 100% of the variance in the outcome dummy matrix ad.trt.mat, as ad.trt contains only 2 classes.
We then use the PLSDA_batch() function, with both treatment and batch grouping
information, to visualise the optimal number of batch components to remove.
# estimate the number of batch components
ad.batch.tune.cnc <- PLSDA_batch(
X = ad.clr,
Y.trt = ad.trt, Y.bat = ad.batch,
ncomp.trt = 1, ncomp.bat = 6,
mode = "canonical"
)
ad.batch.tune.cnc$explained_variance.bat$Y # 4
## comp1 comp2 comp3 comp4 comp5 comp6
## 0.2474654 0.2615741 0.2301382 0.2608223 0.2300984 0.2524412
sum(ad.batch.tune.cnc$explained_variance.bat$Y[seq_len(4)])
## [1] 1
For the batch effect, 4 components were able to explain 100% of the variance in the batch dummy matrix, as ad.batch contains 5 groups.
For convenience, the optimal number of components can simply be chosen as nlevels(y)-1.
# the optimal number of treatment components
nlevels(ad.trt) - 1
## [1] 1
# the optimal number of batch components
nlevels(ad.batch) - 1
## [1] 4
ad.PLSDA_batch.res.cnc <- PLSDA_batch(
X = ad.clr,
Y.trt = ad.trt, Y.bat = ad.batch,
ncomp.trt = 1, ncomp.bat = 4,
mode = "canonical"
)
ad.PLSDA_batch.cnc <- ad.PLSDA_batch.res.cnc$X.nobatch
We apply sPLSDA-batch using the same PLSDA_batch() function, but we specify
the number of variables to select on each component (usually only
treatment-related components, keepX.trt). To determine the optimal number of
variables to select, we use the tune.splsda() function from the
\(\color{orange}{\text{mixOmics}}\) package (???), testing all candidate values provided in test.keepX for each component.
# estimate the number of variables to select per treatment component
ad.test.keepX <- c(
seq(1, 10, 1), seq(20, 100, 10),
seq(150, 231, 50), 231
)
ad.trt.tune.v <- tune.splsda(
X = ad.clr, Y = ad.trt,
ncomp = 1, test.keepX = ad.test.keepX,
folds = 4, nrepeat = 50, seed = 777
)
ad.trt.tune.v$choice.keepX # 100
Here the optimal number of variables to select for the treatment component was 100.
ad.sPLSDA_batch.res.reg <- PLSDA_batch(
X = ad.clr,
Y.trt = ad.trt, Y.bat = ad.batch,
ncomp.trt = 1, keepX.trt = 100,
ncomp.bat = 4, mode = "regression"
)
ad.sPLSDA_batch.reg <- ad.sPLSDA_batch.res.reg$X.nobatch
ad.sPLSDA_batch.res.cnc <- PLSDA_batch(
X = ad.clr,
Y.trt = ad.trt, Y.bat = ad.batch,
ncomp.trt = 1, keepX.trt = 100,
ncomp.bat = 4, mode = "canonical"
)
ad.sPLSDA_batch.cnc <- ad.sPLSDA_batch.res.cnc$X.nobatch
Note: For unbalanced batch x treatment designs (except for nested designs), we can set balance = FALSE in the PLSDA_batch() function to apply weighted PLSDA-batch.
We apply various visualisation and quantitative methods to assess the effectiveness of batch-effect correction.
PCA
In the \(\color{blue}{\text{AD data}}\), we compared the PCA sample plots before and after batch effect correction.
ad.pca.before <- pca(ad.clr, ncomp = 3, scale = TRUE)
ad.pca.PLSDA_batch.cnc <- pca(ad.PLSDA_batch.cnc, ncomp = 3, scale = TRUE)
ad.pca.sPLSDA_batch.cnc <- pca(ad.sPLSDA_batch.cnc, ncomp = 3, scale = TRUE)
ad.pca.PLSDA_batch.reg <- pca(ad.PLSDA_batch.cnc, ncomp = 3, scale = TRUE)
ad.pca.sPLSDA_batch.reg <- pca(ad.sPLSDA_batch.cnc, ncomp = 3, scale = TRUE)
# order batches
ad.batch <- factor(ad.metadata$sequencing_run_date,
levels = unique(ad.metadata$sequencing_run_date)
)
ad.pca.before.plot <- Scatter_Density(
components = ad.pca.before$variates$X, comp = c(1, 2),
expl.var = ad.pca.before$prop_expl_var$X,
batch = ad.batch,
trt = ad.trt,
title = "Before correction"
)
ad.pca.PLSDA_batch.cnc.plot <- Scatter_Density(
components = ad.pca.PLSDA_batch.cnc$variates$X, comp = c(1, 2),
expl.var = ad.pca.PLSDA_batch.cnc$prop_expl_var$X,
batch = ad.batch,
trt = ad.trt,
title = "PLSDA-batch (canonical)"
)
ad.pca.sPLSDA_batch.cnc.plot <- Scatter_Density(
components = ad.pca.sPLSDA_batch.cnc$variates$X, comp = c(1, 2),
expl.var = ad.pca.sPLSDA_batch.cnc$prop_expl_var$X,
batch = ad.batch,
trt = ad.trt,
title = "sPLSDA-batch (canonical)"
)
ad.pca.PLSDA_batch.reg.plot <- Scatter_Density(
components = ad.pca.PLSDA_batch.reg$variates$X, comp = c(1, 2),
expl.var = ad.pca.PLSDA_batch.reg$prop_expl_var$X,
batch = ad.batch,
trt = ad.trt,
title = "PLSDA-batch (regression)"
)
ad.pca.sPLSDA_batch.reg.plot <- Scatter_Density(
components = ad.pca.sPLSDA_batch.reg$variates$X, comp = c(1, 2),
expl.var = ad.pca.sPLSDA_batch.reg$prop_expl_var$X,
batch = ad.batch,
trt = ad.trt,
title = "sPLSDA-batch (regression)"
)
Figure 5: PCA sample plots with density overlays before and after batch-effect correction in the AD data
As shown in the PCA sample plots, the differences between samples sequenced on “14/04/2016”, “21/09/2017” and the other dates were removed after batch-effect correction. The data corrected with PLSDA-batch retained slightly more treatment variation, mainly on the first PC, compared with sPLSDA-batch, as indicated by the x-axis label (26%). We can also compare boxplots and density plots for key variables identified by PCA as major drivers of variance, as well as heatmaps showing clear patterns before and after correction (results not shown). No noticeable differences between the two modes can be observed from the PCA plots.
pRDA
We calculate the global explained variance across all microbial variables
using pRDA. To achieve this, we loop through each variable in both the original (uncorrected) data and the batch-corrected data. The final results are then visualised using the partVar_plot() function from the
\(\color{orange}{\text{PLSDAbatch}}\) package.
# AD data
ad.corrected.list <- list(
`Before correction` = ad.clr,
`PLSDA-batch (canonical)` = ad.PLSDA_batch.cnc,
`sPLSDA-batch (canonical)` = ad.sPLSDA_batch.cnc,
`PLSDA-batch (regression)` = ad.PLSDA_batch.reg,
`sPLSDA-batch (regression)` = ad.sPLSDA_batch.reg
)
ad.prop.df <- data.frame(
Treatment = NA, Batch = NA,
Intersection = NA,
Residuals = NA
)
for (i in seq_len(length(ad.corrected.list))) {
rda.res <- varpart(ad.corrected.list[[i]], ~trt, ~batch,
data = ad.factors.df, scale = TRUE
)
ad.prop.df[i, ] <- rda.res$part$indfract$Adj.R.squared
}
rownames(ad.prop.df) <- names(ad.corrected.list)
ad.prop.df <- ad.prop.df[, c(1, 3, 2, 4)]
partVar_plot(prop.df = ad.prop.df)
Figure 6: Global explained variance before and after batch effect correction for the AD data
As shown in the figure above, the intersection between batch and treatment variance was small (1.3%) for the \(\color{blue}{\text{AD data}}\), indicating that the batch-by-treatment design is not highly unbalanced. Therefore, the unweighted PLSDA-batch and sPLSDA-batch methods were appropriate, and the weighted versions were not required. The sPLSDA-batch–corrected data showed better performance, with lower intersection variance compared with PLSDA-batch. The results from the two modes were similar.
\(\mathbf{R^2}\)
The \(R^2\) values for each variable are calculated using lm() from the
\(\color{orange}{\text{stats}}\) package. To compare \(R^2\) values across
variables, we scale the corrected data prior to calculating \(R^2\). The results
are then visualised using ggplot() from the \(\color{orange}{\text{ggplot2}}\) R package.
# AD data
# scale
ad.corr_scale.list <- lapply(
ad.corrected.list,
function(x) {
apply(x, 2, scale)
}
)
ad.r_values.list <- list()
for (i in seq_len(length(ad.corr_scale.list))) {
ad.r_values <- data.frame(trt = NA, batch = NA)
for (c in seq_len(ncol(ad.corr_scale.list[[i]]))) {
ad.fit.res.trt <- lm(ad.corr_scale.list[[i]][, c] ~ ad.trt)
ad.r_values[c, 1] <- summary(ad.fit.res.trt)$r.squared
ad.fit.res.batch <- lm(ad.corr_scale.list[[i]][, c] ~ ad.batch)
ad.r_values[c, 2] <- summary(ad.fit.res.batch)$r.squared
}
ad.r_values.list[[i]] <- ad.r_values
}
names(ad.r_values.list) <- names(ad.corr_scale.list)
ad.boxp.list <- list()
for (i in seq_len(length(ad.r_values.list))) {
ad.boxp.list[[i]] <-
data.frame(
r2 = c(
ad.r_values.list[[i]][, "trt"],
ad.r_values.list[[i]][, "batch"]
),
Effects = as.factor(rep(c("Treatment", "Batch"),
each = 231
))
)
}
names(ad.boxp.list) <- names(ad.r_values.list)
ad.r2.boxp <- rbind(
ad.boxp.list$`Before correction`,
ad.boxp.list$`PLSDA-batch (canonical)`,
ad.boxp.list$`sPLSDA-batch (canonical)`,
ad.boxp.list$`PLSDA-batch (regression)`,
ad.boxp.list$`sPLSDA-batch (regression)`
)
ad.r2.boxp$methods <- rep(
c(
"Before correction", "PLSDA-batch (canonical)",
"sPLSDA-batch (canonical)", "PLSDA-batch (regression)",
"sPLSDA-batch (regression)"
),
each = 462
)
ad.r2.boxp$methods <- factor(ad.r2.boxp$methods,
levels = unique(ad.r2.boxp$methods)
)
ggplot(ad.r2.boxp, aes(x = Effects, y = r2, fill = Effects)) +
geom_boxplot(alpha = 0.80) +
theme_bw() +
theme(
text = element_text(size = 18),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_text(angle = 60, hjust = 1, size = 18),
axis.text.y = element_text(size = 18),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
legend.position = "right"
) +
facet_grid(~methods) +
scale_fill_manual(values = pb_color(c(12, 14)))
Figure 7: AD study: \(R^2\) values for each microbial variable before and after batch effect correction
The batch-related variance was reduced after both PLSDA-batch and sPLSDA-batch correction, with PLSDA-batch retaining slightly more treatment-related variance.
##################################
ad.barp.list <- list()
for (i in seq_len(length(ad.r_values.list))) {
ad.barp.list[[i]] <- data.frame(
r2 = c(
sum(ad.r_values.list[[i]][, "trt"]),
sum(ad.r_values.list[[i]][, "batch"])
),
Effects = c("Treatment", "Batch")
)
}
names(ad.barp.list) <- names(ad.r_values.list)
ad.r2.barp <- rbind(
ad.barp.list$`Before correction`,
ad.barp.list$`PLSDA-batch (canonical)`,
ad.barp.list$`sPLSDA-batch (canonical)`,
ad.barp.list$`PLSDA-batch (regression)`,
ad.barp.list$`sPLSDA-batch (regression)`
)
ad.r2.barp$methods <- rep(
c(
"Before correction", "PLSDA-batch (canonical)",
"sPLSDA-batch (canonical)", "PLSDA-batch (regression)",
"sPLSDA-batch (regression)"
),
each = 2
)
ad.r2.barp$methods <- factor(ad.r2.barp$methods,
levels = unique(ad.r2.barp$methods)
)
ggplot(ad.r2.barp, aes(x = Effects, y = r2, fill = Effects)) +
geom_bar(stat = "identity") +
theme_bw() +
theme(
text = element_text(size = 18),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_text(angle = 60, hjust = 1, size = 18),
axis.text.y = element_text(size = 18),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
legend.position = "right"
) +
facet_grid(~methods) +
scale_fill_manual(values = pb_color(c(12, 14)))
Figure 8: AD study: Sum of \(R^2\) values for each microbial variable before and after batch effect correction
The overall sum of \(R^2\) values indicated that sPLSDA-batch removed slightly more batch variance than PLSDA-batch (Regression mode: PLSDA-batch (12.49), sPLSDA-batch (9.27); Canonical mode: PLSDA-batch (12.40), sPLSDA-batch (9.25) but preserved less treatment variance (Regression mode: PLSDA-batch (40.21), sPLSDA-batch (36.80); Canonical mode: PLSDA-batch (40.00), sPLSDA-batch (36.22)) compared with PLSDA-batch.
Alignment scores
To use the alignment_score() function from
\(\color{orange}{\text{PLSDAbatch}}\), we need to specify the proportion of
variance to explain (var), the number of nearest neighbours (k), and the
number of principal components to estimate (ncomp). The results are then visualised using ggplot() from the \(\color{orange}{\text{ggplot2}}\) package.
# AD data
ad.scores <- c()
names(ad.batch) <- rownames(ad.clr)
for (i in seq_len(length(ad.corrected.list))) {
res <- alignment_score(
data = ad.corrected.list[[i]],
batch = ad.batch,
var = 0.95,
k = 8,
ncomp = 50
)
ad.scores <- c(ad.scores, res)
}
ad.scores.df <- data.frame(
scores = ad.scores,
methods = names(ad.corrected.list)
)
ad.scores.df$methods <- factor(ad.scores.df$methods,
levels = rev(names(ad.corrected.list))
)
ggplot() +
geom_col(aes(
x = ad.scores.df$methods,
y = ad.scores.df$scores
)) +
geom_text(
aes(
x = ad.scores.df$methods,
y = ad.scores.df$scores / 2,
label = round(ad.scores.df$scores, 3)
),
size = 3, col = "white"
) +
coord_flip() +
theme_bw() +
ylab("Alignment Scores") +
xlab("") +
ylim(0, 0.85)
Figure 9: Comparison of alignment scores before and after batch effect correction using different methods for the AD data
The alignment scores complement the PCA results, especially when the extent of batch-effect removal is difficult to judge from PCA sample plots. In Figure 5, we can clearly see that all methods reduce the major batch effect, but the PCA plots do not reveal the more subtle differences in how effectively each method performs.
Because a higher alignment score indicates better mixing of samples, the bar plot above shows that sPLSDA-batch achieved superior performance compared with PLSDA-batch.
After batch effect correction, we can identify discriminative variables associated with different treatments.
Here, we use the splsda() function from \(\color{orange}{\text{mixOmics}}\) to select the
top 50 microbial variables that, in combination, discriminate the
treatment
groups in the \(\color{blue}{\text{AD data}}\).
For details on how to apply sPLS-DA, see the mixOmics documentation: mixOmics.
splsda.plsda_batch.reg <- splsda(
X = ad.PLSDA_batch.reg, Y = ad.trt,
ncomp = 3, keepX = rep(50, 3)
)
select.plsda_batch.reg <- selectVar(splsda.plsda_batch.reg, comp = 1)
head(select.plsda_batch.reg$value)
## value.var
## OTU46 0.3034711
## OTU24 0.2816639
## OTU28 0.2815924
## OTU35 -0.2666526
## OTU61 0.2614321
## OTU68 0.2544252
splsda.splsda_batch.reg <- splsda(
X = ad.sPLSDA_batch.reg, Y = ad.trt,
ncomp = 3, keepX = rep(50, 3)
)
select.splsda_batch.reg <- selectVar(splsda.splsda_batch.reg, comp = 1)
head(select.splsda_batch.reg$value)
## value.var
## OTU46 0.2949833
## OTU24 0.2744206
## OTU28 0.2716075
## OTU35 -0.2621345
## OTU61 0.2588398
## OTU68 0.2492678
length(intersect(select.plsda_batch.reg$name, select.splsda_batch.reg$name))
## [1] 44
The discriminative variables were selected and ranked according to their contributions to separating the sample groups treated with different ranges of phenol concentration (0–0.5 vs. 1–2 g/L).
The overlap between the variables selected from the data corrected with PLSDA-batch and sPLSDA-batch was high (44 out of 50), although some differences between the two selections remained.
sessionInfo()
## R Under development (unstable) (2025-12-22 r89219)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] PLSDAbatch_1.99.1 SummarizedExperiment_1.41.0
## [3] GenomicRanges_1.63.1 Seqinfo_1.1.0
## [5] IRanges_2.45.0 S4Vectors_0.49.0
## [7] MatrixGenerics_1.23.0 matrixStats_1.5.0
## [9] Biobase_2.71.0 BiocGenerics_0.57.0
## [11] generics_0.1.4 mixOmics_6.35.0
## [13] ggplot2_4.0.1 lattice_0.22-7
## [15] MASS_7.3-65 gridExtra_2.3
## [17] vegan_2.7-2 permute_0.9-8
## [19] pheatmap_1.0.13 BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] Rdpack_2.6.4 rlang_1.1.6
## [3] magrittr_2.0.4 otel_0.2.0
## [5] compiler_4.6.0 mgcv_1.9-4
## [7] vctrs_0.6.5 reshape2_1.4.5
## [9] stringr_1.6.0 crayon_1.5.3
## [11] pkgconfig_2.0.3 fastmap_1.2.0
## [13] magick_2.9.0 backports_1.5.0
## [15] XVector_0.51.0 labeling_0.4.3
## [17] rmarkdown_2.30 nloptr_2.2.1
## [19] tinytex_0.58 purrr_1.2.0
## [21] xfun_0.55 cachem_1.1.0
## [23] jsonlite_2.0.0 DelayedArray_0.37.0
## [25] BiocParallel_1.45.0 broom_1.0.11
## [27] parallel_4.6.0 cluster_2.1.8.1
## [29] R6_2.6.1 bslib_0.9.0
## [31] stringi_1.8.7 RColorBrewer_1.1-3
## [33] car_3.1-3 boot_1.3-32
## [35] jquerylib_0.1.4 numDeriv_2016.8-1.1
## [37] Rcpp_1.1.0.8.1 bookdown_0.46
## [39] knitr_1.51 Matrix_1.7-4
## [41] splines_4.6.0 igraph_2.2.1
## [43] tidyselect_1.2.1 dichromat_2.0-0.1
## [45] abind_1.4-8 yaml_2.3.12
## [47] TreeSummarizedExperiment_2.19.0 codetools_0.2-20
## [49] tibble_3.3.0 lmerTest_3.1-3
## [51] plyr_1.8.9 treeio_1.35.0
## [53] withr_3.0.2 rARPACK_0.11-0
## [55] S7_0.2.1 evaluate_1.0.5
## [57] Biostrings_2.79.3 pillar_1.11.1
## [59] BiocManager_1.30.27 ggpubr_0.6.2
## [61] carData_3.0-5 reformulas_0.4.3
## [63] ellipse_0.5.0 insight_1.4.4
## [65] tidytree_0.4.6 scales_1.4.0
## [67] minqa_1.2.8 glue_1.8.0
## [69] lazyeval_0.2.2 tools_4.6.0
## [71] lme4_1.1-38 RSpectra_0.16-2
## [73] ggsignif_0.6.4 fs_1.6.6
## [75] grid_4.6.0 ape_5.8-1
## [77] tidyr_1.3.2 rbibutils_2.4
## [79] SingleCellExperiment_1.33.0 nlme_3.1-168
## [81] performance_0.15.3 Formula_1.2-5
## [83] cli_3.6.5 rappdirs_0.3.3
## [85] S4Arrays_1.11.1 dplyr_1.1.4
## [87] corpcor_1.6.10 gtable_0.3.6
## [89] yulab.utils_0.2.3 rstatix_0.7.3
## [91] sass_0.4.10 digest_0.6.39
## [93] SparseArray_1.11.10 ggrepel_0.9.6
## [95] farver_2.1.2 htmltools_0.5.9
## [97] lifecycle_1.0.4