1 Brief Introduction

PLSDA-batch is a batch-effect correction method based on Projection to Latent Structures Discriminant Analysis. It estimates latent components associated with treatment and batch effects and removes the batch-related variation prior to any downstream analysis. It also enables ordination and data visualisation. PLSDA-batch is particularly suitable for microbiome data because it is non-parametric and multivariate. When combined with the centered log-ratio transformation to address uneven library sizes and compositionality, PLSDA-batch directly tackles key characteristics of microbiome data that existing correction methods often overlook.

In addition to the main method, the R package includes two extensions:

1/ weighted PLSDA-batch designed for unbalanced batch-by-treatment designs commonly encountered in small-sample microbiome studies;

2/ sparse PLSDA-batch which performs discriminative variable selection to reduce overfitting in classification settings.

These two variants broaden the applicability of PLSDA-batch across diverse data scenarios (???).

This vignette covers microbiome data pre-processing, batch-effect detection and visualisation, the use of the PLSDA-batch family of methods, and post-correction evaluation and variable selection. For guidance on method choice according to experimental objectives and study designs, see “Batch Effects Management in Case Studies”.

Note: A new argument mode has been added to PLSDA_batch(). We recommend using “regression” mode, but if you need to reproduce results obtained using PLSDAbatch ≤ 1.6.0, please explicitly set mode = "canonical".

2 Packages installation and loading

First, we load the packages required for the analysis and check their versions.

# CRAN
library(pheatmap)
library(vegan)
library(gridExtra)

# Bioconductor
library(mixOmics)
library(Biobase)
library(SummarizedExperiment)
library(PLSDAbatch)

# print package versions
package.version("pheatmap")

## [1] "1.0.13"

package.version("vegan")

## [1] "2.7-2"

package.version("gridExtra")

## [1] "2.3"

package.version("mixOmics")

## [1] "6.35.0"

package.version("Biobase")

## [1] "2.71.0"

package.version("PLSDAbatch")

## [1] "1.99.1"

3 Case study description

We considered a case study to illustrate the application of PLSDA-batch. The study is described as follows:

\(\color{blue}{\bf{\text{Anaerobic digestion.}}}\) This study investigated microbial indicators that could enhance the efficacy of anaerobic digestion (AD) bioprocesses and prevent system failure (???). The dataset includes 75 samples and 567 microbial variables. Samples were treated with two different ranges of phenol concentration (effect of interest) and processed on five different dates (batch effect). This study exhibits a clear and strong batch effect with an approx. balanced batch x treatment design.

4 Data pre-processing

4.1 Pre-filtering

We load the \(\color{blue}{\text{AD data}}\) stored internally using the data() function, and then extract the batch and treatment information.

# AD data
data("AD_data")
ad.count <- assays(AD_data$FullData)$Count
dim(ad.count)

## [1]  75 567

ad.metadata <- rowData(AD_data$FullData)
ad.batch <- factor(ad.metadata$sequencing_run_date,
    levels = unique(ad.metadata$sequencing_run_date)
)
ad.trt <- as.factor(ad.metadata$initial_phenol_concentration.regroup)
names(ad.batch) <- names(ad.trt) <- rownames(ad.metadata)

The raw \(\color{blue}{\text{AD data}}\) include 567 OTUs and 75 samples. We then use the PreFL() function from our \(\color{orange}{\text{PLSDAbatch}}\) R package to filter the data.

ad.filter.res <- PreFL(data = ad.count)
ad.filter <- ad.filter.res$data.filter
dim(ad.filter)

## [1]  75 231

# zero proportion before filtering
ad.filter.res$zero.prob.before

## [1] 0.6328042

# zero proportion after filtering
ad.filter.res$zero.prob.after

## [1] 0.3806638

After filtering, 231 OTUs remained, and the proportion of zeroes decreased from 63% to 38%.

Note: The PreFL() function is intended for raw count data rather than relative abundance data. We also recommend performing pre-filtering on raw counts to better mitigate compositionality issues.

4.2 Transformation

Prior to CLR transformation, we recommend adding 1 as an offset for raw count data (e.g., \(\color{blue}{\text{AD data}}\)), and half of the minimum value as an offset for relative abundance data. We use the logratio.transfo() function from the \(\color{orange}{\text{mixOmics}}\) package to apply the CLR transformation.

ad.clr <- logratio.transfo(X = ad.filter, logratio = "CLR", offset = 1)
class(ad.clr) <- "matrix"

5 Batch effect detection

5.1 PCA

We apply the pca() function from the \(\color{orange}{\text{mixOmics}}\) package to the \(\color{blue}{\text{AD data}}\) and use the Scatter_Density() function from \(\color{orange}{\text{PLSDAbatch}}\) to display the PCA sample plot with density overlays.

# AD data
ad.pca.before <- mixOmics::pca(ad.clr, ncomp = 3, scale = TRUE)

Scatter_Density(
    components = ad.pca.before$variates$X, comp = c(1, 2),
    expl.var = ad.pca.before$prop_expl_var$X,
    batch = ad.batch, trt = ad.trt,
    title = "AD data", trt.legend.title = "Phenol conc."
)

Figure 1: PCA sample plot with density overlays for the AD data

In the figure above, we observed (1) clear separation between samples treated with different phenol concentrations and (2) noticeable differences between samples sequenced on “14/04/2016” and “21/09/2017” compared with the other dates. Therefore, the date-related batch effect needs to be removed.

5.2 Boxplots and density plots

We first identify the top OTUs driving the major variance in the PCA using the selectVar() function from the \(\color{orange}{\text{mixOmics}}\) package. Each selected OTU can then be visualised using boxplots and density plots with the box_plot() and density_plot() functions in \(\color{orange}{\text{PLSDAbatch}}\).

ad.OTU.name <- selectVar(ad.pca.before, comp = 1)$name[1]
ad.OTU_batch <- data.frame(value = ad.clr[, ad.OTU.name], batch = ad.batch)
box_plot(
    df = ad.OTU_batch, title = paste(ad.OTU.name, "(AD data)"),
    x.angle = 30
)

Figure 2: Boxplots of sample values for OTU28 before batch-effect correction in the AD data

density_plot(df = ad.OTU_batch, title = paste(ad.OTU.name, "(AD data)"))

Figure 3: Density plots of sample values for OTU28 before batch-effect correction in the AD data

The boxplot and density plot indicate a strong date-related batch effect, driven by the differences between “14/04/2016”, “21/09/2017”, and the other dates for “OTU28”.

We also apply a linear regression model to “OTU28” using the linear_regres() function from \(\color{orange}{\text{PLSDAbatch}}\), including batch and treatment effects as covariates. We set “14/04/2016” and “21/09/2017” as the reference batches, respectively, using the relevel() function from \(\color{orange}{\text{stats}}\).

# reference batch: 14/04/2016
ad.batch <- relevel(x = ad.batch, ref = "14/04/2016")

ad.OTU.lm <- linear_regres(
    data = ad.clr[, ad.OTU.name],
    trt = ad.trt, batch.fix = ad.batch,
    type = "linear model"
)
summary(ad.OTU.lm$model$data)

## 
## Call:
## stats::lm(formula = y ~ trt + batch.fix)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9384 -0.3279  0.1635  0.3849  0.9887 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.8501     0.2196   3.871 0.000243 ***
## trt1-2               -1.6871     0.1754  -9.617 2.27e-14 ***
## batch.fix09/04/2015   1.5963     0.2950   5.410 8.55e-07 ***
## batch.fix01/07/2016   2.0839     0.2345   8.886 4.82e-13 ***
## batch.fix14/11/2016   1.7405     0.2480   7.018 1.24e-09 ***
## batch.fix21/09/2017   1.2646     0.2690   4.701 1.28e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7033 on 69 degrees of freedom
## Multiple R-squared:  0.7546, Adjusted R-squared:  0.7368 
## F-statistic: 42.44 on 5 and 69 DF,  p-value: < 2.2e-16

# reference batch: 21/09/2017
ad.batch <- relevel(x = ad.batch, ref = "21/09/2017")

ad.OTU.lm <- linear_regres(
    data = ad.clr[, ad.OTU.name],
    trt = ad.trt, batch.fix = ad.batch,
    type = "linear model"
)
summary(ad.OTU.lm$model$data)

## 
## Call:
## stats::lm(formula = y ~ trt + batch.fix)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9384 -0.3279  0.1635  0.3849  0.9887 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.1147     0.2502   8.453 2.97e-12 ***
## trt1-2               -1.6871     0.1754  -9.617 2.27e-14 ***
## batch.fix14/04/2016  -1.2646     0.2690  -4.701 1.28e-05 ***
## batch.fix09/04/2015   0.3317     0.3139   1.056  0.29446    
## batch.fix01/07/2016   0.8193     0.2573   3.185  0.00218 ** 
## batch.fix14/11/2016   0.4759     0.2705   1.760  0.08292 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7033 on 69 degrees of freedom
## Multiple R-squared:  0.7546, Adjusted R-squared:  0.7368 
## F-statistic: 42.44 on 5 and 69 DF,  p-value: < 2.2e-16

From the linear regression results, we observed P < 0.001 for the regression coefficients associated with all other batches when the reference batch was “14/04/2016”, confirming the differences between samples from “14/04/2016” and those from the other dates, as seen in the previous plots. When the reference batch was “21/09/2017”, we also observed significant differences between “21/09/2017” and “14/04/2016”, as well as between “21/09/2017” and “01/07/2016”. Therefore, a batch effect associated with “21/09/2017” is also present.

5.3 Heatmap

We produce a heatmap using the \(\color{orange}{\text{pheatmap}}\) package. The data first need to be scaled across both OTUs and samples.

# scale the clr data on both OTUs and samples
ad.clr.s <- scale(ad.clr, center = TRUE, scale = TRUE)
ad.clr.ss <- scale(t(ad.clr.s), center = TRUE, scale = TRUE)

ad.anno_col <- data.frame(Batch = ad.batch, Treatment = ad.trt)
ad.anno_colors <- list(
    Batch = color.mixo(seq_len(5)),
    Treatment = pb_color(seq_len(2))
)
names(ad.anno_colors$Batch) <- levels(ad.batch)
names(ad.anno_colors$Treatment) <- levels(ad.trt)

pheatmap(ad.clr.ss,
    cluster_rows = FALSE,
    fontsize_row = 4,
    fontsize_col = 6,
    fontsize = 8,
    clustering_distance_rows = "euclidean",
    clustering_method = "ward.D",
    treeheight_row = 30,
    annotation_col = ad.anno_col,
    annotation_colors = ad.anno_colors,
    border_color = "NA",
    main = "AD data - Scaled"
)

Figure 4: Hierarchical clustering of samples in the AD data

In the heatmap, samples in the \(\color{blue}{\text{AD data}}\) from the batch dated “14/04/2016” were clustered together and were distinct from the other samples, indicating the presence of a batch effect.

5.4 pRDA

We apply pRDA using the varpart() function from the \(\color{orange}{\text{vegan}}\) R package.

# AD data
ad.factors.df <- data.frame(trt = ad.trt, batch = ad.batch)
class(ad.clr) <- "matrix"
ad.rda.before <- varpart(ad.clr, ~trt, ~batch,
    data = ad.factors.df, scale = TRUE
)
ad.rda.before$part$indfract

##                 Df R.squared Adj.R.squared Testable
## [a] = X1|X2      1        NA    0.08943682     TRUE
## [b] = X2|X1      4        NA    0.26604420     TRUE
## [c]              0        NA    0.01296248    FALSE
## [d] = Residuals NA        NA    0.63155651    FALSE

In the results, X1 and X2 represent the first and second covariates fitted in the model. [a] and [b] represent the independent proportions of variance explained by X1 and X2 respectively, and [c] represents the shared variance between X1 and X2. In the \(\color{blue}{\text{AD data}}\), batch variance (X2) was larger than treatment variance (X1) with some interaction variance (shown in line [c], Adj.R.squared = 0.013). A larger shared variance indicates a more unbalanced batch x treatment design is. In this study, we considered the design to be approx. balanced.

6 Batch effect correction

6.1 PLSDA-batch

The PLSDA_batch() function is implemented in the \(\color{orange}{\text{PLSDAbatch}}\) package. To use this function, we need to specify the optimal number of components related to the treatment (ncomp.trt) and batch effects (ncomp.bat).

For versions > 1.6.0, we introduced the mode argument. The default is “regression”, which is also the recommended mode. In our applications, both Y.trt and Y.batch are categorical, and our goal is to understand how X explains and predicts Y: specifically, to remove the information in X that explains Y.batch while preserving the information in X that explains Y.trt.

6.1.1 Regression mode

In the \(\color{blue}{\text{AD data}}\), ad.trt has 2 classes and ad.batch has 5 batches, so we choose 1 component for the treatment effect and 4 components for the batch effect.

In PLS-based methods, a categorical Y variable is converted into a 0/1 dummy matrix, A Y variable with \(k\) classes becomes a matrix with \(k\) columns, but only \(k-1\) of these are informative because the final column can always be derived from the others. Mathmetically, the rank of Y is \(k-1\).

# the optimal number of treatment components
nlevels(ad.trt) - 1

## [1] 1

# the optimal number of batch components
nlevels(ad.batch) - 1

## [1] 4

ad.PLSDA_batch.res.reg <- PLSDA_batch(
    X = ad.clr,
    Y.trt = ad.trt, Y.bat = ad.batch,
    ncomp.trt = 1, ncomp.bat = 4,
    mode = "regression"
)
ad.PLSDA_batch.reg <- ad.PLSDA_batch.res.reg$X.nobatch

6.1.2 Canonical mode

The canonical mode is not recommended for batch-effect removal, as it measures how X and Y co-vary rather then how X explains Y. For PLSDAbatch <= 1.6.0, the canonical mode was used by default. Therefore, if you need to reproduce results obtained with versions ≤ 1.6.0, please explicitly set mode = "canonical". Do not worry, the results from the two modes are generally similar, but biologically the “regression” mode is more appropriate.

The procedure for choosing the optimal number of components is the same in both modes. In canonical mode, however, we can directly visualise that a \(k\)-class Y variable requires only \(k − 1\) informative components, reflecting the effective rank of its dummy-coded representation.

We first convert ad.trt into a dummy matrix using the unmap() function from \(\color{orange}{\text{mixOmics}}\), and then use pls() in “canonical” mode from the same package to calculate the explained variance of Y.

# estimate the number of treatment components
ad.trt.mat <- unmap(ad.trt)
ad.trt.tune.cnc <- pls(
    X = ad.clr, Y = ad.trt.mat,
    ncomp = 2, mode = "canonical"
)
ad.trt.tune.cnc$prop_expl_var$Y # 1

##     comp1     comp2 
## 1.0000000 0.9979148

In the results, 1 component was sufficient to explain 100% of the variance in the outcome dummy matrix ad.trt.mat, as ad.trt contains only 2 classes.

We then use the PLSDA_batch() function, with both treatment and batch grouping information, to visualise the optimal number of batch components to remove.

# estimate the number of batch components
ad.batch.tune.cnc <- PLSDA_batch(
    X = ad.clr,
    Y.trt = ad.trt, Y.bat = ad.batch,
    ncomp.trt = 1, ncomp.bat = 6,
    mode = "canonical"
)
ad.batch.tune.cnc$explained_variance.bat$Y # 4

##     comp1     comp2     comp3     comp4     comp5     comp6 
## 0.2474654 0.2615741 0.2301382 0.2608223 0.2300984 0.2524412

sum(ad.batch.tune.cnc$explained_variance.bat$Y[seq_len(4)])

## [1] 1

For the batch effect, 4 components were able to explain 100% of the variance in the batch dummy matrix, as ad.batch contains 5 groups.

For convenience, the optimal number of components can simply be chosen as nlevels(y)-1.

# the optimal number of treatment components
nlevels(ad.trt) - 1

## [1] 1

# the optimal number of batch components
nlevels(ad.batch) - 1

## [1] 4

ad.PLSDA_batch.res.cnc <- PLSDA_batch(
    X = ad.clr,
    Y.trt = ad.trt, Y.bat = ad.batch,
    ncomp.trt = 1, ncomp.bat = 4,
    mode = "canonical"
)
ad.PLSDA_batch.cnc <- ad.PLSDA_batch.res.cnc$X.nobatch

6.2 sPLSDA-batch

We apply sPLSDA-batch using the same PLSDA_batch() function, but we specify the number of variables to select on each component (usually only treatment-related components, keepX.trt). To determine the optimal number of variables to select, we use the tune.splsda() function from the \(\color{orange}{\text{mixOmics}}\) package (???), testing all candidate values provided in test.keepX for each component.

# estimate the number of variables to select per treatment component
ad.test.keepX <- c(
    seq(1, 10, 1), seq(20, 100, 10),
    seq(150, 231, 50), 231
)
ad.trt.tune.v <- tune.splsda(
    X = ad.clr, Y = ad.trt,
    ncomp = 1, test.keepX = ad.test.keepX,
    folds = 4, nrepeat = 50, seed = 777
)
ad.trt.tune.v$choice.keepX # 100

Here the optimal number of variables to select for the treatment component was 100.

6.2.1 Regression mode

ad.sPLSDA_batch.res.reg <- PLSDA_batch(
    X = ad.clr,
    Y.trt = ad.trt, Y.bat = ad.batch,
    ncomp.trt = 1, keepX.trt = 100,
    ncomp.bat = 4, mode = "regression"
)
ad.sPLSDA_batch.reg <- ad.sPLSDA_batch.res.reg$X.nobatch

6.2.2 Canonical mode

ad.sPLSDA_batch.res.cnc <- PLSDA_batch(
    X = ad.clr,
    Y.trt = ad.trt, Y.bat = ad.batch,
    ncomp.trt = 1, keepX.trt = 100,
    ncomp.bat = 4, mode = "canonical"
)
ad.sPLSDA_batch.cnc <- ad.sPLSDA_batch.res.cnc$X.nobatch

Note: For unbalanced batch x treatment designs (except for nested designs), we can set balance = FALSE in the PLSDA_batch() function to apply weighted PLSDA-batch.

7 Assessing batch effect correction

We apply various visualisation and quantitative methods to assess the effectiveness of batch-effect correction.

7.1 Methods that detect batch effects

PCA

In the \(\color{blue}{\text{AD data}}\), we compared the PCA sample plots before and after batch effect correction.

ad.pca.before <- pca(ad.clr, ncomp = 3, scale = TRUE)
ad.pca.PLSDA_batch.cnc <- pca(ad.PLSDA_batch.cnc, ncomp = 3, scale = TRUE)
ad.pca.sPLSDA_batch.cnc <- pca(ad.sPLSDA_batch.cnc, ncomp = 3, scale = TRUE)
ad.pca.PLSDA_batch.reg <- pca(ad.PLSDA_batch.cnc, ncomp = 3, scale = TRUE)
ad.pca.sPLSDA_batch.reg <- pca(ad.sPLSDA_batch.cnc, ncomp = 3, scale = TRUE)

# order batches
ad.batch <- factor(ad.metadata$sequencing_run_date,
    levels = unique(ad.metadata$sequencing_run_date)
)

ad.pca.before.plot <- Scatter_Density(
    components = ad.pca.before$variates$X, comp = c(1, 2),
    expl.var = ad.pca.before$prop_expl_var$X,
    batch = ad.batch,
    trt = ad.trt,
    title = "Before correction"
)

ad.pca.PLSDA_batch.cnc.plot <- Scatter_Density(
    components = ad.pca.PLSDA_batch.cnc$variates$X, comp = c(1, 2),
    expl.var = ad.pca.PLSDA_batch.cnc$prop_expl_var$X,
    batch = ad.batch,
    trt = ad.trt,
    title = "PLSDA-batch (canonical)"
)

ad.pca.sPLSDA_batch.cnc.plot <- Scatter_Density(
    components = ad.pca.sPLSDA_batch.cnc$variates$X, comp = c(1, 2),
    expl.var = ad.pca.sPLSDA_batch.cnc$prop_expl_var$X,
    batch = ad.batch,
    trt = ad.trt,
    title = "sPLSDA-batch (canonical)"
)

ad.pca.PLSDA_batch.reg.plot <- Scatter_Density(
    components = ad.pca.PLSDA_batch.reg$variates$X, comp = c(1, 2),
    expl.var = ad.pca.PLSDA_batch.reg$prop_expl_var$X,
    batch = ad.batch,
    trt = ad.trt,
    title = "PLSDA-batch (regression)"
)

ad.pca.sPLSDA_batch.reg.plot <- Scatter_Density(
    components = ad.pca.sPLSDA_batch.reg$variates$X, comp = c(1, 2),
    expl.var = ad.pca.sPLSDA_batch.reg$prop_expl_var$X,
    batch = ad.batch,
    trt = ad.trt,
    title = "sPLSDA-batch (regression)"
)

Figure 5: PCA sample plots with density overlays before and after batch-effect correction in the AD data

As shown in the PCA sample plots, the differences between samples sequenced on “14/04/2016”, “21/09/2017” and the other dates were removed after batch-effect correction. The data corrected with PLSDA-batch retained slightly more treatment variation, mainly on the first PC, compared with sPLSDA-batch, as indicated by the x-axis label (26%). We can also compare boxplots and density plots for key variables identified by PCA as major drivers of variance, as well as heatmaps showing clear patterns before and after correction (results not shown). No noticeable differences between the two modes can be observed from the PCA plots.

pRDA

We calculate the global explained variance across all microbial variables using pRDA. To achieve this, we loop through each variable in both the original (uncorrected) data and the batch-corrected data. The final results are then visualised using the partVar_plot() function from the \(\color{orange}{\text{PLSDAbatch}}\) package.

# AD data
ad.corrected.list <- list(
    `Before correction` = ad.clr,
    `PLSDA-batch (canonical)` = ad.PLSDA_batch.cnc,
    `sPLSDA-batch (canonical)` = ad.sPLSDA_batch.cnc,
    `PLSDA-batch (regression)` = ad.PLSDA_batch.reg,
    `sPLSDA-batch (regression)` = ad.sPLSDA_batch.reg
)

ad.prop.df <- data.frame(
    Treatment = NA, Batch = NA,
    Intersection = NA,
    Residuals = NA
)
for (i in seq_len(length(ad.corrected.list))) {
    rda.res <- varpart(ad.corrected.list[[i]], ~trt, ~batch,
        data = ad.factors.df, scale = TRUE
    )
    ad.prop.df[i, ] <- rda.res$part$indfract$Adj.R.squared
}

rownames(ad.prop.df) <- names(ad.corrected.list)

ad.prop.df <- ad.prop.df[, c(1, 3, 2, 4)]

partVar_plot(prop.df = ad.prop.df)

Figure 6: Global explained variance before and after batch effect correction for the AD data

As shown in the figure above, the intersection between batch and treatment variance was small (1.3%) for the \(\color{blue}{\text{AD data}}\), indicating that the batch-by-treatment design is not highly unbalanced. Therefore, the unweighted PLSDA-batch and sPLSDA-batch methods were appropriate, and the weighted versions were not required. The sPLSDA-batch–corrected data showed better performance, with lower intersection variance compared with PLSDA-batch. The results from the two modes were similar.

7.2 Other methods

\(\mathbf{R^2}\)

The \(R^2\) values for each variable are calculated using lm() from the \(\color{orange}{\text{stats}}\) package. To compare \(R^2\) values across variables, we scale the corrected data prior to calculating \(R^2\). The results are then visualised using ggplot() from the \(\color{orange}{\text{ggplot2}}\) R package.

# AD data
# scale
ad.corr_scale.list <- lapply(
    ad.corrected.list,
    function(x) {
        apply(x, 2, scale)
    }
)

ad.r_values.list <- list()
for (i in seq_len(length(ad.corr_scale.list))) {
    ad.r_values <- data.frame(trt = NA, batch = NA)
    for (c in seq_len(ncol(ad.corr_scale.list[[i]]))) {
        ad.fit.res.trt <- lm(ad.corr_scale.list[[i]][, c] ~ ad.trt)
        ad.r_values[c, 1] <- summary(ad.fit.res.trt)$r.squared
        ad.fit.res.batch <- lm(ad.corr_scale.list[[i]][, c] ~ ad.batch)
        ad.r_values[c, 2] <- summary(ad.fit.res.batch)$r.squared
    }
    ad.r_values.list[[i]] <- ad.r_values
}
names(ad.r_values.list) <- names(ad.corr_scale.list)

ad.boxp.list <- list()
for (i in seq_len(length(ad.r_values.list))) {
    ad.boxp.list[[i]] <-
        data.frame(
            r2 = c(
                ad.r_values.list[[i]][, "trt"],
                ad.r_values.list[[i]][, "batch"]
            ),
            Effects = as.factor(rep(c("Treatment", "Batch"),
                each = 231
            ))
        )
}
names(ad.boxp.list) <- names(ad.r_values.list)

ad.r2.boxp <- rbind(
    ad.boxp.list$`Before correction`,
    ad.boxp.list$`PLSDA-batch (canonical)`,
    ad.boxp.list$`sPLSDA-batch (canonical)`,
    ad.boxp.list$`PLSDA-batch (regression)`,
    ad.boxp.list$`sPLSDA-batch (regression)`
)

ad.r2.boxp$methods <- rep(
    c(
        "Before correction", "PLSDA-batch (canonical)",
        "sPLSDA-batch (canonical)", "PLSDA-batch (regression)",
        "sPLSDA-batch (regression)"
    ),
    each = 462
)

ad.r2.boxp$methods <- factor(ad.r2.boxp$methods,
    levels = unique(ad.r2.boxp$methods)
)

ggplot(ad.r2.boxp, aes(x = Effects, y = r2, fill = Effects)) +
    geom_boxplot(alpha = 0.80) +
    theme_bw() +
    theme(
        text = element_text(size = 18),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.x = element_text(angle = 60, hjust = 1, size = 18),
        axis.text.y = element_text(size = 18),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_blank(),
        legend.position = "right"
    ) +
    facet_grid(~methods) +
    scale_fill_manual(values = pb_color(c(12, 14)))

Figure 7: AD study: \(R^2\) values for each microbial variable before and after batch effect correction

The batch-related variance was reduced after both PLSDA-batch and sPLSDA-batch correction, with PLSDA-batch retaining slightly more treatment-related variance.

##################################
ad.barp.list <- list()
for (i in seq_len(length(ad.r_values.list))) {
    ad.barp.list[[i]] <- data.frame(
        r2 = c(
            sum(ad.r_values.list[[i]][, "trt"]),
            sum(ad.r_values.list[[i]][, "batch"])
        ),
        Effects = c("Treatment", "Batch")
    )
}
names(ad.barp.list) <- names(ad.r_values.list)

ad.r2.barp <- rbind(
    ad.barp.list$`Before correction`,
    ad.barp.list$`PLSDA-batch (canonical)`,
    ad.barp.list$`sPLSDA-batch (canonical)`,
    ad.barp.list$`PLSDA-batch (regression)`,
    ad.barp.list$`sPLSDA-batch (regression)`
)


ad.r2.barp$methods <- rep(
    c(
        "Before correction", "PLSDA-batch (canonical)",
        "sPLSDA-batch (canonical)", "PLSDA-batch (regression)",
        "sPLSDA-batch (regression)"
    ),
    each = 2
)

ad.r2.barp$methods <- factor(ad.r2.barp$methods,
    levels = unique(ad.r2.barp$methods)
)


ggplot(ad.r2.barp, aes(x = Effects, y = r2, fill = Effects)) +
    geom_bar(stat = "identity") +
    theme_bw() +
    theme(
        text = element_text(size = 18),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.x = element_text(angle = 60, hjust = 1, size = 18),
        axis.text.y = element_text(size = 18),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_blank(),
        legend.position = "right"
    ) +
    facet_grid(~methods) +
    scale_fill_manual(values = pb_color(c(12, 14)))

Figure 8: AD study: Sum of \(R^2\) values for each microbial variable before and after batch effect correction

The overall sum of \(R^2\) values indicated that sPLSDA-batch removed slightly more batch variance than PLSDA-batch (Regression mode: PLSDA-batch (12.49), sPLSDA-batch (9.27); Canonical mode: PLSDA-batch (12.40), sPLSDA-batch (9.25) but preserved less treatment variance (Regression mode: PLSDA-batch (40.21), sPLSDA-batch (36.80); Canonical mode: PLSDA-batch (40.00), sPLSDA-batch (36.22)) compared with PLSDA-batch.

Alignment scores

To use the alignment_score() function from \(\color{orange}{\text{PLSDAbatch}}\), we need to specify the proportion of variance to explain (var), the number of nearest neighbours (k), and the number of principal components to estimate (ncomp). The results are then visualised using ggplot() from the \(\color{orange}{\text{ggplot2}}\) package.

# AD data
ad.scores <- c()
names(ad.batch) <- rownames(ad.clr)
for (i in seq_len(length(ad.corrected.list))) {
    res <- alignment_score(
        data = ad.corrected.list[[i]],
        batch = ad.batch,
        var = 0.95,
        k = 8,
        ncomp = 50
    )
    ad.scores <- c(ad.scores, res)
}

ad.scores.df <- data.frame(
    scores = ad.scores,
    methods = names(ad.corrected.list)
)

ad.scores.df$methods <- factor(ad.scores.df$methods,
    levels = rev(names(ad.corrected.list))
)


ggplot() +
    geom_col(aes(
        x = ad.scores.df$methods,
        y = ad.scores.df$scores
    )) +
    geom_text(
        aes(
            x = ad.scores.df$methods,
            y = ad.scores.df$scores / 2,
            label = round(ad.scores.df$scores, 3)
        ),
        size = 3, col = "white"
    ) +
    coord_flip() +
    theme_bw() +
    ylab("Alignment Scores") +
    xlab("") +
    ylim(0, 0.85)

Figure 9: Comparison of alignment scores before and after batch effect correction using different methods for the AD data

The alignment scores complement the PCA results, especially when the extent of batch-effect removal is difficult to judge from PCA sample plots. In Figure 5, we can clearly see that all methods reduce the major batch effect, but the PCA plots do not reveal the more subtle differences in how effectively each method performs.

Because a higher alignment score indicates better mixing of samples, the bar plot above shows that sPLSDA-batch achieved superior performance compared with PLSDA-batch.

8 Variable selection

After batch effect correction, we can identify discriminative variables associated with different treatments.

Here, we use the splsda() function from \(\color{orange}{\text{mixOmics}}\) to select the top 50 microbial variables that, in combination, discriminate the treatment groups in the \(\color{blue}{\text{AD data}}\).

For details on how to apply sPLS-DA, see the mixOmics documentation: mixOmics.

splsda.plsda_batch.reg <- splsda(
    X = ad.PLSDA_batch.reg, Y = ad.trt,
    ncomp = 3, keepX = rep(50, 3)
)
select.plsda_batch.reg <- selectVar(splsda.plsda_batch.reg, comp = 1)
head(select.plsda_batch.reg$value)

##        value.var
## OTU46  0.3034711
## OTU24  0.2816639
## OTU28  0.2815924
## OTU35 -0.2666526
## OTU61  0.2614321
## OTU68  0.2544252

splsda.splsda_batch.reg <- splsda(
    X = ad.sPLSDA_batch.reg, Y = ad.trt,
    ncomp = 3, keepX = rep(50, 3)
)
select.splsda_batch.reg <- selectVar(splsda.splsda_batch.reg, comp = 1)
head(select.splsda_batch.reg$value)

##        value.var
## OTU46  0.2949833
## OTU24  0.2744206
## OTU28  0.2716075
## OTU35 -0.2621345
## OTU61  0.2588398
## OTU68  0.2492678

length(intersect(select.plsda_batch.reg$name, select.splsda_batch.reg$name))

## [1] 44

The discriminative variables were selected and ranked according to their contributions to separating the sample groups treated with different ranges of phenol concentration (0–0.5 vs. 1–2 g/L).

The overlap between the variables selected from the data corrected with PLSDA-batch and sPLSDA-batch was high (44 out of 50), although some differences between the two selections remained.

9 Session Information

sessionInfo()

## R Under development (unstable) (2025-12-22 r89219)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] PLSDAbatch_1.99.1           SummarizedExperiment_1.41.0
##  [3] GenomicRanges_1.63.1        Seqinfo_1.1.0              
##  [5] IRanges_2.45.0              S4Vectors_0.49.0           
##  [7] MatrixGenerics_1.23.0       matrixStats_1.5.0          
##  [9] Biobase_2.71.0              BiocGenerics_0.57.0        
## [11] generics_0.1.4              mixOmics_6.35.0            
## [13] ggplot2_4.0.1               lattice_0.22-7             
## [15] MASS_7.3-65                 gridExtra_2.3              
## [17] vegan_2.7-2                 permute_0.9-8              
## [19] pheatmap_1.0.13             BiocStyle_2.39.0           
## 
## loaded via a namespace (and not attached):
##  [1] Rdpack_2.6.4                    rlang_1.1.6                    
##  [3] magrittr_2.0.4                  otel_0.2.0                     
##  [5] compiler_4.6.0                  mgcv_1.9-4                     
##  [7] vctrs_0.6.5                     reshape2_1.4.5                 
##  [9] stringr_1.6.0                   crayon_1.5.3                   
## [11] pkgconfig_2.0.3                 fastmap_1.2.0                  
## [13] magick_2.9.0                    backports_1.5.0                
## [15] XVector_0.51.0                  labeling_0.4.3                 
## [17] rmarkdown_2.30                  nloptr_2.2.1                   
## [19] tinytex_0.58                    purrr_1.2.0                    
## [21] xfun_0.55                       cachem_1.1.0                   
## [23] jsonlite_2.0.0                  DelayedArray_0.37.0            
## [25] BiocParallel_1.45.0             broom_1.0.11                   
## [27] parallel_4.6.0                  cluster_2.1.8.1                
## [29] R6_2.6.1                        bslib_0.9.0                    
## [31] stringi_1.8.7                   RColorBrewer_1.1-3             
## [33] car_3.1-3                       boot_1.3-32                    
## [35] jquerylib_0.1.4                 numDeriv_2016.8-1.1            
## [37] Rcpp_1.1.0.8.1                  bookdown_0.46                  
## [39] knitr_1.51                      Matrix_1.7-4                   
## [41] splines_4.6.0                   igraph_2.2.1                   
## [43] tidyselect_1.2.1                dichromat_2.0-0.1              
## [45] abind_1.4-8                     yaml_2.3.12                    
## [47] TreeSummarizedExperiment_2.19.0 codetools_0.2-20               
## [49] tibble_3.3.0                    lmerTest_3.1-3                 
## [51] plyr_1.8.9                      treeio_1.35.0                  
## [53] withr_3.0.2                     rARPACK_0.11-0                 
## [55] S7_0.2.1                        evaluate_1.0.5                 
## [57] Biostrings_2.79.3               pillar_1.11.1                  
## [59] BiocManager_1.30.27             ggpubr_0.6.2                   
## [61] carData_3.0-5                   reformulas_0.4.3               
## [63] ellipse_0.5.0                   insight_1.4.4                  
## [65] tidytree_0.4.6                  scales_1.4.0                   
## [67] minqa_1.2.8                     glue_1.8.0                     
## [69] lazyeval_0.2.2                  tools_4.6.0                    
## [71] lme4_1.1-38                     RSpectra_0.16-2                
## [73] ggsignif_0.6.4                  fs_1.6.6                       
## [75] grid_4.6.0                      ape_5.8-1                      
## [77] tidyr_1.3.2                     rbibutils_2.4                  
## [79] SingleCellExperiment_1.33.0     nlme_3.1-168                   
## [81] performance_0.15.3              Formula_1.2-5                  
## [83] cli_3.6.5                       rappdirs_0.3.3                 
## [85] S4Arrays_1.11.1                 dplyr_1.1.4                    
## [87] corpcor_1.6.10                  gtable_0.3.6                   
## [89] yulab.utils_0.2.3               rstatix_0.7.3                  
## [91] sass_0.4.10                     digest_0.6.39                  
## [93] SparseArray_1.11.10             ggrepel_0.9.6                  
## [95] farver_2.1.2                    htmltools_0.5.9                
## [97] lifecycle_1.0.4

Contents