Quantifying batch effects with HVP

Wei Xin Chan1,2*, Hafiz Bin Ismail1, Wilson Wen Bin Goh1,2,3,4,5 and Limsoon Wong6

1School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, 637551, Singapore
2Center for Biomedical Informatics, Nanyang Technological University, 59 Nanyang Drive, 636921, Singapore
3Lee Kong Chian School of Medicine, Nanyang Technological University, 59 Nanyang Drive, 636921, Singapore
4Center for AI in Medicine, Nanyang Technological University, 59 Nanyang Drive, 636921, Singapore
5Division of Neurology, Department of Brain Sciences, Faculty of Medicine, Imperial College London, Burlington Danes, The Hammersmith Hospital, Du Cane Road, London, W12 0NN, United Kingdom
6School of Computing, National University of Singapore, 13 Computing Drive, 117417, Singapore

*cweixin@gmail.com

22 June 2025

Abstract

Batch effects are technical biases that can mislead data interpretation and analysis. Accurate detection and quantification of batch effects are critical for determining whether a correction is necessary and for assessing the effectiveness of the correction. Unfortunately, existing batch effect metrics often lack objectivity and are confounded by factors such as batch-class imbalance or high sensitivity to internal parameters, which results in additional errors. We present hierarchical variance partitioning (HVP), a quantitative batch effects metric that estimates the proportion of variance associated with batch effects in a data set. HVP is based on a “partition of sum of squares” concept that is used in the analysis of variance (ANOVA).

Package

HVP

1 Introduction
2 Installation
3 Quick start
4 Quantifying batch effects using HVP
- 4.1 Class: SummarizedExperiment / SingleCellExperiment
- 4.2 Class: Seurat
5 Testing the statistical significance of batch effects
6 Simulating microarray gene expression data
7 HVP: Theory
- 7.1 Batch effects
- 7.2 Hierarchical variance partitioning
8 Session info
References

1 Introduction

Hierarchical variance partitioning (HVP) is a quantitative batch effect metric that estimates the proportion of variance associated with batch effects in a data set. HVP is suitable for use in all types of high-dimensional biological data, such as bulk gene expression, single-cell RNA-sequencing (scRNA-seq), and proteomics data. In addition, HVP can be used on both small and large data sets. HVP is designed to be robust to batch-class imbalance, where different batches have different proportions of classes. We implement HVP as an R package, with the main functionality of the package provided through the HVP function. HVP is developed to be compatible with existing Bioconductor classes (e.g. SummarizedExperiment, SingleCellExperiment) and the Seurat class for scRNA-seq analysis. HVP is intended for use on log-normalised data. In the following sections, we demonstrate how to use the HVP function to quantify batch effects. For details on how HVP works, please refer to the theory section.

2 Installation

The HVP package can be installed from GitHub using the devtools package as follows:

# Install package: devtools if not present
if (!require("devtools", quietly = TRUE))
  install.packages("devtools")

devtools::install_github("dblux/HVP")

3 Quick start

We show how to use the HVP function to quantify the magnitude of batch effects in a matrix object containing log-normalised gene expression values. First, we use the simulateMicroarray function provided in the HVP package to simulate gene expression data with three classes and two batches (see the simulation section for further details). We indicate the batch-class design using a cross table with rows representing classes and columns representing batches, specifying the number of samples in each batch-class group in the corresponding cell of the table. After simulating the data, we provide the 1) matrix containing log-normalised expression values, along with the 2) batch information and 3) class information to the HVP function, which in turn returns a list containing two elements:

HVP: The estimated proportion of variance associated with batch effects in the data set.
sum.squares: A matrix of the sum of squares between batch and total sum of squares for all features.

library(HVP)

crosstab <- matrix(10, 3, 2)
# Simulate microarray data with 100 features
data <- simulateMicroarray(crosstab, 100)

X <- data$X # matrix with dimensions (nfeature, nsamples)
batch <- data$metadata$batch # vector representing batch
class <- data$metadata$class # vector representing class

res <- HVP(X, batch, class)
res@HVP # proportion of variance attributable to batch effects
#> [1] 0.3632038

4 Quantifying batch effects using HVP

The HVP function accepts as input objects from common classes that are used to store high-throughput biological data, such as the SummarizedExperiment, SingleCellExperiment and Seurat, in addition to base classes such as the matrix and data.frame classes. HVP is implemented as a S4 generic function that dispatches to different S4 class methods based on the class of the first argument x. The default S4 method is implemented for the classes matrix or data.frame, and S4 methods for SummarizedExperiment, SingleCellExperiment and Seurat have also been implemented. Users are free to implement their own S4 methods for other classes as well. We demonstrate below how to call the HVP function for objects of different classes. For array-like objects, please refer to the quick start section.

4.1 Class: `SummarizedExperiment` / `SingleCellExperiment`

HVP is compatible with existing Bioconductor workflows, and accepts objects from the SummarizedExperiment class and its descendent classes, such as the SingleCellExperiment class. We simulate scRNA-seq data with batch effects using the splatter package:

library(splatter)
library(scater)

params <- newSplatParams()
params <- setParams(
  params,
  nGenes = 100, # 100 features
  batchCells = c(150, 150), # Simulate 300 cells
  group.prob = c(0.5, 0.5),
  batch.facLoc = 0, # location parameter of log-normal distribution
  batch.facScale = 0.1 # scale parameter of log-normal distribution
)
sce_data <- splatSimulate(params, method = "groups")
sce_data <- logNormCounts(sce_data)

Applying the HVP function on a SingleCellExperiment object is straightforward, with sample metadata being stored in the colData slot of the SingleCellExperiment object. The arguments which need to be provided are:

x: SingleCellExperiment object.
batchname: name of column in metadata that indicates the batch of samples.
clsname: name of the column/s in metadata indicating class information of samples.

res_sce <- HVP(sce_data, "Batch", "Group")

We also provide an example below of how to apply the HVP function on real-world scRNA-seq data that can be downloaded using the ExperimentHub package. The data set is stored as a SingleCellExperiment object.

library(ExperimentHub)
library(scater)

eh <- ExperimentHub()
data <- query(eh, "muscData")
kang <- data[["EH2259"]] # downloads data set
kang_ctrl <- kang[rowSums(counts(kang)) > 0, kang$stim == 'ctrl']
kang_ctrl <- logNormCounts(kang_ctrl)
kang_ctrl$ind <- as.factor(kang_ctrl$ind)

res <- HVP(kang_ctrl, "ind", "cell")
#> Warning in asMethod(object): sparse->dense coercion: allocating vector of size
#> 2.1 GiB

4.2 Class: `Seurat`

Similarly, HVP can be used to quantify batch effects in Seurat objects by providing it with the arguments:

x: Seurat object.
batchname: name of column in metadata that indicates the batch of samples.
clsname: name of the column/s in metadata indicating class information of samples.

if (requireNamespace("Seurat", quietly = TRUE)) {
  library(Seurat)
  seurat_data <- as.Seurat(sce_data)
  res_seurat <- HVP(seurat_data, "Batch", "Group")
} else {
  message("Please install the 'Seurat' package to run this example.")
}
#> Warning: `PackageCheck()` was deprecated in SeuratObject 5.0.0.
#> ℹ Please use `rlang::check_installed()` instead.
#> ℹ The deprecated feature was likely used in the Seurat package.
#>   Please report the issue at <https://github.com/satijalab/seurat/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: The `slot` argument of `SetAssayData()` is deprecated as of SeuratObject 5.0.0.
#> ℹ Please use the `layer` argument instead.
#> ℹ The deprecated feature was likely used in the Seurat package.
#>   Please report the issue at <https://github.com/satijalab/seurat/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

5 Testing the statistical significance of batch effects

To determine whether batch effects are statistically significant in a data set, a Monte Carlo permutation test can be performed by setting the optional argument nperm of the HVP function to the desired number of permutations. We recommend performing at least 1000 permutations. In each permutation, batch labels of the samples are permuted and a HVP value is subsequently computed. HVP values measured in all permutations are gathered to form a null distribution of HVP values. The \(p\)-value is calculated by dividing the number of permutations with a higher HVP value than the HVP value observed in the original data, by the total number of permutations. Running HVP with the permutation test option returns a list with two additional elements:

p.value: p-value of the permutation test, i.e. the proportion of permutations with a larger HVP value than the observed HVP value of the original data.
null.distribution: numeric vector containing the null distribution of HVP values.

res_permtest <- HVP(X, batch, class, nperm = 1000)
res_permtest@p.value
#> [1] 0

We plot a histogram of the null distribution of HVP values obtained from the permutation test below and indicate the HVP value observed in the original data set as well.

library(ggplot2)

plot(res_permtest)

6 Simulating microarray gene expression data

We propose a two-stage Bayesian hierarchical model to simulate log expression values of microarray gene expression data with batch and class effects, which is illustrated in the figure below. Our model simulates batch effects in all genes, and is able to simulate both additive and multiplicative batch effects (in log space).

We implement the proposed Bayesian hierarchical model as the simulateMicroarray function, which we include in the HVP package. At the moment, the simulateMicroarray function only has the option to simulate additive batch effects. Users are able to specify the magnitude of batch and class effects, percentage of differentially expressed genes, and other distribution parameters to simulate gene expression data tailored for their needs. The names of the optional arguments correspond to the distribution parameters (highlighted in yellow) in the figure above. Please refer to the documentation provided in R (?simulateMicroarray) for details on each parameter.

The simulateMicroarray function requires a cross table specifying the number of samples in each batch-class group to simulate. The rows and columns of the cross table are to correspond to classes and batches, respectively. We show how to simulate a data set with three classes and two batches, with ten samples in each batch-class group:

crosstab <- matrix(10, 3, 2)
crosstab
#>      [,1] [,2]
#> [1,]   10   10
#> [2,]   10   10
#> [3,]   10   10

data <- simulateMicroarray(crosstab, 100) # 100 genes
names(data)
#> [1] "X"           "metadata"    "diff.genes"  "Y"           "batch.terms"
#> [6] "class.logfc" "batch.logfc" "params"
dim(data$X)
#> [1] 100  60
head(data$metadata)
#>         class batch
#> ID1_A_1     A     1
#> ID2_A_1     A     1
#> ID3_A_1     A     1
#> ID4_A_1     A     1
#> ID5_A_1     A     1
#> ID6_A_1     A     1

X <- data$X
batch <- data$metadata$batch
class <- data$metadata$class

7 HVP: Theory

7.1 Batch effects

Batch effects in gene expression data are generally modelled either as additive or multiplicative batch effects, or both (Lazar et al. 2012). The popular batch effects correction method ComBat (Johnson, Li, and Rabinovic 2007) models batch effects as both additive and multiplicative batch effects as follows:

\[ Y_{ijk} = X_{ijk} + \delta_{ik} + \gamma_{ik}\epsilon_{ijk} \]

where \(Y_{ijk}\) is the measured log-transformed expression value for gene \(i\) of sample \(j\) from batch \(k\), \(X_{ijk}\) is the theoretical gene expression value, \(\epsilon_{ijk}\) represents random Gaussian noise with mean zero and variance \(\sigma^2_i\), and \(\delta_{ik}\) and \(\gamma_{ik}\) represents additive and multiplicative batch effects, respectively. Following the above model, additive batch effects can be visualised intuitively as the distance separating different batches of samples in a PCA plot of log-transformed data, while multiplicative batch effects can be visualised as clusters of samples from different batches having different levels of dispersion.

7.2 Hierarchical variance partitioning

HVP measures the proportion of variance associated with batch effects in a data set. It is based on the partition of sums of squares concept in the analysis of variance (ANOVA), where the total sum of squares is equals to the addition of sum of squares between groups and the sum of squares within groups. The total sum of squares can also be calculated by multiplying the variance of a group of samples by the number of samples less one.

We illustrate the concept of HVP in the figure below. First, classes are used as groups to partition the total sum of squares into sum of squares between classes and within classes. Subsequently, batches are used as groups to partition the sum of squares within each class into the sum of squares between batches and the sum of squares within batches. HVP is calculated by summing up the sum of squares between batches for all classes across all features, before dividing it by the total sum of squares across all features.

We visualise how the sum of squares between batches (across all classes) and the total sum of squares is calculated for a single feature in the figure below. The sum of squares between batches is computed for a class by taking the weighted sum of squared distances between the means of each batch-class group (orange and purple diamonds) and their respective class mean (grey dashed line). Subsequently, these values are summed across all classes with multiple batches. The total sum of squares is computed by taking the squared distance between each sample and the grand mean (black dashed line).

The hierarchical partitioning of the total sum of squares in HVP follows the hierarchical structure of the data, where differences in mean expression values between different classes exist (in differentially expressed genes), and additive batch effects manifest as differences in the mean expression values of samples from different batches within each class.

8 Session info

sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] Seurat_5.3.0                SeuratObject_5.1.0         
#>  [3] sp_2.2-0                    muscData_1.23.0            
#>  [5] ExperimentHub_2.99.5        AnnotationHub_3.99.6       
#>  [7] BiocFileCache_2.99.5        dbplyr_2.5.0               
#>  [9] scater_1.37.0               ggplot2_3.5.2              
#> [11] scuttle_1.19.0              splatter_1.33.0            
#> [13] SingleCellExperiment_1.31.0 SummarizedExperiment_1.39.1
#> [15] Biobase_2.69.0              GenomicRanges_1.61.1       
#> [17] Seqinfo_0.99.1              IRanges_2.43.0             
#> [19] S4Vectors_0.47.0            BiocGenerics_0.55.0        
#> [21] generics_0.1.4              MatrixGenerics_1.21.0      
#> [23] matrixStats_1.5.0           HVP_0.99.5                 
#> [25] BiocStyle_2.37.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] RcppAnnoy_0.0.22       splines_4.5.1          later_1.4.2           
#>   [4] filelock_1.0.3         tibble_3.3.0           polyclip_1.10-7       
#>   [7] fastDummies_1.7.5      lifecycle_1.0.4        httr2_1.1.2           
#>  [10] globals_0.18.0         lattice_0.22-7         MASS_7.3-65           
#>  [13] backports_1.5.0        magrittr_2.0.3         plotly_4.11.0         
#>  [16] sass_0.4.10            rmarkdown_2.29         jquerylib_0.1.4       
#>  [19] yaml_2.3.10            httpuv_1.6.16          sctransform_0.4.2     
#>  [22] spam_2.11-1            spatstat.sparse_3.1-0  reticulate_1.42.0     
#>  [25] cowplot_1.1.3          pbapply_1.7-2          DBI_1.2.3             
#>  [28] RColorBrewer_1.1-3     abind_1.4-8            Rtsne_0.17            
#>  [31] purrr_1.0.4            rappdirs_0.3.3         ggrepel_0.9.6         
#>  [34] irlba_2.3.5.1          spatstat.utils_3.1-4   listenv_0.9.1         
#>  [37] goftest_1.2-3          RSpectra_0.16-2        spatstat.random_3.4-1 
#>  [40] fitdistrplus_1.2-2     parallelly_1.45.0      codetools_0.2-20      
#>  [43] DelayedArray_0.35.2    tidyselect_1.2.1       farver_2.1.2          
#>  [46] ScaledMatrix_1.17.0    viridis_0.6.5          spatstat.explore_3.4-3
#>  [49] jsonlite_2.0.0         BiocNeighbors_2.3.1    progressr_0.15.1      
#>  [52] ggridges_0.5.6         survival_3.8-3         progress_1.2.3        
#>  [55] tools_4.5.1            ica_1.0-3              Rcpp_1.0.14           
#>  [58] glue_1.8.0             gridExtra_2.3          SparseArray_1.9.0     
#>  [61] xfun_0.52              dplyr_1.1.4            withr_3.0.2           
#>  [64] BiocManager_1.30.26    fastmap_1.2.0          digest_0.6.37         
#>  [67] rsvd_1.0.5             R6_2.6.1               mime_0.13             
#>  [70] scattermore_1.2        tensor_1.5.1           spatstat.data_3.1-6   
#>  [73] dichromat_2.0-0.1      RSQLite_2.4.1          tidyr_1.3.1           
#>  [76] data.table_1.17.6      prettyunits_1.2.0      httr_1.4.7            
#>  [79] htmlwidgets_1.6.4      S4Arrays_1.9.1         uwot_0.2.3            
#>  [82] pkgconfig_2.0.3        gtable_0.3.6           blob_1.2.4            
#>  [85] lmtest_0.9-40          XVector_0.49.0         htmltools_0.5.8.1     
#>  [88] dotCall64_1.2          bookdown_0.43          scales_1.4.0          
#>  [91] png_0.1-8              spatstat.univar_3.1-3  knitr_1.50            
#>  [94] reshape2_1.4.4         nlme_3.1-168           checkmate_2.3.2       
#>  [97] curl_6.4.0             cachem_1.1.0           zoo_1.8-14            
#> [100] stringr_1.5.1          BiocVersion_3.22.0     KernSmooth_2.23-26    
#> [103] parallel_4.5.1         miniUI_0.1.2           vipor_0.4.7           
#> [106] AnnotationDbi_1.71.0   pillar_1.10.2          grid_4.5.1            
#> [109] vctrs_0.6.5            RANN_2.6.2             promises_1.3.3        
#> [112] BiocSingular_1.25.0    beachmat_2.25.1        xtable_1.8-4          
#> [115] cluster_2.1.8.1        beeswarm_0.4.0         evaluate_1.0.4        
#> [118] magick_2.8.7           tinytex_0.57           cli_3.6.5             
#> [121] locfit_1.5-9.12        compiler_4.5.1         rlang_1.1.6           
#> [124] crayon_1.5.3           future.apply_1.20.0    labeling_0.4.3        
#> [127] plyr_1.8.9             ggbeeswarm_0.7.2       stringi_1.8.7         
#> [130] deldir_2.0-4           viridisLite_0.4.2      BiocParallel_1.43.4   
#> [133] Biostrings_2.77.2      lazyeval_0.2.2         spatstat.geom_3.4-1   
#> [136] Matrix_1.7-3           RcppHNSW_0.6.0         hms_1.1.3             
#> [139] patchwork_1.3.1        bit64_4.6.0-1          future_1.58.0         
#> [142] KEGGREST_1.49.1        shiny_1.10.0           ROCR_1.0-11           
#> [145] igraph_2.1.4           memoise_2.0.1          bslib_0.9.0           
#> [148] bit_4.6.0

References

Johnson, W Evan, Cheng Li, and Ariel Rabinovic. 2007. “Adjusting Batch Effects in Microarray Expression Data Using Empirical Bayes Methods.” Biostatistics 8 (1): 118–27.

Lazar, Cosmin, Stijn Meganck, Jonatan Taminau, David Steenhoff, Alain Coletta, Colin Molter, David Y Weiss-Solı́s, Robin Duque, Hugues Bersini, and Ann Nowé. 2012. “Batch Effect Removal Methods for Microarray Gene Expression Data Integration: A Survey.” Briefings in Bioinformatics 14 (4): 469–90.