Contents

1 Introduction

The MicrobiomeBenchamrkData package provides access to a collection of datasets with biological ground truth for benchmarking differential abundance methods. The datasets are deposited on Zenodo: https://doi.org/10.5281/zenodo.6911026

2 Installation

## Install BioConductor if not installed
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

## Release version (not yet in Bioc, so it doesn't work yet)
BiocManager::install("MicrobiomeBenchmarkData")

## Development version
BiocManager::install("waldronlab/MicrobiomeBenchmarkData") 
library(MicrobiomeBenchmarkData)
library(purrr)

3 Sample metadata

All sample metadata is merged into a single data frame and provided as a data object:

data('sampleMetadata', package = 'MicrobiomeBenchmarkData')
## Get columns present in all samples
sample_metadata <- sampleMetadata |> 
    discard(~any(is.na(.x))) |> 
    head()
knitr::kable(sample_metadata)
dataset sample_id body_site library_size pmid study_condition sequencing_method
HMP_2012_16S_gingival_V13 700103497 oral_cavity 5356 22699609 control 16S
HMP_2012_16S_gingival_V13 700106940 oral_cavity 4489 22699609 control 16S
HMP_2012_16S_gingival_V13 700097304 oral_cavity 3043 22699609 control 16S
HMP_2012_16S_gingival_V13 700099015 oral_cavity 2832 22699609 control 16S
HMP_2012_16S_gingival_V13 700097644 oral_cavity 2815 22699609 control 16S
HMP_2012_16S_gingival_V13 700097247 oral_cavity 6333 22699609 control 16S

4 Accessing datasets

Currently, there are 6 datasets available through the MicrobiomeBenchmarkData. These datasets are accessed through the getBenchmarkData function.

4.2 Access a single dataset

In order to import a dataset, the getBenchmarkData function must be used with the name of the dataset as the first argument (x) and the dryrun argument set to FALSE. The output is a list vector with the dataset imported as a TreeSummarizedExperiment object.

tse <- getBenchmarkData('HMP_2012_16S_gingival_V35_subset', dryrun = FALSE)[[1]]
#> Finished HMP_2012_16S_gingival_V35_subset.
tse
#> class: TreeSummarizedExperiment 
#> dim: 892 76 
#> metadata(0):
#> assays(1): counts
#> rownames(892): OTU_97.31247 OTU_97.44487 ... OTU_97.45365 OTU_97.45307
#> rowData names(7): kingdom phylum ... genus taxon_annotation
#> colnames(76): 700023057 700023179 ... 700114009 700114338
#> colData names(13): dataset subject_id ... sequencing_method
#>   variable_region_16s
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: a LinkDataFrame (892 rows)
#> rowTree: 1 phylo tree(s) (892 leaves)
#> colLinks: NULL
#> colTree: NULL

4.3 Access a few datasets

Several datasets can be imported simultaneously by giving the names of the different datasets in a character vector:

list_tse <- getBenchmarkData(dats$Dataset[2:4], dryrun = FALSE)
#> Finished HMP_2012_16S_gingival_V35.
#> Finished HMP_2012_16S_gingival_V35_subset.
#> Finished HMP_2012_WMS_gingival.
str(list_tse, max.level = 1)
#> List of 3
#>  $ HMP_2012_16S_gingival_V35       :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#>  $ HMP_2012_16S_gingival_V35_subset:Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#>  $ HMP_2012_WMS_gingival           :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots

4.4 Access all of the datasets

If all of the datasets must to be imported, this can be done by providing the dryrun = FALSE argument alone.

mbd <- getBenchmarkData(dryrun = FALSE)
#> Finished HMP_2012_16S_gingival_V13.
#> Finished HMP_2012_16S_gingival_V35.
#> Finished HMP_2012_16S_gingival_V35_subset.
#> Finished HMP_2012_WMS_gingival.
#> Warning: No taxonomy_tree available for Ravel_2011_16S_BV.
#> Finished Ravel_2011_16S_BV.
#> Warning: No taxonomy_tree available for Stammler_2016_16S_spikein.
#> Finished Stammler_2016_16S_spikein.
str(mbd, max.level = 1)
#> List of 6
#>  $ HMP_2012_16S_gingival_V13       :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#>  $ HMP_2012_16S_gingival_V35       :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#>  $ HMP_2012_16S_gingival_V35_subset:Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#>  $ HMP_2012_WMS_gingival           :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#>  $ Ravel_2011_16S_BV               :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#>  $ Stammler_2016_16S_spikein       :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots

5 Annotations for each taxa are included in rowData

The biological annotations of each taxa are provided as a column in the rowData slot of the TreeSummarizedExperiment.

## In the case, the column is named as taxon_annotation 
tse <- mbd$HMP_2012_16S_gingival_V35_subset
rowData(tse)
#> DataFrame with 892 rows and 7 columns
#>                  kingdom      phylum       class           order
#>              <character> <character> <character>     <character>
#> OTU_97.31247    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.44487    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.34979    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.34572    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.42259    Bacteria  Firmicutes     Bacilli Lactobacillales
#> ...                  ...         ...         ...             ...
#> OTU_97.44294    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.45429    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.44375    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.45365    Bacteria  Firmicutes     Bacilli Lactobacillales
#> OTU_97.45307    Bacteria  Firmicutes     Bacilli Lactobacillales
#>                        family         genus      taxon_annotation
#>                   <character>   <character>           <character>
#> OTU_97.31247 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.44487 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.34979 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.34572 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.42259 Streptococcaceae Streptococcus facultative_anaerobic
#> ...                       ...           ...                   ...
#> OTU_97.44294 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.45429 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.44375 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.45365 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.45307 Streptococcaceae Streptococcus facultative_anaerobic

6 Cache

The datasets are cached so they’re only downloaded once. The cache and all of the files contained in it can be removed with the removeCache function.

removeCache()

7 Session information

sessionInfo()
#> R version 4.5.0 beta (2025-04-02 r88102)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] purrr_1.0.4                     MicrobiomeBenchmarkData_1.11.0 
#>  [3] TreeSummarizedExperiment_2.17.0 Biostrings_2.77.0              
#>  [5] XVector_0.49.0                  SingleCellExperiment_1.31.0    
#>  [7] SummarizedExperiment_1.39.0     Biobase_2.69.0                 
#>  [9] GenomicRanges_1.61.0            GenomeInfoDb_1.45.0            
#> [11] IRanges_2.43.0                  S4Vectors_0.47.0               
#> [13] BiocGenerics_0.55.0             generics_0.1.3                 
#> [15] MatrixGenerics_1.21.0           matrixStats_1.5.0              
#> [17] BiocStyle_2.37.0               
#> 
#> loaded via a namespace (and not attached):
#>  [1] xfun_0.52               bslib_0.9.0             lattice_0.22-7         
#>  [4] vctrs_0.6.5             tools_4.5.0             yulab.utils_0.2.0      
#>  [7] curl_6.2.2              parallel_4.5.0          RSQLite_2.3.9          
#> [10] tibble_3.2.1            blob_1.2.4              pkgconfig_2.0.3        
#> [13] Matrix_1.7-3            dbplyr_2.5.0            lifecycle_1.0.4        
#> [16] GenomeInfoDbData_1.2.14 compiler_4.5.0          treeio_1.33.0          
#> [19] codetools_0.2-20        htmltools_0.5.8.1       sass_0.4.10            
#> [22] yaml_2.3.10             lazyeval_0.2.2          pillar_1.10.2          
#> [25] crayon_1.5.3            jquerylib_0.1.4         tidyr_1.3.1            
#> [28] BiocParallel_1.43.0     DelayedArray_0.35.0     cachem_1.1.0           
#> [31] abind_1.4-8             nlme_3.1-168            tidyselect_1.2.1       
#> [34] digest_0.6.37           dplyr_1.1.4             bookdown_0.43          
#> [37] fastmap_1.2.0           grid_4.5.0              cli_3.6.4              
#> [40] SparseArray_1.9.0       magrittr_2.0.3          S4Arrays_1.9.0         
#> [43] ape_5.8-1               withr_3.0.2             filelock_1.0.3         
#> [46] UCSC.utils_1.5.0        bit64_4.6.0-1           rmarkdown_2.29         
#> [49] httr_1.4.7              bit_4.6.0               memoise_2.0.1          
#> [52] evaluate_1.0.3          knitr_1.50              BiocFileCache_2.17.0   
#> [55] rlang_1.1.6             Rcpp_1.0.14             DBI_1.2.3              
#> [58] glue_1.8.0              tidytree_0.4.6          BiocManager_1.30.25    
#> [61] jsonlite_2.0.0          R6_2.6.1                fs_1.6.6