knitr::opts_chunk$set(warning = FALSE, comment = NA,
                      fig.width = 6.25, fig.height = 5)
library(ANCOMBC)
library(tidyverse)The data_sanity_check function performs essential
validations on the input data to ensure its integrity before further
processing. It verifies data types, confirms the structure of the input
data, and checks for consistency between sample names in the metadata
and the feature table, safeguarding against common data input
errors.
Download package.
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ANCOMBC")Load the package.
phyloseq objectThe HITChip Atlas dataset contains genus-level microbiota profiling with HITChip for 1006 western adults with no reported health complications, reported in (Lahti et al. 2014). The dataset is available via the microbiome R package (Lahti et al. 2017) in phyloseq (McMurdie and Holmes 2013) format.
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 130 taxa and 1151 samples ]
sample_data() Sample Data:       [ 1151 samples by 10 sample variables ]
tax_table()   Taxonomy Table:    [ 130 taxa by 3 taxonomic ranks ]List the taxonomic levels available for data aggregation.
[1] "Phylum" "Family" "Genus" List the variables available in the sample metadata.
 [1] "age"                   "sex"                   "nationality"          
 [4] "DNA_extraction_method" "project"               "diversity"            
 [7] "bmi_group"             "subject"               "time"                 
[10] "sample"               Data sanity and integrity check.
# With `group` variable
check_results = data_sanity_check(data = atlas1006,
                                  tax_level = "Family",
                                  fix_formula = "age + sex + bmi_group",
                                  group = "bmi_group",
                                  struc_zero = TRUE,
                                  global = TRUE,
                                  verbose = TRUE)Checking the input data type ...The input data is of type: phyloseqPASSChecking the sample metadata ...The specified variables in the formula: age, sex, bmi_groupThe available variables in the sample metadata: age, sex, nationality, DNA_extraction_method, project, diversity, bmi_group, subject, time, samplePASSChecking other arguments ...The number of groups of interest is: 6The sample size per group is: underweight = 21, lean = 484, overweight = 197, obese = 222, severeobese = 99, morbidobese = 22PASS# Without `group` variable
check_results = data_sanity_check(data = atlas1006,
                                  tax_level = "Family",
                                  fix_formula = "age + sex + bmi_group",
                                  group = NULL,
                                  struc_zero = FALSE,
                                  global = FALSE,
                                  verbose = TRUE)Checking the input data type ...The input data is of type: phyloseqPASSChecking the sample metadata ...The specified variables in the formula: age, sex, bmi_groupThe available variables in the sample metadata: age, sex, nationality, DNA_extraction_method, project, diversity, bmi_group, subject, time, samplePASSChecking other arguments ...PASStse objectList the taxonomic levels available for data aggregation.
[1] "Phylum" "Family" "Genus" List the variables available in the sample metadata.
 [1] "age"                   "sex"                   "nationality"          
 [4] "DNA_extraction_method" "project"               "diversity"            
 [7] "bmi_group"             "subject"               "time"                 
[10] "sample"               Data sanity and integrity check.
check_results = data_sanity_check(data = tse,
                                  assay_name = "counts",
                                  tax_level = "Family",
                                  fix_formula = "age + sex + bmi_group",
                                  group = "bmi_group",
                                  struc_zero = TRUE,
                                  global = TRUE,
                                  verbose = TRUE)Checking the input data type ...The input data is of type: TreeSummarizedExperimentPASSChecking the sample metadata ...The specified variables in the formula: age, sex, bmi_groupThe available variables in the sample metadata: age, sex, nationality, DNA_extraction_method, project, diversity, bmi_group, subject, time, samplePASSChecking other arguments ...The number of groups of interest is: 6The sample size per group is: underweight = 21, lean = 484, overweight = 197, obese = 222, severeobese = 99, morbidobese = 22PASSmatrix or data.frameBoth abundance data and sample metadata are required for this import method.
Note that aggregating taxa to higher taxonomic levels is not
supported in this method. Ensure that the data is already aggregated to
the desired taxonomic level before proceeding. If aggregation is needed,
consider creating a phyloseq or tse object for
importing.
Ensure that the rownames of the metadata correspond to
the colnames of the abundance data.
[1] TRUEList the variables available in the sample metadata.
 [1] "age"                   "sex"                   "nationality"          
 [4] "DNA_extraction_method" "project"               "diversity"            
 [7] "bmi_group"             "subject"               "time"                 
[10] "sample"               Data sanity and integrity check.
check_results = data_sanity_check(data = abundance_data,
                                  assay_name = "counts",
                                  tax_level = "Family",
                                  meta_data = meta_data,
                                  fix_formula = "age + sex + bmi_group",
                                  group = "bmi_group",
                                  struc_zero = TRUE,
                                  global = TRUE,
                                  verbose = TRUE)Checking the input data type ...The input data is of type: matrixThe imported data is in a generic 'matrix'/'data.frame' format.PASSChecking the sample metadata ...The specified variables in the formula: age, sex, bmi_groupThe available variables in the sample metadata: age, sex, nationality, DNA_extraction_method, project, diversity, bmi_group, subject, time, samplePASSChecking other arguments ...The number of groups of interest is: 6The sample size per group is: underweight = 21, lean = 484, overweight = 197, obese = 222, severeobese = 99, morbidobese = 22PASSR version 4.5.0 Patched (2025-04-21 r88169)
Platform: aarch64-apple-darwin20
Running under: macOS Ventura 13.7.1
Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
locale:
[1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
 [1] doRNG_1.8.6.2   rngtools_1.5.2  foreach_1.5.2   DT_0.33        
 [5] lubridate_1.9.4 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
 [9] purrr_1.0.4     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
[13] ggplot2_3.5.2   tidyverse_2.0.0 ANCOMBC_2.11.0 
loaded via a namespace (and not attached):
  [1] ggtext_0.1.2                    fs_1.6.6                       
  [3] matrixStats_1.5.0               DirichletMultinomial_1.51.0    
  [5] httr_1.4.7                      RColorBrewer_1.1-3             
  [7] doParallel_1.0.17               numDeriv_2016.8-1.1            
  [9] tools_4.5.0                     backports_1.5.0                
 [11] R6_2.6.1                        vegan_2.6-10                   
 [13] lazyeval_0.2.2                  mgcv_1.9-3                     
 [15] rhdf5filters_1.21.0             permute_0.9-7                  
 [17] withr_3.0.2                     gridExtra_2.3                  
 [19] textshaping_1.0.0               cli_3.6.5                      
 [21] Biobase_2.69.0                  sandwich_3.1-1                 
 [23] labeling_0.4.3                  slam_0.1-55                    
 [25] sass_0.4.10                     mvtnorm_1.3-3                  
 [27] proxy_0.4-27                    systemfonts_1.2.2              
 [29] yulab.utils_0.2.0               foreign_0.8-90                 
 [31] dichromat_2.0-0.1               scater_1.37.0                  
 [33] decontam_1.29.0                 parallelly_1.43.0              
 [35] readxl_1.4.5                    fillpattern_1.0.2              
 [37] rstudioapi_0.17.1               generics_0.1.3                 
 [39] gtools_3.9.5                    crosstalk_1.2.1                
 [41] rbiom_2.2.0                     Matrix_1.7-3                   
 [43] biomformat_1.37.0               ggbeeswarm_0.7.2               
 [45] DescTools_0.99.60               S4Vectors_0.47.0               
 [47] DECIPHER_3.5.0                  abind_1.4-8                    
 [49] lifecycle_1.0.4                 multcomp_1.4-28                
 [51] yaml_2.3.10                     SummarizedExperiment_1.39.0    
 [53] rhdf5_2.53.0                    SparseArray_1.9.0              
 [55] Rtsne_0.17                      grid_4.5.0                     
 [57] crayon_1.5.3                    lattice_0.22-7                 
 [59] haven_2.5.4                     beachmat_2.25.0                
 [61] pillar_1.10.2                   knitr_1.50                     
 [63] GenomicRanges_1.61.0            boot_1.3-31                    
 [65] gld_2.6.7                       estimability_1.5.1             
 [67] codetools_0.2-20                glue_1.8.0                     
 [69] data.table_1.17.0               MultiAssayExperiment_1.35.1    
 [71] vctrs_0.6.5                     treeio_1.33.0                  
 [73] Rdpack_2.6.4                    cellranger_1.1.0               
 [75] gtable_0.3.6                    cachem_1.1.0                   
 [77] xfun_0.52                       rbibutils_2.3                  
 [79] S4Arrays_1.9.0                  coda_0.19-4.1                  
 [81] reformulas_0.4.0                survival_3.8-3                 
 [83] SingleCellExperiment_1.31.0     iterators_1.0.14               
 [85] bluster_1.19.0                  gmp_0.7-5                      
 [87] TH.data_1.1-3                   nlme_3.1-168                   
 [89] phyloseq_1.53.0                 bit64_4.6.0-1                  
 [91] GenomeInfoDb_1.45.0             bslib_0.9.0                    
 [93] irlba_2.3.5.1                   vipor_0.4.7                    
 [95] rpart_4.1.24                    colorspace_2.1-1               
 [97] BiocGenerics_0.55.0             DBI_1.2.3                      
 [99] Hmisc_5.2-3                     nnet_7.3-20                    
[101] ade4_1.7-23                     Exact_3.3                      
[103] tidyselect_1.2.1                emmeans_1.11.0                 
[105] bit_4.6.0                       compiler_4.5.0                 
[107] microbiome_1.31.0               htmlTable_2.4.3                
[109] BiocNeighbors_2.3.0             expm_1.0-0                     
[111] xml2_1.3.8                      DelayedArray_0.35.1            
[113] checkmate_2.3.2                 scales_1.4.0                   
[115] digest_0.6.37                   minqa_1.2.8                    
[117] rmarkdown_2.29                  XVector_0.49.0                 
[119] htmltools_0.5.8.1               pkgconfig_2.0.3                
[121] base64enc_0.1-3                 lme4_1.1-37                    
[123] sparseMatrixStats_1.21.0        MatrixGenerics_1.21.0          
[125] fastmap_1.2.0                   rlang_1.1.6                    
[127] htmlwidgets_1.6.4               UCSC.utils_1.5.0               
[129] DelayedMatrixStats_1.31.0       farver_2.1.2                   
[131] jquerylib_0.1.4                 zoo_1.8-14                     
[133] jsonlite_2.0.0                  energy_1.7-12                  
[135] BiocParallel_1.43.0             BiocSingular_1.25.0            
[137] magrittr_2.0.3                  Formula_1.2-5                  
[139] scuttle_1.19.0                  GenomeInfoDbData_1.2.14        
[141] patchwork_1.3.0                 Rhdf5lib_1.31.0                
[143] Rcpp_1.0.14                     ape_5.8-1                      
[145] ggnewscale_0.5.1                viridis_0.6.5                  
[147] CVXR_1.0-15                     stringi_1.8.7                  
[149] rootSolve_1.8.2.4               MASS_7.3-65                    
[151] plyr_1.8.9                      parallel_4.5.0                 
[153] ggrepel_0.9.6                   lmom_3.2                       
[155] Biostrings_2.77.0               splines_4.5.0                  
[157] gridtext_0.1.5                  multtest_2.65.0                
[159] hms_1.1.3                       igraph_2.1.4                   
[161] reshape2_1.4.4                  stats4_4.5.0                   
[163] ScaledMatrix_1.17.0             evaluate_1.0.3                 
[165] nloptr_2.2.1                    tzdb_0.5.0                     
[167] BiocBaseUtils_1.11.0            rsvd_1.0.5                     
[169] xtable_1.8-4                    Rmpfr_1.0-0                    
[171] e1071_1.7-16                    tidytree_0.4.6                 
[173] ragg_1.4.0                      viridisLite_0.4.2              
[175] class_7.3-23                    gsl_2.1-8                      
[177] lmerTest_3.1-3                  beeswarm_0.4.0                 
[179] IRanges_2.43.0                  cluster_2.1.8.1                
[181] TreeSummarizedExperiment_2.17.0 timechange_0.3.0               
[183] mia_1.17.0