The statistics functionality in
notameStats aims to identify interesting features across
study groups. See the project example vignette in the notame package and
notame
website reference index for listing of functions. Similar
functionality is available in several packages.
Unless otherwise stated, all functions return separate data frames or
other objects with the results. These can be then added to the object
feature data using join_rowData(object, results). The
reason for not adding these to the objects automatically is that most of
the functions return excess information that is not always worth saving.
We encourage you to choose which information is important to you.
To install notameStats, install BiocManager
first, if it is not installed. Afterwards use the install
function from BiocManager and load
notameStats.
It is straightforward to provide summary statistics and effect sizes for all features:
toy_notame_set <- mark_nas(toy_notame_set, value = 0)
# Impute missing values, required especially for multivariate methods
toy_notame_set <- notame::impute_rf(toy_notame_set)
sum_stats <- summary_statistics(toy_notame_set, grouping_cols = "Group")
toy_notame_set <- notame::join_rowData(toy_notame_set, sum_stats)
d_results <- cohens_d(toy_notame_set, group = "Group")
toy_notame_set <- notame::join_rowData(toy_notame_set, d_results)
fc <- fold_change(toy_notame_set, group = "Group")
toy_notame_set <- notame::join_rowData(toy_notame_set, fc)
colnames(rowData(toy_notame_set))## [1] "Feature_ID" "Split" "Alignment" "Average_Mz"
## [5] "Average_Rt_min" "Column" "Ion_mode" "Flag"
## [9] "A_mean" "A_sd" "A_median" "A_mad"
## [13] "A_min" "A_Q25" "A_Q75" "A_max"
## [17] "B_mean" "B_sd" "B_median" "B_mad"
## [21] "B_min" "B_Q25" "B_Q75" "B_max"
## [25] "QC_mean" "QC_sd" "QC_median" "QC_mad"
## [29] "QC_min" "QC_Q25" "QC_Q75" "QC_max"
## [33] "B_vs_A_Cohen_d" "QC_vs_A_Cohen_d" "QC_vs_B_Cohen_d" "B_vs_A_FC"
## [37] "QC_vs_A_FC" "QC_vs_B_FC"
These functions perform univariate hypothesis tests for each feature,
report relevant statistics and correct the p-values using FDR
correction. For features, where the model fails for some reason, all
statistics are recorded as NA. NOTE setting
all_features = FALSE does not prevent the tests on the
flagged compounds, but only affects p-value correction, where flagged
features are not included in the correction and thus do not have an FDR-
corrected p-value. To prevent the testing of flagged features
altogether, use notame::drop_flagged before the tests.
Most of the univariate statistical test functions in this package use the formula interface, where the formula is provided as a character, with one special condition: the word “Feature” will get replaced at each iteration by the corresponding feature name. So for example, when testing if any of the features predict the difference between study groups, the formula would be: “Group ~ Feature”. Or, when testing if group and time point affect metabolite levels, the formula could be “Feature ~ Group + Time + Group:Time”, with the last term being an interaction term (“Feature ~ Group * Time” is equivalent).
toy_notame_set <- notame::flag_quality(toy_notame_set)
toy_notame_set <- notame::drop_qcs(toy_notame_set)
lm_results <- perform_lm(toy_notame_set,
formula_char = "Feature ~ Group + Time")Most of the functions allow you to pass extra arguments to the underlying functions performing the actual tests, so you can set custom contrasts etc.
Functions not using the formula interface include correlation tests
between molecular features and/or sample information variable
(perform_correlation_tests()) and area under curve
computation (perform_auc()).
notame provides a wrapper for the MUVR analysis (Multivariate methods
with Unbiased Variable selection in R, [shi2019variable] using the MUVR2
package. MUVR2 allows fitting both RF and PLS models with clever
variable selection for both finding a minimal subset of features that
achieves a good performance AND for finding all relevant features. There
is also a set of useful visualizations in MUVR2.
# nRep = 2 for quick example
pls_model <- muvr_analysis(toy_notame_set,
y = "Injection_order", nRep = 2, method = "PLS")## Warning in (function (X, Y, ID, scale = TRUE, nRep = 5, nOuter = 6, nInner, :
## Missing ID -> Assume all unique (i.e. sample independence)
## Warning: executing %dopar% sequentially: no parallel backend registered
## [1] "MUVR" "Regression" "PLS"
For random forest models, we also use the randomForest
package. We also include a wrapper for getting feature importance.
## [1] "randomForest"
## Feature_ID A B
## HILIC_neg_118_9111a4_1865 HILIC_neg_118_9111a4_1865 -0.001266017 0.020308924
## HILIC_pos_255_0094a7_9288 HILIC_pos_255_0094a7_9288 -0.046865584 -0.001874653
## RP_neg_139_4456a4_1251 RP_neg_139_4456a4_1251 0.018118326 0.015059646
## MeanDecreaseAccuracy MeanDecreaseGini
## HILIC_neg_118_9111a4_1865 0.009517204 6.565834
## HILIC_pos_255_0094a7_9288 -0.023202559 5.799136
## RP_neg_139_4456a4_1251 0.014382649 7.124830
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] pROC_1.19.0.1 notameStats_1.1.1
## [3] notameViz_1.1.1 notame_1.1.1
## [5] SummarizedExperiment_1.41.0 Biobase_2.71.0
## [7] GenomicRanges_1.63.1 Seqinfo_1.1.0
## [9] IRanges_2.45.0 S4Vectors_0.49.0
## [11] BiocGenerics_0.57.0 generics_0.1.4
## [13] MatrixGenerics_1.23.0 matrixStats_1.5.0
## [15] ggplot2_4.0.1 BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3 jsonlite_2.0.0
## [4] shape_1.4.6.1 magrittr_2.0.4 farver_2.1.2
## [7] rmarkdown_2.30 vctrs_0.7.0 htmltools_0.5.9
## [10] S4Arrays_1.11.1 itertools_0.1-3 missForest_1.6.1
## [13] lambda.r_1.2.4 SparseArray_1.11.10 caret_7.0-1
## [16] sass_0.4.10 parallelly_1.46.1 bslib_0.9.0
## [19] plyr_1.8.9 futile.options_1.0.1 lubridate_1.9.4
## [22] cachem_1.1.0 buildtools_1.0.0 igraph_2.2.1
## [25] lifecycle_1.0.5 iterators_1.0.14 pkgconfig_2.0.3
## [28] Matrix_1.7-4 R6_2.6.1 fastmap_1.2.0
## [31] rbibutils_2.4 future_1.69.0 digest_0.6.39
## [34] rARPACK_0.11-0 RSpectra_0.16-2 ellipse_0.5.0
## [37] randomForest_4.7-1.2 timechange_0.3.0 abind_1.4-8
## [40] mgcv_1.9-4 compiler_4.5.2 rngtools_1.5.2
## [43] withr_3.0.2 doParallel_1.0.17 S7_0.2.1
## [46] BiocParallel_1.45.0 psych_2.5.6 MASS_7.3-65
## [49] lava_1.8.2 DelayedArray_0.37.0 corpcor_1.6.10
## [52] MUVR2_0.1.0 ModelMetrics_1.2.2.2 tools_4.5.2
## [55] ranger_0.18.0 future.apply_1.20.1 nnet_7.3-20
## [58] glue_1.8.0 nlme_3.1-168 grid_4.5.2
## [61] reshape2_1.4.5 recipes_1.3.1 gtable_0.3.6
## [64] class_7.3-23 tidyr_1.3.2 data.table_1.18.0
## [67] XVector_0.51.0 ggrepel_0.9.6 foreach_1.5.2
## [70] pillar_1.11.1 stringr_1.6.0 splines_4.5.2
## [73] dplyr_1.1.4 lattice_0.22-7 survival_3.8-6
## [76] tidyselect_1.2.1 mixOmics_6.35.0 maketools_1.3.2
## [79] knitr_1.51 gridExtra_2.3 futile.logger_1.4.9
## [82] xfun_0.56 hardhat_1.4.2 timeDate_4051.111
## [85] stringi_1.8.7 yaml_2.3.12 evaluate_1.0.5
## [88] codetools_0.2-20 tibble_3.3.1 BiocManager_1.30.27
## [91] cli_3.6.5 rpart_4.1.24 Rdpack_2.6.4
## [94] jquerylib_0.1.4 Rcpp_1.1.1 globals_0.18.0
## [97] parallel_4.5.2 gower_1.0.2 doRNG_1.8.6.2
## [100] listenv_0.10.0 glmnet_4.1-10 viridisLite_0.4.2
## [103] ipred_0.9-15 scales_1.4.0 prodlim_2025.04.28
## [106] purrr_1.2.1 rlang_1.1.7 formatR_1.14
## [109] mnormt_2.1.1