1 Installation
2 Citing BreastSubtypeR
3 Brief description
- 3.1 Features
- 3.2 Implemented approaches
4 Quick start
5 Guidance & best practices
6 Shiny app
- 6.1 Launch the local Shiny app
- 6.2 Example data (for Shiny & scripts)
7 Limitations
8 Appendix

1 Installation

Install the released version from Bioconductor:

# Requires R >= 4.5.0
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("BreastSubtypeR")

2 Citing BreastSubtypeR

If you use BreastSubtypeR, please cite:

Yang Q., Hartman J., Sifakis E.G. (2025). BreastSubtypeR: a unified R/Bioconductor package for intrinsic molecular subtyping in breast cancer research. NAR Genomics and Bioinformatics, 7(4):lqaf131. https://doi.org/10.1093/nargab/lqaf131

For BibTeX/LaTeX, run in R:

citation("BreastSubtypeR")

3 Brief description

Breast cancer (BC) is a biologically heterogeneous disease with intrinsic molecular subtypes (e.g., Luminal A, Luminal B, HER2-enriched, Basal-like, Normal-like) that inform biological interpretation and clinical decision-making. While clinical assays such as Prosigna provide standardized subtyping in the clinic, research implementations have proliferated and diverge in pre-processing, gene mapping, and algorithmic assumptions—reducing reproducibility and complicating cross-cohort analyses.

BreastSubtypeR consolidates multiple published gene-expression signature classifiers into a unified, assumption-aware Bioconductor package with: - a unified multi-method API (run many classifiers in one call), - AUTO mode for cohort-aware method selection, - standardized, method-specific pre-processing for multiple input types (raw counts, FPKM, log2-processed arrays), - Entrez ID–based probe/gene mapping, - and a local Shiny app (iBreastSubtypeR) for non-programmers.

3.1 Features

Unified interface for published methods: consolidates PAM50 variants, AIMS, ssBC/sspbc, and others under one consistent API.
Run multiple methods at once (BS_Multi): execute several classifiers in a single call and compare results side by side.
AUTO (cohort-aware selection): checks ER/HER2 distribution, subtype purity, and subgroup sizes; disables incompatible classifiers.
Method-specific pre-processing: automatically routes raw RNA-seq counts, precomputed FPKM, or log2-processed microarray/nCounter matrices.
Robust mapping: Entrez ID–based gene mapping with conflict resolution.
Local Shiny app (iBreastSubtypeR): point-and-click analysis; data stay on your machine.
Reproducibility: Bioconductor distribution, unit tests, vignettes, and SummarizedExperiment compatibility.

3.2 Implemented approaches

The package includes implementations of commonly used subtyping methods (NC-based and SSP-based):

Method id	Short description	Group	Reference
`parker.original`	Original PAM50 by Parker et al., 2009	NC-based	Parker et al., 2009
`genefu.scale`	PAM50 implementation as in the genefu R package (scaled version)	NC-based	Gendoo et al., 2016
`genefu.robust`	PAM50 implementation as in the genefu R package (robust version)	NC-based	Gendoo et al., 2016
`cIHC`	Conventional ER-balancing using immunohistochemistry (IHC)	NC-based	Ciriello et al., 2015
`cIHC.itr`	Iterative version of cIHC	NC-based	Curtis et al., 2012
`PCAPAM50`	Selects IHC-defined ER subsets, then uses Principal Component Analysis (PCA) to create ESR1 expression-based ER-balancing	NC-based	Raj-Kumar et al., 2019
`ssBC`	Subgroup-specific gene-centering PAM50	NC-based	Zhao et al., 2015
`ssBC.v2`	Updated subgroup-specific gene-centering PAM50 with refined quantiles	NC-based	Fernandez-Martinez et al., 2020
`AIMS`	Absolute Intrinsic Molecular Subtyping (AIMS) method	SSP-based	Paquet & Hallett, 2015
`sspbc`	Single-Sample Predictors for Breast Cancer (AIMS adaptation)	SSP-based	Staaf et al., 2022

4 Quick start

The examples below use small example datasets shipped with the package. For your own data, provide a SummarizedExperiment with clinical metadata in colData (e.g., PatientID, ER/HER2; for ROR: TSIZE, NODE).

library(BreastSubtypeR)

# Example data
data("BreastSubtypeRobj")
data("OSLO2EMIT0obj")

1) Map & prepare (method-specific pre-processing + mapping)

# Pre-processing: automatically apply tailored normalization, map probes/IDs to Entrez,
# and (optionally) impute missing values
data_input <- Mapping(
    OSLO2EMIT0obj$se_obj,
    method = "max", # mapping strategy (example)
    RawCounts = FALSE,
    impute = TRUE,
    verbose = FALSE
)

Notes

Mapping() prepares expression inputs for downstream subtyping functions by:
- automatically applying tailored normalization workflows depending on input type
  - Raw RNA-seq counts (+ gene lengths): converted to log2-CPM (upper-quartile normalized) for NC-based methods; converted to linear FPKM for SSP-based methods.
  - Precomputed RNA-seq FPKM (log₂-transformed): used directly for NC-based methods; back-transformed to linear scale (2^x) for SSP-based methods.
  - Microarray/nCounter (log₂-processed): used directly for NC-based methods; back-transformed to linear scale (2^x) for SSP-based methods.
- resolving probe/ID → Entrez mappings,
- selecting or collapsing multiple probes per gene (method argument),
- optionally imputing missing marker values,
- and returning a packaged object ready for BS_Multi or single-method callers.
See ?Mapping for the full parameter list (e.g., RawCounts, method, impute, verbose) and Methods (Sections 2.3–2.4) in the paper for a complete description of the input/normalization pipeline.

2) Multi-method run (user-defined)

methods <- c("parker.original", "PCAPAM50", "sspbc")

res <- BS_Multi(
    data_input = data_input,
    methods = methods,
    Subtype = FALSE,
    hasClinical = FALSE
)

# Per-sample calls (methods × samples)
head(res$res_subtypes, 5)
#>                parker.original PCAPAM50  sspbc   entropy
#> OSLO2EMIT0.001            LumA     LumA   LumB 0.9182958
#> OSLO2EMIT0.002           Basal    Basal  Basal 0.0000000
#> OSLO2EMIT0.003            LumA     LumA   LumA 0.0000000
#> OSLO2EMIT0.004            LumA     LumA   LumA 0.0000000
#> OSLO2EMIT0.005          Normal     LumA Normal 0.9182958

3) AUTO mode (cohort-aware selection) + visualize

AUTO evaluates cohort diagnostics (for example, ER/HER2 distribution, subtype purity, and subgroup sizes) and selects methods compatible with the cohort. It disables classifiers whose distributional assumptions would likely be violated.

res_auto <- BS_Multi(
    data_input = data_input,
    methods = "AUTO",
    Subtype = FALSE,
    hasClinical = FALSE
)

# visualize multi-method output and concordance
Vis_Multi(res_auto$res_subtypes)

AUTO logic (clarifications)

ER/HER2-defined cohorts (any of ER+/HER2−, ER−/HER2−, ER+/HER2+, ER−/HER2+): AUTO runs ssBC.v2 only, plus SSP-based methods (AIMS, sspbc).
ER-only cohorts (ER+ or ER−) and TNBC: when above minimum sizes (see below), AUTO runs ssBC and/or ssBC.v2, plus SSP-based methods.
ER+ fraction gate (simulation-based): lower_ratio = 0.39, upper_ratio = 0.69.
Minimum sample group sizes (defaults used by AUTO):
- ER+ total: n_ERpos_threshold = 15
- ER− total: n_ERneg_threshold = 18
- TNBC total: n_TN_threshold = 18 (currently aligned with ER−)
- ER+ subgroups (HER2+ or HER2−): n_ERposHER2pos_threshold = n_ERposHER2neg_threshold = round(n_ERpos_threshold / 2)
- ER− subgroups (HER2+ or HER2−): n_ERnegHER2pos_threshold = n_ERnegHER2neg_threshold = round(n_ERneg_threshold / 2)

Notes. Thresholds are selection gates for method eligibility; they do not force a consensus call.

Provenance & future updates. The ER+ (15) and ER− (18) cohort minimums are simulation-based defaults. ER/HER2 subgroup thresholds (approx. half of each ER total) are heuristic and may be updated as additional simulation studies are completed. For TNBC, we currently use the ER− minimum (18) as the cohort cutoff; TN-specific thresholds may likewise refined in future releases.

4) Single-method run

PAM50 (NC-based)

res_pam <- BS_parker(
    se_obj = data_input$se_NC, # object prepared for NC-based methods
    calibration = "Internal",
    internal = "medianCtr",
    Subtype = FALSE,
    hasClinical = FALSE
)

AIMS (SSP-based)

res_aims <- BS_AIMS(data_input$se_SSP)

5 Guidance & best practices

5.1 Input types

Provide one of the following as input:
- raw counts plus gene lengths (for internal calculation of CPM/FPKM),
- precomputed FPKM/TPM matrices,
- log2-processed microarray/nCounter matrices (e.g., RMA).
BreastSubtypeR routes the supplied input to the appropriate, method-specific pre-processing pipeline automatically — see ?BS_Multi and Methods (Section 2.3) in the paper for details.

5.2 When to use `AUTO`

Use methods = "AUTO" (i.e. BS_Multi(methods = "AUTO", ...)) for exploratory datasets or cohorts of unknown / skewed composition.
Use AUTO when you want the package to select only classifiers compatible with the cohort (it disables methods whose assumptions appear violated).
For validation against a single published method or a clinical assay (e.g., Prosigna®), run the corresponding single-method implementation directly (e.g., BS_parker()).

5.3 Interpretation

AUTO is designed to avoid misapplication of NC-based classifiers when cohort assumptions are violated; it does not produce a forced consensus label.

6 Shiny app

For users new to R, we offer an intuitive Shiny app for interactive molecular subtyping.

6.1 Launch the local Shiny app

BreastSubtypeR::iBreastSubtypeR() # interactive GUI (local)

If needed, install UI dependencies and re-run:

install.packages(c("shiny", "bslib"))

The app runs locally; no data leave your machine.

What you can do:
- Upload expression, clinical, and feature-annotation tables (clinical lives in colData).
- Run single methods, or run multiple classifiers at once with BS_Multi and AUTO enabled for cohort-aware selection.
- Choose 5-class (incl. Normal-like) or 4-class (AIMS is 5-class only).
- Inspect per-sample concordance (entropy), heatmap and pie summaries.
- Export Calls-only or Full metrics. ROR is available for NC methods when TSIZE/NODE are present and numeric.

6.2 Example data (for Shiny & scripts)

The Shiny UI provides a “Load example data…” button that preloads a small demo cohort (expression, clinical, annotation). After loading, click Preprocess & map (Step 1), then proceed to analyses (Step 2).

Programmatic access to the same files:

exdir <- system.file("RshinyTest", package = "BreastSubtypeR")
gex <- file.path(exdir, "OSLO2EMIT0_GEX_log2.FPKM.txt")
clin <- file.path(exdir, "OSLO2EMIT0_clinical.txt")
anno <- file.path(exdir, "OSLO2EMIT0_anno.txt")
file.exists(gex)
file.exists(clin)
file.exists(anno)

7 Limitations

BreastSubtypeR harmonises many published, signature-based classifiers but has known limitations:

It is not a clinical-grade replacement for assays like Prosigna; clinical validation requires paired clinical assay data.
AUTO selects compatible methods; it does not perform consensus voting by default.

8 Appendix

8.1 Sources & support

Bioconductor package page: https://bioconductor.org/packages/BreastSubtypeR
Bioconductor DOI: https://doi.org/10.18129/B9.bioc.BreastSubtypeR
GitHub mirrors: https://github.com/yqkiuo/BreastSubtypeR (personal), https://github.com/JohanHartmanGroupBioteam/BreastSubtypeR (org)
Bugs / pull requests: https://github.com/yqkiuo/BreastSubtypeR/issues

8.2 References

Yang Q., Hartman J., Sifakis E.G. (2025). BreastSubtypeR: a unified R/Bioconductor package for intrinsic molecular subtyping in breast cancer research. NAR Genomics and Bioinformatics, 7(4):lqaf131. https://doi.org/10.1093/nargab/lqaf131
Parker JS, Mullins M, Cheang MCU, Leung S, Voduc D, Vickery T, et al. (2009). Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol, 27(8):1160–1167. https://doi.org/10.1200/JCO.2008.18.1370
Gendoo DMA, Ratanasirigulchai N, Schröder MS, Pare L, Parker JS, Prat A, Haibe-Kains B. (2016). Genefu: an R/Bioconductor package for computation of gene expression-based signatures in breast cancer. Bioinformatics, 32(7):1097–1099. https://doi.org/10.1093/bioinformatics/btv693
Ciriello G, Gatza ML, Beck AH, Wilkerson MD, Rhie SK, Pastore A, et al. (2015). Comprehensive molecular portraits of invasive lobular breast cancer. Cell, 163(2):506–519. https://doi.org/10.1016/j.cell.2015.09.033
Curtis C, Shah SP, Chin S-F, Turashvili G, Rueda OM, Dunning MJ, et al. (2012). The genomic and transcriptomic architecture of 2,000 breast tumors reveals novel subgroups. Nature, 486:346–352. https://doi.org/10.1038/nature10983
Raj-Kumar PK, Liu J, Hooke JA, Kovatich AJ, Kvecher L, Shriver CD, Hu H. (2019). PCA-PAM50 improves subtype assignment in ER-positive breast cancer. Sci Rep, 9:14386. https://doi.org/10.1038/s41598-019-44339-4
Zhao X, Rodland EA, Tibshirani R, Edvardsen H, Sauer T, Hovig E. (2015). Systematic evaluation of subtype prediction using gene expression profiles and intrinsic subtyping methods. Breast Cancer Res, 17:55. https://doi.org/10.1186/s13058-015-0520-4
Fernandez-Martinez A, Krop IE, Hillman DW, Polley M-YC, Parker JS, Huebner L, et al. (2020). Survival, pathologic response, and PAM50 subtype in stage II–III HER2-positive breast cancer treated with neoadjuvant chemotherapy and trastuzumab ± lapatinib. J Clin Oncol, 38(19):2140–2150. https://doi.org/10.1200/JCO.20.01276
Paquet ER, Hallett MT. (2015). Absolute assignment of breast cancer intrinsic molecular subtype. J Natl Cancer Inst, 107(1):357. https://doi.org/10.1093/jnci/dju357
Staaf J, Ringnér M, Vallon-Christersson J. (2022). Simple single-sample predictors for breast cancer subtype identification using gene expression data. npj Breast Cancer, 8:104. https://doi.org/10.1038/s41523-022-00465-3

8.3 Session information

sessionInfo()
#> R Under development (unstable) (2025-10-20 r88955)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] BreastSubtypeR_1.3.2 BiocStyle_2.39.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] SummarizedExperiment_1.41.0 gtable_0.3.6               
#>  [3] impute_1.85.0               circlize_0.4.17            
#>  [5] shape_1.4.6.1               rjson_0.2.23               
#>  [7] xfun_0.55                   bslib_0.9.0                
#>  [9] ggplot2_4.0.1               GlobalOptions_0.1.3        
#> [11] ggrepel_0.9.6               Biobase_2.71.0             
#> [13] lattice_0.22-7              Cairo_1.7-0                
#> [15] vctrs_0.6.5                 tools_4.6.0                
#> [17] generics_0.1.4              stats4_4.6.0               
#> [19] parallel_4.6.0              proxy_0.4-28               
#> [21] tibble_3.3.0                cluster_2.1.8.1            
#> [23] pkgconfig_2.0.3             Matrix_1.7-4               
#> [25] data.table_1.17.8           RColorBrewer_1.1-3         
#> [27] S7_0.2.1                    S4Vectors_0.49.0           
#> [29] lifecycle_1.0.4             compiler_4.6.0             
#> [31] farver_2.1.2                stringr_1.6.0              
#> [33] tinytex_0.58                Seqinfo_1.1.0              
#> [35] codetools_0.2-20            ComplexHeatmap_2.27.0      
#> [37] clue_0.3-66                 class_7.3-23               
#> [39] htmltools_0.5.9             sass_0.4.10                
#> [41] yaml_2.3.12                 pillar_1.11.1              
#> [43] crayon_1.5.3                jquerylib_0.1.4            
#> [45] cachem_1.1.0                DelayedArray_0.37.0        
#> [47] magick_2.9.0                iterators_1.0.14           
#> [49] abind_1.4-8                 foreach_1.5.2              
#> [51] tidyselect_1.2.1            digest_0.6.39              
#> [53] stringi_1.8.7               dplyr_1.1.4                
#> [55] bookdown_0.46               fastmap_1.2.0              
#> [57] grid_4.6.0                  SparseArray_1.11.10        
#> [59] colorspace_2.1-2            cli_3.6.5                  
#> [61] magrittr_2.0.4              S4Arrays_1.11.1            
#> [63] dichromat_2.0-0.1           e1071_1.7-17               
#> [65] withr_3.0.2                 scales_1.4.0               
#> [67] XVector_0.51.0              rmarkdown_2.30             
#> [69] matrixStats_1.5.0           otel_0.2.0                 
#> [71] png_0.1-8                   GetoptLong_1.1.0           
#> [73] evaluate_1.0.5              knitr_1.51                 
#> [75] GenomicRanges_1.63.1        IRanges_2.45.0             
#> [77] doParallel_1.0.17           rlang_1.1.6                
#> [79] Rcpp_1.1.0.8.1              glue_1.8.0                 
#> [81] BiocManager_1.30.27         BiocGenerics_0.57.0        
#> [83] jsonlite_2.0.0              R6_2.6.1                   
#> [85] MatrixGenerics_1.23.0

BreastSubtypeR: Introduction and Workflow

21 December 2025

Contents

1 Installation

2 Citing BreastSubtypeR

3 Brief description

3.1 Features

3.2 Implemented approaches

4 Quick start

5 Guidance & best practices

5.1 Input types

5.2 When to use `AUTO`

5.3 Interpretation

6 Shiny app

6.1 Launch the local Shiny app

6.2 Example data (for Shiny & scripts)

7 Limitations

8 Appendix

8.1 Sources & support

8.2 References

8.3 Session information

BreastSubtypeR: Introduction and Workflow

21 December 2025

Contents

1 Installation

2 Citing BreastSubtypeR

3 Brief description

3.1 Features

3.2 Implemented approaches

4 Quick start

5 Guidance & best practices

5.1 Input types

5.2 When to use AUTO

5.3 Interpretation

6 Shiny app

6.1 Launch the local Shiny app

6.2 Example data (for Shiny & scripts)

7 Limitations

8 Appendix

8.1 Sources & support

8.2 References

8.3 Session information

5.2 When to use `AUTO`