ProBatchFeatures is an extension of
proBatch based on QFeatures (1), which extends it’s functionality by storing
each processing stage (for example, raw data → log-transformation →
normalization → batch effect correction → …) as a new
SummarizedExperiment assay, and logging every step. It
maintains full compatibility with QFeatures functions, and
facilitates quick benchmarking of data preprocessing pipelines.
This vignette demonstrates how to:
Construct a ProBatchFeatures object from
matrix-formatted quantification data.
Apply and record standard preprocessing steps (filtering features with too many missing values, log-transformation, normalization, and batch-effect correction) in a chained workflow.
Perform diagnostic checks.
Retrieve processed assay data and the corresponding preprocessing pipeline for benchmarking or reuse.
To install the latest version of proBatch package, use
BiocManager package:
This vignette uses dplyr, tibble, and
ggplot2 (alongside proBatch) for data
manipulation and visualization.
As an example, we will use the E. coli dataset described in the FedProt paper (2):
The dataset consists of data from five centers that were measured and analyzed independently. For simplicity, data from all centers have been merged into a single dataset, and a subset containing 400 protein groups and their corresponding peptides is used in this example.
ProBatchFeatures objectThe dataset contains two hierarchical levels. While additional levels
can be added in a manner similar to QFeatures, this example
focuses on the peptide and protein group levels. The
all_precursor_pg_match dataframe stores the mapping between
these levels. We begin by using the ProBatchFeatures()
constructor to create an object representing the peptide-level data:
# Building from a (features x samples) matrix
pbf <- ProBatchFeatures(
data_matrix = all_precursors, # matrix of peptide intensities (features x samples)
sample_annotation = all_metadata, # data.frame of sample metadata
sample_id_col = "Run", # column in metadata that matches sample IDs
level = "peptide" # label this assay level as "peptide"
)
pbf # show basic info about the ProBatchFeatures object
#> An instance of class ProBatchFeatures (type: bulk) with 1 set:
#>
#> [1] peptide::raw: SummarizedExperiment with 1463 rows and 118 columns
#> Processing chain: unprocessed data (raw)Assay names follow the convention “
Next, we add the protein-level data as a second assay in the same
ProBatchFeatures instance. Internally, this is achieved by
creating a SummarizedExperiment for the protein-level data
and adding it to the existing object using QFeatures
functionality. The same sample annotation should be used for the new
assay to ensure correct column alignment:
# Adding proteins as a new level and mapping them to precursors
# all_precursor_pg_match has columns: "Precursor.Id", "Protein.Ids"
pbf <- pb_add_level(
object = pbf,
from = "peptide::raw",
new_matrix = all_protein_groups,
to_level = "protein", # will name "protein::raw" by default
mapping_df = all_precursor_pg_match,
from_id = "Precursor.Id",
to_id = "Protein.Ids",
map_strategy = "as_is"
)
pbf
#> An instance of class ProBatchFeatures (type: bulk) with 2 sets:
#>
#> [1] peptide::raw: SummarizedExperiment with 1463 rows and 118 columns
#> [2] protein::raw: SummarizedExperiment with 400 rows and 118 columns
#> Processing chain:
#> [1] add_level(protein)_byVar
#> Steps logged: 1 (see get_operation_log())If a precursor (peptide) maps to multiple protein groups,
map_strategy parameter specifies how to resolve
multimappers. Available options are “first”, “longest”, and “as_is”. The
“first” and “longest” strategies select a single ProteinID per peptide,
where first takes the first match found in the mapping dataframe, and
longest selects the ProteinID with the longest sequence, while “as_is”
assumes a one-to-one mapping between identifiers.
# Check
validObject(pbf) # should be TRUE
#> [1] TRUE
assayLink(pbf, "protein::raw") # verify link summary
#> AssayLink for assay <protein::raw>
#> [from:peptide::raw|fcol:ProteinID|hits:1463]
# Remove unused variables to clean up the workspace
rm(all_metadata, all_precursor_pg_match, all_precursors, all_protein_groups)With the data now loaded into a ProBatchFeatures object
with peptide- and protein-level assays, we can demonstrate an example of
typical preprocessing workflow. This pipeline includes filtering
low-quality features, applying a log2-transformation, median
normalization, and batch effect correction.
Features detected in only a small subset of samples are unreliable
and may obscure systematic effects in subsequent analyses. To examine
the proportion of missing values across datasets, pb_nNA()
function is used:
Function pb_nNA() summarizes missingness per assay:
# pb_nNA returns a DataFrame with counts of NAs per feature/sample
na_counts <- pb_nNA(pbf)
# na_counts contains info about total number of NAs and % in the data:
na_counts[["nNA"]] %>% as_tibble()
#> # A tibble: 2 × 3
#> assay nNA pNA
#> <chr> <int> <dbl>
#> 1 peptide::raw 78870 0.457
#> 2 protein::raw 9163 0.194Here nNA is a number of missing values in the assay and pNA is its proportion. It is possible to inspect the per-feature and per-sample breakdown for each assay:
# check NAs per sample:
head(na_counts[["peptide::raw"]]$nNAcols) %>% as_tibble()
#> # A tibble: 6 × 4
#> assay name nNA pNA
#> <chr> <chr> <int> <dbl>
#> 1 peptide::raw lab_A_S1 496 0.339
#> 2 peptide::raw lab_A_S2 483 0.330
#> 3 peptide::raw lab_A_S3 524 0.358
#> 4 peptide::raw lab_A_S4 525 0.359
#> 5 peptide::raw lab_A_S5 541 0.370
#> 6 peptide::raw lab_A_S6 555 0.379For each assay the following tables are availabe:
Visualisations helps explore batch-specific patterns of missingness. Firstly, we define a color scheme which will be used in all diagnostic plots:
# color scheme for lab and condition labels
color_scheme <- sample_annotation_to_colors(
pbf,
sample_id_col = "Run",
factor_columns = c("Lab", "Condition"),
numeric_columns = NULL
)Displaying features that contain at least one missing as a heatmap:
plot_NA_heatmap(
pbf, # by default the last created assay
show_rownames = FALSE, show_row_dend = FALSE,
color_by = "Lab"
)To display all features, set the parameter
drop_complete=F. When the number of features or samples
exceeds 5000, a random subset of 5000 rows/columns is used for
visualization by default. To display the entire dataset, set
use_subset = T.
To display multiple assays in one plot, specify them using
pbf_name parameter:
plot_NA_heatmap(
pbf,
pbf_name = c("peptide::raw", "protein::raw"),
show_rownames = FALSE, show_row_dend = FALSE,
color_by = "Lab"
)The overlap between peptide and protein identifications across samples can be visualized as follows:
In the resulting plot, particularly for peptides, batch-specific patterns are evident, reflecting the 19-20 samples contributed by each laboratory.
Here, we apply a simple filter that retains only features quantified in at least 40% of the samples. This threshold can be adjusted if necessary.
pbf <- pb_filterNA(
pbf, # without specifying, the filter will be applied to all assays in the object
inplace = TRUE, # if false, filtered matrix will be saved as a new assay.
pNA = 0.6 # the maximal proportion of missing values per feature
)
pbf
#> An instance of class ProBatchFeatures (type: bulk) with 2 sets:
#>
#> [1] peptide::raw: SummarizedExperiment with 867 rows and 118 columns
#> [2] protein::raw: SummarizedExperiment with 331 rows and 118 columns
#> Processing chain:
#> [1] add_level(protein)_byVar filterNA filterNA
#> Steps logged: 3 (see get_operation_log())After filtering, the ProBatchFeatures object stores
fewer features, which sharpens downstream diagnostics.
plot_NA_heatmap(
pbf,
pbf_name = c("peptide::raw", "protein::raw"),
show_rownames = FALSE, show_row_dend = FALSE,
color_by = "Lab", # currently only one level is supported
border_color = NA # pheatmap parameter to remove cell borders
)However, as the plots indicate, the remaining patterns of missingness
are still associated with the lab.
To remove the residual systematic effects from the data, we further
perform normalization and batch-effect correction.
Proteomics data are commonly log-transformed to stabilize variance as required for furter statistical analyzes and data visualization. Here, we perform a log2-transformation of both peptide- and protein-level data:
pbf <- log_transform_dm(
pbf,
log_base = 2, offset = 1,
pbf_name = "protein::raw"
)
pbf <- log_transform_dm(
pbf,
log_base = 2, offset = 1,
pbf_name = "peptide::raw"
)
pbf
#> An instance of class ProBatchFeatures (type: bulk) with 2 sets:
#>
#> [1] peptide::raw: SummarizedExperiment with 867 rows and 118 columns
#> [2] protein::raw: SummarizedExperiment with 331 rows and 118 columns
#> Processing chain:
#> [1] add_level(protein)_byVar filterNA filterNA ; [4] log2 log2
#> - peptide: log2_on_raw
#> - protein: log2_on_raw
#> Steps logged: 5 (see get_operation_log())After log-transformation, the ProBatchFeatures instance
pbf contains additional assays
(peptide::log2_on_raw and
protein::log2_on_raw). If the assay was transformed using
“fast to recompute” step (for exmaple, log-transformation or median
normalization), the transformed data will not be saved in the list of
assays, but only re-computed for plotting. To force storing the results
of this steps store_fast_steps = T parameter can be
added.
The intensity distributions of each sample after log-transformation can be visualized as boxplots to identify potential batch-specific shifts:
plot_boxplot(
pbf,
pbf_name = c("peptide::log2_on_raw", "protein::log2_on_raw"), # plot multiple on one
sample_id_col = "Run",
order_col = "Run",
batch_col = "Lab",
color_by_batch = TRUE,
color_scheme = color_scheme,
base_size = 7,
plot_ncol = 1, # how many columns use for plotting multy-assay plot
plot_title = c( # title for each assay
"Peptide level - log2 scale",
"Protein level - log2 scale"
)
)Additionally, the intensity distribution of peptides and proteins with and without NAs can be plotted:
The comparison confirms that log2 stabilises the variance for both
peptides and proteins, yet the cross-lab shifts in median intensity
persist and will need to be addressed downstream.
Next, we will perform a Principal Component Analysis (PCA) to get a high-level overview of the sample clustering. We expect to see a strong batch effect, where samples cluster by their center of origin (“Lab”) rather than their biological condition (“Condition”).
pca1 <- plot_PCA(
pbf,
pbf_name = "protein::log2_on_raw",
sample_id_col = "Run",
color_scheme = color_scheme,
color_by = "Lab",
shape_by = "Condition",
fill_the_missing = NULL, # remove all rows with missing values. Can be also "remove"
plot_title = "NA rows removed, protein, log2",
base_size = 10, point_size = 3, point_alpha = 0.5
)
pca2 <- plot_PCA(
pbf,
pbf_name = "protein::log2_on_raw",
sample_id_col = "Run",
color_scheme = color_scheme,
color_by = "Condition",
shape_by = "Lab",
fill_the_missing = 0, # default value is -1
plot_title = "NA replaced with 0, protein, log2",
base_size = 10, point_size = 3, point_alpha = 0.5
)
grid.arrange(pca1, pca2, ncol = 2, nrow = 1)Simiarly, the PCA plots clearly demonstrate that the primary source of variance in the raw data is the lab.
To make the samples more comparable by removing the effect of sample load on intensities, we apply median normalization:
pbf <- pb_transform(
object = pbf,
from = "peptide::log2_on_raw",
steps = "medianNorm"
)
pbf <- pb_transform(
object = pbf,
from = "protein::log2_on_raw",
steps = "medianNorm"
)
pbf
#> An instance of class ProBatchFeatures (type: bulk) with 2 sets:
#>
#> [1] peptide::raw: SummarizedExperiment with 867 rows and 118 columns
#> [2] protein::raw: SummarizedExperiment with 331 rows and 118 columns
#> Processing chain:
#> [1] add_level(protein)_byVar filterNA filterNA ; [4] log2 log2 medianNorm ; [7] medianNorm
#> - peptide: log2_on_raw, medianNorm_on_log2_on_raw
#> - protein: log2_on_raw, medianNorm_on_log2_on_raw
#> Steps logged: 7 (see get_operation_log())pb_transform() appends
peptide::medianNorm_on_log2_on_raw and
protein::medianNorm_on_log2_on_raw. We can now plot and
compare these assays before and after normalization:
# boxplot
plot_boxplot(
pbf,
sample_id_col = "Run",
order_col = "Run",
batch_col = "Lab",
color_by_batch = TRUE,
color_scheme = color_scheme,
base_size = 7,
pbf_name = c("protein::log2_on_raw", "protein::medianNorm_on_log2_on_raw"),
plot_ncol = 1,
plot_title = c(
"Before Median Normalization, protein level",
"After Median Normalization, protein level"
)
)
# plot PCA plot
plot_PCA(
pbf,
pbf_name = c("protein::log2_on_raw", "protein::medianNorm_on_log2_on_raw"),
sample_id_col = "Run",
color_scheme = color_scheme,
color_by = "Lab",
shape_by = "Condition",
fill_the_missing = NULL, # remove all rows with missing values. Can be also "remove"
plot_title = c(
"Before Median Normalization, protein level",
"After Median Normalization, protein level"
),
base_size = 10, point_size = 3, point_alpha = 0.5
)Although median normalization made itensity distributions more similar, the residual lab-specific batch effects remain evident on the PCA plots and need to be corrected.
The user can choose from two methods for batch effects correction:
ComBat (3) from
sva and removeBatchEffect() from
limma (4). Both methods
can be accessed via the same function pb_transform().
Apply limma’s removeBatchEffect() to
log-transformed and normalized data:
# limma::removeBatchEffect method
pbf <- pb_transform(
pbf,
from = "protein::log2_on_raw",
steps = "limmaRBE",
params_list = list(list(
sample_annotation = sa,
batch_col = "Lab",
covariates_cols = c("Condition"),
fill_the_missing = FALSE
))
)
pbf <- pb_transform(
pbf,
from = "protein::medianNorm_on_log2_on_raw",
steps = "limmaRBE",
params_list = list(list(
sample_annotation = sa,
batch_col = "Lab",
covariates_cols = c("Condition"),
fill_the_missing = FALSE
))
)As batch effect correction does not listed as a
fast step, two new protein-level assay are added to the
pbd: - [3] protein::limmaRBE_on_log2_on_raw and - [4]
protein::limmaRBE_on_medianNorm_on_log2_on_raw.
print(pbf)
#> An instance of class ProBatchFeatures (type: bulk) with 4 sets:
#>
#> [1] peptide::raw: SummarizedExperiment with 867 rows and 118 columns
#> [2] protein::raw: SummarizedExperiment with 331 rows and 118 columns
#> [3] protein::limmaRBE_on_log2_on_raw: SummarizedExperiment with 331 rows and 118 columns
#> [4] protein::limmaRBE_on_medianNorm_on_log2_on_raw: SummarizedExperiment with 331 rows and 118 columns
#> Processing chain:
#> [1] add_level(protein)_byVar filterNA filterNA ; [4] log2 log2 medianNorm ; [7] medianNorm limmaRBE limmaRBE
#> - peptide: log2_on_raw, medianNorm_on_log2_on_raw
#> - protein: log2_on_raw, limmaRBE_on_log2_on_raw, medianNorm_on_log2_on_raw, limmaRBE_on_medianNorm_on_log2_on_raw
#> Steps logged: 9 (see get_operation_log())Alternatively, batch effect correction can be performed using
ComBat:
# sva::ComBat wrapped via pb_transform()
pbf <- pb_transform(
pbf,
from = "protein::log2_on_raw",
steps = "combat",
params_list = list(list(
sample_annotation = sa,
batch_col = "Lab",
sample_id_col = "Run",
par.prior = TRUE,
fill_the_missing = "remove"
))
)
pbf <- pb_transform(
pbf,
from = "protein::medianNorm_on_log2_on_raw",
steps = "combat",
params_list = list(list(
sample_annotation = sa,
batch_col = "Lab",
sample_id_col = "Run",
par.prior = TRUE,
fill_the_missing = "remove"
))
)The ComBat method from the sva package
cannot be applied to data containing missing values.
Therefore, features with any missing values are removed by
default.
Alternatively, missing values can be imputed with a user-defined
constant using the fill_the_missing parameter (default:
-1).
print(pbf)
#> An instance of class ProBatchFeatures (type: bulk) with 6 sets:
#>
#> [1] peptide::raw: SummarizedExperiment with 867 rows and 118 columns
#> [2] protein::raw: SummarizedExperiment with 331 rows and 118 columns
#> [3] protein::limmaRBE_on_log2_on_raw: SummarizedExperiment with 331 rows and 118 columns
#> [4] protein::limmaRBE_on_medianNorm_on_log2_on_raw: SummarizedExperiment with 331 rows and 118 columns
#> [5] protein::combat_on_log2_on_raw: SummarizedExperiment with 244 rows and 118 columns
#> [6] protein::combat_on_medianNorm_on_log2_on_raw: SummarizedExperiment with 244 rows and 118 columns
#> Processing chain:
#> [1] add_level(protein)_byVar filterNA filterNA ; [4] log2 log2 medianNorm ; [7] medianNorm limmaRBE limmaRBE ; [10] combat combat
#> - peptide: log2_on_raw, medianNorm_on_log2_on_raw
#> - protein: log2_on_raw, combat_on_log2_on_raw, limmaRBE_on_log2_on_raw, medianNorm_on_log2_on_raw, combat_on_medianNorm_on_log2_on_raw, limmaRBE_on_medianNorm_on_log2_on_raw
#> Steps logged: 11 (see get_operation_log())Similarly, two new protein-level assays were added to
pdf: - [5] protein::combat_on_log2_on_raw: … with 244 rows
- [6] protein::combat_on_medianNorm_on_log2_on_raw: … with 244 rows.
And as we removed features with missing values, these assays contain less protein groups than the raw filtered data (331 rows).
All steps can be applied to peptide-level data in the same way:
pbf <- pb_transform(
pbf,
from = "peptide::medianNorm_on_log2_on_raw",
steps = "limmaRBE",
params_list = list(
list(
sample_annotation = sa,
batch_col = "Lab",
covariates_cols = c("Condition"),
fill_the_missing = FALSE
)
)
)
pbf <- pb_transform(
pbf,
from = "peptide::medianNorm_on_log2_on_raw",
steps = "combat",
params_list = list(
list(
sample_annotation = sa,
batch_col = "Lab",
sample_id_col = "Run",
par.prior = TRUE,
fill_the_missing = "remove"
)
)
)
pbf
#> An instance of class ProBatchFeatures (type: bulk) with 8 sets:
#>
#> [1] peptide::raw: SummarizedExperiment with 867 rows and 118 columns
#> [2] protein::raw: SummarizedExperiment with 331 rows and 118 columns
#> [3] protein::limmaRBE_on_log2_on_raw: SummarizedExperiment with 331 rows and 118 columns
#> ...
#> [6] protein::combat_on_medianNorm_on_log2_on_raw: SummarizedExperiment with 244 rows and 118 columns
#> [7] peptide::limmaRBE_on_medianNorm_on_log2_on_raw: SummarizedExperiment with 867 rows and 118 columns
#> [8] peptide::combat_on_medianNorm_on_log2_on_raw: SummarizedExperiment with 152 rows and 118 columns
#> Processing chain:
#> [1] add_level(protein)_byVar filterNA filterNA ; [4] log2 log2 medianNorm ; [7] medianNorm limmaRBE limmaRBE ; [10] combat combat limmaRBE ; [13] combat
#> - peptide: log2_on_raw, medianNorm_on_log2_on_raw, combat_on_medianNorm_on_log2_on_raw, limmaRBE_on_medianNorm_on_log2_on_raw
#> - protein: log2_on_raw, combat_on_log2_on_raw, limmaRBE_on_log2_on_raw, medianNorm_on_log2_on_raw, combat_on_medianNorm_on_log2_on_raw, limmaRBE_on_medianNorm_on_log2_on_raw
#> Steps logged: 13 (see get_operation_log())Two new peptide-level assays were added to pdf: - [7]
peptide::limmaRBE_on_medianNorm_on_log2_on_raw: … with 867 rows - [8]
peptide::combat_on_medianNorm_on_log2_on_raw: … with 152 rows (as
NA-containing features were removed before ComBat).
The four new assays (protein::limmaRBE_on_… and
protein::combat_on_…) are now stored alongside the previous
processing steps. We can explore these assays using PCA to verify that
the batch effect has been successfully removed and compare them to the
data before batch effect correction.
The progression from log\(_2\)-transformed to batch-corrected assays demonstrates the expected reduction in lab-specific variance and a clearer separation by biological condition in the final panel.
plot_PCA(
pbf,
pbf_name = c(
"protein::log2_on_raw",
"protein::medianNorm_on_log2_on_raw",
"protein::limmaRBE_on_medianNorm_on_log2_on_raw",
"protein::combat_on_medianNorm_on_log2_on_raw"
),
sample_id_col = "Run",
color_scheme = color_scheme,
color_by = "Lab",
shape_by = "Condition",
fill_the_missing = NULL,
base_size = 8, point_size = 3, point_alpha = 0.5
)Similarly, for peptide-level data:
plot_PCA(
pbf,
pbf_name = c(
"peptide::log2_on_raw",
"peptide::medianNorm_on_log2_on_raw",
"peptide::limmaRBE_on_medianNorm_on_log2_on_raw",
"peptide::combat_on_medianNorm_on_log2_on_raw"
),
sample_id_col = "Run",
color_scheme = color_scheme,
color_by = "Lab",
shape_by = "Condition",
fill_the_missing = "remove",
base_size = 8, point_size = 3, point_alpha = 0.5
)Hierarchical clustering provides another way to explore sample
similarities. Here we display clustering results for
protein::combat_on_medianNorm_on_log2_on_raw and
protein::limmaRBE_on_medianNorm_on_log2_on_raw assays and
colour samples by lab and condition:
plot_hierarchical_clustering(
pbf,
pbf_name = "protein::combat_on_medianNorm_on_log2_on_raw",
sample_id_col = "Run",
label_font = 0.6,
color_list = color_scheme
)plot_hierarchical_clustering(
pbf,
pbf_name = "protein::limmaRBE_on_medianNorm_on_log2_on_raw",
sample_id_col = "Run",
label_font = 0.6,
color_list = color_scheme,
fill_the_missing = "remove"
)PVCA quantifies the contribution of known factors to the observed variance (5) which can be compared across raw, normalised, and ComBat-corrected assays:
plot_PVCA(
pbf,
pbf_name = c(
"protein::medianNorm_on_log2_on_raw",
"protein::combat_on_medianNorm_on_log2_on_raw"
),
sample_id_col = "Run",
technical_factors = c("Lab"),
biological_factors = c("Condition"),
fill_the_missing = NULL,
base_size = 7,
plot_ncol = 2, # the number of plots in a row
variance_threshold = 0 # the the percentile value of weight each of the
# covariates needs to explain
# (the rest will be lumped together)
)Each call to pb_transform() or
pb_filterNA() creates a log entry that records the source
assay, the applied operation, and the resulting assay name. Reviewing
this log ensures that complex processing pipelines remain transparent
and reproducible.
get_operation_log(pbf) %>%
as_tibble()
#> # A tibble: 13 × 7
#> step fun from to params timestamp pkg
#> <chr> <chr> <chr> <chr> <list> <dttm> <chr>
#> 1 add_level(protein)_… addA… pept… prot… <named list> 2026-01-29 04:17:10 proB…
#> 2 filterNA filt… pept… pept… <named list> 2026-01-29 04:17:12 proB…
#> 3 filterNA filt… prot… prot… <named list> 2026-01-29 04:17:12 proB…
#> 4 log2 log2 prot… prot… <named list> 2026-01-29 04:17:13 proB…
#> 5 log2 log2 pept… pept… <named list> 2026-01-29 04:17:13 proB…
#> 6 medianNorm medi… pept… pept… <list [0]> 2026-01-29 04:17:18 proB…
#> 7 medianNorm medi… prot… prot… <list [0]> 2026-01-29 04:17:18 proB…
#> 8 limmaRBE limm… prot… prot… <named list> 2026-01-29 04:17:22 proB…
#> 9 limmaRBE limm… prot… prot… <named list> 2026-01-29 04:17:22 proB…
#> 10 combat comb… prot… prot… <named list> 2026-01-29 04:17:23 proB…
#> 11 combat comb… prot… prot… <named list> 2026-01-29 04:17:23 proB…
#> 12 limmaRBE limm… pept… pept… <named list> 2026-01-29 04:17:23 proB…
#> 13 combat comb… pept… pept… <named list> 2026-01-29 04:17:23 proB…If you need a compact name for the current assay chain (for example,
to label benchmark results), use
pb_pipeline_name(pbf, "protein::combat_on_medianNorm_on_log2_on_raw").
pb_pipeline_name(pbf, "protein::combat_on_medianNorm_on_log2_on_raw")
#> [1] "combat_on_medianNorm_on_log2_on_raw"pbf objectIt is possible to extract a specific data matrix from a
ProBatchFeatures object using the assay name:
extracted_matrix <- pb_assay_matrix(
pbf,
assay = "peptide::raw"
)
# show
extracted_matrix[1:5, 1:5]
#> lab_A_S1 lab_A_S2 lab_A_S3 lab_A_S4 lab_A_S5
#> ELGNWKDFIEVMLR3 NA NA NA NA NA
#> DFIEVMLR2 NA NA NA NA NA
#> AAADLISR2 14647900 13463600 8518130 12172600 11612300
#> AGIGINAGR2 11785300 11025600 9713860 9612410 13795300
#> AEQYLLENETTK2 30835300 29444100 19451100 20090000 22758800
rm(extracted_matrix)For consistency with the main ProBatch vignette,
assay can also be extracted in the long format. In this case, the
sample_id_col needs to be specified:
extracted_long <- pb_as_long(
pbf,
sample_id_col = "Run",
pbf_name = "peptide::raw"
)
# show
extracted_long[1:3, ]
#> feature_label Run Intensity Lab Condition
#> 1 ELGNWKDFIEVMLR3 lab_A_S1 NA lab_A Pyr
#> 2 DFIEVMLR2 lab_A_S1 NA lab_A Pyr
#> 3 AAADLISR2 lab_A_S1 14647900 lab_A Pyr
rm(extracted_long)sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] QFeatures_1.21.0 MultiAssayExperiment_1.37.2
#> [3] SummarizedExperiment_1.41.0 Biobase_2.71.0
#> [5] GenomicRanges_1.63.1 Seqinfo_1.1.0
#> [7] IRanges_2.45.0 S4Vectors_0.49.0
#> [9] BiocGenerics_0.57.0 generics_0.1.4
#> [11] MatrixGenerics_1.23.0 matrixStats_1.5.0
#> [13] gridExtra_2.3 knitr_1.51
#> [15] proBatch_1.99.4 ggplot2_4.0.1
#> [17] tibble_3.3.1 dplyr_1.1.4
#>
#> loaded via a namespace (and not attached):
#> [1] splines_4.5.2 ggplotify_0.1.3 preprocessCore_1.73.0
#> [4] XML_3.99-0.20 rpart_4.1.24 lifecycle_1.0.5
#> [7] Rdpack_2.6.5 fastcluster_1.3.0 edgeR_4.9.2
#> [10] doParallel_1.0.17 lattice_0.22-7 MASS_7.3-65
#> [13] backports_1.5.0 magrittr_2.0.4 limma_3.67.0
#> [16] Hmisc_5.2-5 sass_0.4.10 rmarkdown_2.30
#> [19] jquerylib_0.1.4 yaml_2.3.12 wesanderson_0.3.7
#> [22] otel_0.2.0 MsCoreUtils_1.23.2 DBI_1.2.3
#> [25] buildtools_1.0.0 minqa_1.2.8 RColorBrewer_1.1-3
#> [28] lubridate_1.9.4 abind_1.4-8 purrr_1.2.1
#> [31] AnnotationFilter_1.35.0 yulab.utils_0.2.3 nnet_7.3-20
#> [34] rappdirs_0.3.4 sva_3.59.0 maketools_1.3.2
#> [37] genefilter_1.93.0 pheatmap_1.0.13 annotate_1.89.0
#> [40] codetools_0.2-20 DelayedArray_0.37.0 tidyselect_1.2.1
#> [43] farver_2.1.2 lme4_1.1-38 viridis_0.6.5
#> [46] dynamicTreeCut_1.63-1 base64enc_0.1-3 jsonlite_2.0.0
#> [49] Formula_1.2-5 survival_3.8-6 iterators_1.0.14
#> [52] foreach_1.5.2 tools_4.5.2 Rcpp_1.1.1
#> [55] glue_1.8.0 SparseArray_1.11.10 BiocBaseUtils_1.13.0
#> [58] xfun_0.56 mgcv_1.9-4 ggfortify_0.4.19
#> [61] HDF5Array_1.39.0 withr_3.0.2 BiocManager_1.30.27
#> [64] fastmap_1.2.0 pvca_1.51.0 boot_1.3-32
#> [67] rhdf5filters_1.23.3 digest_0.6.39 timechange_0.3.0
#> [70] R6_2.6.1 gridGraphics_0.5-1 colorspace_2.1-2
#> [73] GO.db_3.22.0 RSQLite_2.4.5 h5mread_1.3.1
#> [76] utf8_1.2.6 tidyr_1.3.2 data.table_1.18.2.1
#> [79] httr_1.4.7 htmlwidgets_1.6.4 S4Arrays_1.11.1
#> [82] pkgconfig_2.0.3 gtable_0.3.6 blob_1.3.0
#> [85] S7_0.2.1 impute_1.85.0 XVector_0.51.0
#> [88] sys_3.4.3 htmltools_0.5.9 ProtGenerics_1.43.0
#> [91] clue_0.3-66 scales_1.4.0 png_0.1-8
#> [94] reformulas_0.4.3.1 corrplot_0.95 rstudioapi_0.18.0
#> [97] reshape2_1.4.5 checkmate_2.3.3 nlme_3.1-168
#> [100] nloptr_2.2.1 cachem_1.1.0 rhdf5_2.55.12
#> [103] stringr_1.6.0 parallel_4.5.2 foreign_0.8-90
#> [106] AnnotationDbi_1.73.0 vsn_3.79.5 pillar_1.11.1
#> [109] grid_4.5.2 vctrs_0.7.1 xtable_1.8-4
#> [112] cluster_2.1.8.1 htmlTable_2.4.3 evaluate_1.0.5
#> [115] cli_3.6.5 locfit_1.5-9.12 compiler_4.5.2
#> [118] rlang_1.1.7 crayon_1.5.3 labeling_0.4.3
#> [121] affy_1.89.0 plyr_1.8.9 fs_1.6.6
#> [124] stringi_1.8.7 viridisLite_0.4.2 WGCNA_1.73
#> [127] BiocParallel_1.45.0 Biostrings_2.79.4 lazyeval_0.2.2
#> [130] Matrix_1.7-4 bit64_4.6.0-1 Rhdf5lib_1.33.0
#> [133] KEGGREST_1.51.1 statmod_1.5.1 rbibutils_2.4.1
#> [136] igraph_2.2.1 memoise_2.0.1 affyio_1.81.0
#> [139] bslib_0.10.0 bit_4.6.0To cite this package, please use:
citation("proBatch")
#> To cite proBatch in publications use:
#>
#> Cuklina J., Lee C.H., Williams E.G., Sajic T., Collins B.,
#> Rodriguez-Martinez M., Aebersold R., Pedrioli P. Diagnostics and
#> correction of batch effects in large-scale proteomic studies: a
#> tutorial Molecular Systems Biology 17, e10240 (2021)
#> https://doi.org/10.15252/msb.202110240
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Article{,
#> title = {Diagnostics and correction of batch effects in large-scale proteomic studies: a tutorial},
#> author = {Jelena Cuklina and Chloe H. Lee and Evan G. Willams and Tatjana Sajic and Ben Collins and Maria Rodriguez-Martinez and Patrick Pedrioli and Ruedi Aebersold},
#> journal = {Molecular Systems Biology},
#> year = {2021},
#> url = {https://doi.org/10.15252/msb.202110240},
#> doi = {10.15252/msb.202110240},
#> }