1 Introduction

Data quality assessment is an integral part of preparatory data analysis to ensure sound biological information retrieval. We present here the MatrixQCvis package, which provides shiny-based interactive visualization of data quality metrics at the per-sample and per-feature level. It is broadly applicable to quantitative omics data types that come in matrix-like format (features x samples). It enables the detection of low-quality samples, drifts, outliers and batch effects in data sets. Visualizations include amongst others bar- and violin plots of the (count/intensity) values, mean vs standard deviation plots, MA plots, empirical cumulative distribution function (ECDF) plots, visualizations of the distances between samples, and multiple types of dimension reduction plots. MatrixQCvis builds upon the Bioconductor SummarizedExperiment S4 class and enables thus the facile integration into existing workflows.

MatrixQCvis is especially addressed to analyze the quality of proteomics and metabolomics data sets that are characterized by missing values as it allows the user for imputation of missing values and differential expression analysis using the proDA package (Ahlman-Eltze and Anders 2019). Besides this, MatrixQCvis is extensible to other type of data(e.g. transcriptomics count data) that can be represented as a SummarizedExperiment object. Furthermore, the shiny application facilitates simple differential expression analysis using either moderated t-tests (from the limma package, Ritchie et al. (2015)) or Wald tests (from the proDA package, Ahlman-Eltze and Anders (2019)).

Within this vignette, the term feature will refer to a probed molecular entity, e.g. gene, transcript, protein, peptide, or metabolite.

In the following, we will describe the major setup of MatrixQCvis and the navigation through the shiny application, shinyQC.

Questions and bugs

MatrixQCvis is currently under active development. If you discover any bugs, typos or develop ideas of improving MatrixQCvis feel free to raise an issue via GitHub or send a mail to the developer.

2 Prepare the environment

To install MatrixQCvis enter the following to the R console

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("MatrixQCvis")

Before starting with the analysis, load the MatrixQCvis package. This will also load the required packages Biobase, BiocGenerics, GenomeInfoDb, GenomicRanges, ggplot2, IRanges, MatrixGenerics, parallel, matrixStats, plotly, shiny, SummarizedExperiment, and stats4.

library(MatrixQCvis)

3 Appearance of user interface depends on the data input

Please note: Depending on the supplied SummarizedExperiment object the user interface of shinyQC will differ:

  • for a SummarizedExperiment object containing missing values
    • the tabs Samples, Measured Values, Missing Values, Values, Dimension Reduction and DE will be displayed,
    • within the tabs Values and Dimension Reduction the imputed data set will be visualized,
    • the sidebar panel includes a drop-down menu for the imputation method.
  • for a SummarizedExperiment object containing no missing values (i.e. with complete observations)
    • the tabs Samples, Values, Dimension Reduction and DE will be displayed,
    • within the tabs Values and DE the imputed data set will not be visualized,
    • the sidebar panel does not includes a drop-down menu for the imputation method.

In the following, the vignette will be (mainly) described from the point of view of a SummarizedExperiment containing no missing values (RNA-seq dataset) and missing values (proteomics dataset).

4 QC analysis of TCGA RNA-seq data

Here, we will retrieve a SummarizedExperiment, se, object
from the ExperimentHubpackage. The dataset (GEO accession GSE62944) contains 741 normal samples across 24 cancer types from the TCGA re-processed RNA-seq data. We will use the dataset obtained by ExperimentHub to showcase the functionality of shinyQC.

library(ExperimentHub)
## Loading required package: AnnotationHub
## Loading required package: BiocFileCache
## Loading required package: dbplyr
## 
## Attaching package: 'AnnotationHub'
## The following object is masked from 'package:Biobase':
## 
##     cache
eh <- ExperimentHub()
## Assuming valid proxy connection through ':1'
##  If you experience connection issues consider using 'localHub=TRUE'
## the SummarizedExperiment object has the title "RNA-Sequencing and clinical  
## data for 741 normal samples from The Cancer Genome Atlas"
eh[eh$title == "RNA-Sequencing and clinical data for 741 normal samples from The Cancer Genome Atlas"]
## ExperimentHub with 1 record
## # snapshotDate(): 2024-10-24
## # names(): EH1044
## # package(): GSE62944
## # $dataprovider: GEO
## # $species: Homo sapiens
## # $rdataclass: SummarizedExperiment
## # $rdatadateadded: 2017-10-29
## # $title: RNA-Sequencing and clinical data for 741 normal samples from The C...
## # $description: TCGA RNA-seq Rsubread-summarized raw count data for 741 norm...
## # $taxonomyid: 9606
## # $genome: hg19
## # $sourcetype: tar.gz
## # $sourceurl: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944
## # $sourcesize: NA
## # $tags: c("ExperimentData", "Genome", "DNASeqData", "RNASeqData") 
## # retrieve record with 'object[["EH1044"]]'
## in a next step download the SummarizedExperiment object from ExperimentHub
se <- eh[["EH1044"]]
## see ?GSE62944 and browseVignettes('GSE62944') for documentation
## loading from cache
## here we will restrain the analysis on 40 samples and remove the features
## that have a standard deviation of 0
se <- se[, seq_len(40)]
se_sds <- apply(assay(se), 1, sd, na.rm = TRUE)
se <- se[!is.na(se_sds) & se_sds > 0, ]

The most important function to assess the data quality is the shinyQC function and its most important argument is se. shinyQC expects a SummarizedExperiment object.

The shinyQC function sets the following requirements to the SummarizedExperiment object se: - rownames(se) are not allowed to be NULL and have to be set to the feature names, - colnames(se) are not allowed to be NULL and have to be set to the sample names, - colnames(se), colnames(assay(se)) and rownames(colData(se)) all have to be identical.

If these requirements are not met, shinyQC will stop and throw an error.

Alternatively, a SummarizedExperiment object can also be loaded from within shinyQC (when no se object is supplied to shinyQC).

Objects belonging to the SummarizedExperiment class are containers for one or more assays, which are (numerical) matrices containing the quantitative, measured information of the experiment. The rows represent features of interest (e.g.  transcripts, peptides, proteins, or metabolites) and the columns represent the samples. The SummarizedExperiment object stores also information on the features of interest (accessible by rowData) and information on the samples (accessible by colData). The name of samples and features will be accessed from colnames(se) and rownames(se), respectively.

If there is more than one experimental data set (assay) stored in the SummarizedExperiment object, a select option will appear in the sidebar allowing the user to select the assay.

The actual shiny application can then be started by entering the following to the R console:

qc <- shinyQC(se)

The assignment to qc or any other object is not mandatory. Upon exiting the shiny application, shinyQC will return a SummarizedExperiment object containing the imputed dataset that can be in the following further analyzed. The object will only be returned if the function call was assigned to an object, e.g.  qc <- shinyQC(se).

Now, we will have a closer look on the user interface of the Shiny application.

4.2 Tab: Samples

The tab Samples gives general information on the number of samples in the se object.

4.2.1 Histogram

The first panel shows a barplot and displays the number of samples per sample type, treatment, etc. As an example, if we want to display how many samples are in se for the different types (type is a column name in colData(se) and any column in colData(se) can be selected), this panel will show the following output:

4.2.2 Mosaic plot

The figure in this panel displays the relative proportions of the numbers, e.g. how many samples (in %) are there for type against arbitrary_values. As the dataset was only shipped with a type and sample column, for demonstration, we manually added the column arbitrary_levels to the se object. This column is filled with the the values "A" and "B".

Again, type and arbitrary_values are columns in colData(se) and any column in colData(se) can be selected to create the Mosaic plot.

The figure will tell us to what extent the se contains the different types in a balanced manner depending on arbitrary_values:

4.3 Tab: Values

The tab Values will take a closer look on the assay slot of the SummarizedExperiment.

4.3.1 Boxplot/Violin plot

This panel shows the (count/intensity) values for raw (raw), batch corrected (normalized), normalized+batch corrected (batch corrected), normalized+batch corrected+transformed (transformed), and normalized+batch corrected+transformed+imputed (imputed) (count/intensity) values (imputation of missing values, imputed will only be shown if there are missing values in the SummarizedExperiment). As already mentioned, the different methods for normalization, batch correction, transformation (and imputation) are specified in the side panel.

For visualization purposes only, the (count/intensity) values for the raw, normalized and batch corrected data sets can be log2 transformed (see the radio buttons in Display log2 values? (only for 'raw', 'normalized' and 'batch corrected')).

The type of visualization (boxplot or violin plot) can be specified by selecting boxplot or violin in the radio button panel (Type of display).

The figure (violin plot) using the raw values will look like (log set to TRUE)

4.3.2 Trend/drift

This panel shows a trend line for aggregated values to indicate drifts/trends in data acquisition. It shows the sum- or median-aggregated values (specified in Select aggregation). The plot displays trends in data acquisition that originate e.g. from differences in instrument sensitivity. The panel displays aggregated values for raw (raw), batch corrected (batch corrected), batch corrected+normalized (normalized), batch corrected+normalized+transformed (transformed), and batch corrected+normalized+transformed+imputed (imputed) (count/intensity) values (imputation of missing values, imputed will only be shown if there are missing values in the SummarizedExperiment). The different methods for normalization, batch correction, transformation (and imputation) are specified in the sidebar panel.

The smoothing is calculated from the selection of samples that are specified by the drop-down menus Select variable and Select level to highlight. The menu Select variable corresponds to the colnames in colData(se). Here, we can select for the higher-order variable, e.g. the type (containing for example BLCA, BRCA, etc.). The drop-down menu Select level to highlight will specify the actual selection from which the trend line will be calculated (e.g. BLCA, BRCA, etc.). Also, the menu will always include the level all, which will use all points to calculate the trend line. If we want to calculate the trend line of aggregated values of all samples belonging to the type QC, we select QC in the drop-down menu.

The panel allows the users for further customization after expanding the collapsed box. The data input is selected in the drop-down menu under Select data input. The smoothing method (either LOESS or linear model) is selected in the drop-down menu under Select smoothing method. The aggregation method is selected in the drop-down menu Select smoothing method.

With the drop-down menu Select categorical variable to order samples, the samples (x-axis) will be ordered alphanumerically according to the selected level (and the sample name).

Here, we are interested in observing if there is a trend/drift for samples of type BRCA. We select LOESS as the method for the trend line and median as the aggregation method. The figure will then look as follows:

4.3.3 Coefficient of variation

This panel shows the coefficient of variation values for raw (raw), normalized (normalized), normalized+batch corrected (batch corrected), normalized+batch corrected+transformed (transformed), and normalized+batch corrected+transformed+imputed (imputed) (count/intensity) values (imputation of missing values, imputed will only be shown if there are missing values in the SummarizedExperiment) among the samples. The different methods for normalization, batch corrected, transformation, (and imputation) are specified in the sidebar panel.

The panel displays the coefficient of variation values from the samples of the SummarizedExperiment object. The coefficients of variation are calculated according to the formula sd(x) / mean(x) * 100 with x the sample values and sd the standard deviation. The plot might be useful when looking at the coefficient of variation values of a specific sample type (e.g. QCs) and trying to identify outliers.

Here, we shows the plot of coefficient of variation values from the raw values (as obtained by assay(se)), normalized values (using sum normalization), transformed values (using vsn), batch corrected values (using none) and imputed values (using the MinDet algorithm, Lazar et al. (2016)).

4.3.4 Mean-sd plot

The panel shows the three mean-sd (standard deviation) plots for normalized+batch corrected+transformed (transformed) and normalized+batch corrected+transformed+imputed (imputed) values (imputed will only be shown if there are missing values in the SummarizedExperiment). The sd and mean are calculated feature-wise from the values of the respective data set. The plot allows the user to visualize if there is a dependence of the sd on the mean. The red line depicts the running median estimator (window-width 10%). In case of sd-mean independence, the running median should be approximately horizontal.

For the transformed values, the mean-sd plot will look like

4.3.5 MA plot

The panel displays MA plots and Hoeffding’s D statistic.

In the first part of the panel the A vs. M plots per sample are depicted.

The values are defined as follows

  • \(A = \frac{1}{2} \cdot (I_i + I_j)\) and
  • \(M = I_i−I_j\) ,

where \(I_i\) and \(I_j\) are per definition log2-transformed values. In the case of raw, normalized, or batch corrected the values are log2-transformed prior to calculating A and M. In case of transformed or imputed the values are taken as they are (N.B. when the transformation method is set to none the values are not log2-transformed).

The values for \(I_i\) are taken from the sample \(i\). For \(I_j\), the feature-wise means are calculated from the values of the group \(j\) of samples specified by the drop-down menu group. The sample for calculating \(I_i\) is excluded from the group \(j\). The group can be set to "all" (i.e. all samples except sample \(i\) are used to calculate \(I_j\)) or any other column in colData(se). For any group except "all" the group is taken to which the sample \(i\) belongs to and the sample \(i\) is excluded from the feature-wise calculation.

The MA values for all samples are by default displayed facet-wise. The MA plot can be set to a specific sample by changing the selected value in the drop-down menu plot.

The underlying data set can be selected by the drop-down menu (Data set for the MA plot).

In the second part of the panel, the Hoeffding’s D statistic values are visualized for the different data sets raw, normalized, batch corrected, transformed, and imputed (imputed will only be shown if there are missing values in the SummarizedExperiment).

D is a measure of the distance between F(A, M) and G(A)H(M), where F(A, M) is the joint cumulative distribution function (CDF) of A and M, and G and H are marginal CDFs. The higher the value of D, the more dependent are A and M. The D values are connected for the same samples along the different data sets (when lines is selected), enabling the user to track the influence of normalization, batch correction, transformation (and imputation) methods on the D values.

The MA plot using the raw values and group = "all" will look like the following (plot = “Sample_1-1”`):

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_hex()`).

4.3.6 ECDF

The plot shows the ECDF of the values of the sample \(i\) and the feature-wise mean values of a group \(j\) of samples specified by the drop-down menu Group. The sample for calculating \(I_i\) is excluded from the group \(j\). The group can be set to "all" (i.e. all samples except sample \(i\) are used to calculate \(I_j\)) or any other column in colData(se). For any group except "all" the group is taken to which the sample \(i\) belongs to and the sample \(i\) is excluded from the feature-wise calculation.

The underlying data set can be selected by the drop-down menu Data set for the MA plot. The sample \(i\) can be selected by the drop-down menu Sample. The group can be selected by the drop-down menu Group.

The ECDF plot for the sample "TCGA-K4-A3WV-11A-21R-A22U-07", group = "all", and the raw values (as obtained by assay(se)) will look like:

## Warning in ks.test.default(x = value_s, y = value_g, exact = NULL, alternative
## = "two.sided"): p-value will be approximate in the presence of ties

4.3.7 Distance matrix

On the left, the panel depicts heatmaps of distances between samples for the data sets of raw (raw), normalized (normalized), normalized+batch corrected (batch corrected), normalized+batch corrected+transformed (transformed), and normalized+batch corrected+transformed+imputed (imputed, imputed will only be shown if there are missing values in the SummarizedExperiment). The annotation of the heatmaps can be adjusted by the drop-down menu annotation.

On the right panel the sum of distances to other samples is depicted.

The distance matrix and sum of distances (for raw values, as obtained by assay(se)) will look like:

4.3.8 Features

The first plot shows the values for each samples along the data processing steps. The feature to be displayed is selected via the drop-down menu Select feature.

The second plot shows the coefficient of variation of the values for all features in the data set along the data processing steps. The features of same identity can be connected by lines by clicking on lines.