box
variantviolin
variantboxdot
variantviodot
variantviosumm
variantnotch
variantvsScatterPlot()
vsMAPlot()
vsVolcano()
vsFourWay()
Differential gene expression (DGE) is one of the most common applications of RNA-sequencing (RNA-seq) data. This process allows for the elucidation of differentially expressed genes (DEGs) across two or more conditions. Interpretation of the DGE results can be non-intuitive and time consuming due to the variety of formats based on the tool of choice and the numerous pieces of information provided in these results files. Here we present an R package, ViDGER (Visualization of Differential Gene Expression Results using R), which contains nine functions that generate information-rich visualizations for the interpretation of DGE results from three widely-used tools, Cuffdiff, DESeq2, and edgeR.
The stable version of this package is available on Bioconductor. You can install it by running the following:
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("vidger")
The latest developmental version of ViDGER
can be installed via GitHub
using the devtools
package:
if (!require("devtools")) install.packages("devtools")
devtools::install_github("btmonier/vidger", ref = "devel")
Once installed, you will have access to the following functions:
vsBoxplot()
vsScatterPlot()
vsScatterMatrix()
vsDEGMatrix()
vsMAPlot()
vsMAMatrix()
vsVolcano()
vsVolcanoMatrix()
vsFourWay()
Further explanation will be given to how these functions work later on in the
documentation. For the following examples, three toy data sets will be used:
df.cuff
, df.deseq
, and df.edger
. Each of these data sets reflect the
three RNA-seq analyses this package covers. These can be loaded in the R
workspace by using the following command:
data(<data_set>)
Where <data_set>
is one of the previously mentioned data sets. Some of the
recurring elements that are found in each of these functions are the type
and d.factor
arguments. The type
argument tells the function how to
process the data for each analytical type (i.e. "cuffdiff"
, "deseq"
, or
"edger"
). The d.factor
argument is used specifically for DESeq2
objects
which we will discuss in the DESeq2 section. All other arguments are discussed
in further detail by looking at the respective help file for each functions
(i.e. ?vsScatterPlot
).
As mentioned earlier, three toy data sets are included with this package. In addition to these data sets, 5 “real-world” data sets were also used. All real-world data used is currently unpublished from ongoing collaborations. Summaries of this data can be found in the following tables:
Table 1: An overview of the toy data sets included in this package. In this table, each data set is summarized in terms of what analytical software was used, organism ID, experimental layout (replicates and treatments), number of transcripts (IDs), and size of the data object in terms of megabytes (MB).
Data | Software | Organism | Reps | Treat. | IDs | Size (MB) |
---|---|---|---|---|---|---|
df.cuff |
CuffDiff | H | 2 | 3 | 1200 | 0.2 |
sapiens | ||||||
df.deseq |
DESeq2 | D. | 2 | 3 | 29391 | 2.3 |
melanogaster | ||||||
df.deseq |
edgeR | A. | 2 | 3 | 724 | 0.1 |
thaliana |
Table 2: “Real-world” (RW) data set statistics. To test the reliability of our package, real data was used from human collections and several plant samples. Each data set is summarized in terms of organism ID, number of experimental samples (n), experimental conditions, and number of transcripts ( IDs).
Data | Organism | n | Exp. Conditions | IDs |
---|---|---|---|---|
RW-1 | H. | 10 | Two treatment dosages taken at two | 198002 |
sapiens | time points and one control sample | |||
taken at one time point | ||||
RW-2 | M. | 24 | Two phenotypes taken at four time | 63517 |
domestia | points (three replicates each) | |||
RW-3 | V. | 6 | Two conditions (three replicates | 59262 |
ripria: | each). | |||
bud | ||||
RW-4 | V. | 6 | Two conditions (three replicates | 17962 |
ripria: | each). | |||
shoot-tip | ||||
(7 days) | ||||
RW-5 | V. | 6 | Two conditions (three replicates | 19064 |
ripria: | each). | |||
shoot-tip | ||||
(21 days) |
Box plots are a useful way to determine the distribution of data. In this case
we can determine the distribution of FPKM or CPM values by using the
vsBoxPlot()
function. This function allows you to extract necessary
results-based data from analytical objects to create a box plot comparing
\(log_{10}\) (FPKM or CPM) distributions for experimental treatments.
vsBoxPlot(
data = df.cuff, d.factor = NULL, type = 'cuffdiff', title = TRUE,
legend = TRUE, grid = TRUE
)
vsBoxPlot(
data = df.deseq, d.factor = 'condition', type = 'deseq',
title = TRUE, legend = TRUE, grid = TRUE
)
vsBoxPlot(
data = df.edger, d.factor = NULL, type = 'edger',
title = TRUE, legend = TRUE, grid = TRUE
)
vsBoxPlot()
can allow for different iterations to showcase data
distribution. These changes can be implemented using the aes
parameter.
Currently, there are 6 different variants:
box
: standard box plotviolin
: violin plotboxdot
: box plot with dot plot overlayviodot
: violin plot with dot plot overlayviosumm
: violin plot with summary stats overlaynotch
: box plot with notchbox
variantdata("df.edger")
vsBoxPlot(
data = df.edger, d.factor = NULL, type = "edger", title = TRUE,
legend = TRUE, grid = TRUE, aes = "box"
)
violin
variantdata("df.edger")
vsBoxPlot(
data = df.edger, d.factor = NULL, type = "edger", title = TRUE,
legend = TRUE, grid = TRUE, aes = "violin"
)
boxdot
variantdata("df.edger")
vsBoxPlot(
data = df.edger, d.factor = NULL, type = "edger", title = TRUE,
legend = TRUE, grid = TRUE, aes = "boxdot"
)
viodot
variantdata("df.edger")
vsBoxPlot(
data = df.edger, d.factor = NULL, type = "edger", title = TRUE,
legend = TRUE, grid = TRUE, aes = "viodot"
)
viosumm
variantdata("df.edger")
vsBoxPlot(
data = df.edger, d.factor = NULL, type = "edger", title = TRUE,
legend = TRUE, grid = TRUE, aes = "viosumm"
)
notch
variantdata("df.edger")
vsBoxPlot(
data = df.edger, d.factor = NULL, type = "edger", title = TRUE,
legend = TRUE, grid = TRUE, aes = "notch"
)
In addition to aesthetic changes, the fill color of each variant can
also be changed. This can be implemented by modifiying the fill.color
parameter.
The palettes that can be used for this parameter are based off of the
palettes found in the RColorBrewer
package. A visual list of all the palettes can be found
here.
data("df.edger")
vsBoxPlot(
data = df.edger, d.factor = NULL, type = "edger", title = TRUE,
legend = TRUE, grid = TRUE, aes = "box", fill.color = "RdGy"
)
data("df.edger")
vsBoxPlot(
data = df.edger, d.factor = NULL, type = "edger", title = TRUE,
legend = TRUE, grid = TRUE, aes = "viosumm", fill.color = "Paired"
)
data("df.edger")
vsBoxPlot(
data = df.edger, d.factor = NULL, type = "edger", title = TRUE,
legend = TRUE, grid = TRUE, aes = "notch", fill.color = "Greys"
)
This example will look at a basic scatter plot function, vsScatterPlot()
.
This function allows you to visualize comparisons of \(log_{10}\) values of
either FPKM or CPM measurements of two treatments depending on analytical type.
vsScatterPlot(
x = 'hESC', y = 'iPS', data = df.cuff, type = 'cuffdiff',
d.factor = NULL, title = TRUE, grid = TRUE
)
vsScatterPlot(
x = 'treated_paired.end', y = 'untreated_paired.end',
data = df.deseq, type = 'deseq', d.factor = 'condition',
title = TRUE, grid = TRUE
)
vsScatterPlot(
x = 'WM', y = 'MM', data = df.edger, type = 'edger',
d.factor = NULL, title = TRUE, grid = TRUE
)
This example will look at an extension of the vsScatterPlot()
function which
is vsScatterMatrix()
. This function will create a matrix of all possible
comparisons of treatments within an experiment with additional info.
vsScatterMatrix(
data = df.cuff, d.factor = NULL, type = 'cuffdiff',
comp = NULL, title = TRUE, grid = TRUE, man.title = NULL
)
vsScatterMatrix(
data = df.deseq, d.factor = 'condition', type = 'deseq',
comp = NULL, title = TRUE, grid = TRUE, man.title = NULL
)
vsScatterMatrix(
data = df.edger, d.factor = NULL, type = 'edger', comp = NULL,
title = TRUE, grid = TRUE, man.title = NULL
)
Using the vsDEGMatrix()
function allows the user to visualize the number of
differentially expressed genes (DEGs) at a given adjusted p-value (padj =
) for each experimental treatment level. Higher color intensity correlates to
a higher number of DEGs.
vsDEGMatrix(
data = df.cuff, padj = 0.05, d.factor = NULL, type = 'cuffdiff',
title = TRUE, legend = TRUE, grid = TRUE
)
vsDEGMatrix(
data = df.deseq, padj = 0.05, d.factor = 'condition',
type = 'deseq', title = TRUE, legend = TRUE, grid = TRUE
)
vsDEGMatrix(
data = df.edger, padj = 0.05, d.factor = NULL, type = 'edger',
title = TRUE, legend = TRUE, grid = TRUE
)
A grey-scale option is available for this function if you wish to use a
grey-to-white gradient instead of the classic blue-to-white gradient. This
can be invoked by setting the grey.scale
parameter to TRUE
.
vsDEGMatrix(data = df.deseq, d.factor = "condition", type = "deseq",
grey.scale = TRUE
)
vsMAPlot()
visualizes the variance between two samples in terms of gene
expression values where logarithmic fold changes of count data are plotted
against mean counts. For more information on how each of the aesthetics are
plotted, please refer to the figure captions and Method S1.
vsMAPlot(
x = 'iPS', y = 'hESC', data = df.cuff, d.factor = NULL,
type = 'cuffdiff', padj = 0.05, y.lim = NULL, lfc = NULL,
title = TRUE, legend = TRUE, grid = TRUE
)
vsMAPlot(
x = 'treated_paired.end', y = 'untreated_paired.end',
data = df.deseq, d.factor = 'condition', type = 'deseq',
padj = 0.05, y.lim = NULL, lfc = NULL, title = TRUE,
legend = TRUE, grid = TRUE
)
vsMAPlot(
x = 'WW', y = 'MM', data = df.edger, d.factor = NULL,
type = 'edger', padj = 0.05, y.lim = NULL, lfc = NULL,
title = TRUE, legend = TRUE, grid = TRUE
)
Similar to a scatter plot matrix, vsMAMatrix()
will produce visualizations
for all comparisons within your data set. For more information on how the
aesthetics are plotted in these visualizations, please refer to the figure
caption and Method S1.
vsMAMatrix(
data = df.cuff, d.factor = NULL, type = 'cuffdiff',
padj = 0.05, y.lim = NULL, lfc = 1, title = TRUE,
grid = TRUE, counts = TRUE, data.return = FALSE
)
vsMAMatrix(
data = df.deseq, d.factor = 'condition', type = 'deseq',
padj = 0.05, y.lim = NULL, lfc = 1, title = TRUE,
grid = TRUE, counts = TRUE, data.return = FALSE
)
vsMAMatrix(
data = df.edger, d.factor = NULL, type = 'edger',
padj = 0.05, y.lim = NULL, lfc = 1, title = TRUE,
grid = TRUE, counts = TRUE, data.return = FALSE
)
The next few visualizations will focus on ways to display differential gene
expression between two or more treatments. Volcano plots visualize the variance
between two samples in terms of gene expression values where the \(-log_{10}\) of
calculated p-values (y-axis) are a plotted against the \(log_2\) changes
(x-axis). These plots can be visualized with the vsVolcano()
function.
For more information on how each of the aesthetics are plotted, please refer
to the figure captions and Method S1.
vsVolcano(
x = 'iPS', y = 'hESC', data = df.cuff, d.factor = NULL,
type = 'cuffdiff', padj = 0.05, x.lim = NULL, lfc = NULL,
title = TRUE, legend = TRUE, grid = TRUE, data.return = FALSE
)
vsVolcano(
x = 'treated_paired.end', y = 'untreated_paired.end',
data = df.deseq, d.factor = 'condition', type = 'deseq',
padj = 0.05, x.lim = NULL, lfc = NULL, title = TRUE,
legend = TRUE, grid = TRUE, data.return = FALSE
)
vsVolcano(
x = 'WW', y = 'MM', data = df.edger, d.factor = NULL,
type = 'edger', padj = 0.05, x.lim = NULL, lfc = NULL,
title = TRUE, legend = TRUE, grid = TRUE, data.return = FALSE
)
Similar to the prior matrix functions, vsVolcanoMatrix()
will produce
visualizations for all comparisons within your data set. For more information
on how the aesthetics are plotted in these visualizations, please refer to the
figure caption and Method S1.
vsVolcanoMatrix(
data = df.cuff, d.factor = NULL, type = 'cuffdiff',
padj = 0.05, x.lim = NULL, lfc = NULL, title = TRUE,
legend = TRUE, grid = TRUE, counts = TRUE
)
vsVolcanoMatrix(
data = df.deseq, d.factor = 'condition', type = 'deseq',
padj = 0.05, x.lim = NULL, lfc = NULL, title = TRUE,
legend = TRUE, grid = TRUE, counts = TRUE
)
vsVolcanoMatrix(
data = df.edger, d.factor = NULL, type = 'edger', padj = 0.05,
x.lim = NULL, lfc = NULL, title = TRUE, legend = TRUE,
grid = TRUE, counts = TRUE
)
To create four-way plots, the function, vsFourWay()
is used. This plot
compares the \(log_2\) fold changes between two samples and a ‘control’. For more
information on how each of the aesthetics are plotted, please refer to the
figure captions and Method S1.
vsFourWay(
x = 'iPS', y = 'hESC', control = 'Fibroblasts', data = df.cuff,
d.factor = NULL, type = 'cuffdiff', padj = 0.05, x.lim = NULL,
y.lim = NULL, lfc = NULL, legend = TRUE, title = TRUE, grid = TRUE
)
vsFourWay(
x = 'treated_paired.end', y = 'untreated_single.read',
control = 'untreated_paired.end', data = df.deseq,
d.factor = 'condition', type = 'deseq', padj = 0.05, x.lim = NULL,
y.lim = NULL, lfc = NULL, legend = TRUE, title = TRUE, grid = TRUE
)
vsFourWay(
x = 'WW', y = 'WM', control = 'MM', data = df.edger,
d.factor = NULL, type = 'edger', padj = 0.05, x.lim = NULL,
y.lim = NULL, lfc = NULL, legend = TRUE, title = TRUE, grid = TRUE
)
For point-based plots, users can highlight IDs of interest (i.e. genes, transcripts, etc.). Currently, this functionality is implemented in the following functions:
vsScatterPlot()
vsMAPlot()
vsVolcano()
vsFourWay()
To use this feature, simply provide a vector of specified IDs to the
highlight
parameter found in the prior functions. An example of a typical
vector would be as follows:
important_ids <- c(
"ID_001",
"ID_002",
"ID_003",
"ID_004",
"ID_005"
)
important_ids
## [1] "ID_001" "ID_002" "ID_003" "ID_004" "ID_005"
For specific examples using the toy data set, please see the proceeding 4 sub-sections.
vsScatterPlot()
data("df.cuff")
hl <- c(
"XLOC_000033",
"XLOC_000099",
"XLOC_001414",
"XLOC_001409"
)
vsScatterPlot(
x = "hESC", y = "iPS", data = df.cuff, d.factor = NULL,
type = "cuffdiff", title = TRUE, grid = TRUE, highlight = hl
)
vsMAPlot()
hl <- c(
"FBgn0022201",
"FBgn0003042",
"FBgn0031957",
"FBgn0033853",
"FBgn0003371"
)
vsMAPlot(
x = "treated_paired.end", y = "untreated_paired.end",
data = df.deseq, d.factor = "condition", type = "deseq",
padj = 0.05, y.lim = NULL, lfc = NULL, title = TRUE,
legend = TRUE, grid = TRUE, data.return = FALSE, highlight = hl
)
vsVolcano()
hl <- c(
"FBgn0036248",
"FBgn0026573",
"FBgn0259742",
"FBgn0038961",
"FBgn0038928"
)
vsVolcano(
x = "treated_paired.end", y = "untreated_paired.end",
data = df.deseq, d.factor = "condition",
type = "deseq", padj = 0.05, x.lim = NULL, lfc = NULL,
title = TRUE, grid = TRUE, data.return = FALSE, highlight = hl
)
vsFourWay()
data("df.edger")
hl <- c(
"ID_639",
"ID_518",
"ID_602",
"ID_449",
"ID_076"
)
vsFourWay(
x = "WM", y = "WW", control = "MM", data = df.edger,
d.factor = NULL, type = "edger", padj = 0.05, x.lim = NULL,
y.lim = NULL, lfc = 2, title = TRUE, grid = TRUE,
data.return = FALSE, highlight = hl
)
For all plots, users can extract datasets used for the visualizations. You may want to pursue this option if you want to use a highly customized plot script or you would like to perform some unmentioned analysis, for example.
To use this this feature, set the data.return
parameter in the function
you are using to TRUE
. You will also need to assign the function to an
object. See the following example for further details.
In this example, we will use the toy data set df.cuff
, a cuffdiff output
on the function vsScatterPlot()
. Take note that we are assigning the
function to an object tmp
:
# Extract data frame from visualization
data("df.cuff")
tmp <- vsScatterPlot(
x = "hESC", y = "iPS", data = df.cuff, d.factor = NULL,
type = "cuffdiff", title = TRUE, grid = TRUE, data.return = TRUE
)
The object we have created is a list with two elements: data
and plot
.
To extract the data, we can call the first element of the list using the
subset method (<object>[[1]]
) or by invoking its element name
(<object>$data
):
df_scatter <- tmp[[1]] ## or use tmp$data
head(df_scatter)
## id x y
## 1 XLOC_000001 3.47386e-01 20.21750
## 2 XLOC_000002 0.00000e+00 0.00000
## 3 XLOC_000003 0.00000e+00 0.00000
## 4 XLOC_000004 6.97259e+05 0.00000
## 5 XLOC_000005 6.96704e+02 355.82300
## 6 XLOC_000006 0.00000e+00 1.51396
By assigning each of these functions to a list, we can also store the plot as another element. To extract the plot, we can call the second element of the list using the aformentioned procedures:
my_plot <- tmp[[2]] ## or use tmp$plot
my_plot
For all functions, users can modify the font size of multiple portions of the plot. These portions primarily revolve around these components:
To manipulate these components, users can modify the default values of the following parameters:
xaxis.text.size
yaxis.text.size
xaxis.title.size
yaxis.title.size
main.title.size
legend.text.size
legend.title.size
facet.title.size
Each of parameters mentioned in the prior section refer to numerical values. These values correlate to font size in typographic points. To illustrate what exactly these parameters modify, please refer to the following figure:
The facet.title.size
parameter refers to the facets which are allocated in
the matrix functions (vsScatterMatrix()
, vsMAMatrix()
,
vsVolcanoMatrix()
). This is illustrated in the following figure:
Since not all functions are equal in their parameters and component layout, some functions will either have or lack some of the prior parameters. To get an idea of which have functions have which, please refer to the following figure:
The shape and size of each data point will also change depending on several conditions. To maximize the viewing area while retaining high resolution, some data points will not be present within the viewing area. If they exceed the viewing area, they will change shape from a circle to a triangular orientation.
The extent (i.e. fold change) to how far these points exceed the viewing area are based on the following criteria:
To further clarify theses conditions, please refer to the following figure:
Function efficiencies were determined by calculating system times by using the
microbenchmark
R package. Each function was ran 100 times with the prior code
used in the documentation. All benchmarks were determined on a machine running
a 64-bit Windows 10 operating system, 8 GB of RAM, and an Intel Core i5-6400
processor running at 2.7 GHz.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] edgeR_4.4.0 limma_3.62.0
## [3] DESeq2_1.46.0 SummarizedExperiment_1.36.0
## [5] Biobase_2.66.0 MatrixGenerics_1.18.0
## [7] matrixStats_1.4.1 GenomicRanges_1.58.0
## [9] GenomeInfoDb_1.42.0 IRanges_2.40.0
## [11] S4Vectors_0.44.0 BiocGenerics_0.52.0
## [13] vidger_1.26.0 BiocStyle_2.34.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.48 bslib_0.8.0
## [4] ggplot2_3.5.1 ggrepel_0.9.6 GGally_2.2.1
## [7] lattice_0.22-6 vctrs_0.6.5 tools_4.4.1
## [10] generics_0.1.3 parallel_4.4.1 tibble_3.2.1
## [13] fansi_1.0.6 highr_0.11 pkgconfig_2.0.3
## [16] Matrix_1.7-1 RColorBrewer_1.1-3 lifecycle_1.0.4
## [19] GenomeInfoDbData_1.2.13 farver_2.1.2 compiler_4.4.1
## [22] tinytex_0.53 statmod_1.5.0 munsell_0.5.1
## [25] codetools_0.2-20 htmltools_0.5.8.1 sass_0.4.9
## [28] yaml_2.3.10 tidyr_1.3.1 pillar_1.9.0
## [31] crayon_1.5.3 jquerylib_0.1.4 BiocParallel_1.40.0
## [34] DelayedArray_0.32.0 cachem_1.1.0 magick_2.8.5
## [37] abind_1.4-8 ggstats_0.7.0 tidyselect_1.2.1
## [40] locfit_1.5-9.10 digest_0.6.37 purrr_1.0.2
## [43] dplyr_1.1.4 bookdown_0.41 labeling_0.4.3
## [46] fastmap_1.2.0 grid_4.4.1 colorspace_2.1-1
## [49] cli_3.6.3 SparseArray_1.6.0 magrittr_2.0.3
## [52] S4Arrays_1.6.0 utf8_1.2.4 withr_3.0.2
## [55] scales_1.3.0 UCSC.utils_1.2.0 rmarkdown_2.28
## [58] XVector_0.46.0 httr_1.4.7 evaluate_1.0.1
## [61] knitr_1.48 rlang_1.1.4 Rcpp_1.0.13
## [64] glue_1.8.0 BiocManager_1.30.25 jsonlite_1.8.9
## [67] plyr_1.8.9 R6_2.5.1 zlibbioc_1.52.0