How to use iSEE with big data
iSEE 2.19.3
Compiled date: 2025-03-06
Last edited: 2018-03-08
License: MIT + file LICENSE
Some tweaks can be performed to enable iSEE to run efficiently on large datasets. This includes datasets with many features (methylation, SNPs) or many columns (cytometry, single-cell RNA-seq). To demonstrate some of this functionality, we will use a dataset from the TENxPBMCData dataset:
library(TENxPBMCData)
sce.pbmc <- TENxPBMCData("pbmc68k")
sce.pbmc$Library <- factor(sce.pbmc$Library)
sce.pbmc
#> class: SingleCellExperiment
#> dim: 32738 68579
#> metadata(0):
#> assays(1): counts
#> rownames(32738): ENSG00000243485 ENSG00000237613 ... ENSG00000215616
#> ENSG00000215611
#> rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
#> colnames: NULL
#> colData names(11): Sample Barcode ... Individual Date_published
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
Many SummarizedExperiment
objects store assay matrices as in-memory matrix-like objects,
be they ordinary matrices or alternative representations such as sparse matrices from the Matrix package.
For example, if we looked at the Allen data, we would see that the counts are stored as an ordinary matrix.
library(scRNAseq)
sce.allen <- ReprocessedAllenData("tophat_counts")
class(assay(sce.allen, "tophat_counts"))
#> [1] "matrix" "array"
In situations involving large datasets and limited computational resources, storing the entire assay in memory may not be feasible.
Rather, we can represent the data as a file-backed matrix where contents are stored on disk and retrieved on demand.
Within the Bioconductor ecosystem, the easiest way of doing this is to create a HDF5Matrix
, which uses a HDF5 file to store all the assay data.
We see that this has already been done for use in the 68K PBMC dataset:
counts(sce.pbmc, withDimnames=FALSE)
#> <32738 x 68579> sparse HDF5Matrix object of type "integer":
#> [,1] [,2] [,3] [,4] ... [,68576] [,68577] [,68578]
#> [1,] 0 0 0 0 . 0 0 0
#> [2,] 0 0 0 0 . 0 0 0
#> [3,] 0 0 0 0 . 0 0 0
#> [4,] 0 0 0 0 . 0 0 0
#> [5,] 0 0 0 0 . 0 0 0
#> ... . . . . . . . .
#> [32734,] 0 0 0 0 . 0 0 0
#> [32735,] 0 0 0 0 . 0 0 0
#> [32736,] 0 0 0 0 . 0 0 0
#> [32737,] 0 0 0 0 . 0 0 0
#> [32738,] 0 0 0 0 . 0 0 0
#> [,68579]
#> [1,] 0
#> [2,] 0
#> [3,] 0
#> [4,] 0
#> [5,] 0
#> ... .
#> [32734,] 0
#> [32735,] 0
#> [32736,] 0
#> [32737,] 0
#> [32738,] 0
Despite the dimensions of this matrix, the HDF5Matrix
object occupies very little space in memory.
object.size(counts(sce.pbmc, withDimnames=FALSE))
#> 2496 bytes
However, parts of the data can still be read in on demand.
For all intents and purposes, the HDF5Matrix
appears to be an ordinary matrix to downstream applications and can be used as such.
first.gene <- counts(sce.pbmc)[1,]
head(first.gene)
#> [1] 0 0 0 0 0 0
This means that we can use the 68K PBMC SingleCellExperiment
object in iSEE()
without any extra work.
The app below shows the distribution of counts for everyone’s favorite gene MALAT1 across libraries.
Here, iSEE()
is simply retrieving data on demand from the HDF5Matrix
without ever loading the entire assay matrix into memory.
This enables it to run efficiently on arbitrary large datasets with limited resources.
library(iSEE)
app <- iSEE(sce.pbmc, initial=
list(RowDataTable(Selected="ENSG00000251562", Search="MALAT1"),
FeatureAssayPlot(XAxis="Column data", XAxisColumnData="Library",
YAxisFeatureSource="RowDataTable1")
)
)
Generally speaking, these HDF5 files are written once by a process with sufficient computational resources (i.e., memory and time).
We typically create HDF5Matrix
objects using the writeHDF5Array()
function from the HDF5Array package.
After the file is created, the objects can be read many times in more deprived environments.
sce.h5 <- sce.allen
library(HDF5Array)
assay(sce.h5, "tophat_counts", withDimnames=FALSE) <-
writeHDF5Array(assay(sce.h5, "tophat_counts"), file="assay.h5", name="counts")
class(assay(sce.h5, "tophat_counts", withDimnames=FALSE))
#> [1] "HDF5Matrix"
#> attr(,"package")
#> [1] "HDF5Array"
list.files("assay.h5")
#> character(0)
It is worth noting that iSEE()
does not know or care that the data is stored in a HDF5 file.
The app is fully compatible with any matrix-like representation of the assay data that supports dim()
and [,
.
As such, iSEE()
can be directly used with other memory-efficient objects like the DeferredMatrix
and LowRankMatrix
from the BiocSingular package, or perhaps the ResidualMatrix
from the batchelor package.
It is also possible to downsample points to reduce the time required to generate the plot. This involves subsetting the dataset so that only the most recently plotted point for an overlapping set of points is shown. In this manner, we avoid wasting time in plotting many points that would not be visible anyway. To demonstrate, we will re-use the 68K PBMC example and perform downsampling on the feature assay plot; we can see that its aesthetics are largely similar to the non-downsampled counterpart above.
library(iSEE)
app <- iSEE(sce.pbmc, initial=
list(RowDataTable(Selected="ENSG00000251562", Search="MALAT1"),
FeatureAssayPlot(XAxis="Column data", XAxisColumnData="Library",
YAxisFeatureSource="RowDataTable1",
VisualChoices="Point", Downsample=TRUE,
VisualBoxOpen=TRUE
)
)
)