Summarizing omics data in SummarizedExperiment
format
from airway
into KEGG pathway level functional activity
scores
Here we illustrate through some examples how to apply the function
summarize_pathway_level
to aggregate data in
SummarizedExperiment
format into KEGG pathway-level
functional activity scores. Note that the function can also be used to
summarize other types of omics data into any higher-level functional
representations beyond pathways, such as protein complexes or cellular
locations.
Let’s first get an example dataset stored as a
SummarizedExperiment
from the airway
package.
This data represents an actual RNA sequencing experiment on four human
airway smooth muscle cell lines.
library(SummarizedExperiment)
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## The following object is masked from 'package:Biobase':
##
## rowMedians
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:NMF':
##
## nrun
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
library(airway)
data(airway)
airway
## class: RangedSummarizedExperiment
## dim: 63677 8
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
## ENSG00000273493
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
The measurement data can be accessed by assay and assays. Note that
SummarizedExperiment
object can contain multiple
measurement matrices (all of the same dimension), but in this case
airway
contains only one matrix of RNA sequencing data
named counts
:
## [1] "counts"
head(assay(airway, "counts"))
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003 679 448 873 408 1138
## ENSG00000000005 0 0 0 0 0
## ENSG00000000419 467 515 621 365 587
## ENSG00000000457 260 211 263 164 245
## ENSG00000000460 60 55 40 35 78
## ENSG00000000938 0 0 2 0 1
## SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003 1047 770 572
## ENSG00000000005 0 0 0
## ENSG00000000419 799 417 508
## ENSG00000000457 331 233 229
## ENSG00000000460 63 76 60
## ENSG00000000938 0 0 0
dim(assay(airway, "counts"))
## [1] 63677 8
The data matrix contains 63677 genes (or transcripts) and 8 samples.
The features names are Ensembl identifiers, let’s get a list of KEGG
gene sets with Ensembl IDs through the function
get_kegg_sets
provided by funOmics
package.
Note that get_kegg_sets
can be used to retrieve a list of
KEGG gene sets from any organism available, given its abbreviation
(e.g., “hsa” for Homo sapiens or “ecj” for Escherichia coli).
Since airway
data corresponds to human samples, the
parameter geneid_type
in get_kegg_sets
can be
used to retrieve the molecular sets with Ensembl IDs, and the organism
is set to default (“hsa”):
kegg_sets <- get_kegg_sets(geneid_type = "ensembl")
head(kegg_sets)
## $`2-Oxocarboxylic acid metabolism`
## [1] "ENSG00000114786" "ENSG00000169154" "ENSG00000062485" "ENSG00000161653"
## [5] "ENSG00000137992" "ENSG00000150768" "ENSG00000091140" "ENSG00000119689"
## [9] "ENSG00000172482" "ENSG00000120053" "ENSG00000125166" "ENSG00000167701"
## [13] "ENSG00000138413" "ENSG00000182054" "ENSG00000166411" "ENSG00000101365"
## [17] "ENSG00000067829" "ENSG00000122729" "ENSG00000105953" "ENSG00000100412"
## [21] "ENSG00000109576" "ENSG00000131828" "ENSG00000163114" "ENSG00000168291"
## [25] "ENSG00000181192" "ENSG00000197444" "ENSG00000060982" "ENSG00000105552"
## [29] "ENSG00000248098" "ENSG00000083123" "ENSG00000113492" "ENSG00000166123"
## [33] "ENSG00000243989"
##
## $`ABC transporters`
## [1] "ENSG00000114770" "ENSG00000115657" "ENSG00000069431" "ENSG00000125257"
## [5] "ENSG00000064687" "ENSG00000154263" "ENSG00000154258" "ENSG00000141338"
## [9] "ENSG00000001626" "ENSG00000197150" "ENSG00000023839" "ENSG00000179869"
## [13] "ENSG00000164825" "ENSG00000165029" "ENSG00000107331" "ENSG00000167972"
## [17] "ENSG00000101986" "ENSG00000131269" "ENSG00000173208" "ENSG00000135776"
## [21] "ENSG00000150967" "ENSG00000154262" "ENSG00000154265" "ENSG00000198691"
## [25] "ENSG00000144452" "ENSG00000004846" "ENSG00000091262" "ENSG00000103222"
## [29] "ENSG00000085563" "ENSG00000005471" "ENSG00000117528" "ENSG00000119688"
## [33] "ENSG00000172350" "ENSG00000138075" "ENSG00000143921" "ENSG00000006071"
## [37] "ENSG00000168394" "ENSG00000204267" "ENSG00000121270" "ENSG00000073734"
## [41] "ENSG00000108846" "ENSG00000124574" "ENSG00000140798" "ENSG00000118777"
## [45] "ENSG00000160179"
##
## $`AGE-RAGE signaling pathway in diabetic complications`
## [1] "ENSG00000117020" "ENSG00000135446" "ENSG00000111276" "ENSG00000278139"
## [5] "ENSG00000161714" "ENSG00000108821" "ENSG00000164692" "ENSG00000168542"
## [9] "ENSG00000187498" "ENSG00000134871" "ENSG00000169031" "ENSG00000081052"
## [13] "ENSG00000188153" "ENSG00000197565" "ENSG00000112062" "ENSG00000165168"
## [17] "ENSG00000131504" "ENSG00000204305" "ENSG00000135744" "ENSG00000144891"
## [21] "ENSG00000078401" "ENSG00000120738" "ENSG00000142208" "ENSG00000105221"
## [25] "ENSG00000117525" "ENSG00000165197" "ENSG00000150907" "ENSG00000182621"
## [29] "ENSG00000115414" "ENSG00000007952" "ENSG00000174775" "ENSG00000090339"
## [33] "ENSG00000115008" "ENSG00000125538" "ENSG00000136244" "ENSG00000169429"
## [37] "ENSG00000096968" "ENSG00000177606" "ENSG00000133703" "ENSG00000175387"
## [41] "ENSG00000166949" "ENSG00000141646" "ENSG00000087245" "ENSG00000131196"
## [45] "ENSG00000109320" "ENSG00000164867" "ENSG00000213281" "ENSG00000086991"
## [49] "ENSG00000106366" "ENSG00000138193" "ENSG00000121879" "ENSG00000051382"
## [53] "ENSG00000137193" "ENSG00000171608" "ENSG00000145675" "ENSG00000105647"
## [57] "ENSG00000137841" "ENSG00000149782" "ENSG00000101333" "ENSG00000187091"
## [61] "ENSG00000124181" "ENSG00000197943" "ENSG00000154229" "ENSG00000166501"
## [65] "ENSG00000163932" "ENSG00000171132" "ENSG00000067606" "ENSG00000100030"
## [69] "ENSG00000102882" "ENSG00000107643" "ENSG00000185386" "ENSG00000050748"
## [73] "ENSG00000109339" "ENSG00000156711" "ENSG00000087088" "ENSG00000136238"
## [77] "ENSG00000110092" "ENSG00000171791" "ENSG00000173039" "ENSG00000188130"
## [81] "ENSG00000108691" "ENSG00000007908" "ENSG00000115415" "ENSG00000168610"
## [85] "ENSG00000126561" "ENSG00000173757" "ENSG00000105329" "ENSG00000092969"
## [89] "ENSG00000119699" "ENSG00000106799" "ENSG00000163513" "ENSG00000178726"
## [93] "ENSG00000232810" "ENSG00000162692" "ENSG00000112715" "ENSG00000173511"
## [97] "ENSG00000150630" "ENSG00000164305" "ENSG00000115556" "ENSG00000117461"
## [101] "ENSG00000070831"
##
## $`AMPK signaling pathway`
## [1] "ENSG00000117020" "ENSG00000107175" "ENSG00000110931" "ENSG00000001626"
## [5] "ENSG00000084733" "ENSG00000109819" "ENSG00000278139" "ENSG00000176194"
## [9] "ENSG00000169169" "ENSG00000110090" "ENSG00000205560" "ENSG00000118260"
## [13] "ENSG00000120907" "ENSG00000143578" "ENSG00000167658" "ENSG00000187840"
## [17] "ENSG00000066044" "ENSG00000160741" "ENSG00000142208" "ENSG00000105221"
## [21] "ENSG00000169710" "ENSG00000165140" "ENSG00000150907" "ENSG00000118689"
## [25] "ENSG00000065882" "ENSG00000096717" "ENSG00000103150" "ENSG00000198793"
## [29] "ENSG00000131482" "ENSG00000167393" "ENSG00000103319" "ENSG00000104812"
## [33] "ENSG00000111713" "ENSG00000278540" "ENSG00000113161" "ENSG00000101076"
## [37] "ENSG00000076555" "ENSG00000017427" "ENSG00000140443" "ENSG00000254647"
## [41] "ENSG00000171105" "ENSG00000169047" "ENSG00000174697" "ENSG00000116678"
## [45] "ENSG00000079435" "ENSG00000167461" "ENSG00000124253" "ENSG00000100889"
## [49] "ENSG00000159346" "ENSG00000106617" "ENSG00000119396" "ENSG00000140992"
## [53] "ENSG00000135932" "ENSG00000158571" "ENSG00000123836" "ENSG00000170525"
## [57] "ENSG00000114268" "ENSG00000141959" "ENSG00000152556" "ENSG00000067057"
## [61] "ENSG00000121879" "ENSG00000051382" "ENSG00000171608" "ENSG00000145675"
## [65] "ENSG00000105647" "ENSG00000115592" "ENSG00000132170" "ENSG00000092020"
## [69] "ENSG00000113575" "ENSG00000104695" "ENSG00000105568" "ENSG00000137713"
## [73] "ENSG00000221914" "ENSG00000156475" "ENSG00000074211" "ENSG00000073711"
## [77] "ENSG00000066027" "ENSG00000068971" "ENSG00000078304" "ENSG00000112640"
## [81] "ENSG00000154001" "ENSG00000082146" "ENSG00000132356" "ENSG00000162409"
## [85] "ENSG00000111725" "ENSG00000131791" "ENSG00000181929" "ENSG00000175470"
## [89] "ENSG00000141564" "ENSG00000152254" "ENSG00000104388" "ENSG00000110092"
## [93] "ENSG00000106615" "ENSG00000108443" "ENSG00000175634" "ENSG00000099194"
## [97] "ENSG00000182158" "ENSG00000181856" "ENSG00000072310" "ENSG00000118046"
## [101] "ENSG00000135341" "ENSG00000165699" "ENSG00000103197" "ENSG00000006831"
## [105] "ENSG00000145284" "ENSG00000102547" "ENSG00000177169" "ENSG00000204673"
## [109] "ENSG00000060566" "ENSG00000133124" "ENSG00000117461" "ENSG00000185950"
## [113] "ENSG00000130957" "ENSG00000145386" "ENSG00000133101" "ENSG00000157613"
## [117] "ENSG00000185236" "ENSG00000266173" "ENSG00000141349" "ENSG00000181092"
## [121] "ENSG00000135218" "ENSG00000146592"
##
## $`ATP-dependent chromatin remodeling`
## [1] "ENSG00000167797" "ENSG00000187778" "ENSG00000106400" "ENSG00000172977"
## [5] "ENSG00000080603" "ENSG00000183207" "ENSG00000112983" "ENSG00000185787"
## [9] "ENSG00000170004" "ENSG00000111642" "ENSG00000076108" "ENSG00000198604"
## [13] "ENSG00000229674" "ENSG00000258315" "ENSG00000153391" "ENSG00000189079"
## [17] "ENSG00000171634" "ENSG00000164508" "ENSG00000112624" "ENSG00000184402"
## [21] "ENSG00000135999" "ENSG00000099954" "ENSG00000169592" "ENSG00000166164"
## [25] "ENSG00000105619" "ENSG00000123636" "ENSG00000063169" "ENSG00000277075"
## [29] "ENSG00000196866" "ENSG00000188486" "ENSG00000164032" "ENSG00000116478"
## [33] "ENSG00000196591" "ENSG00000184270" "ENSG00000230797" "ENSG00000277858"
## [37] "ENSG00000274183" "ENSG00000170322" "ENSG00000116750" "ENSG00000077080"
## [41] "ENSG00000048649" "ENSG00000071655" "ENSG00000148229" "ENSG00000104472"
## [45] "ENSG00000071243" "ENSG00000128908" "ENSG00000167491" "ENSG00000114933"
## [49] "ENSG00000163939" "ENSG00000101189" "ENSG00000130024" "ENSG00000099284"
## [53] "ENSG00000246705" "ENSG00000178028" "ENSG00000143614" "ENSG00000049618"
## [57] "ENSG00000057935" "ENSG00000183495" "ENSG00000162521" "ENSG00000102054"
## [61] "ENSG00000133884" "ENSG00000075624" "ENSG00000110987" "ENSG00000075089"
## [65] "ENSG00000163875" "ENSG00000102038" "ENSG00000080503" "ENSG00000127616"
## [69] "ENSG00000099956" "ENSG00000028310" "ENSG00000173473" "ENSG00000139613"
## [73] "ENSG00000066117" "ENSG00000108604" "ENSG00000082014" "ENSG00000073584"
## [77] "ENSG00000141380" "ENSG00000163159" "ENSG00000184009" "ENSG00000288859"
## [81] "ENSG00000100811" "ENSG00000101442" "ENSG00000120616" "ENSG00000127337"
## [85] "ENSG00000111328" "ENSG00000205683" "ENSG00000011332" "ENSG00000117713"
## [89] "ENSG00000196367" "ENSG00000196747" "ENSG00000275221" "ENSG00000276368"
## [93] "ENSG00000276903" "ENSG00000180573" "ENSG00000278463" "ENSG00000278677"
## [97] "ENSG00000288825" "ENSG00000184260" "ENSG00000115274" "ENSG00000277745"
## [101] "ENSG00000156531" "ENSG00000153147" "ENSG00000274997" "ENSG00000136518"
## [105] "ENSG00000175792" "ENSG00000134046" "ENSG00000196787" "ENSG00000009954"
## [109] "ENSG00000182979" "ENSG00000149480" "ENSG00000099385" "ENSG00000106635"
## [113] "ENSG00000181218" "ENSG00000113812" "ENSG00000105968" "ENSG00000113648"
## [117] "ENSG00000123562"
##
## $`Acute myeloid leukemia`
## [1] "ENSG00000117020" "ENSG00000245848" "ENSG00000092067" "ENSG00000278139"
## [5] "ENSG00000102096" "ENSG00000213341" "ENSG00000182578" "ENSG00000164400"
## [9] "ENSG00000139318" "ENSG00000187840" "ENSG00000142208" "ENSG00000105221"
## [13] "ENSG00000150337" "ENSG00000122025" "ENSG00000198793" "ENSG00000177885"
## [17] "ENSG00000174775" "ENSG00000104365" "ENSG00000164399" "ENSG00000169896"
## [21] "ENSG00000078061" "ENSG00000173801" "ENSG00000157404" "ENSG00000133703"
## [25] "ENSG00000005381" "ENSG00000136997" "ENSG00000109320" "ENSG00000213281"
## [29] "ENSG00000138795" "ENSG00000121879" "ENSG00000051382" "ENSG00000137193"
## [33] "ENSG00000171608" "ENSG00000145675" "ENSG00000105647" "ENSG00000140464"
## [37] "ENSG00000112033" "ENSG00000100030" "ENSG00000102882" "ENSG00000169032"
## [41] "ENSG00000126934" "ENSG00000002330" "ENSG00000132155" "ENSG00000131759"
## [45] "ENSG00000110092" "ENSG00000140379" "ENSG00000173039" "ENSG00000108443"
## [49] "ENSG00000175634" "ENSG00000115904" "ENSG00000100485" "ENSG00000066336"
## [53] "ENSG00000157764" "ENSG00000168610" "ENSG00000126561" "ENSG00000173757"
## [57] "ENSG00000081059" "ENSG00000148737" "ENSG00000109906" "ENSG00000152284"
## [61] "ENSG00000117461" "ENSG00000269335" "ENSG00000159216" "ENSG00000079102"
## [65] "ENSG00000132326" "ENSG00000145386" "ENSG00000133101" "ENSG00000170458"
Example usage 1: Summarize airway
omics data into
dimension-reduction derived activity scores at KEGG pathway level.
The dimension-reduction operators implemented in
funOmics
include PCA (Principal Component Analysis), NMF
(Non-Negative Matrix Factorization), MDS (Multidimensional Scaling), and
pathifier deregulation scores from the pathifier
package derived from principal curves.
Now, let’s summarize the counts data using PCA. The PCA-aggregated
activity scores values represent the projection of the overall
expression of all genes in each pathway onto the first principal
component. For this example, let’s use the default minimum size of sets
(10). Note that when default value for minsize
parameter is
used, it is not necessary to assign a value for this parameter in the
function call:
pathway_activity_pca <- summarize_pathway_level(assay(airway, "counts"), kegg_sets, type = "pca")
##
## 361 functional sets read.
## iteration 100
## iteration 200
## iteration 300
## 345 successful functional aggregations over minsize
## 16 failed functional aggregations under minsize
## Functional activity score matrix has dimensions: 345 , 8
From the original airway
data containing 63677 genes
(transcripts) and 8 samples, the function
summarize_pathway_level
has generated a pathway-level
activity score for each of the 8 samples, for 343 KEGG pathways
containing more than 10 genes (16 failed functional aggregations under
minsize
). Let’s see how this matrix looks like:
print(head(pathway_activity_pca))
## SRR1039508 SRR1039509
## 2-Oxocarboxylic acid metabolism -0.06702655 1.980602
## ABC transporters -0.62811819 3.493143
## AGE-RAGE signaling pathway in diabetic complications -0.94405736 -4.372707
## AMPK signaling pathway -2.33237340 -3.446439
## ATP-dependent chromatin remodeling -0.10612468 4.822368
## Acute myeloid leukemia -1.68843459 -2.745095
## SRR1039512 SRR1039513
## 2-Oxocarboxylic acid metabolism -2.253388 6.875470
## ABC transporters -4.375987 5.828548
## AGE-RAGE signaling pathway in diabetic complications 4.514782 -9.753186
## AMPK signaling pathway 2.596147 -10.301870
## ATP-dependent chromatin remodeling -5.045007 13.846549
## Acute myeloid leukemia 1.499643 -7.342374
## SRR1039516 SRR1039517
## 2-Oxocarboxylic acid metabolism -2.670499 -8.032398
## ABC transporters -4.399747 -4.392270
## AGE-RAGE signaling pathway in diabetic complications 4.931905 11.913342
## AMPK signaling pathway 3.463877 16.514465
## ATP-dependent chromatin remodeling -6.676558 -14.120201
## Acute myeloid leukemia 2.162537 11.748711
## SRR1039520 SRR1039521
## 2-Oxocarboxylic acid metabolism 2.713740 1.4534992
## ABC transporters 1.040788 3.4336426
## AGE-RAGE signaling pathway in diabetic complications -3.754588 -2.5354917
## AMPK signaling pathway -6.065388 -0.4284188
## ATP-dependent chromatin remodeling 4.129495 3.1494778
## Acute myeloid leukemia -4.224764 0.5897765
The resulting matrix of higher-level functional representations looks
very similar to the original one, except that the original had many more
features (63677 instead of 343). This reduction in dimensionality can
facilitate the interpretation of the data and the identification of
patterns in the samples.
In this illustrative example, the RNA sequencing data in
airway
package has been directly summarized or aggregated
by the summarize_pathway_level
function of the
funOmics
package without intermediate processing. Depending
on the type of omics data, you may want to apply corresponding
processing steps to the omics abundance matrix prior to the aggregation
into higher level functional features. For instance, it is common
practice to filter out rows (genes or features) with low counts when
analyzing transcriptomics data. This filtering step is often performed
to remove genes that are expressed at very low levels and may not be
biologically relevant or reliable.
Some aggregation methods, may also have specific assumptions or
requirements on the input data. Let’s see another example where some
filtering is indeed necessary.
funOmics
allows to generate aggregated representations
using dimension-reduction derived scores from the NMF (Non-Negative
Matrix Factorization) method. The NMF-aggregated activity scores values
represent the weight (or contribution) of a single underlying basis
component or latent factor contributing to the pathway activity (or
higher level functional structure in use) for each sample in your data
set. Rank=1 is used for the basis matrix in the internal NMF
dimension-reduction.
Notably, the NMF method does not allow for negative values or null
rows in the input matrix. Transcriptomics data in the
airway
dataset are measured as counts, hence the matrix
presumably does not contain negative values, but it may contain null
rows. To avoid errors, we can filter out rows with less than 10 counts
across all samples before applying the NMF method:
print(any(assay(airway, "counts")[rowSums(assay(airway, "counts")) <0, ]))
## [1] FALSE
X <- assay(airway, "counts")[rowSums(assay(airway, "counts")) >= 10, ]
Let’s summarize the filtered counts data using NMF method:
pathway_activity_nmf <- summarize_pathway_level(X, kegg_sets, type = "nmf") # note that the NMF operation can take some minutes
##
## 361 functional sets read.
## iteration 100
## iteration 200
## iteration 300
## 341 successful functional aggregations over minsize
## 20 failed functional aggregations under minsize
## Functional activity score matrix has dimensions: 341 , 8
print(paste("Pathway activity score matrix has dimensions:", nrow(pathway_activity_nmf), ",", ncol(pathway_activity_nmf)))
## [1] "Pathway activity score matrix has dimensions: 341 , 8"
Let’s see how this matrix looks like:
head(pathway_activity_nmf)
## SRR1039508 SRR1039509
## 2-Oxocarboxylic acid metabolism 0.21933137 0.21414736
## ABC transporters 0.14649148 0.14146039
## AGE-RAGE signaling pathway in diabetic complications 0.05019464 0.03417927
## AMPK signaling pathway 0.04659867 0.04510255
## ATP-dependent chromatin remodeling 0.02218234 0.02351434
## Acute myeloid leukemia 0.10952784 0.11225654
## SRR1039512 SRR1039513
## 2-Oxocarboxylic acid metabolism 0.26082219 0.15792009
## ABC transporters 0.18231911 0.11445482
## AGE-RAGE signaling pathway in diabetic complications 0.06757896 0.03250244
## AMPK signaling pathway 0.05351472 0.03309145
## ATP-dependent chromatin remodeling 0.02773240 0.02123981
## Acute myeloid leukemia 0.13540167 0.09523485
## SRR1039516 SRR1039517
## 2-Oxocarboxylic acid metabolism 0.27035589 0.33491695
## ABC transporters 0.20120839 0.22881386
## AGE-RAGE signaling pathway in diabetic complications 0.05692811 0.06025348
## AMPK signaling pathway 0.05154539 0.06775307
## ATP-dependent chromatin remodeling 0.02509445 0.04749533
## Acute myeloid leukemia 0.13291262 0.18361574
## SRR1039520 SRR1039521
## 2-Oxocarboxylic acid metabolism 0.19404878 0.22087814
## ABC transporters 0.15330715 0.18110419
## AGE-RAGE signaling pathway in diabetic complications 0.05316288 0.04528241
## AMPK signaling pathway 0.03967482 0.04909675
## ATP-dependent chromatin remodeling 0.02061707 0.02689728
## Acute myeloid leukemia 0.10198303 0.13587105
In this example, summarize_pathway_level
has generated a
pathway-level activity score for each of the 8 samples, for 340 KEGG
pathways containing more than 10 genes (here 19 failed functional
aggregations under minsize
, slightly more than for the PCA
since some genes were removed in the filtering). Note that the resulting
matrix looks similar to that of the previous example in terms of shape
and format, but the values are derived from the NMF dimension-reduction
method instead of the PCA method. Same analogies apply for other types
of aggregation operator; only the interpretation of the resulting
functional activity scores will change.
Note that beyond pre-processing, you can also post-process the
resulting summarized matrix as you see appropriate for your analyses and
workflows.
Integrating the pathway-level activity scores with the
airway
SummarizedExperiment
object in a
MultiAssayExperiment
object
The resulting matrix of pathway-level activity scores can be further
analyzed as an independent dataset, or can also be integrated with the
airway
SummarizedExperiment
object in a
MultiAssayExperiment
structure (note that
SummarizedExperiment
can simultaneously manage several
experimental assays only if they have the same dimensions, which is not
the case here, hence the need for a MultiAssayExperiment
object). The MultiAssayExperiment library has to be loaded, and a
MultiAssayExperiment (airwayMultiAssay
) can be created and
filled with a list of assays-like matrices that may have different
dimensions. Here, airwayMultiAssay
contains the
counts
and the recently generated KEGG pathway activity
scores by NMF and PCA pooling.
library(MultiAssayExperiment)
assays_list <- list( counts = assay(airway, "counts"), kegg_nmf_agg = pathway_activity_nmf, kegg_pca_agg = pathway_activity_pca)
airwayMultiAssay <- MultiAssayExperiment(experiments=assays_list)
colData(airwayMultiAssay) <- colData(airway)
airwayMultiAssay
## A MultiAssayExperiment object of 3 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 3:
## [1] counts: matrix with 63677 rows and 8 columns
## [2] kegg_nmf_agg: matrix with 341 rows and 8 columns
## [3] kegg_pca_agg: matrix with 345 rows and 8 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
Example usage 2: Summarize airway
omics data with
summary statistics and a minimum size of the KEGG gene sets
Here we will apply the function summarize_pathway_level
to summarize pathway activity using the mean pooling aggregation for
those sets containing at least 12 genes. Remember that you can adjust
the parameters minsize
and type
of aggregation
as desired.
min <- 12
pathway_activity <- summarize_pathway_level(assay(airway, "counts"), kegg_sets, type = "mean", minsize = min)
##
## 361 functional sets read.
## iteration 100
## iteration 200
## iteration 300
## 343 successful functional aggregations over minsize
## 18 failed functional aggregations under minsize
## Functional activity score matrix has dimensions: 343 , 8
print(paste("Pathway activity score matrix has dimensions:", nrow(pathway_activity), ",", ncol(pathway_activity)))
## [1] "Pathway activity score matrix has dimensions: 343 , 8"
Now from the original airway
data of dimensions 63677
genes (transcripts) x 8 samples, summarize_pathway_level
has generated through mean-pooling a pathway-level activity score matrix
of 341 pathways x 8 samples, for gene sets containing more than 12
genes.
print(head(pathway_activity))
## SRR1039508 SRR1039509
## 2-Oxocarboxylic acid metabolism 709.0000 692.2424
## ABC transporters 645.8000 623.6222
## AGE-RAGE signaling pathway in diabetic complications 10087.0200 6868.5900
## AMPK signaling pathway 1594.4750 1543.2833
## ATP-dependent chromatin remodeling 1921.8095 2037.2000
## Acute myeloid leukemia 989.1045 1013.7612
## SRR1039512 SRR1039513
## 2-Oxocarboxylic acid metabolism 843.1818 510.4848
## ABC transporters 803.7556 504.5556
## AGE-RAGE signaling pathway in diabetic complications 13580.5200 6531.6100
## AMPK signaling pathway 1831.0917 1132.3000
## ATP-dependent chromatin remodeling 2402.6381 1840.1524
## Acute myeloid leukemia 1222.8209 860.0448
## SRR1039516 SRR1039517
## 2-Oxocarboxylic acid metabolism 873.9394 1082.636
## ABC transporters 886.9778 1008.644
## AGE-RAGE signaling pathway in diabetic complications 11440.2100 12108.400
## AMPK signaling pathway 1763.6917 2318.308
## ATP-dependent chromatin remodeling 2174.1048 4114.829
## Acute myeloid leukemia 1200.2985 1658.224
## SRR1039520 SRR1039521
## 2-Oxocarboxylic acid metabolism 627.2727 714.000
## ABC transporters 675.8222 798.400
## AGE-RAGE signaling pathway in diabetic complications 10683.4900 9099.840
## AMPK signaling pathway 1357.5417 1679.950
## ATP-dependent chromatin remodeling 1786.1905 2330.286
## Acute myeloid leukemia 921.0448 1227.000
In this example, 18 of the gene sets in kegg_sets
have
size < 12. You can then use the function
short_sets_detail
to get information about which pathways
have been left out, how many genes they had, and which genes are
involved in these shorter sets:
short_sets <- short_sets_detail(kegg_sets, min)
print(short_sets$short_sets)
## [1] "Biotin metabolism"
## [2] "Caffeine metabolism"
## [3] "D-Amino acid metabolism"
## [4] "Folate transport and metabolism"
## [5] "Neomycin, kanamycin and gentamicin biosynthesis"
## [6] "Phenylalanine, tyrosine and tryptophan biosynthesis"
## [7] "Phosphonate and phosphinate metabolism"
## [8] "Riboflavin metabolism"
## [9] "Sulfur metabolism"
## [10] "Sulfur relay system"
## [11] "Ubiquinone and other terpenoid-quinone biosynthesis"
## [12] "Valine, leucine and isoleucine biosynthesis"
## [13] "Virion - Adenovirus"
## [14] "Virion - Flavivirus and Alphavirus"
## [15] "Virion - Herpesvirus"
## [16] "Virion - Human immunodeficiency virus"
## [17] "Virion - Rotavirus"
## [18] "Vitamin B6 metabolism"
print(short_sets$short_sets_lengths)
## Biotin metabolism
## 3
## Caffeine metabolism
## 6
## D-Amino acid metabolism
## 6
## Folate transport and metabolism
## 1
## Neomycin, kanamycin and gentamicin biosynthesis
## 5
## Phenylalanine, tyrosine and tryptophan biosynthesis
## 6
## Phosphonate and phosphinate metabolism
## 6
## Riboflavin metabolism
## 8
## Sulfur metabolism
## 10
## Sulfur relay system
## 8
## Ubiquinone and other terpenoid-quinone biosynthesis
## 11
## Valine, leucine and isoleucine biosynthesis
## 4
## Virion - Adenovirus
## 4
## Virion - Flavivirus and Alphavirus
## 4
## Virion - Herpesvirus
## 9
## Virion - Human immunodeficiency virus
## 5
## Virion - Rotavirus
## 2
## Vitamin B6 metabolism
## 6
print(short_sets$short_sets_molecules)
## $`Biotin metabolism`
## [1] "ENSG00000159267" "ENSG00000151093" "ENSG00000169814"
##
## $`Caffeine metabolism`
## [1] "ENSG00000156006" "ENSG00000140505" "ENSG00000255974" "ENSG00000198077"
## [5] "ENSG00000158125" "ENSG00000171428"
##
## $`D-Amino acid metabolism`
## [1] "ENSG00000110887" "ENSG00000135423" "ENSG00000115419" "ENSG00000167720"
## [5] "ENSG00000133943" "ENSG00000203797"
##
## $`Folate transport and metabolism`
## [1] NA
##
## $`Neomycin, kanamycin and gentamicin biosynthesis`
## [1] "ENSG00000106633" "ENSG00000156515" "ENSG00000159399" "ENSG00000160883"
## [5] "ENSG00000156510"
##
## $`Phenylalanine, tyrosine and tryptophan biosynthesis`
## [1] "ENSG00000169154" "ENSG00000104951" "ENSG00000120053" "ENSG00000125166"
## [5] "ENSG00000171759" "ENSG00000198650"
##
## $`Phosphonate and phosphinate metabolism`
## [1] "ENSG00000134255" "ENSG00000161217" "ENSG00000111666" "ENSG00000185813"
## [5] "ENSG00000138018" "ENSG00000102230"
##
## $`Riboflavin metabolism`
## [1] "ENSG00000197594" "ENSG00000154269" "ENSG00000143727" "ENSG00000134575"
## [5] "ENSG00000102575" "ENSG00000135002" "ENSG00000090013" "ENSG00000160688"
##
## $`Sulfur metabolism`
## [1] "ENSG00000162813" "ENSG00000105755" "ENSG00000128309" "ENSG00000104331"
## [5] "ENSG00000137767" "ENSG00000139531" "ENSG00000128311" "ENSG00000143416"
## [9] "ENSG00000198682" "ENSG00000138801"
##
## $`Sulfur relay system`
## [1] "ENSG00000124217" "ENSG00000174177" "ENSG00000164172" "ENSG00000128309"
## [5] "ENSG00000128311" "ENSG00000167118" "ENSG00000142544" "ENSG00000244005"
##
## $`Ubiquinone and other terpenoid-quinone biosynthesis`
## [1] "ENSG00000167186" "ENSG00000196715" "ENSG00000181019" "ENSG00000115486"
## [5] "ENSG00000173085" "ENSG00000158104" "ENSG00000119723" "ENSG00000132423"
## [9] "ENSG00000198650" "ENSG00000167397" "ENSG00000110871"
##
## $`Valine, leucine and isoleucine biosynthesis`
## [1] "ENSG00000135094" "ENSG00000139410" "ENSG00000060982" "ENSG00000105552"
##
## $`Virion - Adenovirus`
## [1] "ENSG00000154639" "ENSG00000117335" "ENSG00000121594" "ENSG00000114013"
##
## $`Virion - Flavivirus and Alphavirus`
## [1] "ENSG00000104938" "ENSG00000113249" "ENSG00000090659" "ENSG00000092445"
##
## $`Virion - Herpesvirus`
## [1] "ENSG00000121716" "ENSG00000085514" "ENSG00000119912" "ENSG00000197081"
## [5] "ENSG00000161638" "ENSG00000259207" "ENSG00000110400" "ENSG00000130202"
## [9] "ENSG00000157873"
##
## $`Virion - Human immunodeficiency virus`
## [1] "ENSG00000104938" "ENSG00000160791" "ENSG00000090659" "ENSG00000121966"
## [5] "ENSG00000010610"
##
## $`Virion - Rotavirus`
## [1] "ENSG00000164171" "ENSG00000150093"
##
## $`Vitamin B6 metabolism`
## [1] "ENSG00000135069" "ENSG00000138356" "ENSG00000144362" "ENSG00000108439"
## [5] "ENSG00000241360" "ENSG00000160209"
Other summary statistics can be used for the aggregation, such as
median, standard deviation, min, or max. See below some more examples
with varying number of genes in the gene sets:
min <- 15
pathway_activity <- summarize_pathway_level(assay(airway, "counts"), kegg_sets, type = "sd", minsize = min)
##
## 361 functional sets read.
## iteration 100
## iteration 200
## iteration 300
## 340 successful functional aggregations over minsize
## 21 failed functional aggregations under minsize
## Functional activity score matrix has dimensions: 340 , 8
print(paste("Pathway activity score matrix has dimensions:", nrow(pathway_activity), ",", ncol(pathway_activity)))
## [1] "Pathway activity score matrix has dimensions: 340 , 8"
## SRR1039508 SRR1039509
## 2-Oxocarboxylic acid metabolism 897.8148 933.7587
## ABC transporters 1343.7024 1247.1723
## AGE-RAGE signaling pathway in diabetic complications 41059.3414 27919.9726
## AMPK signaling pathway 6261.0234 5764.0069
## ATP-dependent chromatin remodeling 6221.1335 7862.9219
## Acute myeloid leukemia 1540.3614 1571.4687
## SRR1039512 SRR1039513
## 2-Oxocarboxylic acid metabolism 1176.073 700.9372
## ABC transporters 1700.936 1034.8811
## AGE-RAGE signaling pathway in diabetic complications 61813.278 31134.0064
## AMPK signaling pathway 7173.520 3452.9315
## ATP-dependent chromatin remodeling 8738.562 8199.3336
## Acute myeloid leukemia 2372.710 1613.5125
## SRR1039516 SRR1039517
## 2-Oxocarboxylic acid metabolism 1235.497 1476.768
## ABC transporters 2075.315 2100.908
## AGE-RAGE signaling pathway in diabetic complications 44791.910 50360.551
## AMPK signaling pathway 6498.032 6844.707
## ATP-dependent chromatin remodeling 6647.548 19701.660
## Acute myeloid leukemia 2178.260 3040.191
## SRR1039520 SRR1039521
## 2-Oxocarboxylic acid metabolism 818.0049 958.0687
## ABC transporters 1482.2698 1751.8586
## AGE-RAGE signaling pathway in diabetic complications 46530.1838 42718.1856
## AMPK signaling pathway 5062.8806 5557.3168
## ATP-dependent chromatin remodeling 6034.5251 9437.6834
## Acute myeloid leukemia 1703.6101 2032.9305
min <- 7
pathway_activity <- summarize_pathway_level(assay(airway, "counts"), kegg_sets, type = "median", minsize = min)
##
## 361 functional sets read.
## iteration 100
## iteration 200
## iteration 300
## 348 successful functional aggregations over minsize
## 13 failed functional aggregations under minsize
## Functional activity score matrix has dimensions: 348 , 8
print(paste("Pathway activity score matrix has dimensions:", nrow(pathway_activity), ",", ncol(pathway_activity)))
## [1] "Pathway activity score matrix has dimensions: 348 , 8"
## SRR1039508 SRR1039509
## 2-Oxocarboxylic acid metabolism 424.0 553.0
## ABC transporters 136.0 92.0
## AGE-RAGE signaling pathway in diabetic complications 683.5 581.5
## AMPK signaling pathway 667.0 636.5
## ATP-dependent chromatin remodeling 663.0 597.0
## Acute myeloid leukemia 555.0 566.0
## SRR1039512 SRR1039513
## 2-Oxocarboxylic acid metabolism 444.0 405.0
## ABC transporters 225.0 76.0
## AGE-RAGE signaling pathway in diabetic complications 779.5 435.0
## AMPK signaling pathway 801.5 491.5
## ATP-dependent chromatin remodeling 786.0 471.0
## Acute myeloid leukemia 700.0 393.0
## SRR1039516 SRR1039517
## 2-Oxocarboxylic acid metabolism 518.0 780
## ABC transporters 224.0 138
## AGE-RAGE signaling pathway in diabetic complications 786.5 983
## AMPK signaling pathway 756.5 1084
## ATP-dependent chromatin remodeling 797.0 959
## Acute myeloid leukemia 663.0 767
## SRR1039520 SRR1039521
## 2-Oxocarboxylic acid metabolism 375 614.0
## ABC transporters 155 86.0
## AGE-RAGE signaling pathway in diabetic complications 575 601.5
## AMPK signaling pathway 622 664.0
## ATP-dependent chromatin remodeling 602 666.0
## Acute myeloid leukemia 504 618.0
Example usage 3: Summarize airway
omics data with test
statistics
Using the same airway
data and gene sets
kegg_sets
from previous examples, let’s generate aggregated
representations using test statistics. These operators allow to compare
the measurements for each sample between the molecules in each
functional set and the molecules not in the given functional set. They
may help identify functionally related genes/molecules that exhibit
coordinated or significant deviations in their expression patterns
across samples. Currently, the implemented available statistical tests
in funOmics
are the t-test, Wilcoxon test, and
Kolmogorov–Smirnov test.
When using test statistics, one has to be mindful about the
assumptions of the test and the distribution of the data. For instance,
the t-test assumes that the data is normally distributed and compares
the means of two groups; the Wilcoxon test assumes that the data is
continuous and symmetric, and compares the medians of two groups; the
Kolmogorov–Smirnov test is a non-parametric test that does not assume
any distribution and compares the entire cumulative distribution
functions (i.e, in this context, the overall shapes of the distributions
of values) of two groups. Here we provide an example using the t-test
statistic:
pathway_activity <- summarize_pathway_level(assay(airway, "counts"), kegg_sets, type = "ttest", minsize = 15)
##
## 361 functional sets read.
## iteration 100
## iteration 200
## iteration 300
## 340 successful functional aggregations over minsize
## 21 failed functional aggregations under minsize
## Functional activity score matrix has dimensions: 340 , 8
print(paste("Pathway activity score matrix has dimensions:", nrow(pathway_activity), ",", ncol(pathway_activity)))
## [1] "Pathway activity score matrix has dimensions: 340 , 8"
In this case, summarize_pathway_level
has generated a
pathway-level activity score for each of the 8 samples, for 339 pathways
containing more than 15 genes. The resulting test statistic for each
sample represents the difference between the two groups (i.e., the
functional set and rest of molecules) for each gene or molecule. These
statistics can be then be used as features for further analysis or
modeling:
print(head(pathway_activity))
## SRR1039508 SRR1039509
## 2-Oxocarboxylic acid metabolism 2.454453 2.436486
## ABC transporters 1.603352 1.763264
## AGE-RAGE signaling pathway in diabetic complications 2.381487 2.357986
## AMPK signaling pathway 2.226225 2.375512
## ATP-dependent chromatin remodeling 2.635281 2.273416
## Acute myeloid leukemia 3.528012 3.738857
## SRR1039512 SRR1039513
## 2-Oxocarboxylic acid metabolism 2.165767 2.224862
## ABC transporters 1.596487 1.724146
## AGE-RAGE signaling pathway in diabetic complications 2.135969 2.024587
## AMPK signaling pathway 2.191499 2.840510
## ATP-dependent chromatin remodeling 2.353859 2.005228
## Acute myeloid leukemia 2.841968 3.153732
## SRR1039516 SRR1039517
## 2-Oxocarboxylic acid metabolism 2.271840 2.322606
## ABC transporters 1.624509 1.672874
## AGE-RAGE signaling pathway in diabetic complications 2.472229 2.311854
## AMPK signaling pathway 2.329357 2.939678
## ATP-dependent chromatin remodeling 2.763006 1.891442
## Acute myeloid leukemia 3.064161 3.160045
## SRR1039520 SRR1039521
## 2-Oxocarboxylic acid metabolism 2.285466 2.280632
## ABC transporters 1.696842 1.782976
## AGE-RAGE signaling pathway in diabetic complications 2.234984 2.055617
## AMPK signaling pathway 2.290621 2.660227
## ATP-dependent chromatin remodeling 2.526441 2.172547
## Acute myeloid leukemia 2.978309 3.599640
Importantly, this is just an illustrative example. In real-world
experiments, you may have to consider that genes or molecules in a
certain set might be longer or naturally more abundant and therefore
might have higher overall read counts (abundance) compared to genes or
molecules outside the set, even if their expression levels (as a
proportion of transcripts) are similar. Normalizing the expression
counts for all genes (including those within and outside the sets) using
a suitable method (e.g., library size normalization, transcript length
normalization) can help make the expression levels more comparable
before performing the aggregation and set comparisons.
Molecular sets beyond KEGG and omics matrices beyond
SummarizedExperiment
The package funOmics
interoperates with KEGGREST to
retrieve molecular sets from the KEGG through the function
get_kegg_sets
(see description and example above). Other
real-world molecular sets can be downloaded from several sources. In
terms of gene sets, the Gene Ontology is a versatile resource that
covers three domains: cellular components, biological processes and
molecular functions. Reactome pathways can also be used to generate
higher-level functional representations from omics data. Explore the
different releases and download the corresponding gene sets for the
different types of GO terms, and reactome pathways here.
You can also aggregate genes into protein complexes, which you can find
in the CORUM
database.
Regarding other omics types, such as metabolomics, the function
summarize_pathway_level
can be applied in a similar manner
to a metabolomics matrix X
and KEGG metabolic pathways.
Metabolite sets from KEGG pathways can also be downloaded with the KEGG API.
After obtaining the molecular sets information, this data has to be
formatted as a list of lists (similar to what is obtained from the
get_kegg_sets
function). In other words, you need a
structure where you have a list of multiple molecular sets names, and
each of these sets is represented as a list of molecule identifiers,
such as entrez IDs, PubChem CIDs, Uniprot IDs, etc. For instance, let’s
retrieve gene sets from GO terms for cellular compartments. The
information can be downloaded from the msigdb
link or accessed programmatically as follows:
goccdb <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.5/c5.go.cc.v7.5.entrez.gmt"
downdb <- sapply(readLines(goccdb), function(x) strsplit(x, "\t")[[1]])
gocc_sets <- sapply(as.matrix(downdb), function(x) x[3:length(x)])
names(gocc_sets) = sapply(as.matrix(downdb), function(x) x[1])
gocc_sets[1:3]
## $GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX
## [1] "1069" "1161" "2067" "2071" "2072" "2073" "5424" "5887" "7507" "7508"
## [11] "7515"
##
## $GOCC_HISTONE_DEACETYLASE_COMPLEX
## [1] "10013" "10014" "10284" "10428" "10467" "10524" "10629" "10847"
## [9] "10856" "10902" "10933" "1107" "1108" "116092" "1457" "221037"
## [17] "23186" "23309" "23314" "23468" "25855" "25942" "26038" "283248"
## [25] "3065" "3066" "3094" "3622" "473" "51317" "51564" "51780"
## [33] "53615" "54556" "54815" "55758" "55806" "55809" "55818" "55869"
## [41] "55929" "57459" "57504" "57634" "57649" "58516" "5928" "5931"
## [49] "64426" "64431" "6907" "7764" "79595" "79685" "79718" "79885"
## [57] "81611" "8204" "8295" "83933" "84215" "84312" "8607" "8819"
## [65] "8841" "90665" "9112" "91748" "9219" "9611" "9734" "9759"
##
## $GOCC_SAGA_COMPLEX
## [1] "100130302" "10474" "10629" "112869" "117143" "170067"
## [7] "23326" "2648" "27097" "55578" "56943" "56970"
## [13] "6871" "6878" "6880" "6881" "6883" "8295"
## [19] "8464" "8850" "9913"
As you can see, the resulting gocc_sets
object is a list
of lists where each element represents a GO cellular compartment gene
set. Here you can also use the function short_sets_detail
to get information about which and how many pathways contain less than a
given number of molecules:
min <- 8
short_sets <- short_sets_detail(gocc_sets, min)
print(head(short_sets$short_sets_molecules))
## $GOCC_TRANSCRIPTION_FACTOR_TFIIIC_COMPLEX
## [1] "112495" "2975" "2976" "9328" "9329" "9330"
##
## $GOCC_COMMITMENT_COMPLEX
## [1] "11338" "55015" "6631" "6632" "6634"
##
## $GOCC_THO_COMPLEX
## [1] "57187" "79228" "80145" "84321" "8563" "9984"
##
## $GOCC_EKC_KEOPS_COMPLEX
## [1] "112858" "51002" "55644" "8270" "84520"
##
## $GOCC_GLYCOSYLPHOSPHATIDYLINOSITOL_N_ACETYLGLUCOSAMINYLTRANSFERASE_GPI_GNT_COMPLEX
## [1] "51227" "5277" "5279" "5283" "84992" "8818" "9091"
##
## $GOCC_HRD1P_UBIQUITIN_LIGASE_ERAD_L_COMPLEX
## [1] "51009" "550" "57414" "6400" "79139" "84447" "91319"
Notably, omics data does not always come from a
SummarizedExperiment
object. Some times, it is imported
from a csv file, generated through other pre-processing steps and
packages, or even generated from a simulation. In these cases, the data
has to be formatted as a matrix of dimensions g*s
(g
molecules and s
samples).
Let’s see an example usage of the GO cellular compartments gene sets
gocc_sets
where omics data is of type matrix. For this
purpose, we will create an expression matrix where the expression values
are random positive values sampled from a standard normal distribution.
Please, note that funOmics
can be used to aggregate other
types of omics data and molecular sets, such as metabolomics or
proteomics that may have a similar range of values.
Let’s simulate a gene expression matrix X_expr
, where
gene IDs are codes between 1:10000 (to match entrez IDs), and
summarize_pathway_level
can be applied:
# Example usage:
set.seed(1)
g <- 10000
s <- 20
X_expr <- matrix(abs(rnorm(g * s)), nrow = g, dimnames = list(1:g, paste0("s", 1:s)))
print(paste("Dimensions of omics matrix X:", dim(X_expr)[1], "*", dim(X_expr)[2]))
## [1] "Dimensions of omics matrix X: 10000 * 20"
## s1 s2 s3 s4 s5 s6 s7
## 1 0.6264538 0.8043316 0.2353485 0.6179223 0.2212571 0.5258908 0.3413341
## 2 0.1836433 1.0565257 0.2448250 0.8935057 0.3517935 0.4875444 0.4136665
## 3 0.8356286 1.0353958 0.6421869 0.4277562 0.1606019 1.1382508 0.1220357
## 4 1.5952808 1.1855604 1.9348085 0.2999012 0.1240523 1.2151344 1.5893806
## 5 0.3295078 0.5004395 1.0386957 0.5319833 0.6598739 0.4248307 0.7874385
## 6 0.8204684 0.5249887 0.2835501 1.7059816 0.5038493 1.4508403 1.5920640
## s8 s9 s10 s11 s12 s13 s14
## 1 1.00203611 1.55915937 0.09504307 0.7914415 0.0145724495 0.8663644 0.1604426
## 2 0.02590761 0.20166217 0.38805939 0.3921679 1.7854043337 0.9476952 0.9241849
## 3 0.44814178 1.04017610 2.13657003 0.4726670 0.0002997544 0.4522428 1.5561751
## 4 0.84323332 0.07195772 0.55661945 0.4579517 0.4356948690 0.2782408 0.8812202
## 5 0.21846310 0.01526544 0.59094164 0.1681319 1.4076452475 1.4175945 0.5263595
## 6 0.47678629 0.33938598 1.52014345 0.5856737 0.6929698698 0.6329981 0.4627372
## s15 s16 s17 s18 s19 s20
## 1 0.249371112 1.2520152 2.3150930 0.4414410 0.82485558 1.37086468
## 2 0.335346796 0.3351313 1.0603800 0.4130862 0.74402087 0.54569610
## 3 0.004405287 0.1080678 0.3970672 0.8660777 0.69009734 1.62446330
## 4 0.986768348 0.4717051 0.4840034 2.2615708 1.76900681 0.06247283
## 5 0.543575705 2.5070607 1.3584146 0.1787018 0.55215640 0.57021255
## 6 0.626142823 1.2451344 0.6370574 0.4713582 0.03257056 0.31350574
Now, let’s summarize the expression data using standard deviation
pooling for the GO cellular compartments gene sets. We won’t specify a
minimum size of sets in this case, so the default minsize
of 10 is used:
sd_gocc_expr <- summarize_pathway_level(X_expr, gocc_sets, type="sd", minsize=8)
##
## 1006 functional sets read.
## iteration 100
## iteration 200
## iteration 300
## iteration 400
## iteration 500
## iteration 600
## iteration 700
## iteration 800
## iteration 900
## iteration 1000
## 550 successful functional aggregations over minsize
## 456 failed functional aggregations under minsize
## Functional activity score matrix has dimensions: 550 , 20
## s1 s2 s3 s4
## GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX 0.4045096 0.3841496 0.4876056 0.6158703
## GOCC_HISTONE_DEACETYLASE_COMPLEX 0.5961716 0.5524931 0.4077048 0.6363483
## GOCC_SAGA_COMPLEX 0.6190007 0.4613048 0.4179023 0.5202579
## GOCC_GOLGI_MEMBRANE 0.6050940 0.5866127 0.6168553 0.6264271
## GOCC_UBIQUITIN_LIGASE_COMPLEX 0.7146701 0.7208348 0.6495247 0.6066239
## GOCC_NUCLEAR_UBIQUITIN_LIGASE_COMPLEX 0.5324259 0.6635115 0.6381171 0.6187512
## s5 s6 s7 s8
## GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX 0.5427526 0.8412107 0.6130915 0.5778133
## GOCC_HISTONE_DEACETYLASE_COMPLEX 0.7490914 0.4977517 0.5737141 0.6572306
## GOCC_SAGA_COMPLEX 0.4176742 0.5104733 0.6437678 0.6215122
## GOCC_GOLGI_MEMBRANE 0.5953969 0.6037806 0.6283209 0.6215994
## GOCC_UBIQUITIN_LIGASE_COMPLEX 0.6213030 0.6611160 0.6128709 0.5919404
## GOCC_NUCLEAR_UBIQUITIN_LIGASE_COMPLEX 0.6536113 0.6025263 0.6968490 0.4874398
## s9 s10 s11 s12
## GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX 0.8376548 0.6066229 0.6922118 0.9214760
## GOCC_HISTONE_DEACETYLASE_COMPLEX 0.6117108 0.7045937 0.3173755 0.6027702
## GOCC_SAGA_COMPLEX 0.6731533 0.7353114 0.2952001 0.4611380
## GOCC_GOLGI_MEMBRANE 0.5759732 0.6395941 0.6302452 0.6040828
## GOCC_UBIQUITIN_LIGASE_COMPLEX 0.6589884 0.5835952 0.6676812 0.5805518
## GOCC_NUCLEAR_UBIQUITIN_LIGASE_COMPLEX 0.6787391 0.3562284 0.6064547 0.4911328
## s13 s14 s15 s16
## GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX 0.4879332 0.6815041 0.3723823 0.4737714
## GOCC_HISTONE_DEACETYLASE_COMPLEX 0.5398999 0.5982277 0.4842830 0.7086422
## GOCC_SAGA_COMPLEX 0.6219926 0.4270019 0.9318551 0.4333853
## GOCC_GOLGI_MEMBRANE 0.5775356 0.5659483 0.5885292 0.6093459
## GOCC_UBIQUITIN_LIGASE_COMPLEX 0.4800778 0.5803810 0.5722157 0.6796406
## GOCC_NUCLEAR_UBIQUITIN_LIGASE_COMPLEX 0.5056348 0.5565119 0.4373176 0.6570645
## s17 s18 s19 s20
## GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX 0.8067935 0.3302759 0.5167987 0.7764892
## GOCC_HISTONE_DEACETYLASE_COMPLEX 0.6092391 0.5663914 0.5499079 0.4634814
## GOCC_SAGA_COMPLEX 0.5610913 0.6366751 0.4932885 0.6815148
## GOCC_GOLGI_MEMBRANE 0.5888242 0.6110807 0.5752147 0.6149827
## GOCC_UBIQUITIN_LIGASE_COMPLEX 0.5656959 0.6670322 0.5867161 0.6975617
## GOCC_NUCLEAR_UBIQUITIN_LIGASE_COMPLEX 0.5784575 0.4706088 0.5487221 0.5346914
GO cellular compartments level expression signatures have been
generated via standard deviation aggregation. You can apply similar
procedures for other types of molecular sets, aggregation functions and
omics types. The package funOmics
is conceived to be
flexible across omics types and types of molecular sets, so you can also
tailor or directly create your own list of molecular sets based on
specific criteria of your experiments (e.g., include only protein
complexes involved in ubiquitination, or define ad hoc
metabolic routes involving specific metabolites).