This is intended to be a short introduction to the leapr package. First we
need to load the required libraries:
# install from bioconductor
if (!require(BiocManager)) {
install.packages('BiocManager')
BiocManager::install('leapR')
}
# load the library
library(leapR)
library(gplots)
#>
#> Attaching package: 'gplots'
#> The following object is masked from 'package:stats':
#>
#> lowess
library(rmarkdown)
#>
#> Attaching package: 'rmarkdown'
#> The following objects are masked from 'package:BiocStyle':
#>
#> html_document, md_document, pdf_document
Dataset - an expression dataset, contained in the Bioconductor object, that at the bare minimum has a matrix of components (rows) measured in the same system under multiple different conditions (columns)
Component - the things being measured, genes, proteins, methylation site, phosphosite, etc. For functional (currently) the component must be associated with a gene name. That is, there’s not currently a way to calculate pathway enrichment using lipids.
Pathway - a set of components that works together to accomplish something or are related to each other in some other way. This includes classic signaling and metabolic pathways, but also molecular function and localization categories and other groups of related components, like genome location, conservation, etc.
Condition - a sample where the treatment, environmental conditions, patient, time point or some combination of those is varied.
The overall idea for functional enrichment is to determine which pathways are statistically over-represented in one group versus another, display statistically differential abundance from one group to another, or are statistically differentially distributed in a ranked list based on the abundance of one sample. Each of these purposes has a different underlying statistical test (or family of tests) and the results of each can be interpreted in somewhat different ways. The purpose of this vignette is to give the user a very brief introduction on how to use the package, not to discuss the underlying statistical choices that need to be made when analyzing such data.
There are a number of caveats (probably non-exhaustive) with doing this kind of analysis.
One important point is to use data that’s been normalized in a particular way to do these analyses. Data here has been normalized as a Z score by row (gene/protein/etc.). So, for each row, calculate the mean and standard deviation across all the conditions (columns) and then express as a Z score.
Here’s why. All high-throughput technologies (microarray, RNAseq, MS-assisted proteomics, metabolomics, lipidomics, etc.) suffer from the same limitation. The detectability of each molecule being detected (protein, RNA, etc.) is different and, in general, it’s impossible to accurately determine how detectable each one is. The multi-omic functional enrichment process lumps together measurements from different components (proteins, genes, etc.) to summarize a pathway. If the component measurements aren’t directly comparable (they aren’t) then this can and will introduce significant systematic errors and won’t produce the results you’re looking for. Careful consideration must be given that the results of the analysis reflect the question being asked and that the normalization method hasn’t obscured the desired results.
The background of comparison for functional enrichment is always important, but really only impacts the Fisher’s exact tests in my applications below. The background is the answer to “my functional group of interest is s tatistically enriched relative to what?” For Fisher’s exact tests this is really important. Generally, it is best to compare enrichment versus those components observed in the data - not against the universe of all possible components. For example, a proteomics dataset from plasma may have a limited number of proteins observed relative to all possible human proteins. Functional enrichment using all possible proteins will yield very different results than if it’s performed using only those proteins observed in plasma as a background.
When testing the statistical significance of differences in a lot of pathways it’s necessary to correct for multiple hypotheses. This essentially accounts for the possibility you might see SOMETHING significant by chance if you just test enough things- so it moves p values in a less significant direction. The more things you test, the greater this move will be. So pathway databases with lots of pathways are affected more by this correction, making it harder to get a significant result (which is a good thing actually).
Two ‘databases’ (organized text files) are included for pathways. The example is taken from the NCI’s Pathway Interaction Database (PID) and covers signaling pathways in human - but is no longer being actively maintained. They can be loaded as follows:
data(ncipid)
The identifiers (gene names, e.g.) for the data input MUST match the identifiers used in the pathway database. The two included human databases use the HGNC-approved gene names. Which means your data has to use the same identifiers.
A sample data set is included that is from the CPTAC study of 169 ovarian
tumors. We include the dataset as a object,
containing three assays (transcriptomics, global proteomics, and
phosphoproteomics) to enable interoperability with other tools, and
store example file as rda on
Figshare
as example.
This data can be loaded as follows:
url <- "https://api.figshare.com/v2/file/download/56536217"
pdata <- download.file(url, method = "libcurl", destfile = "protData.rda")
# as.matrix()
load("protData.rda")
p <- file.remove("protData.rda")
url <- "https://api.figshare.com/v2/file/download/56536214"
tdata <- download.file(url, method = "libcurl", destfile = "transData.rda")
load("transData.rda")
p <- file.remove("transData.rda")
url <- "https://api.figshare.com/v2/file/download/56536211"
phdata <- download.file(url, method = "libcurl", destfile = "phosData.rda")
load("phosData.rda")
p <- file.remove("phosData.rda")
We also include some groups of patients to compare stored as R data objects:
data(shortlist)
data(longlist)
## columns that we want to use for results
cols_to_display <- c("ingroup_n", "outgroup_n", "background_n",
"pvalue", "BH_pvalue")
The data are now loaded and ready to go through some of the examples.
There are a number of ways to do this. I generally use a simple approach which assesses the statistical difference in distributions between the abundance values from all the members of a pathway in all the group members from one group with those from the other group using a t test.
This is a ‘bag of values’ approach and it does not pay attention to the relationships between values in different groups (i.e. that each group has measurements for the same component). There are likely issues that rise because of this and caveats associated with it. However, it works fairly well.
In this example we are assessing the enrichment of pathways in
a group of short surviving patients versus in a group of long surviving
patients. We can also do a single patient-to-patient comparison or compare a
single patient to a group of patients.
Better corrected p-values are more enriched. However, you can get good
p-values when the algorithm only considers a limited number of components from
a pathway. That is, the pathway may have 30 members and the p-value is coming
from values from just 3 members. You can look at the ingroup_n column from
the result matrix to see this (and screen out if desired).
It is VERY important to also consider the effect size. That is, the difference
between the mean of one group and the mean of the other group. If there are
large numbers of components in the pathway being compared it is relatively easy
to get a significant p value with small effect size. Though this may be a real
difference it is often not as interesting as a smaller group with worse p value
and greater effect size. You can look at the effect size by comparing the
ingroup_mean and outgroup_mean columns.
# in this example we lump a bunch of patients together (the 'short survivors')
# and compare them to another group (the 'long survivors')
### using enrichment_wrapper function
protdata.enrichment.svl <- leapR::leapR(
geneset = ncipid,
enrichment_method = "enrichment_comparison",
eset = pset,
assay_name = "proteomics",
primary_columns = shortlist,
secondary_columns = longlist
)
or <- order(unlist(protdata.enrichment.svl[, "pvalue"]))
rmarkdown::paged_table(protdata.enrichment.svl[or, cols_to_display])
# another application is to compare just one patient against another
# (this would be the equivalent of comparing one time point to another)
### using enrichment_wrapper function
protdata.enrichment.svl.ovo <- leapR::leapR(
geneset = ncipid,
enrichment_method = "enrichment_comparison",
eset = pset,
assay_name = "proteomics",
primary_columns = shortlist[1],
secondary_columns = longlist[1]
)
or <- order(unlist(protdata.enrichment.svl.ovo[, "pvalue"]))
rmarkdown::paged_table(protdata.enrichment.svl.ovo[or, cols_to_display])
When we only compare one sample to another, we get no enriched pathways.
For this test I use Fisher’s exact which is a simple comparison of the overlap of two sets (think of it like a statistical Venn diagram with two groups). It’s also referred to as a hypergeometric test.
Caveat 1. Fisher’s exact does not consider abundance values but only lists of components. Generally this requires some separation of a group of interest using differential expression, module membership (from a network for example), or some other method.
Caveat 2. The choice of background for comparison can make a big difference on outcome. For example, in a proteomics experiment where you’re looking at enrichment in a group of highly differentially expressed proteins, you could choose to use all possible proteins as a background, or you could use just those proteins that were observed by proteomics (generally a much more limited set). The second option is generally the best since the first options will result in (partly to mostly) functions that are enriched in proteins that are seen in proteomics. That is, the most abundant proteins, which is generally not the desired outcome.
In the example below I construct a genelist of interest using a simple abundance threshold on the data then use a background of all the genes in the example dataset (which is a limited number). I then do a simple hierarchical clustering on the data, extract modules, and step through each module to calculate enrichment for them, outputting the results into a separate text file.
As with the t test comparison above it is important to look at the number of pathway members included in the comparison (look at the in_path column). There is no ‘effect size’ problem with Fisher’s exact since it’s just a set comparison, but it’s important to note that significant p values can arise from a pathway being underrepresented in the genelist, which often times is not the desired result. The foldx column gives a ratio of in versus not in the genelist, values > 1 being enriched and <1 being depleted.
# for this example we will construct a list of genes from the expression data
# to emulate what you might be inputting
genelist <- rownames(pset)[which(SummarizedExperiment::assay(pset,
"proteomics")[, 1] > 0.5)]
protdata.enrichment.sets.test <- leapR::leapR(
geneset = ncipid,
enrichment_method = "enrichment_in_sets",
eset = pset,
assay_name = "proteomics",
targets = genelist
)
or <- order(protdata.enrichment.sets.test[, "pvalue"])
rmarkdown::paged_table(protdata.enrichment.sets.test[or, cols_to_display])
### using enrichment_wrapper function
protdata.enrichment.order <- leapR::leapR(
geneset = ncipid,
enrichment_method = "enrichment_in_order",
eset = pset,
assay_name = "proteomics",
primary_columns = "TCGA-13-1484"
)
or <- order(protdata.enrichment.order[, "pvalue"])
rmarkdown::paged_table(protdata.enrichment.order[or, cols_to_display])
# in this example we construct some modules from the hierarchical clustering
# of the data
protdata_naf <- SummarizedExperiment::assay(pset, "proteomics")
# hierarchical clustering is not too happy with lots of missing values
# so we'll do a zero fill on this to get the modules
protdata_naf[which(is.na(protdata_naf))] <- 0
# construct the hierarchical clustering using the 'wardD' method, which
# seems to give more even sized modules
protdata_hc <- hclust(dist(protdata_naf), method = "ward.D2")
# arbitrarily we'll chop the clusters into 5 modules
modules <- cutree(protdata_hc, k = 5)
## sara: created list
clusters <- lapply(unique(modules), function(x) names(which(modules == x)))
# modules is a named list of values where each value is a module
# number and the name is the gene name
# To do enrichment for one module (module 1 in this case) do this
protdata.enrichment.sets.module_1 <- leapR::leapR(
geneset = ncipid,
enrichment_method = "enrichment_in_sets",
eset = pset,
assay_name = "proteomics",
targets = names(modules[which(modules == 1)])
)
# To do enrichment on all modules and return the list of enrichment results
protdata.enrichment.sets.modules <- do.call(rbind,
leapR::cluster_enrichment(
eset = pset,
assay_name= 'proteomics',
geneset = ncipid,
clusters = clusters,
sigfilter = 0.25))
## nothing is enriched
rmarkdown::paged_table(protdata.enrichment.sets.modules[, cols_to_display])
Similar to the popular GSEA, KS tests whether a group of components (the pathway) is distributed in a statistically significant manner in a ranked list of components. That is, if all the members of the pathway are clustered together at the top of the list (highly abundant, e.g.) or at the bottom of the list (low abundance, e.g.) this will return good p values. I should note that GSEA uses a more sophisticated approach than this and their application has a lot of bells and whistles.
In the example below I’m simply calculating enrichment for one of the patients in the list (arbitrarily selected). The ranking value is relative protein abundance in this case, but can be any continuous measure or derived value. For example, you could calculate the topology of all proteins in a network and use the topology measure (degree) as the measure.
Similar to the other examples be cautious of pathways with good p values that consider a small number of pathway numbers (in_path column). The MeanPath column gives a measure that shows how far above or below the median the mean rank of the pathway is (normalized to -1,1). The Zscore column is a Zscore calculated on the basis of the mean percentage rank of the pathway relative to the mean of the entire list divided by the standard deviation of the pathway rank. The foldx column expresses the mean percentage rank of the pathway relative to the entire list - closer to 0 is higher in the list and closer to 1 is closer to the bottom of the list. Each of these should give consistent results, but will be somewhat different.
# This is how you calculate enrichment in a ranked list
# (for example from topology)
### using enrichment_wrapper function
protdata.enrichment.sets <- leapR::leapR(
geneset = ncipid, "enrichment_in_order",
eset = pset,
assay_name = "proteomics",
primary_columns = shortlist[1]
)
or <- order(protdata.enrichment.sets[, "pvalue"])
rmarkdown::paged_table(protdata.enrichment.sets[or, cols_to_display])
In this section I’ll go through the ideas behind and use of several different kinds of enrichment that I’ve been working on. Consider this section to be under development and so use at your risk.
The idea here is to use the correlation of pathway members to each other versus to non-pathway members as a way to assess functional enrichment. This idea seems sound- pathways that are varying in a correlated way across a bunch of conditions (say time points or patients) may be more active and more important than others. However, more testing and validation is needed to show that this is the case.
The ingroup_mean gives the mean correlation of the pathway members to each
other and outgroup_mean gives the correlation of the pathway members to
non-pathway members. Background_mean gives the mean correlation of all
non-pathway members. The pvalue and BH_pvalue are for the pathway members
to each other versus those pathway members to non-pathway components. The
pvalue_background and BH_pvalue_background are for the pathway member
correlation relative to non-pathway member correlation (which is similar but
slightly different than the other p-values).
### using enrichment_wrapper function
protdata.enrichment.correlation <- leapR::leapR(
geneset = ncipid,
enrichment_method = "correlation_enrichment",
assay_name = "proteomics",
eset = pset
)
or <- order(protdata.enrichment.correlation[, "pvalue"])
rmarkdown::paged_table(head(protdata.enrichment.correlation[or,
cols_to_display]))
protdata.enrichment.correlation.short <- leapR::leapR(
geneset = ncipid,
enrichment_method = "correlation_enrichment",
assay_name = "proteomics",
eset = pset[, shortlist]
)
or <- order(protdata.enrichment.correlation.short[, "pvalue"])
rmarkdown::paged_table(head(protdata.enrichment.correlation.short[or,
cols_to_display]))
protdata.enrichment.correlation.long <- leapR::leapR(
geneset = ncipid,
enrichment_method = "correlation_enrichment",
assay_name = "proteomics",
eset = pset[, longlist]
)
or <- order(protdata.enrichment.correlation.long[, "pvalue"])
rmarkdown::paged_table(head(protdata.enrichment.correlation.long[or,
cols_to_display]))
In this example we will use phosphoproteomics data to assess the enrichment in known kinase substrates (a proxy for kinase activity)
data("kinasesubstrates")
# for an individual tumor calculate the Kinase-Substrate
# Enrichment (similar to KSEA)
# This uses the site-specific phosphorylation data to determine
# which kinases
# might be active by assessing the enrichment of the
# phosphorylation of their known substrates
phosphodata.ksea.order <- leapR::leapR(
geneset = kinasesubstrates,
enrichment_method = "enrichment_in_order",
assay_name = "phosphoproteomics",
eset = phset,
primary_columns = "TCGA-13-1484")
or <- order(phosphodata.ksea.order[, "pvalue"])
rmarkdown::paged_table(phosphodata.ksea.order[or, cols_to_display])
# now do the same thing but use a threshold
phosphodata.sets.order <- leapR::leapR(
geneset = kinasesubstrates,
enrichment_method = "enrichment_in_sets",
eset = phset,
assay_name = "phosphoproteomics",
threshold = 0.5,
primary_columns = "TCGA-13-1484"
)
or <- order(phosphodata.sets.order[, "pvalue"])
rmarkdown::paged_table(phosphodata.sets.order[or, cols_to_display])
We will compare the ability of transcriptomics, proteomics, and
phosphoproteomics to inform about
differences between short and long surviving patient groups. In addition
to other methods, we also employ a calcTTest function that takes
two sets of samples from the SummarizedExperiment object and computes
the t-test between them. The results are then stored in the rowData of the
same object, so that they can be used for enrichment later on.
This spans the multiple enrichment methods in leapR and also includes multi-omics.
The resulting heatmap is presented as Figure 2 in the paper.
# load the single omic and multi-omic pathway databases
data("krbpaths")
data("mo_krbpaths")
# comparison enrichment in transcriptional data
transdata.comp.enrichment.svl <- leapR::leapR(
geneset = krbpaths,
enrichment_method = "enrichment_comparison",
eset = tset,
assay_name = "transcriptomics",
primary_columns = shortlist,
secondary_columns = longlist
)
# comparison enrichment in proteomics data
# this is the same code used above, just repeated here for clarity
protdata.comp.enrichment.svl <- leapR::leapR(
geneset = krbpaths,
enrichment_method = "enrichment_comparison",
eset = pset,
assay_name = "proteomics",
primary_columns = shortlist,
secondary_columns = longlist
)
# comparison enrichment in phosphoproteomics data
phosphodata.comp.enrichment.svl <- leapR::leapR(
geneset = krbpaths,
enrichment_method = "enrichment_comparison",
eset = phset,
assay_name = "phosphoproteomics",
primary_columns = shortlist,
secondary_columns = longlist, id_column = "hgnc_id"
)
# set enrichment in transcriptomics data
# perform the comparison t tes
tset <- leapR::calcTTest(tset, assay_name = "transcriptomics",
shortlist, longlist)
## now we need to run enrichment in sets with target list, not eset
transdata.set.enrichment.svl <- leapR::leapR(
geneset = krbpaths,
eset = tset,
assay_name = "transcriptomics",
enrichment_method = "enrichment_in_sets",
primary_columns = "pvalue",
greaterthan = FALSE, threshold = 0.05
)
pset <- leapR::calcTTest(pset, assay_name = "proteomics",
shortlist, longlist)
protdata.set.enrichment.svl <- leapR::leapR(
geneset = krbpaths,
eset = pset,
assay_name = "proteomics",
enrichment_method = "enrichment_in_sets",
primary_columns = "pvalue",
greaterthan = FALSE, threshold = 0.05
)
phset <- leapR::calcTTest(phset, assay_name = "phosphoproteomics",
shortlist, longlist)
phosphodata.set.enrichment.svl <- leapR::leapR(
geneset = krbpaths,
enrichment_method = "enrichment_in_sets",
id_column = "hgnc_id",
assay_name = "phosphoproteomics",
eset = phset, primary_columns = "pvalue",
greaterthan = FALSE, threshold = 0.05
)
# order enrichment in transcriptomics data
transdata.order.enrichment.svl <- leapR::leapR(
geneset = krbpaths,
enrichment_method = "enrichment_in_order",
eset = tset,
assay_name = "transcriptomics",
primary_columns = "difference"
)
# order enrichment in proteomics data
protdata.order.enrichment.svl <- leapR::leapR(
geneset = krbpaths,
enrichment_method = "enrichment_in_order",
eset = pset,
assay_name = "proteomics",
primary_columns = "difference"
)
# order enrichment in phosphoproteomics data
phosphodata.order.enrichment.svl <- leapR::leapR(
geneset = krbpaths,
enrichment_method = "enrichment_in_order",
id_column = "hgnc_id",
eset = phset,
assay_name = "phosphoproteomics",
primary_columns = "difference"
)
# correlation difference in transcriptomics data
transdata.corr.enrichment.svl <- leapR::leapR(
geneset = krbpaths,
enrichment_method = "correlation_comparison",
eset = tset,
assay_name = "transcriptomics",
primary_columns = shortlist,
secondary_columns = longlist
)
# correlation difference in proteomics data
protdata.corr.enrichment.svl <- leapR::leapR(
geneset = krbpaths,
enrichment_method = "correlation_comparison",
eset = pset,
assay_name = "proteomics",
primary_columns = shortlist,
secondary_columns = longlist
)
# correlation difference in phosphoproteomics data
phosphodata.corr.enrichment.svl <- leapR::leapR(
geneset = krbpaths,
enrichment_method = "correlation_comparison",
eset = phset,
assay_name = "phosphoproteomics",
primary_columns = shortlist,
secondary_columns = longlist, id_column = "hgnc_id"
)
# combine the omics data into one with prefix tags
comboset <- leapR::combine_omics(list(pset, phset, tset),
c(NA, "hgnc_id", NA))
# comparison enrichment for combodata
# when we use expression set, we do not need to use the mo_krbpaths
#since the id mapping column is used
combodata.enrichment.svl <- leapR::leapR(
geneset = krbpaths, # mo_krbpaths,
enrichment_method = "enrichment_comparison",
eset = comboset,
assay_name = "combined",
primary_columns = shortlist,
secondary_columns = longlist, id_column = "id"
)
# set enrichment in combo data
# perform the comparison t test
comboset <- leapR::calcTTest(comboset,
assay_name = "combined",
shortlist, longlist)
combodata.set.enrichment.svl <- leapR::leapR(
geneset = krbpaths,
enrichment_method = "enrichment_in_sets",
eset = comboset, primary_columns = "pvalue",
assay_name = "combined",
id_column = "id",
greaterthan = FALSE, threshold = 0.05
)
# order enrichment in combo data
combodata.order.enrichment.svl <- leapR::leapR(
geneset = krbpaths,
enrichment_method = "enrichment_in_order",
assay_name = "combined",
eset = comboset, primary_columns = "difference",
id_column = "id"
)
# correlation difference in combo data
combodata.corr.enrichment.svl <- leapR::leapR(
geneset = krbpaths,
enrichment_method = "correlation_comparison",
eset = comboset,
assay_name = "combined",
primary_columns = shortlist,
id_column = "id",
secondary_columns = longlist
)
# now take all these results and combine them into one figure
all_results <- list(
transdata.comp.enrichment.svl,
protdata.comp.enrichment.svl,
phosphodata.comp.enrichment.svl,
combodata.enrichment.svl,
transdata.set.enrichment.svl,
protdata.set.enrichment.svl,
phosphodata.set.enrichment.svl,
combodata.set.enrichment.svl,
transdata.order.enrichment.svl,
protdata.order.enrichment.svl,
phosphodata.order.enrichment.svl,
combodata.order.enrichment.svl,
transdata.corr.enrichment.svl,
protdata.corr.enrichment.svl,
phosphodata.corr.enrichment.svl,
combodata.corr.enrichment.svl
)
pathways_of_interest <- c(
"KEGG_APOPTOSIS",
"KEGG_CELL_CYCLE",
"KEGG_ERBB_SIGNALING_PATHWAY",
"KEGG_FOCAL_ADHESION",
"KEGG_INSULIN_SIGNALING_PATHWAY",
"KEGG_MAPK_SIGNALING_PATHWAY",
"KEGG_MISMATCH_REPAIR",
"KEGG_MTOR_SIGNALING_PATHWAY",
"KEGG_OXIDATIVE_PHOSPHORYLATION",
"KEGG_P53_SIGNALING_PATHWAY",
"KEGG_PATHWAYS_IN_CANCER",
"KEGG_PROTEASOME",
"KEGG_RIBOSOME",
"KEGG_VEGF_SIGNALING_PATHWAY",
"KEGG_WNT_SIGNALING_PATHWAY"
)
results.frame <- data.frame(
pathway = pathways_of_interest,
td.comp = all_results[[1]][pathways_of_interest, "BH_pvalue"] < 0.05,
pd.comp = all_results[[2]][pathways_of_interest, "BH_pvalue"] < 0.05,
fd.comp = all_results[[3]][pathways_of_interest, "BH_pvalue"] < 0.05,
cd.comp = all_results[[4]][pathways_of_interest, "BH_pvalue"] < 0.05,
td.set = all_results[[5]][pathways_of_interest, "BH_pvalue"] < 0.05,
pd.set = all_results[[6]][pathways_of_interest, "BH_pvalue"] < 0.05,
fd.set = all_results[[7]][pathways_of_interest, "BH_pvalue"] < 0.05,
cd.set = all_results[[8]][pathways_of_interest, "BH_pvalue"] < 0.05,
td.order = all_results[[9]][pathways_of_interest, "BH_pvalue"] < 0.05,
pd.order = all_results[[10]][pathways_of_interest, "BH_pvalue"] < 0.05,
fd.order = all_results[[11]][pathways_of_interest, "BH_pvalue"] < 0.05,
cd.order = all_results[[12]][pathways_of_interest, "BH_pvalue"] < 0.05,
td.corr = all_results[[13]][pathways_of_interest, "BH_pvalue"] < 0.05,
pd.corr = all_results[[14]][pathways_of_interest, "BH_pvalue"] < 0.05,
fd.corr = all_results[[15]][pathways_of_interest, "BH_pvalue"] < 0.05,
cd.corr = all_results[[16]][pathways_of_interest, "BH_pvalue"] < 0.05
)
results.frame.or <- data.frame(
pathway = pathways_of_interest,
td.comp = all_results[[1]][pathways_of_interest, "oddsratio"],
pd.comp = all_results[[2]][pathways_of_interest, "oddsratio"],
fd.comp = all_results[[3]][pathways_of_interest, "oddsratio"],
cd.comp = all_results[[4]][pathways_of_interest, "oddsratio"],
td.set = log(all_results[[5]][pathways_of_interest, "oddsratio"], 2),
pd.set = log(all_results[[6]][pathways_of_interest, "oddsratio"], 2),
fd.set = log(all_results[[7]][pathways_of_interest, "oddsratio"], 2),
cd.set = log(all_results[[8]][pathways_of_interest, "oddsratio"], 2),
td.order = all_results[[9]][pathways_of_interest, "oddsratio"],
pd.order = all_results[[10]][pathways_of_interest, "oddsratio"],
fd.order = all_results[[11]][pathways_of_interest, "oddsratio"],
cd.order = all_results[[12]][pathways_of_interest, "oddsratio"],
td.corr = all_results[[13]][pathways_of_interest, "oddsratio"],
pd.corr = all_results[[14]][pathways_of_interest, "oddsratio"],
fd.corr = all_results[[15]][pathways_of_interest, "oddsratio"],
cd.corr = all_results[[16]][pathways_of_interest, "oddsratio"]
)
rownames(results.frame) <- results.frame[, 1]
rownames(results.frame.or) <- results.frame.or[, 1]
results.frame.sig <- results.frame[, 2:17] * results.frame.or[, 2:17]
heatmap.2(as.matrix(results.frame.sig[, c(1:4, 9:16)]), Colv = NA,
trace = "none", breaks = c(-1, -.1, -0.0001, 0, 0.1, 1),
col = c("blue", "lightblue", "grey", "pink", "red"),
dendrogram = "none")
Figure 3. An application of the KSEA-like approach in leapR as applied to our example data. In this example we are looking for known substrate sets of kinases (from Phosphosite Plus) that are enriched in the short vs long comparison of phosphopeptides.
# this comparison of abundance in substrates between case and control
# is lopsided in the sense that phosphorylation levels were previously
# reported to be overall higher in the short survivors. Thus the
# results are not terribly interesting (all kinases are in the same
# direction)
phosphodata.ksea.comp.svl <- leapR::leapR(
geneset = kinasesubstrates,
enrichment_method = "enrichment_comparison",
eset = phset,
assay_name = "phosphoproteomics",
primary_columns = shortlist, secondary_columns = longlist
)
# thus for the example we'll look at correlation between known substrates
# in the case v control conditions
phosphodata.ksea.corr.svl <- leapR::leapR(
geneset = kinasesubstrates,
enrichment_method = "correlation_comparison",
eset = phset,
assay_name = "phosphoproteomics",
primary_columns = shortlist,
secondary_columns = longlist
)
# for the example we are using an UNCORRECTED PVALUE
# which will allow us to plot more values, but
# for real applications it's necessary to use the
# CORRECTED PVALUE
# here are all the kinases *significant (*uncorrected) from the analysis
or <- order(phosphodata.ksea.corr.svl[, "pvalue"])
ksea_result <- phosphodata.ksea.corr.svl[or, ][1:9, ]
ksea_cols <- rep("grey", 9)
ksea_cols[which(ksea_result[, "oddsratio"] > 0)] <- "black"
# plot left panel: correlation significance of top most significant kinases
barplot(ksea_result[, "oddsratio"],
horiz = TRUE, xlim = c(-1, 0.5),
names.arg = rownames(ksea_result), las = 1, col = ksea_cols
)
# plot right panel: abundance comparison results of the same kinases
barplot(phosphodata.ksea.comp.svl[rownames(ksea_result), "oddsratio"],
horiz = TRUE, names.arg = rownames(ksea_result), las = 1, col = "black"
)
Lastly we print out the session info!
sessionInfo()
#> R Under development (unstable) (2025-10-21 r88958)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Ventura 13.7.8
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.6-x86_64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.6-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/New_York
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] rmarkdown_2.30 gplots_3.2.0 leapR_0.99.5 BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.10 generics_0.1.4
#> [3] SparseArray_1.11.1 bitops_1.0-9
#> [5] KernSmooth_2.23-26 gtools_3.9.5
#> [7] lattice_0.22-7 hms_1.1.4
#> [9] digest_0.6.38 magrittr_2.0.4
#> [11] caTools_1.18.3 evaluate_1.0.5
#> [13] grid_4.6.0 bookdown_0.45
#> [15] fastmap_1.2.0 jsonlite_2.0.0
#> [17] Matrix_1.7-4 tinytex_0.57
#> [19] BiocManager_1.30.27 jquerylib_0.1.4
#> [21] abind_1.4-8 cli_3.6.5
#> [23] rlang_1.1.6 XVector_0.51.0
#> [25] Biobase_2.71.0 cachem_1.1.0
#> [27] DelayedArray_0.37.0 yaml_2.3.10
#> [29] S4Arrays_1.11.0 tools_4.6.0
#> [31] tzdb_0.5.0 SummarizedExperiment_1.41.0
#> [33] BiocGenerics_0.57.0 vctrs_0.6.5
#> [35] R6_2.6.1 magick_2.9.0
#> [37] matrixStats_1.5.0 stats4_4.6.0
#> [39] lifecycle_1.0.4 Seqinfo_1.1.0
#> [41] S4Vectors_0.49.0 IRanges_2.45.0
#> [43] pkgconfig_2.0.3 bslib_0.9.0
#> [45] pillar_1.11.1 Rcpp_1.1.0
#> [47] glue_1.8.0 xfun_0.54
#> [49] tibble_3.3.0 GenomicRanges_1.63.0
#> [51] MatrixGenerics_1.23.0 knitr_1.50
#> [53] htmltools_0.5.8.1 readr_2.1.6
#> [55] compiler_4.6.0