Installation

Install OGRE using Bioconductor’s package installer.

if(!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("OGRE")

Load the OGRE package:

library(OGRE)

Quick start- load datasets from hard drive

To start up OGRE you have to generate an OGREDataSet that is used to store your datasets and additional information about the analysis that you are conducting. Query and subjects files can be conveniently stored in their own folders as GenomicRanges objects in form of stored .rds / .RDS files. We point OGRE to the correct location by supplying a path for each folder with the character vectors queryFolder and subjectFolder. In this vignette we are using lightweight query and subject example data sets to show OGRE’s functionality.

myQueryFolder <- file.path(system.file('extdata', package = 'OGRE'),"query")
mySubjectFolder <- file.path(system.file('extdata', package = 'OGRE'),"subject")

myOGRE <- OGREDataSetFromDir(queryFolder=myQueryFolder,
                             subjectFolder=mySubjectFolder)
## Initializing OGREDataSet...

By monitoring OGRE’s metadata information you can make sure the input paths you supplied are stored correctly.

metadata(myOGRE)
## $queryFolder
## [1] "/tmp/RtmpUYpn7l/Rinst15271e425a48f9/OGRE/extdata/query"
## 
## $subjectFolder
## [1] "/tmp/RtmpUYpn7l/Rinst15271e425a48f9/OGRE/extdata/subject"
## 
## $outputFolder
## [1] "/tmp/RtmpUYpn7l/Rinst15271e425a48f9/OGRE/extdata/output"
## 
## $gvizPlotsFolder
## [1] "/tmp/RtmpUYpn7l/Rinst15271e425a48f9/OGRE/extdata/gvizPlots"
## 
## $summaryDT
## list()
## 
## $itracks
## list()

Query and subject datasets are read by loadAnnotations() and stored in the OGREDataSet as GRanges objects. We are going to read in the following example datasets:

myOGRE <- loadAnnotations(myOGRE)
## Reading query dataset...
## Reading subject datasets...

OGRE uses your dataset file names to label query and subjects internally, we can check these names by using the names() function since every OGREDataSet is a GRangesList.

names(myOGRE)
## [1] "genes" "CGI"   "TFBS"

Let’s have a look at the stored datasets:

myOGRE
## GRangesList object of length 3:
## $genes
## GRanges object with 242 ranges and 3 metadata columns:
##         seqnames            ranges strand |              ID        name
##            <Rle>         <IRanges>  <Rle> |     <character> <character>
##     [1]       21 10906201-11029719      - | ENSG00000166157        TPTE
##     [2]       21 14741931-14745386      - | ENSG00000256715  AL050302.1
##     [3]       21 14982498-15013906      + | ENSG00000166351       POTED
##     [4]       21 15051621-15053459      - | ENSG00000269011  AL050303.1
##     [5]       21 15481134-15583166      - | ENSG00000188992        LIPI
##     ...      ...               ...    ... .             ...         ...
##   [238]       21 47720095-47743789      - | ENSG00000160298    C21orf58
##   [239]       21 47744036-47865682      + | ENSG00000160299        PCNT
##   [240]       21 47878812-47989926      + | ENSG00000160305       DIP2A
##   [241]       21 48018875-48025121      - | ENSG00000160307       S100B
##   [242]       21 48055079-48085036      + | ENSG00000160310       PRMT2
##             score
##         <numeric>
##     [1]        NA
##     [2]        NA
##     [3]        NA
##     [4]        NA
##     [5]        NA
##     ...       ...
##   [238]        NA
##   [239]        NA
##   [240]        NA
##   [241]        NA
##   [242]        NA
##   -------
##   seqinfo: 25 sequences (1 circular) from hg19 genome
## 
## $CGI
## GRanges object with 365 ranges and 3 metadata columns:
##         seqnames            ranges strand |          ID        name     score
##            <Rle>         <IRanges>  <Rle> | <character> <character> <numeric>
##     [1]       21   9437273-9439473      * |       26635    CpG:_285        NA
##     [2]       21   9483486-9484663      * |       26636    CpG:_165        NA
##     [3]       21   9647867-9648116      * |       26637     CpG:_18        NA
##     [4]       21   9708936-9709231      * |       26638     CpG:_31        NA
##     [5]       21   9825443-9826296      * |       26639    CpG:_120        NA
##     ...      ...               ...    ... .         ...         ...       ...
##   [361]       21 48018543-48018791      * |       26995     CpG:_21        NA
##   [362]       21 48055200-48056060      * |       26996     CpG:_88        NA
##   [363]       21 48068518-48068808      * |       26997     CpG:_24        NA
##   [364]       21 48081242-48081849      * |       26998     CpG:_55        NA
##   [365]       21 48087201-48088106      * |       26999     CpG:_93        NA
##   -------
##   seqinfo: 25 sequences (1 circular) from hg19 genome
## 
## $TFBS
## GRanges object with 48761 ranges and 3 metadata columns:
##           seqnames            ranges strand |           ID        name
##              <Rle>         <IRanges>  <Rle> |  <character> <character>
##       [1]       21 29884415-29884427      + |  GATA1.85108    GATA1_04
##       [2]       21 46923766-46923780      + |    CDP.81529      CDP_02
##       [3]       21   9491627-9491638      - |   HFH1.46541     HFH1_01
##       [4]       21   9491706-9491725      - |  PPARA.24892    PPARA_01
##       [5]       21   9491792-9491815      + |   GFI1.35413     GFI1_01
##       ...      ...               ...    ... .          ...         ...
##   [48757]       21 48083381-48083404      + | STAT5A.43326   STAT5A_02
##   [48758]       21 48083400-48083419      + |   ARNT.19751     ARNT_02
##   [48759]       21 48084826-48084841      + |   BRN2.40426     BRN2_01
##   [48760]       21 48084830-48084847      + | FOXJ2.121681    FOXJ2_01
##   [48761]       21 48084834-48084845      + |  NKX3A.47953    NKX3A_01
##               score
##           <numeric>
##       [1]       891
##       [2]       831
##       [3]       865
##       [4]       757
##       [5]       817
##       ...       ...
##   [48757]       751
##   [48758]       792
##   [48759]       803
##   [48760]       889
##   [48761]       851
##   -------
##   seqinfo: 25 sequences (1 circular) from hg19 genome

To find overlaps between your query and subject datasets we call fOverlaps(). Internally OGRE makes use of the GenomicRanges package to calculate full and partial overlap as schematically shown.



Any existing subject - query hits are then listed in detailDT and stored as a data.table.

myOGRE <- fOverlaps(myOGRE)
head(metadata(myOGRE)$detailDT,n=2)
##            queryID queryType subjID subjType queryChr queryStart queryEnd
##             <char>    <char> <char>   <char>   <char>      <int>    <int>
## 1: ENSG00000166157     genes  26649      CGI       21   10906201 11029719
## 2: ENSG00000269011     genes  26654      CGI       21   15051621 15053459
##    queryStrand subjChr subjStart  subjEnd subjStrand overlapWidth overlapRatio
##         <char>  <char>     <int>    <int>     <char>        <int>        <num>
## 1:           -      21  10989914 10991413          *         1500   0.01214388
## 2:           -      21  15052411 15052644          *          234   0.12724307

The summary plot provides us with useful information about the number of overlaps between your datasets.

 myOGRE <- sumPlot(myOGRE)
 metadata(myOGRE)$barplot_summary

Using the Gviz visualization each query can be displayed with all overlapping subject elements. Choose labels for all region tracks by supplying a trackRegionLabels vector. Plots are stored in the same location as your dataset files.

 myOGRE <- gvizPlot(myOGRE,"ENSG00000142168",showPlot = TRUE,
                    trackRegionLabels = setNames(c("name","name"),c("genes","CGI")))
## Plotting query: ENSG00000142168

The overlap distribution can be generated with summarizeOverlap(myOGRE) and outputs a table with informative statistics such as minimum, lower quantile, mean, median, upper quantile, and maximum number of overlaps per region and per dataset. Overlap distribution can also be displayed as histograms using plotHist(myOGRE) and accessed by metadata(myOGRE)$hist and metadata(myOGRE)$summaryDT. Two tables / plots are generated. The first one showing numbers for regions with and without overlap and the second one showing numbers only for regions with overlap by excluding all others. Next, we generate an histogram with the number of TFBS per gene (x-axis, log scale) and the TFBS frequency (y-axis). When focusing only on regions with overlap, we see that genes have on average (median) 54 TFBS overlaps (black dashed line).

 myOGRE <- summarizeOverlap(myOGRE) 
 myOGRE <- plotHist(myOGRE)
 metadata(myOGRE)$summaryDT
## $includes0
##               CGI      TFBS
## Min.     0.000000    0.0000
## 1st Qu.  0.000000    8.0000
## Median   1.000000   36.0000
## Mean     1.210744  119.6116
## 3rd Qu.  1.750000  129.7500
## Max.    14.000000 3136.0000
## 
## $excludes0
##              CGI      TFBS
## Min.     1.00000    1.0000
## 1st Qu.  1.00000   15.0000
## Median   1.00000   54.0000
## Mean     2.02069  139.8357
## 3rd Qu.  2.00000  159.5000
## Max.    14.00000 3136.0000
## NA's    97.00000   35.0000
 metadata(myOGRE)$hist$TFBS
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It is possible to create an average coverage profile of all gene-TFBS overlaps, split in 100 bins, which represent gene bodies of all 242 genes. Both, forward and reverse coding genes are arranged on the x-Axis and peaks indicate an TFBS overlap enrichment. Overlap coverage is calculated as the sum of all gene TFBS overlaps in 5’-3’direction. Generated plots can be accessed by metadata(myOGRE)$covPlot$TFBS and the resulting profile shows an accumulation of TFBS around gene start and end positions.

 myOGRE <- covPlot(myOGRE) 
## Generating coverage plot(s), this might take a while...
## Excluding regions with nucleotides<nbin
 metadata(myOGRE)$covPlot$TFBS$plot
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'