extraChIPs 1.11.0
knitr::opts_chunk$set(message = FALSE, crop = NULL)
The GRAVI workflow, for which this
package is designed, uses sliding windows for differential signal
analysis. However, the use of fixed-width windows, as is common under
DiffBind-style (Ross-Innes et al. 2012) approaches is also possible with
extraChIPs
. This vignette focusses on using conventional peak calls
and fixed-width approaches to replicate and extend these approaches.
The majority of examples below use heavily reduced datasets to provide general guidance on using the functions. Some results may appear trivial as a result, but will hopefully prove far more useful in a true experimental context. All data, along with this vignette are available here. Please place all contents of the data directory in a directory named data in your own working directory.
In order to use the package extraChIPs
and follow this vignette, we
recommend using the package BiocManager
hosted on CRAN. Once this is
installed, the additional packages required for this vignette
(tidyverse
, Rsamtools
, csaw
, BiocParallel
and rtracklayer
) can
also be installed.
if (!"BiocManager" %in% rownames(installed.packages()))
install.packages("BiocManager")
pkg <- c(
"tidyverse", "Rsamtools", "csaw", "BiocParallel", "rtracklayer", "edgeR",
"patchwork", "extraChIPs", "plyranges", "scales", "here", "quantro"
)
BiocManager::install(pkg, update = FALSE)
Once these packages are installed, we can load them easily
library(tidyverse)
library(Rsamtools)
library(csaw)
library(BiocParallel)
library(rtracklayer)
library(edgeR)
library(patchwork)
library(extraChIPs)
library(plyranges)
library(scales)
library(glue)
library(ggrepel)
library(here)
library(quantro)
theme_set(theme_bw())
All data for this vignette is expected to be in a sub-directory of the working directory named “data”, and all paths will be predicated on this. Please ensure you have all data in this location, obtained from here.
The data itself is ChIP-Seq data targeting the Estrogen Receptor (ER),
and is taken from the cell-line ZR-75-1 cell-line using data from the
BioProject , Pre-processing was performed using the
prepareChIPs
workflow,
written in snakemake (Mölder et al. 2021) and all code is available at
https://github.com/smped/PRJNA509779. ER binding was assessed under
Vehicle (E2) and DHT-stimulated (E2DHT) conditions. Using GRCh37 as the
reference genome, a subset of regions found on chromosome 10 are
included in this dataset for simplicity.
First we’ll define our sample data then define our two treatment groups. Defining a consistent colour palette for all plots is also a good habit to develop.
treat_levels <- c("E2", "E2DHT")
treat_colours <- setNames(c("steelblue", "red3"), treat_levels)
samples <- tibble(
accession = paste0("SRR831518", seq(0, 5)),
target = "ER",
treatment = factor(rep(treat_levels, each = 3), levels = treat_levels)
)
samples
## # A tibble: 6 × 3
## accession target treatment
## <chr> <chr> <fct>
## 1 SRR8315180 ER E2
## 2 SRR8315181 ER E2
## 3 SRR8315182 ER E2
## 4 SRR8315183 ER E2DHT
## 5 SRR8315184 ER E2DHT
## 6 SRR8315185 ER E2DHT
accessions <- samples %>%
split(f = .$treatment) %>%
lapply(pull, "accession")
We’ll eventually be loading counts for differential signal analysis from
a set of BamFiles, so first we’ll create a BamFileList
with all of
these files.
bfl <- here("data", "ER", glue("{samples$accession}.bam")) %>%
BamFileList() %>%
setNames(str_remove_all(names(.), ".bam"))
file.exists(path(bfl))
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
Seqinfo
objects are the foundation of working with GRanges, so
defining an object at the start of a workflow is good practice. This is
simply enabled with extraChIPs
by using defineSeqinfo()
and
specifying the appropriate genome.
sq <- defineSeqinfo("GRCh37")
Another key preparatory step for working with peaks is to define a set of regions as either blacklisted or grey-listed regions. The former are known problematic regions based on each genome, with data freely available from https://github.com/Boyle-Lab/Blacklist/tree/master/lists, whilst grey-listed regions are defined from potentially problematic regions as detected within the input sample. For our samples code for this is included in the previously provided repository (https://github.com/smped/PRJNA509779).
greylist <- import.bed(here("data/chr10_greylist.bed"), seqinfo = sq)
blacklist <- import.bed( here("data/chr10_blacklist.bed"), seqinfo = sq)
omit_ranges <- c(greylist, blacklist)
The provided dataset includes six files produced by macs2 callpeak
(Zhang et al. 2008) in the narrowPeak
format, and these are able to be
easily parsed using extraChIPs
. We’ll immediately pass our black &
grey-listed regions to our parsing function so we can exclude these
regions right from the start.
By passing the above seqinfo object to this function, we’re also telling
importPeaks()
to ignore any peaks not on the included chromosomes.
peaks <- here("data", "ER", glue("{samples$accession}_peaks.narrowPeak")) %>%
importPeaks(seqinfo = sq, blacklist = omit_ranges)
This will import the peaks from all files as a single GRangesList
object, adding the file-name to each element by default. We can easily
modify these names if we so wish.
names(peaks) <- str_remove_all(names(peaks), "_peaks.narrowPeak")
Once loaded, we can easily check how similar our replicates are using
plotOverlaps()
. When three or more sets of peaks are contained in the
GRangesList
, an UpSet plot will be drawn by default.
plotOverlaps(
peaks, min_size = 10, .sort_sets = FALSE,
set_col = treat_colours[as.character(samples$treatment)]
)
Optionally, specifying a column and a suitable function will produce an additional panel summarising that value. In the following, we’ll show the maximum score obtained, highlighting that for peaks identified in only one or two replicates, the overall signal intensity is generally lower, even in the sample with the strongest signal.
plotOverlaps(
peaks, min_size = 10, .sort_sets = FALSE, var = "score", f = "max",
set_col = treat_colours[as.character(samples$treatment)]
)
A common task at this point may be to define consensus peaks within each
treatment group, by retaining only the peaks found in 2 of the 3
replicates (p = 2/3)
. The default approach is to take the union of all
ranges, with the returned object containing logical values for each
sample, as well as the number of samples where an overlapping peak was
found.
If we wish to retain any of the original columns, such as the
macs2 callpeak
score, we can simply pass the column names to
makeConsensus()
consensus_e2 <- makeConsensus(peaks[accessions$E2], p = 2/3, var = "score")
consensus_e2dht <- makeConsensus(peaks[accessions$E2DHT], p = 2/3, var = "score")
Alternatively, we could find the centre of the peaks as part of this process, by averaging across the estimated peak centres for each sample. This is a very common step for ChIP-Seq data where the target is a transcription factor, and also forms a key step in the DiffBind workflow.
In the following code chunk, we first find the centre for each sample
using the information provided by macs2
, before retaining this column
when calling makeConsensus()
. This will return each of the individual
centre-position estimates as a list for each merged range, and using
vapply()
we then take the mean position as our estimate for the
combined peak centre.
consensus_e2 <- peaks[accessions$E2] %>%
endoapply(mutate, centre = start + peak) %>%
makeConsensus(p = 2/3, var = "centre") %>%
mutate(centre = vapply(centre, mean, numeric(1)))
consensus_e2
## GRanges object with 164 ranges and 5 metadata columns:
## seqnames ranges strand | centre SRR8315180 SRR8315181
## <Rle> <IRanges> <Rle> | <numeric> <logical> <logical>
## [1] chr10 43048195-43048529 * | 43048362 TRUE TRUE
## [2] chr10 43521739-43522260 * | 43522020 TRUE TRUE
## [3] chr10 43540042-43540390 * | 43540272 TRUE FALSE
## [4] chr10 43606238-43606573 * | 43606416 TRUE TRUE
## [5] chr10 43851214-43851989 * | 43851719 FALSE TRUE
## ... ... ... ... . ... ... ...
## [160] chr10 99096784-99097428 * | 99097254 TRUE TRUE
## [161] chr10 99168353-99168649 * | 99168502 TRUE TRUE
## [162] chr10 99207868-99208156 * | 99207998 FALSE TRUE
## [163] chr10 99331363-99331730 * | 99331595 TRUE TRUE
## [164] chr10 99621632-99621961 * | 99621818 FALSE TRUE
## SRR8315182 n
## <logical> <numeric>
## [1] TRUE 3
## [2] TRUE 3
## [3] TRUE 2
## [4] TRUE 3
## [5] TRUE 2
## ... ... ...
## [160] TRUE 3
## [161] TRUE 3
## [162] TRUE 2
## [163] TRUE 3
## [164] TRUE 2
## -------
## seqinfo: 25 sequences from GRCh37 genome
consensus_e2dht <- peaks[accessions$E2DHT] %>%
endoapply(mutate, centre = start + peak) %>%
makeConsensus(p = 2/3, var = "centre") %>%
mutate(centre = vapply(centre, mean, numeric(1)))
We can also inspect these using plotOverlaps()
provided we use a
GRangesList
for the input. Now that we only have two elements (one for
each treatment) a VennDiagram will be generated instead of an UpSet
plot.
GRangesList(E2 = granges(consensus_e2), E2DHT = granges(consensus_e2dht)) %>%
plotOverlaps(set_col = treat_colours[treat_levels])
We can now go one step further and define the set of peaks found in either treatment. Given we’re being inclusive here, we can leave p = 0 so any peak found in either treatment is included.
union_peaks <- GRangesList(
E2 = select(consensus_e2, centre),
E2DHT = select(consensus_e2dht, centre)
) %>%
makeConsensus(var = c("centre")) %>%
mutate(
centre = vapply(centre, mean, numeric(1)) %>% round(0)
)
Now we have a set of peaks, found in at least 2/3 of samples from either condition, with estimates of each peak’s centre. The next step would be to set all peaks as the same width based on the centre position, with a common width being 500bp.
In the following we’ll perform multiple operations in a single call mutate, so let’s make sure we know what’s happening.
glue("{seqnames}:{centre}:{strand}")
uses glue
syntax to parse
the seqnames, centre position and strand information as a
character-like vector with a width of only 1, and using the
estimated centre as the Range.GRanges
object, before resizing to the
desired width.GRanges
structure, but discarding anything in the
mcols()
element, thencolToRanges()
, we take the centred ranges and place them as
the core set of GRanges for this object.This gives a GRanges object with all original information, but with centred peaks of a fixed width.
w <- 500
centred_peaks <- union_peaks %>%
mutate(
centre = glue("{seqnames}:{centre}:{strand}") %>%
GRanges(seqinfo = sq) %>%
resize(width = w),
union_peak = granges(.)
) %>%
colToRanges("centre")
Now we have our centred, fixed-width peaks, we can count reads using
csaw::regionCounts()
(Lun and Smyth 2016). We know our fragment length
is about 200bp, so we can pass this to the function for a slightly more
sophisticated approach to counting.
se <- regionCounts(bfl, centred_peaks, ext = 200)
se
## class: RangedSummarizedExperiment
## dim: 188 6
## metadata(2): final.ext param
## assays(1): counts
## rownames: NULL
## rowData names(4): E2 E2DHT n union_peak
## colnames(6): SRR8315180 SRR8315181 ... SRR8315184 SRR8315185
## colData names(4): bam.files totals ext rlen
The colData()
element of the returned object as the columns
bam.files, totals, ext and rlen, which are all informative and
can be supplemented with our samples
data frame. In the following,
we’ll 1) coerce to a tibble
, 2) left_join()
the samples
object, 3)
add the accession as the sample column, 4) set the accession back as the
rownames, then 5) coerce back to the required DataFrame()
structure.
colData(se) <- colData(se) %>%
as_tibble(rownames = "accession") %>%
left_join(samples) %>%
mutate(sample = accession) %>%
as.data.frame() %>%
DataFrame(row.names = .$accession)
colData(se)
## DataFrame with 6 rows and 8 columns
## accession bam.files totals ext rlen
## <character> <character> <integer> <integer> <integer>
## SRR8315180 SRR8315180 /home/steviep/github.. 317845 200 75
## SRR8315181 SRR8315181 /home/steviep/github.. 337623 200 75
## SRR8315182 SRR8315182 /home/steviep/github.. 341998 200 75
## SRR8315183 SRR8315183 /home/steviep/github.. 315872 200 75
## SRR8315184 SRR8315184 /home/steviep/github.. 352908 200 75
## SRR8315185 SRR8315185 /home/steviep/github.. 347709 200 75
## target treatment sample
## <character> <factor> <character>
## SRR8315180 ER E2 SRR8315180
## SRR8315181 ER E2 SRR8315181
## SRR8315182 ER E2 SRR8315182
## SRR8315183 ER E2DHT SRR8315183
## SRR8315184 ER E2DHT SRR8315184
## SRR8315185 ER E2DHT SRR8315185
For QC and visualisation, we can add an additional logCPM
assay to our
object as well.
assay(se, "logCPM") <- cpm(assay(se, "counts"), lib.size = se$totals, log = TRUE)
First we might like to check our distribution of counts
plotAssayDensities(se, assay = "counts", colour = "treat", trans = "log1p") +
scale_colour_manual(values = treat_colours)