Abstract
epiSeeker is an R package for analyzing multi-omics epigenetic data. Data of fragment type, like ChIP-seq data and ATAC-seq data, are supported by epiSeeker. It supports annotating ChIP peaks and provides functions to visualize ChIP peaks coverage over chromosomes and profiles of peaks binding to TSS regions. Comparison of ChIP peak profiles and annotation are also supported. Moreover, it supports evaluating significant overlap among ChIP-seq datasets. Currently, it contains 17,000 bed file information from GEO database. These datasets can be downloaded and compare with user’s own data to explore significant overlap datasets for inferring co-regulation or transcription factor complex for further investigation. Data of base type, like oxBS‑seq data, are also supported. It provides function to visualize single-base resolution epigenetic data by considering the strand, motif, and additional information.
Citation
If you use ChIPseeker(Yu, Wang, and He 2015) in published research, please cite:
- Q Wang#, M Li#, T Wu, L Zhan, L Li, M Chen, W Xie, Z Xie, E Hu, S Xu, G Yu*. Exploring epigenomic datasets by ChIPseeker. Current Protocols, 2022, 2(10): e585.
- G Yu*, LG Wang, QY He*. ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparision and visualization. Bioinformatics. 2015, 31(14):2382-2383.
Introduction
Biology relies heavily on epigenetics as a method of altering genetic information without altering nucleic acids. Based on the type of epigenomic data can be divided into two types: (1) fragment type, (2) base type.
Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is an canonical fragment type data. It has become standard technologies for genome wide identification of DNA-binding protein target sites. After read mappings and peak callings, the peak should be annotated to answer the biological questions. Annotation also create the possibility of integrating expression profile data to predict gene expression regulation. ChIPseeker(Yu, Wang, and He 2015) was developed for annotating nearest genes and genomic features to peaks. ChIP peak data set comparison is also very important. We can use it as an index to estimate how well biological replications are. Even more important is applying to infer cooperative regulation. If two ChIP seq data, obtained by two different binding proteins, overlap significantly, these two proteins may form a complex or have interaction in regulation chromosome remodelling or gene expression. ChIPseeker(Yu, Wang, and He 2015) support statistical testing of significant overlap among ChIP seq data sets, and incorporate open access database GEO for users to compare their own dataset to those deposited in database. Protein interaction hypothesis can be generated by mining data deposited in database. Converting genome coordinations from one genome version to another is also supported, making this comparison available for different genome version and different species. Several visualization functions are implemented to visualize the coverage of the ChIP seq data, peak annotation, average profile and heatmap of peaks binding to TSS region. Functional enrichment analysis of the peaks can be performed by my Bioconductor packages DOSE(Yu et al. 2015), ReactomePA(Yu and He 2016), clusterProfiler(Yu et al. 2012).
Epigenomic data of base type refer to epigenomic modification at a single-base resolution. Many methods are used to detect the epigenetic status with the combination of sequencing and chemical methods, such as Me‑DIP and oxBS‑seq for 5mC; TAB-seq and hme‑DIP for 5hmC; 6mA‑DIP and 6mA‑RE-seq for 6mA. Aside from these, third-generation sequencing methods, which refer to direct single-molecule sequencing, such as Oxford Nanopore Sequencing and PacBio SMRT Sequencing.
Data profiling
The datasets CBX6 and CBX7 in this vignettes were downloaded from GEO (GSE40740)(Pemberton et al. 2014) while ARmo_0M, ARmo_1nM and ARmo_100nM were downloaded from GEO (GSE48308)(Urbanucci et al. 2012) . ChIPseeker provides readPeakFile to load the peak and store in GRanges object.
## $ARmo_0M
## [1] "/tmp/RtmplndEOU/Rinsted7c6688e1e97/epiSeeker/extdata/GEO_sample_data/GSM1174480_ARmo_0M_peaks.bed.gz"
##
## $ARmo_1nM
## [1] "/tmp/RtmplndEOU/Rinsted7c6688e1e97/epiSeeker/extdata/GEO_sample_data/GSM1174481_ARmo_1nM_peaks.bed.gz"
##
## $ARmo_100nM
## [1] "/tmp/RtmplndEOU/Rinsted7c6688e1e97/epiSeeker/extdata/GEO_sample_data/GSM1174482_ARmo_100nM_peaks.bed.gz"
##
## $CBX6_BF
## [1] "/tmp/RtmplndEOU/Rinsted7c6688e1e97/epiSeeker/extdata/GEO_sample_data/GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed.gz"
##
## $CBX7_BF
## [1] "/tmp/RtmplndEOU/Rinsted7c6688e1e97/epiSeeker/extdata/GEO_sample_data/GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz"
## GRanges object with 1331 ranges and 2 metadata columns:
## seqnames ranges strand | V4 V5
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr1 815093-817883 * | MACS_peak_1 295.76
## [2] chr1 1243288-1244338 * | MACS_peak_2 63.19
## [3] chr1 2979977-2981228 * | MACS_peak_3 100.16
## [4] chr1 3566182-3567876 * | MACS_peak_4 558.89
## [5] chr1 3816546-3818111 * | MACS_peak_5 57.57
## ... ... ... ... . ... ...
## [1327] chrX 135244783-135245821 * | MACS_peak_1327 55.54
## [1328] chrX 139171964-139173506 * | MACS_peak_1328 270.19
## [1329] chrX 139583954-139586126 * | MACS_peak_1329 918.73
## [1330] chrX 139592002-139593238 * | MACS_peak_1330 210.88
## [1331] chrY 13845134-13845777 * | MACS_peak_1331 58.39
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
Coverage plot
After peak calling, we would like to know the peak locations over the whole genome, plotCov function calculates the coverage of peak regions over chromosomes and generate a figure to visualize. GRangesList is also supported and can be used to compare coverage of multiple bed files. Interactive function is also supported by epiSeeker. When the user hovers the mouse over the corresponding element, it can display the position and value of the peak.
plotCov(peak, weightCol = "V5", chrs = c("chr17", "chr18"), xlim = c(4.5e7, 5e7), interactive = TRUE)Hierarchical clustering and co-accessibility relationships and will emerge in the process of analyzing the peak coverage of multiple samples. Here we provide function to plot the hierarchical clustering tree based on the input peak data. Co-accessibility is also supported to be visualized in the picture by showing the peak pairs with the highest positive correlation scores and the highest negative correlation scores.
# Details of the dataset and pre-processing we used here can be found from xxxx
# We only show usage here for demonstration
pancancer_atac <- readRDS("./data/pancancer_atac.rds")
fill_color <- c(
"#a10589", "#fce364", "#854d23", "#c0b742",
"#1c670d", "#3ec486", "#f7bbff", "#ee0700",
"#172ec5", "#f56103", "#3f3a95", "#7a7875",
"#e06cdc", "#7effff", "#f9b250", "#af1e18",
"#997f69", "#9d82ed", "#40bcf5", "#f28582",
"#97be8c", "#7b2ed0", "#81f75b"
)
design <- "AAAB
AAAB
AAAB
CCC#"
atac_p <- plotCov(pancancer_atac,
weightCol = "V5",
chrs = "chr8", title = "",
xlim = c(126712193, 128412193),
xlab = "", fill_col = fill_color,
legend_position = "none", facet_var = ".id~.",
add_cluster_tree = TRUE,
add_coaccess = TRUE,
design = design
)Profile of data in specific regions
First of all, for calculating the profile of data in specific region, we should align the data that are mapping to these regions, and generate the tagMatrix. Following parameters control what kinds of matrix can be derived.
- Region parameters: two parameters to control which region to plot: (1)
by: users can select one of gene, transcript, exon, intron, 3UTR, 5UTR, and UTR. This parameter control which type of region to be showed.
type: users can select one of start_site, end_site and body. This parameter controls the start site or end site or body region to be showed.
- Mode parameters: users can use two modes to get the tagMatrix: (1) normal mode: each region of interest will be visualized by original resolution, which means one figure in the picture represents one base pairs in the genome.
- binning mode: Users can input
nbinparameter to turn on this mode, each region of interest will be visualized in a binning resolution, which means one point in the figure may represent several base pairs(e.g. 100 bp). Using this mode can speed up the function. The most important is that if users select the body region to visualize, binning mode is necessary to align all the body region.
Extension parameters: we provides
upstreamanddownstreamparameters to extend the region of interest. If we want to visualize the gene region, start_site mode will return region of (TSS – upstream, TSS + downstream), end_site mode will return region of (TES – upstream, TES + downstream), and body mode will region region of (TSS – upstream, TES + downstream). If user select body mode, upstream and downstream parameters can also be a rel object(e.g. rel(0.2)), which means upstream and downstream parameter be length(region) * 0.2.Reference parameters: users can input
TxDbobject as reference. If users do not want to methods above to get the region of interest, users can provide a granges through windows parameters.
Since tagMatrix has been derived, epiSeeker provide two functions to visualize it: (1) plotPeakProfile() function: visualize tagMatrix in a line graph. (2)plotPeakHeatmap() function: visualize the tagMatrix in a heatmap. We also provide plot_prof parameter in plotPeakHeatmap() that can plot the line graph and heatmap in a figure.
There are two crucial mode in visualizing tagMatrix. (1) missingDataAsZero: this parameter determine how to treat the area that do not have signal. For example, users specific a region of 6,000 bp, but the peak only have a region of 200 bp, the rest region is the missing data. If missingDataAsZero is TRUE, then all the missing value will be set to 0. If this value is FALSE, then the missing value will be NULL, and will be filtered in the following visualization. (2) statistic_method : tagMatrix is a matrix that contains many regions of the same type, for example, all the TSS region. statistic_method parameter determine how to combine these regions in the figure. We use mean method as default, which means we will calculate the mean signal of each point.
Profile of one sample of one reference region
Here we show the usage of one sample. by = "transcript" means we visualize data in transcript region. type = "start_site" means plot the start site region, here is the transcript start site(TSS). upstream = 3000, downstream = 3000 means the extension of start site, here is the region of (TSS - 3000 bp, TSS + 3000 bp).
plotPeakHeatmap() can be used to visualize the tagMatrix, with plot_prof can plot the heatmap and line graph in a single figure. statistic_method = "mean" means that this function aggregate multiple region by mean methods. missingDataAsZero = TRUE means that we treat missing data as zero that will not be filtered in the following analysis. conf = 0.95, resample = 1000 means in the line plot using 95% confidence interval and resample 1000 times.
tssTagMatrix <- getTagMatrix(
peak = peak, upstream = 3000, downstream = 3000, weightCol = "V5",
type = "start_site", by = "transcript", TxDb = txdb
)
# plot_prof = FALSE will plot only heatmap
plotPeakHeatmap(tssTagMatrix,
plot_prof = TRUE,
statistic_method = "mean",
missingDataAsZero = TRUE,
conf = 0.95, resample = 1000
)If users want to plot line graph only, can use plotPeakProf() to plot.
Visualization of base modification
The datasets of base modification in this vignettes were downloaded from GEO (GSE223153)(Nipper et al. 2024). epiSeeker provide getBmMatrix() and plotBmProf() function to visualize the base modification. getBmMatrix() accept folloing necessary parameters: (1) region: region of interset in the format of data frame. (2) BSgenome: sequence reference object. (3) input: BSseq object or bmData object. bmData object can be constructed by makeBmDataFromData() provided by epiSeeker. (4) base : type of base that has modification. (5) motif: motif to be analyzed.
data(demo_bmdata)
bmMatrix <- getBmMatrix(
region = data.frame(chr = "chr22", start = 10525991, end = 10526342),
BSgenome = BSgenome.Hsapiens.UCSC.hg38,
input = demo_bmdata,
base = "C",
motif = c("CG")
)
# Interactive function is also supported by interactive parameter
plotBmProf(bmMatrix, interactive = TRUE)Visualization of gene track
Gene information can be added to visualization result produced by epiSeeker in the form of a track. plotGeneTrack() accepts several parameters to plot gene track: (1) txdb: Gene annotation object, txdb object. (2) chr,start_pos,end_pos : region inforamtion to specify which region to plot. (3) OrgDb: annotation object to change gene id.
# this region is the same as the region of pancancer figure
plotGeneTrack(
txdb = txdb, chr = "chr8", start_pos = 126712193,
end_pos = 128412193, OrgDb = "org.Hs.eg.db"
)Visualization of motif
epiSeeker provide functions to visualize the motif in a specific region. Users should input a Granges object to region to specify the region. Sequence reference and motif reference are also necessary.
# motif reference of position-specific weight matrix
# opts_base <- list()
# opts_base[["collection"]] <- "CORE"
# opts_base[["all_versions"]] <- FALSE
# opts_human <- opts_base
# opts_human[["species"]] <- "Homo sapiens"
# opts_human[["tax_group"]] <- "vertebrates"
# sq24 <- RSQLite::dbConnect(RSQLite::SQLite(), JASPAR2024::db(JASPAR2024::JASPAR2024()))
# pwm_obj <- TFBSTools::getMatrixSet(sq24, opts_human)
data(pwm_obj)
motifMatrix <- getMotifMatrix(
region = GRanges(
seqnames = "chr22",
ranges = IRanges(start = 10525991, end = 10526342)
),
pwm = pwm_obj, ref_obj = BSgenome.Hsapiens.UCSC.hg38
)
# Interactive function is also supported by interactive parameter
plotMotifProf(motifMatrix, interactive = TRUE)Peak Annotation
peakAnno <- annotateSeq(files[[4]],
tssRegion = c(-3000, 3000),
TxDb = txdb, annoDb = "org.Hs.eg.db"
)Note that it would also be possible to use Ensembl-based EnsDb annotation databases created by the ensembldb package for the peak annotations by providing it with the TxDb parameter. Since UCSC-style chromosome names are used we have to change the style of the chromosome names from Ensembl to UCSC in the example below.
library(EnsDb.Hsapiens.v75)
edb <- EnsDb.Hsapiens.v75
seqlevelsStyle(edb) <- "UCSC"
peakAnno.edb <- annotateSeq(files[[4]],
tssRegion = c(-3000, 3000),
TxDb = edb, annoDb = "org.Hs.eg.db"
)Peak Annotation is performed by annotateSeq. User can define TSS (transcription start site) region, by default TSS is defined from -3kb to +3kb. The output of annotateSeq is csAnno instance. ChIPseeker provides as.GRanges to convert csAnno to GRanges instance, and as.data.frame to convert csAnno to data.frame which can be exported to file by write.table.
TxDb object contained transcript-related features of a particular genome. Bioconductor provides several package that containing TxDb object of model organisms with multiple commonly used genome version, for instance TxDb.Hsapiens.UCSC.hg38.knownGene, TxDb.Hsapiens.UCSC.hg19.knownGene for human genome hg38 and hg19, TxDb.Mmusculus.UCSC.mm10.knownGene and TxDb.Mmusculus.UCSC.mm9.knownGene for mouse genome mm10 and mm9, etc. User can also prepare their own TxDb object by retrieving information from UCSC Genome Bioinformatics and BioMart data resources by R function makeTxDbFromBiomart and makeTxDbFromUCSC. TxDb object should be passed for peak annotation.
All the peak information contained in peakfile will be retained in the output of annotateSeq. The position and strand information of nearest genes are reported. The distance from peak to the TSS of its nearest gene is also reported. The genomic region of the peak is reported in annotation column. Since some annotation may overlap, ChIPseeker adopted the following priority in genomic annotation.
- Promoter
- 5’ UTR
- 3’ UTR
- Exon
- Intron
- Downstream
- Intergenic
Downstream is defined as the downstream of gene end.
ChIPseeker also provides parameter genomicAnnotationPriority for user to prioritize this hierachy.
annotateSeq report detail information when the annotation is Exon or Intron, for instance “Exon (uc002sbe.3/9736, exon 69 of 80)”, means that the peak is overlap with an Exon of transcript uc002sbe.3, and the corresponding Entrez gene ID is 9736 (Transcripts that belong to the same gene ID may differ in splice events), and this overlaped exon is the 69th exon of the 80 exons that this transcript uc002sbe.3 prossess.
Parameter annoDb is optional, if provided, extra columns including SYMBOL, GENENAME, ENSEMBL/ENTREZID will be added. The geneId column in annotation output will be consistent with the geneID in TxDb. If it is ENTREZID, ENSEMBL will be added if annoDb is provided, while if it is ENSEMBL ID, ENTREZID will be added.
Visualize Genomic Annotation
To annotate the location of a given peak in terms of genomic features, annotateSeq assigns peaks to genomic annotation in “annotation” column of the output, which includes whether a peak is in the TSS, Exon, 5’ UTR, 3’ UTR, Intronic or Intergenic. Many researchers are very interesting in these annotations. TSS region can be defined by user and annotateSeq output in details of which exon/intron of which genes as illustrated in previous section. Pie and Bar plot are supported to visualize the genomic annotation. Since some annotation overlap, user may interested to view the full annotation with their overlap, which can be partially resolved by vennpie function.
Genomic Annotation by barplot
Visualize distribution of TF-binding loci relative to TSS
The distance from the peak (binding site) to the TSS of the nearest gene is calculated by annotateSeq and reported in the output. We provide plotDistToTSS to calculate the percentage of binding sites upstream and downstream from the TSS of the nearest genes, and visualize the distribution.
plotDistToTSS(peakAnno,
title = "Distribution of transcription factor-binding loci\nrelative to TSS"
)Distribution of Binding Sites
Functional enrichment analysis
Once we have obtained the annotated nearest genes, we can perform functional enrichment analysis to identify predominant biological themes among these genes by incorporating biological knowledge provided by biological ontologies. For instance, Gene Ontology (GO)(Ashburner et al. 2000) annotates genes to biological processes, molecular functions, and cellular components in a directed acyclic graph structure, Kyoto Encyclopedia of Genes and Genomes (KEGG)(Kanehisa et al. 2004) annotates genes to pathways, Disease Ontology (DO)(Schriml et al. 2011) annotates genes with human disease association, and Reactome(Croft et al. 2013) annotates gene to pathways and reactions.
ChIPseeker also provides a function, seq2gene, for linking genomc regions to genes in a many-to-many mapping. It consider host gene (exon/intron), promoter region and flanking gene from intergenic region that may under control via cis-regulation. This function is designed to link both coding and non-coding genomic regions to coding genes and facilitate functional analysis.
Enrichment analysis is a widely used approach to identify biological themes. I have developed several Bioconductor packages for investigating whether the number of selected genes associated with a particular biological term is larger than expected, including DOSE(Yu et al. 2015) for Disease Ontology, ReactomePA for reactome pathway, clusterProfiler(Yu et al. 2012) for Gene Ontology and KEGG enrichment analysis.
library(ReactomePA)
pathway1 <- enrichPathway(as.data.frame(peakAnno)$geneId)
head(pathway1, 2)
## ID Description GeneRatio BgRatio RichFactor
## R-HSA-9830369 R-HSA-9830369 Kidney development 19/506 46/11214 0.4130435
## R-HSA-9758941 R-HSA-9758941 Gastrulation 27/506 115/11214 0.2347826
## FoldEnrichment zScore pvalue p.adjust qvalue
## R-HSA-9830369 9.153892 12.045870 2.601506e-14 2.796619e-11 2.760335e-11
## R-HSA-9758941 5.203265 9.848629 8.375539e-13 4.501852e-10 4.443444e-10
## geneID
## R-HSA-9830369 2625/5076/7490/652/6495/2303/3975/6928/10736/5455/7849/3237/6092/2122/255743/2296/3400/28514/2138
## R-HSA-9758941 5453/5692/5076/5080/7546/3169/652/5015/2303/5717/3975/2627/5714/344022/7849/5077/2637/7022/8320/7545/6657/4487/51176/2296/28514/2626/64321
## Count
## R-HSA-9830369 19
## R-HSA-9758941 27
gene <- seq2gene(peak, tssRegion = c(-1000, 1000), flankDistance = 3000, TxDb = txdb)
pathway2 <- enrichPathway(gene)
head(pathway2, 2)
## ID Description GeneRatio BgRatio RichFactor
## R-HSA-9830369 R-HSA-9830369 Kidney development 17/415 46/11214 0.3695652
## R-HSA-9758941 R-HSA-9758941 Gastrulation 24/415 115/11214 0.2086957
## FoldEnrichment zScore pvalue p.adjust qvalue
## R-HSA-9830369 9.986276 11.971929 2.162576e-13 2.175552e-10 2.114772e-10
## R-HSA-9758941 5.639309 9.802874 3.528656e-12 1.774914e-09 1.725327e-09
## geneID
## R-HSA-9830369 2625/5076/3227/652/6495/2303/3975/3237/6092/2122/255743/7490/6928/7849/5455/2296/28514
## R-HSA-9758941 5453/5692/5076/7546/3169/652/2303/5717/3975/2627/5714/344022/2637/8320/7545/7020/2626/5080/5015/7849/51176/2296/64321/28514
## Count
## R-HSA-9830369 17
## R-HSA-9758941 24
dotplot(pathway2)More information can be found in the vignettes of Bioconductor packages DOSE(Yu et al. 2015), ReactomePA, clusterProfiler(Yu et al. 2012), which also provide several methods to visualize enrichment results. The clusterProfiler(Yu et al. 2012) is designed for comparing and visualizing functional profiles among gene clusters, and can directly applied to compare biological themes at GO, DO, KEGG, Reactome perspective.
ChIP peak data set comparison
ChIP peak annotation comparision
The plotAnnoBar and plotDistToTSS can also accept input of a named list of annotated peaks (output of annotateSeq). We can use plotAnnoBar to comparing their genomic annotation.
Genomic Annotation among different ChIPseq data
Functional profiles comparison
As shown in section 4, the annotated genes can analyzed by r Biocpkg("clusterProfiler")(Yu et al. 2012), r Biocpkg("DOSE")(Yu et al. 2015), meshes and r Biocpkg("ReactomePA") for Gene Ontology, KEGG, Disease Ontology, MeSH and Reactome Pathway enrichment analysis.
The clusterProfiler(Yu et al. 2012) package provides compareCluster function for comparing biological themes among gene clusters, and can be easily adopted to compare different ChIP peak experiments.
genes <- lapply(peakAnnoList, function(i) as.data.frame(i)$geneId)
names(genes) <- sub("_", "\n", names(genes))
compKEGG <- compareCluster(
geneCluster = genes,
fun = "enrichKEGG",
pvalueCutoff = 0.05,
pAdjustMethod = "BH"
)
dotplot(compKEGG, showCategory = 15, title = "KEGG Pathway Enrichment Analysis")Overlap of peaks and annotated genes
User may want to compare the overlap peaks of replicate experiments or from different experiments. ChIPseeker provides peak2GRanges that can read peak file and stored in GRanges object. Several files can be read simultaneously using lapply, and then passed to vennplot to calculate their overlap and draw venn plot.
vennplot accept a list of object, can be a list of GRanges or a list of vector. Here, I will demonstrate using vennplot to visualize the overlap of the nearest genes stored in peakAnnoList.
Overlap of annotated genes
Statistical testing of ChIP seq overlap
Overlap is very important, if two ChIP experiment by two different proteins overlap in a large fraction of their peaks, they may cooperative in regulation. Calculating the overlap is only touch the surface. ChIPseeker implemented statistical methods to measure the significance of the overlap.
Shuffle genome coordination
p <- GRanges(
seqnames = c("chr1", "chr3"),
ranges = IRanges(start = c(1, 100), end = c(50, 130))
)
shuffle(p, TxDb = txdb)## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 969139-969188 *
## [2] chr3 164930876-164930906 *
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
We implement the shuffle function to randomly permute the genomic locations of ChIP peaks defined in a genome which stored in TxDb object.
Peak overlap enrichment analysis
With the ease of this shuffle method, we can generate thousands of random ChIP data and calculate the background null distribution of the overlap among ChIP data sets.
enrichPeakOverlap(
queryPeak = files[[5]],
targetPeak = unlist(files[1:4]),
TxDb = txdb,
pAdjustMethod = "BH",
nShuffle = 50,
chainFile = NULL,
verbose = FALSE
)
## qSample
## ARmo_0M GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
## ARmo_1nM GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
## ARmo_100nM GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
## CBX6_BF GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
## tSample qLen tLen N_OL
## ARmo_0M GSM1174480_ARmo_0M_peaks.bed.gz 1663 812 0
## ARmo_1nM GSM1174481_ARmo_1nM_peaks.bed.gz 1663 2296 8
## ARmo_100nM GSM1174482_ARmo_100nM_peaks.bed.gz 1663 1359 3
## CBX6_BF GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed.gz 1663 1331 968
## pvalue p.adjust
## ARmo_0M 0.82352941 0.82352941
## ARmo_1nM 0.17647059 0.35294118
## ARmo_100nM 0.33333333 0.44444444
## CBX6_BF 0.01960784 0.07843137Parameter queryPeak is the query ChIP data, while targetPeak is bed file name or a vector of bed file names from comparison; nShuffle is the number to shuffle the peaks in targetPeak. To speed up the compilation of this vignettes, we only set nShuffle to 50 as an example for only demonstration. User should set the number to 1000 or above for more robust result. Parameter chainFile are chain file name for mapping the targetPeak to the genome version consistent with queryPeak when their genome version are different. This creat the possibility of comparison among different genome version and cross species.
In the output, qSample is the name of queryPeak and qLen is the the number of peaks in queryPeak. N_OL is the number of overlap between queryPeak and targetPeak.
Data Mining with ChIP seq data deposited in GEO
There are many ChIP seq data sets that have been published and deposited in GEO database. We can compare our own dataset to those deposited in GEO to search for significant overlap data. Significant overlap of ChIP seq data by different binding proteins may be used to infer cooperative regulation and thus can be used to generate hypotheses.
We collect about 17,000 bed files deposited in GEO, user can use getGEOspecies to get a summary based on speices.
GEO data collection
## species Freq
## 1 Actinobacillus pleuropneumoniae 1
## 2 Actinobacillus pleuropneumoniae serovar 3 str. JL03 1
## 3 Aedes aegypti 11
## 4 Anabaena 6
## 5 Anolis carolinensis 7
## 6 Anopheles gambiae 2
The summary can also based on genome version as illustrated below:
## organism genomeVersion Freq
## 1 Anolis carolinensis anoCar2 7
## 2 Bos taurus bosTau4 2
## 3 Bos taurus bosTau6 33
## 4 Bos taurus bosTau7 2
## 5 Canis lupus familiaris canFam3 10
## 6 Caenorhabditis elegans ce10 328
User can access the detail information by getGEOInfo, for each genome version.
## series_id gsm organism
## 111 GSE16256 GSM521889 Homo sapiens
## 112 GSE16256 GSM521887 Homo sapiens
## 113 GSE16256 GSM521883 Homo sapiens
## 114 GSE16256 GSM1010966 Homo sapiens
## 115 GSE16256 GSM896166 Homo sapiens
## 116 GSE16256 GSM910577 Homo sapiens
## title
## 111 Reference Epigenome: ChIP-Seq Analysis of H3K27me3 in IMR90 Cells; renlab.H3K27me3.IMR90-02.01
## 112 Reference Epigenome: ChIP-Seq Analysis of H3K27ac in IMR90 Cells; renlab.H3K27ac.IMR90-03.01
## 113 Reference Epigenome: ChIP-Seq Analysis of H3K14ac in IMR90 Cells; renlab.H3K14ac.IMR90-02.01
## 114 polyA RNA sequencing of STL003 Pancreas Cultured Cells; polyA-RNA-seq_STL003PA_r1a
## 115 Reference Epigenome: ChIP-Seq Analysis of H4K8ac in hESC H1 Cells; renlab.H4K8ac.hESC.H1.01.01
## 116 Reference Epigenome: ChIP-Seq Analysis of H3K4me1 in Human Spleen Tissue; renlab.H3K4me1.STL003SX.01.01
## supplementary_file
## 111 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM521nnn/GSM521889/suppl/GSM521889_UCSD.IMR90.H3K27me3.SK05.bed.gz
## 112 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM521nnn/GSM521887/suppl/GSM521887_UCSD.IMR90.H3K27ac.YL58.bed.gz
## 113 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM521nnn/GSM521883/suppl/GSM521883_UCSD.IMR90.H3K14ac.SK17.bed.gz
## 114 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1010nnn/GSM1010966/suppl/GSM1010966_UCSD.Pancreas.mRNA-Seq.STL003.bed.gz
## 115 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM896nnn/GSM896166/suppl/GSM896166_UCSD.H1.H4K8ac.AY17.bed.gz
## 116 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM910nnn/GSM910577/suppl/GSM910577_UCSD.Spleen.H3K4me1.STL003.bed.gz
## genomeVersion pubmed_id
## 111 hg19 19829295
## 112 hg19 19829295
## 113 hg19 19829295
## 114 hg19 19829295
## 115 hg19 19829295
## 116 hg19 19829295
If simplify is set to FALSE, extra information including source_name, extract_protocol, description, data_processing and submission_date will be incorporated.
Download GEO ChIP data sets
ChIPseeker provide function downloadGEObedFiles to download all the bed files of a particular genome.
Or a vector of GSM accession number by downloadGSMbedFiles.
Overlap significant testing
After download the bed files from GEO, we can pass them to enrichPeakOverlap for testing the significant of overlap. Parameter targetPeak can be the folder, e.g. hg19, that containing bed files. enrichPeakOverlap will parse the folder and compare all the bed files. It is possible to test the overlap with bed files that are mapping to different genome or different genome versions, enrichPeakOverlap provide a parameter chainFile that can pass a chain file and liftOver the targetPeak to the genome version consistent with queryPeak. Signifcant overlap can be use to generate hypothesis of cooperative regulation. By mining the data deposited in GEO, we can identify some putative complex or interacted regulators in gene expression regulation or chromsome remodelling for further validation.
Need helps?
If you have questions/issues, please visit ChIPseeker homepage first. Your problems are mostly documented. If you think you found a bug, please follow the guide and provide a reproducible example to be posted on github issue tracker. For questions, please post to Bioconductor support site and tag your post with ChIPseeker.
For Chinese user, you can follow me on WeChat (微信).
Session Information
Here is the output of sessionInfo() on the system on which this document was compiled:
## R Under development (unstable) (2025-12-22 r89219)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] epiSeeker_0.99.14
## [2] ReactomePA_1.55.1
## [3] clusterProfiler_4.19.3
## [4] ggplot2_4.0.1
## [5] org.Hs.eg.db_3.22.0
## [6] BSgenome.Hsapiens.UCSC.hg38_1.4.5
## [7] BSgenome_1.79.1
## [8] rtracklayer_1.71.3
## [9] BiocIO_1.21.0
## [10] Biostrings_2.79.4
## [11] XVector_0.51.0
## [12] GenomeInfoDb_1.47.2
## [13] TxDb.Hsapiens.UCSC.hg38.knownGene_3.22.0
## [14] GenomicFeatures_1.63.1
## [15] AnnotationDbi_1.73.0
## [16] Biobase_2.71.0
## [17] GenomicRanges_1.63.1
## [18] Seqinfo_1.1.0
## [19] IRanges_2.45.0
## [20] S4Vectors_0.49.0
## [21] BiocGenerics_0.57.0
## [22] generics_0.1.4
## [23] yulab.utils_0.2.3
##
## loaded via a namespace (and not attached):
## [1] splines_4.6.0 bitops_1.0-9
## [3] ggplotify_0.1.3 R.oo_1.27.1
## [5] tibble_3.3.1 polyclip_1.10-7
## [7] enrichit_0.0.9 graph_1.89.1
## [9] DirichletMultinomial_1.53.0 XML_3.99-0.20
## [11] gggenes_0.6.0 lifecycle_1.0.5
## [13] pwalign_1.7.0 lattice_0.22-7
## [15] MASS_7.3-65 magrittr_2.0.4
## [17] limma_3.67.0 sass_0.4.10
## [19] rmarkdown_2.30 plotrix_3.8-13
## [21] jquerylib_0.1.4 yaml_2.3.12
## [23] otel_0.2.0 ggtangle_0.1.0
## [25] DBI_1.2.3 RColorBrewer_1.1-3
## [27] abind_1.4-8 R.utils_2.13.0
## [29] purrr_1.2.1 ggraph_2.2.2
## [31] RCurl_1.98-1.17 tweenr_2.0.3
## [33] rappdirs_0.3.3 gdtools_0.4.4
## [35] enrichplot_1.31.3 ggrepel_0.9.6
## [37] seqLogo_1.77.0 tidytree_0.4.7
## [39] reactome.db_1.94.0 permute_0.9-8
## [41] DelayedMatrixStats_1.33.0 codetools_0.2-20
## [43] DelayedArray_0.37.0 DOSE_4.5.1
## [45] ggforce_0.5.0 prettydoc_0.4.1
## [47] tidyselect_1.2.1 aplot_0.2.9
## [49] UCSC.utils_1.7.1 farver_2.1.2
## [51] viridis_0.6.5 universalmotif_1.29.0
## [53] matrixStats_1.5.0 GenomicAlignments_1.47.0
## [55] jsonlite_2.0.0 tidygraph_1.3.1
## [57] systemfonts_1.3.1 tools_4.6.0
## [59] ggnewscale_0.5.2 treeio_1.35.0
## [61] TFMPvalue_0.0.9 ggVennDiagram_1.5.7
## [63] Rcpp_1.1.1 glue_1.8.0
## [65] gridExtra_2.3 SparseArray_1.11.10
## [67] xfun_0.55 qvalue_2.43.0
## [69] MatrixGenerics_1.23.0 dplyr_1.1.4
## [71] HDF5Array_1.39.0 withr_3.0.2
## [73] fastmap_1.2.0 boot_1.3-32
## [75] rhdf5filters_1.23.3 caTools_1.18.3
## [77] digest_0.6.39 R6_2.6.1
## [79] gridGraphics_0.5-1 GO.db_3.22.0
## [81] gtools_3.9.5 dichromat_2.0-0.1
## [83] RSQLite_2.4.5 R.methodsS3_1.8.2
## [85] cigarillo_1.1.0 h5mread_1.3.1
## [87] tidyr_1.3.2 data.table_1.18.0
## [89] fontLiberation_0.1.0 graphlayouts_1.2.2
## [91] httr_1.4.7 htmlwidgets_1.6.4
## [93] S4Arrays_1.11.1 scatterpie_0.2.6
## [95] TFBSTools_1.49.0 graphite_1.57.0
## [97] pkgconfig_2.0.3 gtable_0.3.6
## [99] blob_1.2.4 S7_0.2.1
## [101] htmltools_0.5.9 fontBitstreamVera_0.1.1
## [103] scales_1.4.0 png_0.1-8
## [105] ggfun_0.2.0 knitr_1.51
## [107] reshape2_1.4.5 rjson_0.2.23
## [109] nlme_3.1-168 curl_7.0.0
## [111] cachem_1.1.0 rhdf5_2.55.12
## [113] stringr_1.6.0 parallel_4.6.0
## [115] restfulr_0.0.16 pillar_1.11.1
## [117] grid_4.6.0 vctrs_0.6.5
## [119] ggfittext_0.10.3 tidydr_0.0.6
## [121] beachmat_2.27.2 cluster_2.1.8.1
## [123] evaluate_1.0.5 bsseq_1.47.1
## [125] cli_3.6.5 locfit_1.5-9.12
## [127] compiler_4.6.0 Rsamtools_2.27.0
## [129] rlang_1.1.7 crayon_1.5.3
## [131] labeling_0.4.3 plyr_1.8.9
## [133] fs_1.6.6 ggiraph_0.9.2
## [135] stringi_1.8.7 viridisLite_0.4.2
## [137] BiocParallel_1.45.0 lazyeval_0.2.2
## [139] GOSemSim_2.37.1 fontquiver_0.2.1
## [141] Matrix_1.7-4 patchwork_1.3.2
## [143] sparseMatrixStats_1.23.0 bit64_4.6.0-1
## [145] Rhdf5lib_1.33.0 KEGGREST_1.51.1
## [147] statmod_1.5.1 SummarizedExperiment_1.41.0
## [149] igraph_2.2.1 memoise_2.0.1
## [151] bslib_0.9.0 ggtree_4.1.1
## [153] bit_4.6.0 ape_5.8-1
## [155] gson_0.1.0
References
Ashburner, Michael, Catherine A. Ball, Judith A. Blake, David Botstein, Heather Butler, J. Michael Cherry, Allan P. Davis, et al. 2000. “Gene Ontology: Tool for the Unification of Biology.” Nat Genet 25 (1): 25–29. https://doi.org/10.1038/75556.
Croft, D., A. F. Mundo, R. Haw, M. Milacic, J. Weiser, G. Wu, M. Caudy, et al. 2013. “The Reactome Pathway Knowledgebase.” Nucleic Acids Research 42 (D1): D472–D477. https://doi.org/10.1093/nar/gkt1102.
Kanehisa, Minoru, Susumu Goto, Shuichi Kawashima, Yasushi Okuno, and Masahiro Hattori. 2004. “The KEGG Resource for Deciphering the Genome.” Nucleic Acids Research 32 (suppl 1): D277–D280. https://doi.org/10.1093/nar/gkh063.
Nipper, Michael, Yi Xu, Jun Liu, Xue Yin, Zhijie Liu, Zhengqing Ye, Jianmin Zhang, Yidong Chen, and Pei Wang. 2024. “TGF\(\beta\) and Hippo Signaling Pathways Coordinate to Promote Acinar to Ductal Metaplasia in Human Pancreas.” Cells 13 (2): 186.
Pemberton, Helen, Emma Anderton, Harshil Patel, Sharon Brookes, Hollie Chandler, Richard Palermo, Julie Stock, et al. 2014. “Genome-Wide Co-Localization of Polycomb Orthologs and Their Effects on Gene Expression in Human Fibroblasts.” Genome Biology 15 (2): R23. https://doi.org/10.1186/gb-2014-15-2-r23.
Schriml, L. M., C. Arze, S. Nadendla, Y.-W. W. Chang, M. Mazaitis, V. Felix, G. Feng, and W. A. Kibbe. 2011. “Disease Ontology: A Backbone for Disease Semantic Integration.” Nucleic Acids Research 40 (D1): D940–D946. https://doi.org/10.1093/nar/gkr972.
Urbanucci, A, B Sahu, J Seppälä, A Larjo, L M Latonen, K K Waltering, T L J Tammela, et al. 2012. “Overexpression of Androgen Receptor Enhances the Binding of the Receptor to the Chromatin in Prostate Cancer.” Oncogene 31 (17): 2153–63. https://doi.org/10.1038/onc.2011.401.
Yu, Guangchuang, and Qing-Yu He. 2016. “ReactomePA: An R/Bioconductor Package for Reactome Pathway Analysis and Visualization.” Molecular BioSystems 12 (2): 477–79. https://doi.org/10.1039/C5MB00663E.
Yu, Guangchuang, Li-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “clusterProfiler: an R Package for Comparing Biological Themes Among Gene Clusters.” OMICS: A Journal of Integrative Biology 16 (5): 284–87. https://doi.org/10.1089/omi.2011.0118.
Yu, Guangchuang, Li-Gen Wang, and Qing-Yu He. 2015. “ChIPseeker: An R/Bioconductor Package for Chip Peak Annotation, Comparison and Visualization.” Bioinformatics 31 (14): 2382–3. https://doi.org/10.1093/bioinformatics/btv145.
Yu, Guangchuang, Li-Gen Wang, Guang-Rong Yan, and Qing-Yu He. 2015. “DOSE: An R/Bioconductor Package for Disease Ontology Semantic and Enrichment Analysis.” Bioinformatics 31 (4): 608–9. https://doi.org/10.1093/bioinformatics/btu684.