Over the past years, RNA-seq data for several species have accumulated in public repositories. Additionally, genome-wide association studies (GWAS) have
identified SNPs associated with phenotypes of interest, such as agronomic
traits in plants, production traits in livestock, and complex human diseases.
However, although GWAS can identify SNPs, they cannot identify causative genes
associated with the studied phenotype. The goal of cageminer
is to integrate
GWAS-derived SNPs with transcriptomic data to mine candidate genes and identify
high-confidence genes associated with traits of interest.
If you use cageminer
in your research, please cite us. You can obtain
citation information with citation('cageminer')
, as demonstrated below:
print(citation('cageminer'), bibtex = TRUE)
#> To cite cageminer in publications use:
#>
#> Almeida-Silva, F., & Venancio, T. M. (2022). cageminer: an
#> R/Bioconductor package to prioritize candidate genes by integrating
#> genome-wide association studies and gene coexpression networks. in
#> silico Plants, 4(2), diac018.
#> https://doi.org/10.1093/insilicoplants/diac018
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Article{,
#> title = {cageminer: an R/Bioconductor package to prioritize candidate genes by integrating genome-wide association studies and gene coexpression networks},
#> author = {Fabricio Almeida-Silva and Thiago M. Venancio},
#> journal = {in silico Plants},
#> year = {2022},
#> volume = {4},
#> number = {2},
#> pages = {diac018},
#> url = {https://doi.org/10.1093/insilicoplants/diac018},
#> doi = {10.1093/insilicoplants/diac018},
#> }
if(!requireNamespace('BiocManager', quietly = TRUE))
install.packages('BiocManager')
BiocManager::install("cageminer")
# Load package after installation
library(cageminer)
set.seed(123) # for reproducibility
For this vignette, we will use transcriptomic data on pepper (Capsicum annuum) response to Phytophthora root rot (Kim et al. 2018), and GWAS SNPs associated with resistance to Phytophthora root rot from Siddique et al. (2019). To ensure interoperability with other Bioconductor packages, expression data are stored as SummarizedExperiment objects, and gene/SNP positions are stored as GRanges objects.
# GRanges of SNP positions
data(snp_pos)
snp_pos
#> GRanges object with 116 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> 2 Chr02 149068682 *
#> 3 Chr03 5274098 *
#> 4 Chr05 27703815 *
#> 5 Chr05 27761792 *
#> 6 Chr05 27807397 *
#> ... ... ... ...
#> 114 Chr12 230514706 *
#> 115 Chr12 230579178 *
#> 116 Chr12 230812962 *
#> 117 Chr12 230887290 *
#> 118 Chr12 231022812 *
#> -------
#> seqinfo: 8 sequences from an unspecified genome; no seqlengths
# GRanges of chromosome lengths
data(chr_length)
chr_length
#> GRanges object with 12 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] Chr01 1-272704604 *
#> [2] Chr02 1-171128871 *
#> [3] Chr03 1-257900543 *
#> [4] Chr04 1-222584275 *
#> [5] Chr05 1-233468049 *
#> ... ... ... ...
#> [8] Chr08 1-145103255 *
#> [9] Chr09 1-252779264 *
#> [10] Chr10 1-233593809 *
#> [11] Chr11 1-259726002 *
#> [12] Chr12 1-235688241 *
#> -------
#> seqinfo: 12 sequences from an unspecified genome; no seqlengths
# GRanges of gene coordinates
data(gene_ranges)
gene_ranges
#> GRanges object with 30242 ranges and 6 metadata columns:
#> seqnames ranges strand | source type score
#> <Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
#> [1] Chr01 63209-63880 - | PGA1.55 gene NA
#> [2] Chr01 112298-112938 - | PGA1.55 gene NA
#> [3] Chr01 117979-118392 + | PGA1.55 gene NA
#> [4] Chr01 119464-119712 + | PGA1.55 gene NA
#> [5] Chr01 119892-120101 + | PGA1.55 gene NA
#> ... ... ... ... . ... ... ...
#> [30238] Chr12 235631138-235631467 - | PGA1.55 gene NA
#> [30239] Chr12 235642644-235645110 + | PGA1.55 gene NA
#> [30240] Chr12 235645483-235651927 - | PGA1.55 gene NA
#> [30241] Chr12 235652709-235655955 - | PGA1.55 gene NA
#> [30242] Chr12 235663655-235665276 - | PGA1.55 gene NA
#> phase ID Parent
#> <integer> <character> <CharacterList>
#> [1] <NA> CA01g00010
#> [2] <NA> CA01g00020
#> [3] <NA> CA01g00030
#> [4] <NA> CA01g00040
#> [5] <NA> CA01g00050
#> ... ... ... ...
#> [30238] <NA> CA12g22890
#> [30239] <NA> CA12g22900
#> [30240] <NA> CA12g22910
#> [30241] <NA> CA12g22920
#> [30242] <NA> CA12g22930
#> -------
#> seqinfo: 12 sequences from an unspecified genome; no seqlengths
# SummarizedExperiment of pepper response to Phytophthora root rot (RNA-seq)
data(pepper_se)
pepper_se
#> class: SummarizedExperiment
#> dim: 3892 45
#> metadata(0):
#> assays(1): ''
#> rownames(3892): CA02g23440 CA02g05510 ... CA03g35110 CA02g12750
#> rowData names(0):
#> colnames(45): PL1 PL2 ... TMV-P0-3D TMV-P0-Up
#> colData names(1): Condition
Before mining high-confidence candidates, you can visualize the SNP distribution
in the genome to explore possible patterns. First, let’s see if SNPs are
uniformly across chromosomes with plot_snp_distribution()
.
plot_snp_distribution(snp_pos)
As we can see, SNPs associated with resistance to Phytophthora root rot tend to
co-occur in chromosome 5. Now, we can see if they are close to each other in the
genome, and if they are located in gene-rich regions. We can visualize it with
plot_snp_circos
, which displays a circos plot of SNPs across chromosomes.
plot_snp_circos(chr_length, gene_ranges, snp_pos)
There seems to be no clustering in gene-rich regions, but we can see that SNPs in the same chromosome tend to be physically close to each other.
If you have SNP positions for multiple traits, you need to store them in GRangesList or CompressedGRangesList objects, so each element will have SNP positions for a particular trait. Then, you can visualize their distribution as you would do for a single trait. Let’s simulate multiple traits to see how it works:
# Simulate multiple traits by sampling 20 SNPs 4 times
snp_list <- GenomicRanges::GRangesList(
Trait1 = sample(snp_pos, 20),
Trait2 = sample(snp_pos, 20),
Trait3 = sample(snp_pos, 20),
Trait4 = sample(snp_pos, 20)
)
# Visualize SNP distribution across chromosomes
plot_snp_distribution(snp_list)