SingleMoleculeFootprinting is an R package for Single Molecule Footprinting (SMF) data.
Starting from an aligned bam file, we show how to
For installation, the user can use the following:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("SingleMoleculeFootprinting")
For compatibility with our analysis tools, we recommend the user to perform genomic alignments using the qAlign
function from QuasR as exemplified in the chunk below.
For detailed pre-processing instructions we refer to steps 179 to 186 of Kleinendorst & Barzaghi et al., 2021
clObj <- makeCluster(40)
prj <- QuasR::qAlign(
sampleFile = sampleFile,
genome = "BSgenome.Mmusculus.UCSC.mm10",
aligner = "Rbowtie",
projectName = "prj",
paired = "fr",
bisulfite = "undir",
alignmentParameter = "-e 70 -X 1000 -k 2 --best -strata",
alignmentsDir = "./",
cacheDir = tempdir(),
clObj = clObj
)
stopCluster(clObj)
SingleMoleculeFootprinting inherits QuasR’s sampleFile
style of feeding .bam files. For instructions, refer to the qAlign
documentation. Here we show how our sampleFile looks like.
N.b.: This vignette and some functions of SingleMoleculeFootprinting depend on data available through the data package SingleMoleculeFootprintingData.
For user-friendliness, this data is fetched during the installation of SingleMoleculeFootprinting and stored in tempdir()
.
Please make sure that tempdir()
has enough storage capacity (~1 Gb). You can check this by running ExperimentHub::getExperimentHubOption(arg = "CACHE")
and if needed change the location by running ExperimentHub::setExperimentHubOption(arg = "CACHE", value = "/your/favourite/location")
.
sampleFile = paste0(CacheDir, "/NRF1Pair_sampleFile.txt")
knitr::kable(suppressMessages(readr::read_delim(sampleFile, delim = "\t")))
FileName | SampleName |
---|---|
/home/biocbuild/.cache/R/ExperimentHub/NRF1pair.bam | NRF1pair_DE_ |
For generic sequencing QCs, we refer to QuasR’s qQCreport
.
If a bait capture step was included to enrich for regulatory regions, it is useful to check how efficiently that worked.
Here we calculate the ratio of molecules overlapping the enrichment targets over the total. A Bait capture efficiency below 70% might be problematic.
In that case we refer to the troubleshooting section of our Kleinendorst & Barzaghi et al., 2021.
The bisulfite conversion step, chemically converts un-methylated cytosines to thymines. This process has a margin of error.
Here we ask what is the percentage of converted cytosines among those which shouldn’t be methylated, i.e. those outside of CpG or GpC contexts. Normally, we expect a conversion rate of ~95%.
N.b.: this function runs much slower than the one above, which is why we prefer to check this metric for chr19 only.
It is useful to compare the distribution of cytosine methylation rates across replicates.
RegionOfInterest = GRanges(BSgenome.Mmusculus.UCSC.mm10@seqinfo["chr19"])
CallContextMethylation(
sampleFile = sampleFile,
samples = samples,
genome = BSgenome.Mmusculus.UCSC.mm10,
RegionOfInterest = RegionOfInterest,
coverage = 20,
ConvRate.thr = NULL,
returnSM = FALSE,
clObj = NULL,
verbose = FALSE
) -> Methylation
Methylation %>%
elementMetadata() %>%
as.data.frame() %>%
dplyr::select(grep("_MethRate", colnames(.))) -> MethylationRate_matrix
png("../inst/extdata/MethRateCorr_AllCs.png", units = "cm", res = 100, width = 20, height = 20)
pairs(
MethylationRate_matrix,
upper.panel = panel.cor,
diag.panel = panel.hist,
lower.panel = panel.jet,
labels = gsub("SMF_MM_|DE_|_MethRate", "", colnames(MethylationRate_matrix))
)
dev.off()
knitr::include_graphics(system.file("extdata", "MethRateCorr_AllCs.png", package="SingleMoleculeFootprinting"))
It is also useful, especially in the case of single enzyme treatments, to split the genomics contexts of cytosines based on the MTase used.
Methylation %>%
elementMetadata() %>%
as.data.frame() %>%
filter(GenomicContext %in% c("DGCHN", "GCH")) %>%
dplyr::select(grep("_MethRate", colnames(.))) -> MethylationRate_matrix_GCH
png("../inst/extdata/MethRateCorr_GCHs.png", units = "cm", res = 100, width = 20, height = 20)
pairs(MethylationRate_matrix_GCH, upper.panel = panel.cor, diag.panel = panel.hist, lower.panel = panel.jet, labels = colnames(MethylationRate_matrix_GCH))
dev.off()
Methylation %>%
elementMetadata() %>%
as.data.frame() %>%
filter(GenomicContext %in% c("NWCGW", "HCG")) %>%
dplyr::select(grep("_MethRate", colnames(.))) -> MethylationRate_matrix_HCG
png("../inst/extdata/MethRateCorr_HCGs.png", units = "cm", res = 100, width = 20, height = 20)
pairs(MethylationRate_matrix_HCG, upper.panel = panel.cor, diag.panel = panel.hist, lower.panel = panel.jet, labels = colnames(MethylationRate_matrix_HCG))
dev.off()
Before investing in deep sequencing, it is advisable to shallowly sequence libraries to assess the footprinting quality of the libraries.
However the per-cytosine coverage obtained at shallow sequencing is insufficient to estimate methylation rates for individual cytosines.
A solution is to pile up molecules covering cytosines from genomic loci that are known to behave similarly and compute a “composite” methylation rate.
Such composite methylation rate allows to assess the quality of footprinting at low coverage.
The following chunk exemplifies how to proceed.
First we want to call methylation for the new low depth samples, paying attention to setting the parameter coverage=1
.
Then we want to call methylation for a reference high coverage sample.
Finally, we can use the wrapper function CompositeMethylationCorrelation
to extract composite methylation rates.
N.b.: the methylation calling step is quite computationally demanding for full genomes, so we ran this in the background and reported the results only.
RegionOfInterest = GRanges(BSgenome.Mmusculus.UCSC.mm10@seqinfo["chr19"])
CallContextMethylation(
sampleFile = sampleFile_LowCoverage,
samples = samples_LowCoverage,
genome = BSgenome.Mmusculus.UCSC.mm10,
RegionOfInterest = RegionOfInterest,
coverage = 1,
ConvRate.thr = NULL,
returnSM = FALSE,
clObj = NULL,
verbose = FALSE
)$DGCHN -> LowCoverageMethylation
CallContextMethylation(
sampleFile = sampleFile_HighCoverage_reference,
samples = samples_HighCoverage_reference,
genome = BSgenome.Mmusculus.UCSC.mm10,
RegionOfInterest = RegionOfInterest,
coverage = 20,
ConvRate.thr = NULL,
returnSM = FALSE,
clObj = NULL,
verbose = FALSE
)$DGCHN -> HighCoverageMethylation
CompositeMethylationCorrelation(
LowCoverage = LowCoverageMethylation,
LowCoverage_samples = samples_LowCoverage,
HighCoverage = HighCoverageMethylation,
HighCoverage_samples = samples_HighCoverage_reference,
bins = 50,
returnDF = TRUE,
returnPlot = TRUE,
RMSE = TRUE,
return_RMSE_DF = TRUE,
return_RMSE_plot = TRUE
) -> results
The methylation distribution plot reveals that replicates SMF_MM_NP_NO_R3_MiSeq and SMF_MM_NP_NO_R3_MiSeq deviate fairly from the reference high coverage sample.
results <- qs::qread(system.file("extdata", "low_coverage_methylation_correlation.qs",
package="SingleMoleculeFootprinting"))
results$MethylationDistribution_plot +
scale_color_manual(
values = c("#00BFC4", "#00BFC4", "#00BFC4", "#F8766D", "#F8766D"),
breaks = c("SMF_MM_TKO_as_NO_R_NextSeq", "SMF_MM_NP_NO_R1_MiSeq", "SMF_MM_NP_NO_R2_MiSeq",
"SMF_MM_NP_NO_R3_MiSeq", "SMF_MM_NP_NO_R4_MiSeq"))