Introduction

SingleMoleculeFootprinting is an R package for Single Molecule Footprinting (SMF) data.

Starting from an aligned bam file, we show how to

For installation, the user can use the following:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("SingleMoleculeFootprinting")

For compatibility with our analysis tools, we recommend the user to perform genomic alignments using the qAlign function from QuasR as exemplified in the chunk below.
For detailed pre-processing instructions we refer to steps 179 to 186 of Kleinendorst & Barzaghi et al., 2021

clObj <- makeCluster(40)
prj <- QuasR::qAlign(
  sampleFile = sampleFile,
  genome = "BSgenome.Mmusculus.UCSC.mm10",
  aligner = "Rbowtie",
  projectName = "prj",
  paired = "fr",
  bisulfite = "undir",
  alignmentParameter = "-e 70 -X 1000 -k 2 --best -strata",
  alignmentsDir = "./", 
  cacheDir = tempdir(),
  clObj = clObj
  )
stopCluster(clObj)

Loading libraries

suppressWarnings(library(SingleMoleculeFootprinting))
suppressWarnings(library(SingleMoleculeFootprintingData))
suppressWarnings(library(BSgenome.Mmusculus.UCSC.mm10))
suppressWarnings(library(parallel))
suppressWarnings(library(ggplot2))

Prepare inputs

SingleMoleculeFootprinting inherits QuasR’s sampleFile style of feeding .bam files. For instructions, refer to the qAlign documentation. Here we show how our sampleFile looks like.
N.b.: This vignette and some functions of SingleMoleculeFootprinting depend on data available through the data package SingleMoleculeFootprintingData.
For user-friendliness, this data is fetched during the installation of SingleMoleculeFootprinting and stored in tempdir().
Please make sure that tempdir() has enough storage capacity (~1 Gb). You can check this by running ExperimentHub::getExperimentHubOption(arg = "CACHE") and if needed change the location by running ExperimentHub::setExperimentHubOption(arg = "CACHE", value = "/your/favourite/location").

sampleFile = paste0(CacheDir, "/NRF1Pair_sampleFile.txt")
knitr::kable(suppressMessages(readr::read_delim(sampleFile, delim = "\t")))
FileName SampleName
/home/biocbuild/.cache/R/ExperimentHub/NRF1pair.bam NRF1pair_DE_

Library QCs

QuasR QC report

For generic sequencing QCs, we refer to QuasR’s qQCreport.

Bait capture efficiency

If a bait capture step was included to enrich for regulatory regions, it is useful to check how efficiently that worked.
Here we calculate the ratio of molecules overlapping the enrichment targets over the total. A Bait capture efficiency below 70% might be problematic.
In that case we refer to the troubleshooting section of our Kleinendorst & Barzaghi et al., 2021.

BaitRegions <- SingleMoleculeFootprintingData::EnrichmentRegions_mm10.rds()
clObj = makeCluster(4)
BaitCapture(
  sampleFile = sampleFile, 
  genome = BSgenome.Mmusculus.UCSC.mm10,
  baits = BaitRegions, 
  clObj = clObj
) -> BaitCaptureEfficiency
stopCluster(clObj)

Conversion rate accuracy

The bisulfite conversion step, chemically converts un-methylated cytosines to thymines. This process has a margin of error.
Here we ask what is the percentage of converted cytosines among those which shouldn’t be methylated, i.e. those outside of CpG or GpC contexts. Normally, we expect a conversion rate of ~95%.
N.b.: this function runs much slower than the one above, which is why we prefer to check this metric for chr19 only.

ConversionRate(
  sampleFile = sampleFile, 
  genome = BSgenome.Mmusculus.UCSC.mm10, 
  chr = 19
) -> ConversionRateValue

Methylation rate sample correlation (high coverage)

It is useful to compare the distribution of cytosine methylation rates across replicates.

RegionOfInterest = GRanges(BSgenome.Mmusculus.UCSC.mm10@seqinfo["chr19"])

CallContextMethylation(
  sampleFile = sampleFile, 
  samples = samples, 
  genome = BSgenome.Mmusculus.UCSC.mm10, 
  RegionOfInterest = RegionOfInterest, 
  coverage = 20, 
  ConvRate.thr = NULL, 
  returnSM = FALSE, 
  clObj = NULL,
  verbose = FALSE
  ) -> Methylation

Methylation %>%
  elementMetadata() %>%
  as.data.frame() %>%
  dplyr::select(grep("_MethRate", colnames(.))) -> MethylationRate_matrix

png("../inst/extdata/MethRateCorr_AllCs.png", units = "cm", res = 100, width = 20, height = 20)
pairs(
  MethylationRate_matrix, 
  upper.panel = panel.cor, 
  diag.panel = panel.hist, 
  lower.panel = panel.jet, 
  labels = gsub("SMF_MM_|DE_|_MethRate", "", colnames(MethylationRate_matrix))
  )
dev.off()
knitr::include_graphics(system.file("extdata", "MethRateCorr_AllCs.png", package="SingleMoleculeFootprinting"))

It is also useful, especially in the case of single enzyme treatments, to split the genomics contexts of cytosines based on the MTase used.

Methylation %>%
  elementMetadata() %>%
  as.data.frame() %>%
  filter(GenomicContext %in% c("DGCHN", "GCH")) %>%
  dplyr::select(grep("_MethRate", colnames(.))) -> MethylationRate_matrix_GCH

png("../inst/extdata/MethRateCorr_GCHs.png", units = "cm", res = 100, width = 20, height = 20)
pairs(MethylationRate_matrix_GCH, upper.panel = panel.cor, diag.panel = panel.hist, lower.panel = panel.jet, labels = colnames(MethylationRate_matrix_GCH))
dev.off()

Methylation %>%
  elementMetadata() %>%
  as.data.frame() %>%
  filter(GenomicContext %in% c("NWCGW", "HCG")) %>%
  dplyr::select(grep("_MethRate", colnames(.))) -> MethylationRate_matrix_HCG

png("../inst/extdata/MethRateCorr_HCGs.png", units = "cm", res = 100, width = 20, height = 20)
pairs(MethylationRate_matrix_HCG, upper.panel = panel.cor, diag.panel = panel.hist, lower.panel = panel.jet, labels = colnames(MethylationRate_matrix_HCG))
dev.off()
knitr::include_graphics(system.file("extdata", "MethRateCorr_GCHs.png", package="SingleMoleculeFootprinting"))

knitr::include_graphics(system.file("extdata", "MethRateCorr_HCGs.png", package="SingleMoleculeFootprinting"))

Methylation rate sample correlation (low coverage)

Before investing in deep sequencing, it is advisable to shallowly sequence libraries to assess the footprinting quality of the libraries.
However the per-cytosine coverage obtained at shallow sequencing is insufficient to estimate methylation rates for individual cytosines.
A solution is to pile up molecules covering cytosines from genomic loci that are known to behave similarly and compute a “composite” methylation rate.
Such composite methylation rate allows to assess the quality of footprinting at low coverage.

The following chunk exemplifies how to proceed.
First we want to call methylation for the new low depth samples, paying attention to setting the parameter coverage=1.
Then we want to call methylation for a reference high coverage sample.
Finally, we can use the wrapper function CompositeMethylationCorrelation to extract composite methylation rates.
N.b.: the methylation calling step is quite computationally demanding for full genomes, so we ran this in the background and reported the results only.

RegionOfInterest = GRanges(BSgenome.Mmusculus.UCSC.mm10@seqinfo["chr19"])

CallContextMethylation(
  sampleFile = sampleFile_LowCoverage, 
  samples = samples_LowCoverage, 
  genome = BSgenome.Mmusculus.UCSC.mm10, 
  RegionOfInterest = RegionOfInterest, 
  coverage = 1, 
  ConvRate.thr = NULL, 
  returnSM = FALSE, 
  clObj = NULL,
  verbose = FALSE
  )$DGCHN -> LowCoverageMethylation

CallContextMethylation(
  sampleFile = sampleFile_HighCoverage_reference, 
  samples = samples_HighCoverage_reference, 
  genome = BSgenome.Mmusculus.UCSC.mm10, 
  RegionOfInterest = RegionOfInterest, 
  coverage = 20, 
  ConvRate.thr = NULL, 
  returnSM = FALSE, 
  clObj = NULL,
  verbose = FALSE
  )$DGCHN -> HighCoverageMethylation

CompositeMethylationCorrelation(
  LowCoverage = LowCoverageMethylation, 
  LowCoverage_samples = samples_LowCoverage, 
  HighCoverage = HighCoverageMethylation, 
  HighCoverage_samples = samples_HighCoverage_reference,
  bins = 50, 
  returnDF = TRUE, 
  returnPlot = TRUE, 
  RMSE = TRUE, 
  return_RMSE_DF = TRUE, 
  return_RMSE_plot = TRUE
  ) -> results

The methylation distribution plot reveals that replicates SMF_MM_NP_NO_R3_MiSeq and SMF_MM_NP_NO_R3_MiSeq deviate fairly from the reference high coverage sample.

results <- qs::qread(system.file("extdata", "low_coverage_methylation_correlation.qs", 
                                 package="SingleMoleculeFootprinting"))
results$MethylationDistribution_plot +
  scale_color_manual(
    values = c("#00BFC4", "#00BFC4", "#00BFC4", "#F8766D", "#F8766D"), 
    breaks = c("SMF_MM_TKO_as_NO_R_NextSeq", "SMF_MM_NP_NO_R1_MiSeq", "SMF_MM_NP_NO_R2_MiSeq", 
               "SMF_MM_NP_NO_R3_MiSeq", "SMF_MM_NP_NO_R4_MiSeq"))