Contents

1 1. Introduction

This document gives an introduction to the R package Damsel, for use in DamID analysis; from BAM file input to gene ontology analysis.

Designed for use with DamID data, the Damsel methodology could be modified for use on any similar technology that seeks to identify enriched regions relative to a control sample.

Utilising the power of edgeR for differential analysis and goseq for gene ontology bias correction, Damsel provides a unique end to end analysis for DamID.

The DamID example data used in this vignette is available in the package and has been taken from Vissers et al., (2018), ‘The Scalloped and Nerfin-1 Transcription Factors Cooperate to Maintain Neuronal Cell Fate’. The fastq files were downloaded from SRA, aligned using Rsubread::index and Rsubread::align, before sorting and making bai files with Samtools.

2 Installation

if (!require("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("Damsel")
library(Damsel)

3 Processing the BAM files

As a DamID analysis tool, Damsel requires a GATC region file for analysis. These regions serve as a guide to extract counts from the BAM files.

3.1 Introducing the GATC region file

It can be generated with getGatcRegions() using a BSGenome object or a FASTA file.

It is a GRangesList with the consecutive GATC regions across the genome - representing the region (or the length) between GATC sites, as well as the positions of the sites themselves.

If you have another species of DamID data or would prefer to make your own region file, you can use the following function, providing a BSgenome object or a FASTA file.

library(BSgenome.Dmelanogaster.UCSC.dm6)
#> Loading required package: BSgenome
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: GenomicRanges
#> Loading required package: Biostrings
#> Loading required package: XVector
#> 
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#> 
#>     strsplit
#> Loading required package: BiocIO
#> Loading required package: rtracklayer
#> 
#> Attaching package: 'rtracklayer'
#> The following object is masked from 'package:BiocIO':
#> 
#>     FileForFormat
regions_and_sites <- getGatcRegions(BSgenome.Dmelanogaster.UCSC.dm6)
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Warning in GenomeInfoDb::renameSeqlevels(x = df_, value = newStyle): invalid
#> seqlevels 'chrM' ignored
regions <- regions_and_sites$regions
knitr::kable(head(regions))
seqnames start end width strand Position
2L 82 230 149 * chr2L-82
2L 231 371 141 * chr2L-231
2L 372 539 168 * chr2L-372
2L 540 688 149 * chr2L-540
2L 689 829 141 * chr2L-689
2L 830 997 168 * chr2L-830
knitr::kable(head(regions_and_sites$sites))
seqnames start end width strand Position
2L 80 83 4 * chr2L-80
2L 229 232 4 * chr2L-229
2L 370 373 4 * chr2L-370
2L 538 541 4 * chr2L-538
2L 687 690 4 * chr2L-687
2L 828 831 4 * chr2L-828

If you already have your own GATC region file, ensure that it has the same format with 6 columns:

  • Position: chromosome-start
  • seqnames: chromosome name
  • start: start of region
  • end: end of region
  • width: length of region (ensure that is is correct according to [plyranges::as_granges()])

3.2 Extracting the counts within the GATC regions

Note: Damsel requires BAM files that have been mapped to the reference genome.

Provided the path to a folder of BAM files (and their .bai files) and the appropriate GATC region file, the function countBamInGATC() will extract the counts for each region for each available BAM and add them as columns to a data frame. The columns will be named by the BAM file name - please rename them before running the function if they do not make sense.

path_to_bams <- system.file("extdata", package = "Damsel")
counts.df <- countBamInGATC(path_to_bams,
    regions = regions
)
  • If necessary, at this stage please rearrange the BAM file columns so they are ordered in the following way: Dam_1, Fusion_1, Dam_2, Fusion_2 etc
knitr::kable(head(counts.df))
X Position seqnames start end width strand dam_1_SRR7948872.BAM sd_1_SRR7948874.BAM dam_2_SRR7948876.BAM sd_2_SRR7948877.BAM
1 chr2L-82 chr2L 82 230 149 * 1.0 0.33 0.0 0.0
2 chr2L-231 chr2L 231 371 141 * 1.5 5.67 87.0 57.5
3 chr2L-372 chr2L 372 539 168 * 2.5 6.17 88.0 58.5
4 chr2L-540 chr2L 540 688 149 * 2.0 4.83 0.0 0.0
5 chr2L-689 chr2L 689 829 141 * 0.0 0.00 0.5 0.5
6 chr2L-830 chr2L 830 997 168 * 0.0 1.33 4.5 3.5

This example data is also directly available as a counts file via data.

data("dros_counts")
  • Do not remove the .bam extension on the column names as this is used as a check in later functions to ensure only the BAM files are selected from the data frame.

  • The DamID data captures the ~75bp region extending from each GATC site, so although regions are of differing widths, there is a null to minimal length bias present on the data, and does not require length correction.

3.3 Correlation analysis of samples

At this stage, the similarities and differences between the samples can be analysed via correlation. plotCorrHeatmap plots the correlation of all available BAM files Dam and Fusion, to visualise the similarity between files. The default for all Damsel correlation analysis is the non-parametric “spearman’s” correlation. The correlation between Dam_1 and Fusion_1 can be expected to reach ~ 0.7, whereas the correlation between Dam_1 & Dam_3 or Fusion_1 & Fusion_2 would be expected to be closer to ~0.9

plotCorrHeatmap(df = counts.df, method = "spearman")

Two specific samples can also be compared using ggscatter which plots a scatterplot of the two samples, overlaid with the correlation results. [ggpubr::ggscatter()]

3.4 Visualisation of coverage

The overall coverage of different samples can be compared

plotCountsDistribution(counts.df, constant = 1)

A specific region can be selected to view the counts across samples.

plotCounts(counts.df,
    seqnames = "chr2L",
    start_region = 1,
    end_region = 40000,
    layout = "spread"
)

As shown in the following plots, the default layout is "stacked", where the replicates are overlaid. The counts can also be displayed in a log2 ratio with log2_scale=TRUE

4 Differential methylation analysis

The goal with DamID analysis is to identify regions that are enriched in the fusion sample relative to the control. In Damsel, this step is referred to as differential methylation analysis, and makes use of [edgeR].

For ease of use, Damsel has four main edgeR based functions which compile different steps and functions from within edgeR.

4.1 Setting up edgeR analysis

makeDGE sets up the edgeR analysis for differential methylation testing. Taking the data frame of samples and regions as input, it conducts the following steps:

  • it extracts the sample data
  • groups the samples (Dam or Fusion)
  • filters the samples (remove regions with very low counts, the filtering parameters may be adjusted)
  • normalises the data
  • establishes the design matrix (this includes the sample group and pairing replicates together - Dam_1 & Fusion_1)
  • estimates the dispersion
dge <- makeDGE(counts.df)
head(dge)
#> An object of class "DGEList"
#> $counts
#>            dam_1_SRR7948872.BAM sd_1_SRR7948874.BAM dam_2_SRR7948876.BAM
#> chr2L-231                  1.50                5.67                 87.0
#> chr2L-372                  2.50                6.17                 88.0
#> chr2L-3578                 4.00                1.50                  6.0
#> chr2L-4494                 1.42                7.83                  8.0
#> chr2L-4952                 3.58                1.83                  5.5
#> chr2L-5120                 4.42                3.00                  6.0
#>            sd_2_SRR7948877.BAM
#> chr2L-231                57.50
#> chr2L-372                58.50
#> chr2L-3578                5.00
#> chr2L-4494                6.00
#> chr2L-4952                5.33
#> chr2L-5120                5.83
#> 
#> $samples
#>                       group lib.size norm.factors
#> dam_1_SRR7948872.BAM    Dam  2238419    1.7342299
#> sd_1_SRR7948874.BAM  Fusion  2027147    0.8980197
#> dam_2_SRR7948876.BAM    Dam 10752265    1.2158660
#> sd_2_SRR7948877.BAM  Fusion 11540604    0.5281068
#> 
#> $genes
#>              Position seqnames start  end
#> chr2L-231   chr2L-231    chr2L   231  371
#> chr2L-372   chr2L-372    chr2L   372  539
#> chr2L-3578 chr2L-3578    chr2L  3578 3745
#> chr2L-4494 chr2L-4494    chr2L  4494 4661
#> chr2L-4952 chr2L-4952    chr2L  4952 5119
#> chr2L-5120 chr2L-5120    chr2L  5120 5153
#> 
#> $design
#>   (Intercept) group V2
#> 1           1     0  1
#> 2           1     1  1
#> 3           1     0  0
#> 4           1     1  0
#> attr(,"assign")
#> [1] 0 1 2
#> 
#> $common.dispersion
#> [1] 0.0226594
#> 
#> $trended.dispersion
#> [1] 1.471754e-02 1.502779e-02 9.765625e-05 9.765625e-05 9.765625e-05
#> [6] 9.765625e-05
#> 
#> $tagwise.dispersion
#> [1] 1.550797e-02 1.560039e-02 9.765625e-05 9.765625e-05 9.765625e-05
#> [6] 9.765625e-05
#> 
#> $AveLogCPM
#> [1]  2.583582499  2.620713568  0.004159126  0.388971306 -0.007506024
#> [6]  0.175425183
#> 
#> $trend.method
#> [1] "locfit"
#> 
#> $prior.df
#> [1] 115.1677 116.6016 122.7404 109.0151 122.7404 122.7404
#> 
#> $prior.n
#> [1] 115.1677 116.6016 122.7404 109.0151 122.7404 122.7404
#> 
#> $span
#> [1] 0.2767543

The output from this step is a DGEList containing all of the information from the steps.

4.2 Examining the data - multidimensional scaling plot

It’s important to visualise the differences between the samples.

You would expect the Dam samples to cluster together, and for the Fusion samples to cluster together. You would expect the majority of the variation to be within the 1st dimension (the x axis), and less variation in the 2nd dimension (y axis)

group <- dge$samples$group %>% as.character()
limma::plotMDS(dge, col = as.numeric(factor(group)))