Contents

1 Introduction

Single-cell DNA sequencing (scDNA-seq) enables the analysis of mutation profiles at single-cell resolution. This cutting-edge technique makes it possible to shed light on the previously inaccessible complexity of heterogeneous tissues such as tumors. scafari is an R package that offers easy-to-use data quality control as well as explorative variant analyses and visualization for scDNA-seq data.

Other scDNA-seq analysis packages are: - mosaic (https://missionbio.github.io/mosaic/) - optima optima - scDNA (https://github.com/bowmanr/scDNA)

2 Requirements

To run scafari, you need R (Version 4.4 or higher).

3 Running scafari

3.1 Installation

A development version of scafari will be available on GitHub and can be installed as follows:

BiocManager::install("scafari")

3.2 Getting started

A basic workflow analysis of scDNA-seq data is demonstrated in the following on an exemplary dataset. First we need to load scafari.

library(scafari)
#> 

Once the package is installed and have been loaded, the analysis can start. scafari is available as R package and shiny GUI. You can run the shiny version by RStudio executing the function launchScafariShiny(). The scafari workflow can be split into four main parts.

  1. Sequencing information and quality control
  2. Panel analysis
  3. Variant analysis
  4. Analyses of variants of special interest

These parts are represented in the four tabs of the shiny app and several functions in the R package.

3.3 Sequencing information and quality control

The analysis start with uploading the .h5 file. In the shiny app an interactive upload is provided on the starting page. Information from the .h5 file are loaded in an sce class object in this step.

library(SingleCellExperiment)

# Load .h5 file
h5_file_path <- system.file("extdata", "demo.h5", package = "scafari")
h5 <- h5ToSce(h5_file_path)
sce <- h5$sce_amp
se.var <- h5$se_var

After successful data upload the metadata including sequencing information can be accessed and basic quality information can be inspected.

# Get metadata
metadata(sce)
#> $ado_rate
#> [1] "NA"
#> 
#> $avg_mapping_error_rate
#> [1] "0.006751065"
#> 
#> $avg_panel_uniformity
#> [1] "0.960629921259842"
#> 
#> $chemistry_version
#> [1] "V2"
#> 
#> $genome_version
#> [1] "hg19"
#> 
#> $n_amplicons
#> [1] "127"
#> 
#> $n_bases_r1
#> [1] "1313531484"
#> 
#> $n_bases_r1_q30
#> [1] "1257088621"
#> 
#> $n_bases_r2
#> [1] "1313531484"
#> 
#> $n_bases_r2_q30
#> [1] "1191181775"
#> 
#> $n_cell_barcode_bases
#> [1] "433072697"
#> 
#> $n_cell_barcode_bases_q30
#> [1] "433315642"
#> 
#> $n_cells
#> [1] "1313"
#> 
#> $n_read_pairs
#> [1] "8698884"
#> 
#> $n_read_pairs_mapped_to_cells
#> [1] "5417202"
#> 
#> $n_read_pairs_trimmed
#> [1] "8573237"
#> 
#> $n_read_pairs_valid_cell_barcodes
#> [1] "8419726"
#> 
#> $n_reads_mapped
#> [1] "16453845"
#> 
#> $n_reads_mapped_insert
#> [1] "13777334"
#> 
#> $panel_name
#> [1] "AML"
#> 
#> $pipeline_version
#> [1] "2.0.2"
#> 
#> $af_cutoff
#> [1] "20"
#> 
#> $dp_cutoff
#> [1] "10"
#> 
#> $gq_cutoff
#> [1] "30"
#> 
#> $high_quality_variants
#> [1] "0"
#> 
#> $missing_cells_cutoff
#> [1] "50"
#> 
#> $missing_variants_cutoff
#> [1] "50"
#> 
#> $mutated_cells_cutoff
#> [1] "1"
#> 
#> $n_passing_cells
#> [1] "1287"
#> 
#> $n_passing_variants
#> [1] "40"
#> 
#> $n_passing_variants_per_cell
#> [1] "17"
#> 
#> $n_variants_per_cell
#> [1] "76"
#> 
#> $sample_name
#> [1] "4-cell-lines-AML-multiomics"

Further, reads per barcode can be visualized in a log-log plot.

logLogPlot(sce)

3.4 Panel analysis

The panel analysis start by normalizing the read counts using normalizeReadCounts(). The normalized read counts are stored in the corresponding assay (normalized.counts).

sce <- normalizeReadCounts(sce = sce)

After read counts are normalized they can be annotated with biological relevant information such as exon number and canonical transcript ID.

Therefore, known canonicals should be provided. They can be dowloaded like it is shown here:

library(R.utils)
#> Loading required package: R.oo
#> Loading required package: R.methodsS3
#> R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
#> R.oo v1.27.1 (2025-05-02 21:00:05 UTC) successfully loaded. See ?R.oo for help.
#> 
#> Attaching package: 'R.oo'
#> The following object is masked from 'package:R.methodsS3':
#> 
#>     throw
#> The following object is masked from 'package:SummarizedExperiment':
#> 
#>     trim
#> The following object is masked from 'package:GenomicRanges':
#> 
#>     trim
#> The following object is masked from 'package:IRanges':
#> 
#>     trim
#> The following object is masked from 'package:generics':
#> 
#>     compile
#> The following objects are masked from 'package:methods':
#> 
#>     getClasses, getMethods
#> The following objects are masked from 'package:base':
#> 
#>     attach, detach, load, save
#> R.utils v2.13.0 (2025-02-24 21:20:02 UTC) successfully loaded. See ?R.utils for help.
#> 
#> Attaching package: 'R.utils'
#> The following object is masked from 'package:generics':
#> 
#>     evaluate
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
#> The following objects are masked from 'package:base':
#> 
#>     cat, commandArgs, getOption, isOpen, nullfile, parse, use, warnings

temp_dir <- tempdir()
known_canon_path <- file.path(
    temp_dir,
    "UCSC_hg19_knownCanonical_goldenPath.txt"
)

if (!file.exists(known_canon_path)) {
    url <- paste0(
        "http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/",
        "knownCanonical.txt.gz"
    )
    destfile <- file.path(
        temp_dir,
        "UCSC_hg19_knownCanonical_goldenPath.txt.gz"
    )

    # Download the file
    download.file(url, destfile)

    # Decompress the file
    R.utils::gunzip(destfile, remove = FALSE)
}
try(annotateAmplicons(sce = sce, known.canon = known_canon_path))

The exact genomic locations of the amplicons can be inspected in the amplicon distribution plot.

plotAmpliconDistribution(sce)
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.

The amplicon performance plot depicts normalized read counts per amplicon. This can help to identify low performing amplicons and get deeper insights into potential copy number alterations (Sarah).

plotNormalizedReadCounts(sce = sce)