CellBarcode 1.12.0
library(data.table)
library(ggplot2)
library(CellBarcode)
What’s this package used for?
Cellular DNA barcoding (genetic lineage tracing) is a powerful tool for lineage tracing and clonal tracking studies. This package provides a toolbox for DNA barcode analysis, from extraction from fastq files to barcode error correction and quantification.
What types of barcode can this package handle?
The package can handle all kinds of barcodes, as long as the barcodes have a pattern which can be matched by a regular expression, and each barcode is within a single sequencing read. It can handle barcodes with flexible length, and barcodes with UMI (unique molecular identifier).
This tool can also be used for the pre-processing part of amplicon data analysis such as CRISPR gRNA screening, immune repertoire sequencing and meta genome data.
What can the package do?
The package provides functions for 1). Sequence quality control and filtering, 2). Barcode (and UMI) extraction from sequencing reads, 3). Sample and barcode management with metadata, 4). Barcode filtering.
Most of the functions in this packages have names with bc_
as initiation. We
hope it can facilitate the syntax auto-complement function of IDE (integrated
development toolkit) or IDE-like tools such as RStudio, R-NVIM (in VIM), and
ESS (in Emacs). By typing bc_
you can have a list of suggested functions,
then you can pick the function you need from the list.
TODO: the function brain-map
The test data set in this package can be accessed by
system.file("extdata", "mef_test_data", package="CellBarcode")
The data are from Jos et. al (TODO: citation). There are 7 mouse embryo fibroblast (MEF) lines with different DNA barcodes. The barcodes are in vivo inducible VDJ barcodes (TODO: add citation when have). These MEF lines were mixed in a ratio of 1:2:4:8:16:32:64.
sequence | clone size 2^x |
---|---|
AAGTCCAGTTCTACTATCGTAGCTACTA | 1 |
AAGTCCAGTATCGTTACGCTACTA | 2 |
AAGTCCAGTCTACTATCGTTACGACAGCTACTA | 3 |
AAGTCCAGTTCTACTATCGTTACGAGCTACTA | 4 |
AAGTCCATCGTAGCTACTA | 5 |
AAGTCCAGTACTGTAGCTACTA | 6 |
AAGTCCAGTACTATCGTACTA | 7 |
Then 5 pools of 196 to 50000 cells were prepared from the MEF lines mixture. For each pool 2 technical replicates (NGS libraries) were prepared and sequenced, finally resulting in 10 samples.
sample name | cell number | replication |
---|---|---|
195_mixa | 195 | mixa |
195_mixb | 195 | mixb |
781_mixa | 781 | mixa |
781_mixb | 781 | mixb |
3125_mixa | 3125 | mixa |
3125_mixb | 3125 | mixb |
12500_mixa | 12500 | mixa |
12500_mixb | 12500 | mixb |
50000_mixa | 50000 | mixa |
50000_mixb | 50000 | mixb |
The original FASTQ files are relatively large, so only 2000 reads for each sample have been randomly sampled as a test set here.
example_data <- system.file("extdata", "mef_test_data", package = "CellBarcode")
fq_files <- dir(example_data, "fastq.gz", full=TRUE)
# prepare metadata for the samples
metadata <- stringr::str_split_fixed(basename(fq_files), "_", 10)[, c(4, 6)]
metadata <- as.data.frame(metadata)
sample_name <- apply(metadata, 1, paste, collapse = "_")
colnames(metadata) = c("cell_number", "replication")
# metadata should has the row names consistent to the sample names
rownames(metadata) = sample_name
metadata
#> cell_number replication
#> 195_mixb 195 mixb
#> 50000_mixa 50000 mixa
#> 50000_mixb 50000 mixb
#> 12500_mixa 12500 mixa
#> 12500_mixb 12500 mixb
#> 3125_mixa 3125 mixa
#> 3125_mixb 3125 mixb
#> 781_mixa 781 mixa
#> 781_mixb 781 mixb
#> 195_mixa 195 mixa
Install from Bioconductor.
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("CellBarcode")
Install the development version from Github.
# install.packages("remotes")
remotes::install_github("wenjie1991/CellBarcode")
Here is an example of a basic workflow:
# install.packages("stringr")
library(CellBarcode)
library(magrittr)
# The example data is the mix of MEF lines with known barcodes
# 2000 reads for each file have been sampled for this test dataset
# extract UMI barcode with regular expression
bc_obj <- bc_extract(
fq_files, # fastq file
pattern = "([ACGT]{12})CTCGAGGTCATCGAAGTATC([ACGT]+)CCGTAGCAAGCTCGAGAGTAGACCTACT",
pattern_type = c("UMI" = 1, "barcode" = 2),
sample_name = sample_name,
metadata = metadata
)
bc_obj
#> Bonjour le monde, This is a BarcodeObj.
#> ----------
#> It contains:
#> ----------
#> @metadata: 4 field(s) available:
#> cell_number replication raw_read_count barcode_read_count
#> ----------
#> @messyBc: 10 sample(s) for raw barcodes:
#> In sample $195_mixb there are: 1318 Tags
#> In sample $50000_mixa there are: 1310 Tags
#> In sample $50000_mixb there are: 1385 Tags
#> In sample $12500_mixa there are: 1321 Tags
#> In sample $12500_mixb there are: 1361 Tags
#> In sample $3125_mixa there are: 1287 Tags
#> In sample $3125_mixb there are: 1297 Tags
#> In sample $781_mixa there are: 1295 Tags
#> In sample $781_mixb there are: 1303 Tags
#> In sample $195_mixa there are: 1343 Tags
# sample subset operation, select technical repeats 'mixa'
bc_sub = bc_subset(bc_obj, sample=replication == "mixa")
bc_sub
#> Bonjour le monde, This is a BarcodeObj.
#> ----------
#> It contains:
#> ----------
#> @metadata: 4 field(s) available:
#> cell_number replication raw_read_count barcode_read_count
#> ----------
#> @messyBc: 5 sample(s) for raw barcodes:
#> In sample $50000_mixa there are: 1310 Tags
#> In sample $12500_mixa there are: 1321 Tags
#> In sample $3125_mixa there are: 1287 Tags
#> In sample $781_mixa there are: 1295 Tags
#> In sample $195_mixa there are: 1343 Tags
# filter the barcode, UMI barcode amplicon >= 2 & UMI counts >= 2
bc_sub <- bc_cure_umi(bc_sub, depth = 2) %>% bc_cure_depth(depth = 2)
# select barcodes with a white list
bc_2df(bc_sub)
#> sample_name barcode_seq count
#> 1 50000_mixa AAGTCCAGTATCGTTACGCTACTA 10
#> 2 50000_mixa AAGTCCAGTACTGTAGCTACTA 11
#> 3 50000_mixa AAGTCCATCGTAGCTACTA 3
#> 4 12500_mixa AAGTCCAGTATCGTTACGCTACTA 20
#> 5 12500_mixa AAGTCCAGTACTGTAGCTACTA 11
#> 6 12500_mixa AAGTCCATCGTAGCTACTA 4
#> 7 12500_mixa AAGTCCAGTTCTACTATCGTTACGAGCTACTA 3
#> 8 3125_mixa AAGTCCATCGTAGCTACTA 7
#> 9 3125_mixa AAGTCCAGTATCGTTACGCTACTA 17
#> 10 3125_mixa AAGTCCAGTACTGTAGCTACTA 9
#> 11 3125_mixa AAGTCCAGTTCTACTATCGTTACGAGCTACTA 7
#> 12 781_mixa AAGTCCAGTATCGTTACGCTACTA 7
#> 13 781_mixa AAGTCCAGTACTGTAGCTACTA 9
#> 14 781_mixa AAGTCCATCGTAGCTACTA 2
#> 15 195_mixa AAGTCCATCGTAGCTACTA 9
#> 16 195_mixa AAGTCCAGTACTGTAGCTACTA 11
#> 17 195_mixa AAGTCCAGTATCGTTACGCTACTA 12
#> 18 195_mixa AAGTCCAGTACTATCGTACTA 2
#> 19 195_mixa AAGTCCAGTTCTACTATCGTTACGAGCTACTA 4
bc_sub[c("AAGTCCAGTACTATCGTACTA", "AAGTCCAGTACTGTAGCTACTA"), ]
#> Bonjour le monde, This is a BarcodeObj.
#> ----------
#> It contains:
#> ----------
#> @metadata: 5 field(s) available:
#> cell_number replication raw_read_count barcode_read_count depth_cutoff
#> ----------
#> @messyBc: 5 sample(s) for raw barcodes:
#> In sample $50000_mixa there are: 434 Tags
#> In sample $12500_mixa there are: 490 Tags
#> In sample $3125_mixa there are: 455 Tags
#> In sample $781_mixa there are: 524 Tags
#> In sample $195_mixa there are: 484 Tags
#> ----------
#> @cleanBc: 5 samples for cleaned barcodes
#> In sample $50000_mixa there are: 1 barcodes
#> In sample $12500_mixa there are: 1 barcodes
#> In sample $3125_mixa there are: 1 barcodes
#> In sample $781_mixa there are: 1 barcodes
#> In sample $195_mixa there are: 2 barcodes
# export the barcode counts to data.frame
head(bc_2df(bc_sub))
#> sample_name barcode_seq count
#> 1 50000_mixa AAGTCCAGTATCGTTACGCTACTA 10
#> 2 50000_mixa AAGTCCAGTACTGTAGCTACTA 11
#> 3 50000_mixa AAGTCCATCGTAGCTACTA 3
#> 4 12500_mixa AAGTCCAGTATCGTTACGCTACTA 20
#> 5 12500_mixa AAGTCCAGTACTGTAGCTACTA 11
#> 6 12500_mixa AAGTCCATCGTAGCTACTA 4
# export the barcode counts to matrix
head(bc_2matrix(bc_sub))
#> X12500_mixa X195_mixa X3125_mixa X50000_mixa
#> AAGTCCAGTACTATCGTACTA 0 2 0 0
#> AAGTCCAGTACTGTAGCTACTA 11 11 9 11
#> AAGTCCAGTATCGTTACGCTACTA 20 12 17 10
#> AAGTCCAGTTCTACTATCGTTACGAGCTACTA 3 4 7 0
#> AAGTCCATCGTAGCTACTA 4 9 7 3
#> X781_mixa
#> AAGTCCAGTACTATCGTACTA 0
#> AAGTCCAGTACTGTAGCTACTA 9
#> AAGTCCAGTATCGTTACGCTACTA 7
#> AAGTCCAGTTCTACTATCGTTACGAGCTACTA 0
#> AAGTCCATCGTAGCTACTA 2
In a full analysis starting from fastq files, the first step is to check the
seqencing quality and filter as required. The bc_seq_qc
function is for
checking the sequencing quality. If multiple samples are input the output is a
BarcodeQcSet
object, otherwise a BarcodeQC
object will be returned. In
addition, bc_seq_qc
also can handle the ShortReadQ
, DNAStringSet
and
other data types.
qc_noFilter <- bc_seq_qc(fq_files)
qc_noFilter
#> The sequence QC set, use `[]` to select sample:
#> 5290_10_BCM_195_mef_mixb_GTCATTG_S11_R1_001.fastq.gz
#> 5290_1_BCM_50000_mef_mixa_GTTCTCC_S2_R1_001.fastq.gz
#> 5290_2_BCM_50000_mef_mixb_GATGTGT_S5_R1_001.fastq.gz
#> 5290_3_BCM_12500_mef_mixa_TGCCTTG_S4_R1_001.fastq.gz
#> 5290_4_BCM_12500_mef_mixb_TAACTGC_S8_R1_001.fastq.gz
#> 5290_5_BCM_3125_mef_mixa_GCTTCCA_S9_R1_001.fastq.gz
#> 5290_6_BCM_3125_mef_mixb_TGTGAGT_S7_R1_001.fastq.gz
#> 5290_7_BCM_781_mef_mixa_CCTTACC_S12_R1_001.fastq.gz
#> 5290_8_BCM_781_mef_mixb_CGTATCC_S13_R1_001.fastq.gz
#> 5290_9_BCM_195_mef_mixa_GTACTGT_S14_R1_001.fastq.gz
bc_names(qc_noFilter)
#> [1] "5290_10_BCM_195_mef_mixb_GTCATTG_S11_R1_001.fastq.gz"
#> [2] "5290_1_BCM_50000_mef_mixa_GTTCTCC_S2_R1_001.fastq.gz"
#> [3] "5290_2_BCM_50000_mef_mixb_GATGTGT_S5_R1_001.fastq.gz"
#> [4] "5290_3_BCM_12500_mef_mixa_TGCCTTG_S4_R1_001.fastq.gz"
#> [5] "5290_4_BCM_12500_mef_mixb_TAACTGC_S8_R1_001.fastq.gz"
#> [6] "5290_5_BCM_3125_mef_mixa_GCTTCCA_S9_R1_001.fastq.gz"
#> [7] "5290_6_BCM_3125_mef_mixb_TGTGAGT_S7_R1_001.fastq.gz"
#> [8] "5290_7_BCM_781_mef_mixa_CCTTACC_S12_R1_001.fastq.gz"
#> [9] "5290_8_BCM_781_mef_mixb_CGTATCC_S13_R1_001.fastq.gz"
#> [10] "5290_9_BCM_195_mef_mixa_GTACTGT_S14_R1_001.fastq.gz"
class(qc_noFilter)
#> [1] "BarcodeQcSet"
#> attr(,"package")
#> [1] "CellBarcode"
The bc_plot_seqQc
function can be invoked with a BarcodeQcSet
as argument,
and the output is a QC summary with two panels. The first shows the ratio of
ATCG bases for each sequencing cycle with one sample per row; this allows the
user to, for example, identify constant or random parts of the sequencing read.
The second figure shows the average sequencing quality index of each cycle
(base).
For the test set, the first 12 bases are UMI, which are random. This is followed by the constant region of the barcode (the PCR primer selects reads with this sequence), and here we observe a specific base for each cycle across all the samples.
bc_plot_seqQc(qc_noFilter)