The barbieQ package provides a series of robust statistical tools for analysing barcode count data generated from cell clonal tracking (lineage tracing) experiments.
In these experiments, an initial cell and its offspring collectively form a clone (or lineage). A unique DNA barcode, incorporated into the genome of an initial cell, is inherited by all its progeny within the clone. This one-to-one mapping of barcodes to clones enables tracking of clonal behaviours. By quantifying barcode counts, researchers can measure the abundance of individual clones under various experimental conditions or perturbations.
While existing tools for barcode count data analysis primarily rely on qualitative interpretation through visualizations, they often lack robust methods to model the sources of barcode variability.[barcodetrackR, CellDestiny, genBaRcode]
To address this gap, this R software package, barbieQ, provides advanced statistical methods to model barcode variability. The package supports preprocessing, visualization, and statistical testing to identify barcodes with significant differences in proportions or occurrences across experimental conditions. Key functionalities include initializing data structures, filtering barcodes, and applying regression models to test for significant clonal changes.
The main functions include:
createBarbieQ()
tagTopBarcodes()
plotBarcodePairCorrelation()
clusterCorrelatingBarcodes()
plotSamplePairCorrelation()
plotBarcodeProportion()
testBarcodeSignif()
plotSignifBarcodeProportion()
plotBarcodeMA()
## You can install the released version of barbieQ like so:
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#
# BiocManager::install("barbieQ")
## Alternatively, you can install the development version of barbieQ from GitHub
devtools::install_github("Oshlack/barbieQ")monkeyHSPC)A subset of data from a study on monkey HSPC cell expansion using barcoding technique.[NK clonal expansion), barcodetrackRData] Barcode counts within different samples of various cell types were used to interpret the patterns of HSPC differentiation.
It is a SummarizedExperiment object created using
function barbieQ::createBarbie, containing a barcode count
matrix with 16,603 rows and 62 columns, and a data frame of sample
metadata.
barbieQ ObjectPlease start with creating a barbieQ structure by
passing the barcode count matrix as input to
createBarbieQ() function .
By creating a barbieQ object, a series of data
transformations will be automatically applied, and the transformed data
will be saved within the barbieQ object, for easy use in
subsequent analyses.
Here we subset the object by selecting samples from specific stages of collection time.
In sampleMetadata, Define “early”, “mid”, and “late”
stages based on “Months”, and clean up “Celltype”.
updateSampleMetadata <- exampleBBQ$sampleMetadata %>%
as.data.frame() %>%
select(Celltype, Months) %>%
mutate(Phase = ifelse(Months < 6, "early", ifelse(Months >=55, "late", "mid"))) %>%
mutate(Celltype = gsub("(Gr).*", "\\1", Celltype))
SummarizedExperiment::colData(exampleBBQ)$sampleMetadata <- S4Vectors::DataFrame(updateSampleMetadata)
exampleBBQ$sampleMetadata
#> DataFrame with 62 rows and 3 columns
#> Celltype Months Phase
#> <character> <numeric> <character>
#> ZG66_6.5m_T T 6.5 mid
#> ZG66_12m_T T 12.0 mid
#> ZG66_17m_T T 17.0 mid
#> ZG66_27m_T T 27.0 mid
#> ZG66_36m_T T 36.0 mid
#> ... ... ... ...
#> ZG66_58m_NK_NKG2Ap_CD16p_KIR3DL01n NK_NKG2Ap_CD16p_KIR3.. 58 late
#> ZG66_58m_NK_NKG2Ap_CD16p_KIR3DL01p NK_NKG2Ap_CD16p_KIR3.. 58 late
#> ZG66_68m_NK_NKG2Ap_CD16p NK_NKG2Ap_CD16p 68 late
#> ZG66_68m_NK_NKG2Ap_CD16p_KIR3DL01n NK_NKG2Ap_CD16p_KIR3.. 68 late
#> ZG66_68m_NK_NKG2Ap_CD16p_KIR3DL01p NK_NKG2Ap_CD16p_KIR3.. 68 lateSubset the object to retain only the samples from the “mid” stage.
flag_sample <- exampleBBQ$sampleMetadata$Phase == "mid"
exampleBBQ <- exampleBBQ[, flag_sample]
exampleBBQ$sampleMetadata
#> DataFrame with 42 rows and 3 columns
#> Celltype Months Phase
#> <character> <numeric> <character>
#> ZG66_6.5m_T T 6.5 mid
#> ZG66_12m_T T 12.0 mid
#> ZG66_17m_T T 17.0 mid
#> ZG66_27m_T T 27.0 mid
#> ZG66_36m_T T 36.0 mid
#> ... ... ... ...
#> ZG66_22m_Gr Gr 22.0 mid
#> ZG66_24m_Gr Gr 24.0 mid
#> ZG66_36m_Gr_2 Gr 36.0 mid
#> ZG66_14.5m_NK_CD56n_CD16p NK_CD56n_CD16p 14.5 mid
#> ZG66_14.5m_NK_CD56p_CD16n NK_CD56p_CD16n 14.5 midA filtering step is recommended to remove barcodes that consistently show low counts across the dataset. The retained barcodes, which are considered to make an essential contribution, are referred to as “top barcodes”.
By applying the tagTopBarcodes() function to the
barbieQ object, you identify and tag the “top
barcodes” within the object.
In this example dataset, we are interested in the differences in
barcode outcomes between cell types, so we will group samples by cell
types. We set up the nSampleThreshold to 6 as
the minimum group size.
## Check out minimum group size.
table(exampleBBQ$sampleMetadata$Celltype)
#>
#> B Gr NK_CD56n_CD16p NK_CD56p_CD16n T
#> 6 12 7 7 10
## Tag top Barcodes.
exampleBBQ <- tagTopBarcodes(barbieQ = exampleBBQ, nSampleThreshold = 6)Once “top barcodes” are determined and tagged, it’s useful to assess their contributions before actually removing the “bottom barcodes”, which are considered as non-essential contributors.
By applying the plotBarcodePareto() function to the
barbieQ object, you can visualize the contribution of each
barcode, colour-coded as “top” or “bottom”. (Here,
“contribution” refers to the average proportion of individual barcodes
across samples in the dataset.)
By applying the plotBarcodeSankey() function to the
barbieQ object, you can visualize the collective
contribution of the “top” and “bottom” barcode
groups.
## visualize contribution of top vs. bottom barcodes
plotBarcodePareto(barbieQ = exampleBBQ) |> plot()
#> Warning: Removed 10 rows containing missing values or values outside the scale range
#> (`geom_bar()`).## visualize collective contribution of top vs. bottom barcodes
plotBarcodeSankey(barbieQ = exampleBBQ) |> plot()barbieQ object based on the
tagged array.To gain a general understanding of sample similarity, you can
visualize sample pairwise correlations in a checkboard style by applying
the plotSamplePairCorrelation() function to the
barbieQ object.
## visualize sample pair wise correlation
plotSamplePairCorrelation(barbieQ = exampleBBQ) |> plot()
#> setting Celltype as the primary factor in `sampleMetadata`.
#> displaying pearson correlation coefficient between samples on Barcode log2 CPM+1.The barbieQ object is interoperable with other packages,
such as bartools. Below is an example of how to import a
barbieQ object into the bartools pipeline for
visualization. This code chunk is not executed in the vignette, but you
can run it in your local environment.
# devtools::install_github("DaneVass/bartools", dependencies = TRUE, force = TRUE)
#
# dge <- DGEList(
# counts = assay(exampleBBQ),
# group = exampleBBQ$sampleMetadata$Celltype)
#
# bartools::plotBarcodeHistogram(dge)Below is an example of inspecting barcode data variance using
speckle package. This code chunk is not executed in the
vignette, but you can run it in your local environment.
Based on the understanding of sample conditions that likely to be the
source of variability in barcode outcomes, you can robustly test the
significance of the barcode changes between the sample conditions, by
applying the function testBarcodeSignif() to the
barbieQ object. The testing results will be saved in the
object, and can be further visualized using functions:
plotBarcodeMA(), plotSignifBarcodeHeatmap(),
plotSignifBarcodeProportion(), and etc.
By setting the method parameter to
“diffProp” (default), you test each barcode’s differential
proportion between conditions.
By setting the method parameter by
“diffOcc”, you test each barcode’s differential
occurrence between conditions.
We recommend setting the transformation parameter to
“asin-sqrt” (default), although alternatives such as
“logit” and “none” are also available. Statistical
tests are performed on the data following the specified proportion
transformation.
## test Barcode differential proportion between sample groups
## Defult transformation: asin-sqrt
asinTrans <- testBarcodeSignif(
barbieQ = exampleBBQ,
contrastFormula = "(CelltypeNK_CD56n_CD16p) - (CelltypeB+CelltypeGr+CelltypeT+CelltypeNK_CD56p_CD16n)/4",
method = "diffProp", transformation = "asin-sqrt"
)
#> setting Celltype as the primary factor in `sampleMetadata`.
#> removing factors with only one level from sampleMetadata: NA
#> no block specified, so there are no duplicate measurements.
## Alternatively: using logit transformation
logitTrans <- testBarcodeSignif(
barbieQ = exampleBBQ,
contrastFormula = "(CelltypeNK_CD56n_CD16p) - (CelltypeB+CelltypeGr+CelltypeT+CelltypeNK_CD56p_CD16n)/4",
method = "diffProp", transformation = "logit"
)
#> setting Celltype as the primary factor in `sampleMetadata`.
#> removing factors with only one level from sampleMetadata: NA
#> Warning in FUN(X[[i]], ...): NaNs produced
#> no block specified, so there are no duplicate measurements.
## Alternatively: no transformation
noTrans <- testBarcodeSignif(
barbieQ = exampleBBQ,
contrastFormula = "(CelltypeNK_CD56n_CD16p) - (CelltypeB+CelltypeGr+CelltypeT+CelltypeNK_CD56p_CD16n)/4",
method = "diffProp", transformation = "none"
)
#> setting Celltype as the primary factor in `sampleMetadata`.
#> removing factors with only one level from sampleMetadata: NA
#> no block specified, so there are no duplicate measurements.Draw MA plot for differential proportion tests following different transformations.
(plotBarcodeMA(asinTrans) + coord_trans(x = "log10"))|> plot()
#> Warning: `coord_trans()` was deprecated in ggplot2 4.0.0.
#> ℹ Please use `coord_transform()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.Annotate barcodes in the heatmap based on significance derived from differential proportion tests following different transformations.
Visualize the aggregated barcode proportion in each sample, grouped by significance.
In differential occurrence test, the regularization
parameter is set to “firth” by default, and is strongly
recommended, especially with small sample sizes.
## test Barcode differential occurrence between sample groups
## set up the targets (sample conditions)
targets <- exampleBBQ$sampleMetadata %>%
as.data.frame() %>%
mutate(Group = ifelse(
Celltype == "NK_CD56n_CD16p",
"NK_CD56n_CD16p",
"B.Gr.T.NK_CD56p_CD16n"))
exampleBBQ <- testBarcodeSignif(
barbieQ = exampleBBQ,
sampleMetadata = targets[,"Group", drop=FALSE],
method = "diffOcc"
)
#> setting Group as the primary factor in `sampleMetadata`.
#> setting up contrastFormula: GroupNK_CD56n_CD16p - GroupB.Gr.T.NK_CD56p_CD16nDraw an “MA plot” for the differential occurrence test by plotting the Log Odds Ratio (LOR) against the Mean Occurrence Frequency (number of total occurrences across samples / number of total samples) for each barcode.
We are currently writing a paper to introduce the methods and
approaches implemented in this barbieQ package and will
update with a citation once available.
sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 grid stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] SummarizedExperiment_1.41.0 Biobase_2.71.0
#> [3] GenomicRanges_1.63.1 Seqinfo_1.1.0
#> [5] IRanges_2.45.0 S4Vectors_0.49.0
#> [7] BiocGenerics_0.57.0 generics_0.1.4
#> [9] MatrixGenerics_1.23.0 matrixStats_1.5.0
#> [11] limma_3.67.0 ComplexHeatmap_2.27.0
#> [13] data.table_1.18.0 igraph_2.2.1
#> [15] logistf_1.26.1 circlize_0.4.17
#> [17] ggplot2_4.0.1 dplyr_1.1.4
#> [19] tidyr_1.3.2 magrittr_2.0.4
#> [21] barbieQ_1.3.0 BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] Rdpack_2.6.4 rlang_1.1.6 clue_0.3-66
#> [4] GetoptLong_1.1.0 compiler_4.5.2 mgcv_1.9-4
#> [7] png_0.1-8 vctrs_0.6.5 pkgconfig_2.0.3
#> [10] shape_1.4.6.1 crayon_1.5.3 fastmap_1.2.0
#> [13] backports_1.5.0 XVector_0.51.0 labeling_0.4.3
#> [16] rmarkdown_2.30 nloptr_2.2.1 purrr_1.2.0
#> [19] xfun_0.55 glmnet_4.1-10 jomo_2.7-6
#> [22] cachem_1.1.0 jsonlite_2.0.0 DelayedArray_0.37.0
#> [25] pan_1.9 broom_1.0.11 parallel_4.5.2
#> [28] cluster_2.1.8.1 R6_2.6.1 bslib_0.9.0
#> [31] RColorBrewer_1.1-3 boot_1.3-32 rpart_4.1.24
#> [34] jquerylib_0.1.4 Rcpp_1.1.0.8.1 iterators_1.0.14
#> [37] knitr_1.51 Matrix_1.7-4 splines_4.5.2
#> [40] nnet_7.3-20 tidyselect_1.2.1 abind_1.4-8
#> [43] yaml_2.3.12 doParallel_1.0.17 codetools_0.2-20
#> [46] lattice_0.22-7 tibble_3.3.0 withr_3.0.2
#> [49] S7_0.2.1 evaluate_1.0.5 survival_3.8-3
#> [52] pillar_1.11.1 BiocManager_1.30.27 mice_3.19.0
#> [55] foreach_1.5.2 reformulas_0.4.3 scales_1.4.0
#> [58] minqa_1.2.8 glue_1.8.0 maketools_1.3.2
#> [61] tools_4.5.2 sys_3.4.3 lme4_1.1-38
#> [64] buildtools_1.0.0 rbibutils_2.4 colorspace_2.1-2
#> [67] nlme_3.1-168 formula.tools_1.7.1 cli_3.6.5
#> [70] S4Arrays_1.11.1 gtable_0.3.6 sass_0.4.10
#> [73] digest_0.6.39 operator.tools_1.6.3 SparseArray_1.11.10
#> [76] rjson_0.2.23 farver_2.1.2 htmltools_0.5.9
#> [79] lifecycle_1.0.4 GlobalOptions_0.1.3 mitml_0.4-5
#> [82] statmod_1.5.1 MASS_7.3-65