Single cell RNA sequencing (scRNAseq) has made it possible to examine the cellular heterogeneity within a tissue or sample, and observe changes and characteristics in specific cell types. To do this, we need to group the cells into clusters and figure out what they are.

In a typical scRNAseq experiment the gene expression levels are first quantified to per-cell counts. Then, cells are clustered into related groups (or clusters) on the basis of transcriptional similarity. There are many different cell-clustering tools that can do this (Freytag et al. 2017).

Clustering tools generally define groups of similar cells - but do not offer explanation as to their biological contents. The annotation of the ‘cell type’ of each cluster is performed by a domain expert biologist - who can examine the known marker genes, or differential expression to understand what type of cell each cluster might describe. This can be a time-consuming semi-manual process, and must be performed before addressing the actual biological question of interest.

The celaref package aims to streamline this cell-type identification step, by suggesting cluster labels on the basis of similarity to an already-characterised reference dataset - whether that’s from a similar experiment performed previously in the same lab, or from a public dataset from a similar sample.

Celaref differs from other cell-type identification tools like scmap (Kiselev, Yiu, and Hemberg 2018) or (functions in) MUDAN in that it operates at the cluster-level.

Celaref requires a table of read counts per cell per gene, and a list of the cells belonging to each of the clusters, (for both test and reference data). It compares the reference sample rankings of the most distinctly enriched genes in each query group to match cell types.


A typical celaref workflow is below, characterising a query dataset’s cell clusters on the basis of transcriptomic similarity to a annotated reference dataset.

Prepare dataset

To compare scRNAseq datasets with celaref, two inputs are needed for each dataset:

  1. A counts matrix of number of reads per gene, per cell.
gene Cell1 cell2 cell3 cell4 cell954
GeneA 0 1 0 1 0
GeneB 0 3 0 2 2
GeneC 1 40 1 0 0
  1. Cluster assignment for each cell.
CellId Cluster
cell1 cluster1
cell2 cluster7
cell954 cluster8

See Input for details.

Cell clusters might be defined by any cell-clustering technique, such as those implemented in tools such as Seurat (Satija et al. 2015), cellRanger (10X genomics), SC3(Kiselev et al. 2017), among many others.

Within dataset differential expression

Every dataset, whether a query or a reference, is prepared the same way. For each cluster, cells within that cluster are compared to the rest of the cells pooled together, calculating differential gene expression using MAST (Finak et al. 2015). Because of the low counts and potential drop-out issues in single cell RNAseq data, only genes enriched in each cluster are considered. For every cluster – cells are ranked from most to least enriched according to their lower 95% CI of fold-change. Each gene is assigned a ‘rescaled rank’ from 0 (most enriched) to 1 (most absent).

That this step is the most time consuming, but only needs to be done once per dataset.

Query-Reference comparison

A list of ‘Up’ genes are extracted for each query cluster – defined as those that have significantly higher expression in that cluster versus the rest of the sample (p<0.01 after BH multiple hypothesis correction). The ‘Up’ gene list is capped at the top 100 (ranked by lower 95% FC). Then, those genes are looked up in the ranking of genes in each reference cell cluster. The distribution of these ‘up gene’ ranks is plotted to evaluate similarity of the query cell-group to a reference cell-group.

Output plots are described here.


Interpreting output

Typically, every cell cluster in the query data (each box) is plotted against everything in the reference data (X-axis). Each of the ‘up’ genes is represented by a tick mark, and the median generank is shown as a thick bar. A biased distribution near the top (i.e.. rescaled rank of 0) indicates similarity of the groups – essentially the same genes are representative of the clusters within their respective samples.

  • A median gene rank of 0.5 would indicate a completely random distribution. However, much lower values are common. The reciprocal nature of the within-dataset differential expression can cause this - what’s up in one cluster is down in another.

  • A small or heterogeneous cell group will not have much statistical power to select many ‘top’ genes (few tick marks) and these distributions will not be particuarly informative. If there are no ‘top’ genes it won’t be plotted at all.

  • Because ‘top’ genes are compared to total reference rankings - the comparison between two datasets is not symmetrical. In ambiguous cases, it might helpful to plot the reverse comparison from reference to query. Note that these receiprocal comparisons are considered in Assigning labels to clusters. For instance - if a query cluster happens to be a mix of two reference cell groups, a reciprocal plot may make this more obvious.

Assigning labels to clusters

Lastly, there is a function to suggest some semi-sensible query cluster labels.

The first 4 columns of output (below) are the most interesting, the rest are described at bottom of section. The suggested cluster label is in the shortlab column. e.g.

test_group shortlab pval stepped_pvals
cluster_1 cluster_1:astrocytes_ependymal 2.98e-23 astrocytes_ependymal:2.98e-23,microglia:0.208,interneurons:0.1,pyramidal SS:0.455,endothelial-mural:0.0444,oligodendrocytes:NA
cluster_2 cluster_2:endothelial-mural 8.44e-10 endothelial-mural:8.44e-10,microglia:2.37e-06,astrocytes_ependymal:0.000818,interneurons:0.435,oligodendrocytes:0.245,pyramidal SS:NA
cluster_3 cluster_3:no_similarity NA astrocytes_ependymal:0.41,microglia:0.634,oligodendrocytes:0.305,endothelial-mural:0.512,interneurons:0.204,pyramidal SS:NA
cluster_4 cluster_4:microglia 2.71e-19 microglia:2.71e-19,interneurons:0.435,pyramidal SS:0.11,endothelial-mural:0.221,astrocytes_ependymal:0.627,oligodendrocytes:NA
cluster_5 cluster_5:pyramidal SS|interneurons 3.49e-10 pyramidal SS:0.362,interneurons:3.49e-10,endothelial-mural:0.09,astrocytes_ependymal:0.0449,microglia:7.68e-19,oligodendrocytes:NA
cluster_6 cluster_6:oligodendrocytes 2.15e-28 oligodendrocytes:2.15e-28,interneurons:0.624,astrocytes_ependymal:0.207,endothelial-mural:0.755,microglia:0.0432,pyramidal SS:NA

There can be none, one or multiple reference group similarities for the query group. This is expected when there are similar cell sub-populations in the reference data. This can be visualised throught the relative shapes of the top gene distribution for the reference group, and reference group similarity labels are calculated as follows:

These labels are based on the distributions of the ranks of the query cluster’s ‘top’ genes in each of the reference groups (as plotted in the violin plots), rescaled to be in the 0-1 range.

  1. The median gene rank for the ‘top’ genes in each reference group is calculated.

  2. Reference groups are ordered from most to least similar (ascending median rank).

  3. Mann-Whitney U tests are calculated between the adjacent reference groups - ie. 1st-2nd most similar, 2nd-3rd, 3rd-4th e.t.c. These are the stepped_pvals reported above - the last value will always be undefined NA. Essentially this is testing if the ‘top’ genes representative of the query group are significantly lower ranked (more similar) in one reference group vs the next most similar reference group. A genuine similarity of cell types should result in an abrupt change in these gene rank distributions.

  4. Initial calls are made on which reference groups to include in the group label.

      1. If there is a significant difference (p < 0.01) between the first and second most similar groups - report on the top ranked reference group. The pval in this case will be the first stepped pvalue (which is significant).
      1. If there are no significant differences between any steps, report a lack of similarity. No p-value (NA) is reported - because no call could be made on what group(s) are more similar. Check the stepped_pvals to look at borderline cases.
      1. Otherwise, start at the most similar reference group, and report everything up to the first signifcantly different ranking. Effectively, this reports multiple reference group similarities when a call can’t be made between them e.g. subtypes, or a mixed query group. Here the reported pval describes the ‘jump’ between the bottom ranked similar reference group, and the first of the other groups. e.g. the 2nd or 3rd stepped pvalue.
  5. The group assignment from step 4 is checked to ensure that the (median of the) gene ranks is significantly above a random distribution. Ie. above the 0.5 halfway point in the violin plots.

    • The distribution of the ‘top’ genes in the matched group (or last-ranked matching group when there are multiple matches) is tested to see if its median rank is significantly above 0.5 (sign test, using one-tailed binomial test on ranks < 0.5). If this fails (<0.01), ‘No_similarity’ will be reported, even if the p-value is significant. See Group1 and Group2 results in the quickstart example (occurs due to being only a 200 gene test). *B: This means that a significant ‘pval’ column can sometime occur on a ‘No_similarity’ label - but the pval_vs_random column (described below) will be non-significant.
    • This test exists for cases of significant difference from an ‘anti-some-other-celltype’ signature with genes distributed at the low end. That can occur when there are uneven gene cluster sizes and no genuine match for the query.
  6. Reciprocal-only matches are added to cluster labels in brackets.

    • e.g. c1:celltypeA(celltypeB) or c2:(celltypeC)
    • The cluster groups are reciprocally tested against the query groups. All reciprocal matches are listed in the reciprocal_matches column. If there is a recriprocal match that isn’t in the query->ref match list, its added on to the end of the cluster label.
    • Reciprocal matches reported in the cluster names might indicate a mixed query cell type. It woudl be a good idea to look at the reciprocalviolin plot of the reference dataset vs the query.

The full version of this table is:

test_group shortlab pval stepped_pvals pval_to_random matches reciprocal_matches similar_non_match similar_non_match_detail differences_within
astrocytes astrocytes:astrocytes_ependymal 2.98e-23 astrocytes_ependymal:2.98e-23,microglia:0.208,interneurons:0.1,pyramidal SS:0.455,endothelial-mural:0.0444,oligodendrocytes:NA 1.00e-21 astrocytes_ependymal astrocytes_ependymal
endothelial endothelial:endothelial-mural 8.44e-10 endothelial-mural:8.44e-10,microglia:2.37e-06,astrocytes_ependymal:0.000818,interneurons:0.435,oligodendrocytes:0.245,pyramidal SS:NA 3.55e-21 endothelial-mural endothelial-mural
hybrid hybrid:No similarity astrocytes_ependymal:0.41,microglia:0.634,oligodendrocytes:0.305,endothelial-mural:0.512,interneurons:0.204,pyramidal SS:NA
microglia microglia:microglia 2.71e-19 microglia:2.71e-19,interneurons:0.435,pyramidal SS:0.11,endothelial-mural:0.221,astrocytes_ependymal:0.627,oligodendrocytes:NA 3.54e-16 microglia microglia
neurons neurons:pyramidal SS|interneurons 3.49e-10 pyramidal SS:0.362,interneurons:3.49e-10,endothelial-mural:0.09,astrocytes_ependymal:0.0449,microglia:7.68e-19,oligodendrocytes:NA 2.19e-12 pyramidal SS|interneurons interneurons|pyramidal SS
oligodendrocytes oligodendrocytes:oligodendrocytes 2.15e-28 oligodendrocytes:2.15e-28,interneurons:0.624,astrocytes_ependymal:0.207,endothelial-mural:0.755,microglia:0.0432,pyramidal SS:NA 4.72e-20 oligodendrocytes oligodendrocytes

The next few columns of the ouput describe some of the heuristics used in the cluster labelling.

  • pval_to_random : P-value of test of median rank (of last matched reference group) < random, from binomial test on top gene ranks (<0.5). If this isn’t signiicant, ‘No similarity’ will be reported. A completely random distribution would have a median rank in the middle of the violin plots, at 0.5.

  • matches : List of all reference groups that ‘match’, as described, except it also includes (rare) examples where pval_to_random is not significant. “|” separated, in descending order of match.

  • reciprocal_matches : List of all reference groups that flagged test group as a match when directon of comparison is reversed. (significant pval and pval_to_random). “|” separated, in descending order of match.

The last 3 columns of the output are usually empty. When defined they may indicate borderline labelling or edge cases - checking the violin plots is advised! Tests are again Mann-Whitney U, but on non-adjacently ranked groups.

  • similar_non_match : This column lists any reference groups outside of the reported label match that are not signifcantly different to a reported match group. E.g If a query matches group A (because A is significantly different to the second-ranked B); if A is then not significantly different to the third -ranked C, then C will be reported here. This is more likely with smaller sets of ‘top genes’, for instance in Group3 of the quickstart example. Examples:
    • NA [Nothing]
    • C
    • microglia|interneurons
  • similar_non_match_detail: P-values for any details about similar_non_match results. These p-values will always be non-significant (but again may be borderline). Examples:
    • NA [Nothing]
    • A > C (p=0.0214,n.s)
    • dunno > Exciting (p=0.0104,n.s)
  • differences_within: This feild lists any pairs of matched groups (in shortlab) that are significantly different. This could happen if there are many matched groups listed in shortlab, that are all slightly different. Examples
    • NA [Nothing]
    • B > E (p=0.004)|C > E (p=0.009)

Using the package


The bioconductor landing page with information about this package is at https://bioconductor.org/packages/celaref

To install from bioconductor via BiocManager

# Installing BiocManager if necessary:
# install.packages("BiocManager")

Or, to use the dev version from github

# Or


Suppose there’s a new scRNAseq dataset (the query), whose cells have already been clustered into 4 groups : Groups 1-4. But we don’t know which group corresponds to which cell type yet.

Luckily, there’s an older dataset (the reference) of the same tissue type in which someone else has already determined the cell types. They very helpfully named them ‘Weird subtype’, ‘Exciting’, ‘Mystery cell type’ and ‘Dunno’.

This example uses the reference dataset to flag likely cell types in the new experiment.

It is a tiny simulated dataset (using splatter (Zappia, Phipson, and Oshlack 2017)) of 200 genes included in the package that can be copy-pasted, and will complete fairly quickly.


# Paths to data files.
counts_filepath.query    <- system.file("extdata", "sim_query_counts.tab",    package = "celaref")
cell_info_filepath.query <- system.file("extdata", "sim_query_cell_info.tab", package = "celaref")
counts_filepath.ref      <- system.file("extdata", "sim_ref_counts.tab",      package = "celaref")
cell_info_filepath.ref   <- system.file("extdata", "sim_ref_cell_info.tab",   package = "celaref")

# Load data
toy_ref_se   <- load_se_from_files(counts_file=counts_filepath.ref, cell_info_file=cell_info_filepath.ref)
toy_query_se <- load_se_from_files(counts_file=counts_filepath.query, cell_info_file=cell_info_filepath.query)

# Filter data
toy_ref_se     <- trim_small_groups_and_low_expression_genes(toy_ref_se)
toy_query_se   <- trim_small_groups_and_low_expression_genes(toy_query_se)

# Setup within-experiment differential expression
de_table.toy_ref   <- contrast_each_group_to_the_rest(toy_ref_se,    dataset_name="ref")
de_table.toy_query <- contrast_each_group_to_the_rest(toy_query_se,  dataset_name="query")
# Plot
make_ranking_violin_plot(de_table.test=de_table.toy_query, de_table.ref=de_table.toy_ref)
#> Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
#> ℹ Please use the `fun` argument instead.
#> ℹ The deprecated feature was likely used in the celaref package.
#>   Please report the issue to the authors.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: The `fun.ymin` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
#> ℹ Please use the `fun.min` argument instead.
#> ℹ The deprecated feature was likely used in the celaref package.
#>   Please report the issue to the authors.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: The `fun.ymax` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
#> ℹ Please use the `fun.max` argument instead.
#> ℹ The deprecated feature was likely used in the celaref package.
#>   Please report the issue to the authors.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.