% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/CB2FindCell.R
\name{CB2FindCell}
\alias{CB2FindCell}
\title{Main function of distinguish real cells from empty droplets using 
clustering-based Monte-Carlo test}
\usage{
CB2FindCell(
  RawDat,
  FDR_threshold = 0.01,
  lower = 100,
  upper = NULL,
  GeneExpressionOnly = TRUE,
  Ncores = 2,
  TopNGene = 30000,
  verbose = TRUE
)
}
\arguments{
\item{RawDat}{Matrix. Supports standard matrix or sparse matrix. 
This is the raw feature-by-barcode count matrix.}

\item{FDR_threshold}{Numeric between 0 and 1. Default: 0.01. 
The False Discovery Rate (FDR) to be controlled for multiple testing.}

\item{lower}{Positive integer. Default: 100. All barcodes 
whose total count below or equal to this threshold are defined as 
background empty droplets. They will be used to estimate the background 
distribution. The remaining barcodes will be test against background 
distribution. If sequencing depth is deliberately made higher (lower)
than usual, this threshold can be leveled up (down) correspondingly to 
get reasonable number of cells. Recommended sequencing depth for this 
default threshold: 40,000~80,000 reads per cell.}

\item{upper}{Positive integer. Default: \code{NULL}. This is the upper 
threshold for large barcodes. All barcodes whose total counts are larger 
or equal to upper threshold are directly classified as real cells prior 
to testing. If \code{upper = NULL}, the knee point of the log rank curve 
of barcodes total counts will serve as the upper threshold, which is 
calculated using package \code{DropletUtils}'s method. If 
\code{upper = Inf}, no barcodes will be retained prior to testing. 
If manually specified, it should be greater than pooling threshold.}

\item{GeneExpressionOnly}{Logical. Default: \code{TRUE}. For 10x Cell Ranger 
version >=3, extra features (surface proteins, cell multiplexing oligos, etc) 
besides genes are measured simultaneously. If 
\code{GeneExpressionOnly = TRUE}, only genes 
are used for testing. Removing extra features are recommended
because the default pooling threshold (100) is chosen only for handling 
gene expression. Extra features expression level is hugely different
from gene expression level. If using the default pooling threshold 
while keeping extra features, the estimated background distribution
will be hugely biased and does not reflect the real background distribution 
of empty droplets.}

\item{Ncores}{Positive integer. Default: 2. 
Number of cores for parallel computation.}

\item{TopNGene}{Positive integer. Default: 30000. 
Number of top highly expressed genes to use. This threshold avoids 
high number of false positives in ultra-high dimensional datasets, 
e.g. 10x barnyard data.}

\item{verbose}{Logical. Default: \code{TRUE}. If \code{verbose = TRUE}, 
progressing messages will be printed.}
}
\value{
An object of class \code{SummarizedExperiment}. The slot 
\code{assays} contains the real cell barcode matrix distinguished during 
cluster-level test, single-barcode-level test plus large cells who 
exceed the upper threshold. The slot \code{metadata} contains
(1) testing statistics (Pearson correlation to the background) for all 
candidate barcode clusters, (2) barcode IDs for all candidate barcode 
clusters, the name of each cluster is its median barcode size, 
(3) testing statistics (log likelihood under background distribution) 
for remaining single barcodes not clustered, (4) background distribution 
count vector without Good-Turing correction.
}
\description{
The main function of \code{scCB2} package. Distinguish real cells 
from empty droplets using clustering-based Monte-Carlo test.
}
\details{
Input data is a feature-by-barcode matrix. Background barcodes are 
defined based on \code{lower}. Large barcodes are 
automatically treated as real cells based on \code{upper}. Remaining 
barcodes will be first clustered into subgroups, then 
tested against background using Monte-Carlo p-values simulated from 
Multinomial distribution. The rest barcodes will be further tested 
using EmptyDrops (Aaron T. L. Lun \emph{et. al. 2019}).
FDR is controlled based on \code{FDR_threshold}.

This function supports parallel computation. \code{Ncores} is used to specify
number of cores. 

Under CellRanger version >=3, extra features other than genes are 
simultaneously measured (e.g. surface protein, cell multiplexing oligo). 
We recommend filtering them out using 
\code{GeneExpressionOnly = TRUE} because the expression of 
extra features is not in the same scale as gene expression counts.
If using the default pooling threshold while keeping extra features, the 
estimated background distribution will be hugely biased and does not 
reflect the real background distribution of empty droplets. The resulting 
matrix will contain lots of barcodes who have almost zero gene expression
and relatively high extra features expression, which are usually not useful for 
RNA-Seq study.
}
\examples{
# raw data, all barcodes
data(mbrainSub)
str(mbrainSub)

# run CB2 on the first 10000 barcodes
CBOut <- CB2FindCell(mbrainSub[,1:10000], FDR_threshold = 0.01, 
    lower = 100, Ncores = 2)
RealCell <- GetCellMat(CBOut, MTfilter = 0.05)

# real cells
str(RealCell)

}
