% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/refineSites.r
\name{refineSites}
\alias{refineSites}
\title{Adjust ChIP-seq Read Count Table}
\usage{
refineSites(counts, sites, flank = 250L, outputidx = rep(TRUE,
  nrow(counts)), gcrange = c(0.3, 0.8), emtrace = TRUE, plot = TRUE,
  model = c("nbinom", "poisson"), mu0 = 1, mu1 = 50, theta0 = mu0,
  theta1 = mu1, p = 0.2, converge = 1e-04, genome = "hg19",
  gctype = c("ladder", "tricube"))
}
\arguments{
\item{counts}{A count matrix with each row corresponding to each element
in \code{sites} and each column corresponding to one sample. Every value
in the matrix indicates the read counts for one site in one sample. It is
noted that since effective GC content is used in this function, it is
important to extend either original reads or original \code{sites} to
consider reads that 5' starting in \code{flank} regions, when counting
sequencing reads.}

\item{sites}{A GRanges object with length equivalent to number of rows
in \code{counts} matrix. It is preferable that every GRange have the same
width; otherwise, the mixture model is modeling different things with
wider GRanges certainly have more reads. However, it is OK if only a
minority of GRanges have different width, since the model is pretty robust
to outliers. Also, it is important that \code{sites} including both
foreground and background regions in each sample, otherwise the mixture
model will fail to fit two components. Fortunately, if you are inputing
a large collection of samples, foreground sites in one sample may play
the role as background in other samples. In this case, manually selecting
real background is not necessary.}

\item{flank}{A non-negative integer specifying the flanking width of
ChIP-seq binding. This parameter provides the flexibility that reads
appear in flankings by decreased probabilities as increased distance
from binding region. This paramter helps to define effective GC
content calculation.}

\item{outputidx}{A logical vector with the length equivalent to number
of rows in \code{counts}. This provides which subset of adjusted count
matrix should be outputed. This would be extremely useful if you have
manually collected background sites and want to only export the sites
you care about.}

\item{gcrange}{A non-negative numeric vector with length 2. This vector
sets the range of GC content to filter regions for GC effect estimation.
For human, most regions have GC content between 0.3 and 0.8, which is
set as the default. Other regions with GC content beyond this range
will be ignored. This range is critical when very few foreground regions
are selected for mixture model fitting, since outliers could drive the
regression lines. Thus, if possible, first make a scatter plot between
counts and GC content to decide  this parameter. Alternatively,
select a narrower range, e.g. c(0.35,0.7), to aviod outlier effects from
both high and low GC-content regions.}

\item{emtrace}{A logical vector which, when TRUE (default), allows to
print the trace of log likelihood changes in EM iterations.}

\item{plot}{A logical vector which, when TRUE (default), returns miture
fitting plot.}

\item{model}{A character specifying the distribution model to be used in
generalized linear model fitting. The default is negative
binomial(\code{nbinom}), while \code{poisson} is also supported currently.
More details see \code{gcEffects}.}

\item{mu0}{A non-negative numeric initiating read count signals for
background sites. This is treated as the starting value of background mean
for poisson/nbinom fitting.}

\item{mu1}{A non-negative numeric initiating read count signals for
foreground sites. This is treated as the starting value of foreground mean
for poisson/nbinom fitting.}

\item{theta0}{A non-negative numeric initiating the shape parameter of
negative binomial model for background sites. For more detail, see
theta in \code{\link[MASS]{glm.nb}} function.}

\item{theta1}{A non-negative numeric initiating the shape parameter of
negative binomial model for foreground sites. For more detail, see
theta in \code{\link[MASS]{glm.nb}} function.}

\item{p}{A non-negative numeric specifying the proportion of foreground
sites in all estimated sites. This is treated as a starting value for
EM algorithm.}

\item{converge}{A non-negative numeric specifying the condition of EM
algorithm termination. EM algorithm stops when the ratio of log likelihood
increment to whole log likelihood is less or equivalent to
\code{converge}.}

\item{genome}{A \link[BSgenome]{BSgenome} object containing the sequences
of the reference genome that was used to align the reads, or the name of
this reference genome specified in a way that is accepted by the
\code{\link[BSgenome]{getBSgenome}} function defined in the \pkg{BSgenome}
software package. In that case the corresponding BSgenome data package
needs to be already installed (see \code{?\link[BSgenome]{getBSgenome}} in
the \pkg{BSgenome} package for the details).}

\item{gctype}{A character vector specifying choice of method to calculate
effective GC content. Default \code{ladder} is based on uniformed fragment
distribution. A more smoother method based on tricube assumption is also
allowed. However, tricube should be not used if \code{flank} is too large.}
}
\value{
The count matrix after GC adjustment. The matrix values are not
integer any more.
}
\description{
For a given set of sites with the same/comparable width, their
read count table from multiple samples are adjusted based on
potential GC effects. For each sample separately, GC effects are
estimated based on their effective GC content and
reads count using generalized linear mixture models. Then, count
table is adjusted based on estimated GC effects.
It it important that the given sites includes both foreground and
background regions, see \code{sites} below.
}
