\name{samplesize}
\alias{samplesize}
\title{FDR as a function of sample size}
\description{
This function tabulates the false discovery rate (FDR) for selecting differentially expressed genes as a function of sample size and cutoff level. Additionally, the same information can be displayed through an attractive plot.
}
\usage{
samplesize(n = seq(5, 50, by = 5), p0 = 0.99, sigma = 1, D, F0, F1, 
           paired = FALSE, crit, crit.style = c("top percentage", "cutoff"),
		   plot =TRUE, local.show=FALSE, nplot = 100, ylim = c(0, 1), main,
		   legend.show = FALSE, grid.show = FALSE, ...)
}
%- maybe also 'usage' for other objects documented here.
\arguments{
  \item{n}{sample size (as subjects per group)}
  \item{p0}{the proportion of non-differentially expressed genes}
  \item{sigma}{the standard deviation for the log expression values}
  \item{D}{assumed average log fold change (in units of \code{sigma}), by default 1; this is a shortcut for specifying a simple symmetrical alternative hypothesis through \code{F1}.}
  \item{F0}{the distribution of the log2 expression values under the null hypothesis; by default, this is normal with mean zero and standard deviation \code{sigma},  but mixtures of normals can be specified, see Details and Examples.}
  \item{F1}{the distribution of the log2 expression values under the alternative hypothesis; by default, this is an equal mixture of two normals with means  \code{D} and -\code{D} and standard deviation \code{sigma}; mixture of normals are again possible, see Details and Examples.}
  \item{paired}{logical value indicating whether this is the independent sample case (default) or the paired sample case.}
  \item{crit}{a vector of cutoff values for selecting differentially expressed
              genes; the interpretation depends on \code{crit.style}.}
  \item{crit.style}{indicates how differentially expressed genes are selected: either by a fixed cutoff level for the absolute value of the t-statistic or as a fixed percentage of the absolute largest t-statistics.}
  \item{plot}{logical value indicating whether to do the plotting business}
 \item{local.show}{logical value indicating whether to show local or global false discovery rate (default: global).}  
  \item{nplot}{number of points that are evaluated for the curves}
  \item{ylim}{the usual limits on the vertical axis}
  \item{main}{the main title of the plot}
  \item{legend.show}{logical value indicating whether to show a legend for the  types of gene selection in the plot}
  \item{grid.show}{logical value indicating whether to draw grid lines showing the sample sizes \code{n} to be tabulated in the plot}
  \item{\dots}{the usual graphical parameters, passed to \code{plot}}
}
\details{
This function plots the FDR as a function of the sample size when comparing the expression of multiple genes between two groups of subjects. This is based on a model assuming that a proportion \code{p0} of genes is not differentially expressed (regulated) between groups, and that 1-\code{p0} genes are. The logarithmized gene expression values of regulated and non regulated genes are assumed to be generated by mixtures of normal distributions; these mixtures can be specified through the parameters \code{F0}, \code{F1} or \code{D}, and \code{sigma}; please see \code{TOC} for details on the model and the specification of the mixtures. By default, the null distribution of the log expression values is a normal centered on zero, and the alternative an equal mixture of normals centered at \code{+D} and \code{-D}. 

The list of nominally differentially expressed genes can be selected in two ways:
\itemize{
\item all genes with absolute t-statistic larger than the specified critical cutoff values (\code{cutoff}),
\item all genes that represent the specified critical top percentage of the absolutely largest t-statistics (\code{top percentage}).

Multiple critical values correspond to multiple curves, each labeled by the
critical value, but only one value can be specified for the proportion of
non-regulated genes \code{p0} and the standard deviation \code{sigma}.
}
}
\value{
A matrix with rows corresponding to elements of \code{n} and columns corresponding to the specified critical values is returned. The matrix has the attribute \code{param} that contains the specified arguments, see Examples.
}
\references{
Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A (2005) False Discovery Rate, Sensitivity and Sample Size for Microarray Studies. \emph{Bioinformatics}, 21, 3017-3024.

Jung SH (2005) Sample size for FDR-control in microarray data  analysis. \emph{Bioinformatics}, 21, 3097-104.}
\author{Y. Pawitan and A. Ploner}
\note{Both the curve labels and the legend may be squashed if the plotting device is too small. Increasing the size of the device and re-plotting should improve readability.}
\seealso{\code{\link{FDR}}, \code{\link{TOC}}, \code{\link{EOC}}}
\examples{
# Default assumes a proportion of 0.01 regulated genes equally split
# between two-fold up- and down-regulated
# We select the top 1, 2, 3 percent absolute largest t-statistics
samplesize(crit=c(0.03,0.02, 0.01))

# Same model, but using a hard cutoff for the t-statistics
samplesize(crit=2:4, crit.style="cutoff")

# Paired test of the same size has slightly better FDR (as expected)
samplesize(paired=TRUE)

# Compare the effect of p0 and effect size
par(mfrow=c(2,2))
samplesize(crit=c(0.03,0.02, 0.01), p0=0.95, D=1)
samplesize(crit=c(0.03,0.02, 0.01), p0=0.99, D=1)
samplesize(crit=c(0.03,0.02, 0.01), p0=0.95, D=2)
samplesize(crit=c(0.03,0.02, 0.01), p0=0.99, D=2)

# An asymmetric alternative distribution: 20 percent of the regulated genes 
# are expected to be (at least) four-fold up regulated
# NB, no graphical output
ret = samplesize(F1=list(D=c(-1,1,2), p=c(2,2,1)), p0=0.95, crit=0.05, plot=FALSE)
ret
# Look at the parameters
attr(ret, "param")

# A wide null distribution that allows to disregard genes with small effect
# Here: |log2 fold change| < 0.25, i.e. fold change of less than 19 percent
samplesize(F0=list(D=c(-0.25,0,0.25)), grid=TRUE)

# This is close to Example 3 in Jung's paper (see References):
# p0=0.99 and sensitivity=0.6, so we want a rejection rate of 
# around 0.006 from the top list.
# Here we require around 40 arrays/group, compared to 
# around 37 in Jung's paper, most likely because we use 
# the t-distribution instead of normal. Jung's alternative 
# is only one-sided, so the exact correspondence is
# 
samplesize(p0=0.99,crit.style="top", crit=0.006, F1=list(D=1, p=1), grid=TRUE) 
abline(h=0.01)

#The result is very close to the symmetric alternatives: 
samplesize(p0=0.99,crit=0.006, D=1, grid=TRUE, ylim=c(0,0.9))

}
\keyword{hplot}% at least one, from doc/KEYWORDS
\keyword{design}% __ONLY ONE__ keyword per line
