% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/KS.R
\name{calculate_ks}
\alias{calculate_ks}
\title{Calculate the Komolgorov-Smirnov test statistic and q-values for differential gene expression
analysis.}
\usage{
calculate_ks(data, outcomes, nperm = 100, pairwise.p = FALSE, seq = FALSE,
  quantile.norm = FALSE, verbose = TRUE, parallel = TRUE)
}
\arguments{
\item{data}{A matrix containing genomics data (e.g. gene expression levels).
The rownames should contain gene identifiers, while the column names should
contain sample identifiers.}

\item{outcomes}{A vector containing group labels for each of the samples provided
in the \code{data} matrix. The names should be the sample identifiers provided in \code{data}.}

\item{nperm}{An integer specifying the number of randomly permuted EMD
scores to be computed. Defaults to 100.}

\item{pairwise.p}{Boolean specifying whether the user wants the pairwise p-values. Pairwise
p-values returned by \code{\link{ks.test}} are adjusted within pairwise comparison using the
Benjamini-Hochberg (BH) method. Defaults to \code{FALSE}.}

\item{seq}{Boolean specifying if the given data is RNA Sequencing data and ought to be
normalized. Set to \code{TRUE}, if passing transcripts per million (TPM) data or raw
data that is not scaled. If \code{TRUE}, data will be normalized by first multiplying by 1E6, then adding
1, then taking the log base 2. If \code{FALSE}, the data will be handled as is (unless
\code{quantile.norm} is \code{TRUE}). Note that as a distribution comparison function, K-S will
compute faster with scaled data. Defaults to \code{FALSE}.}

\item{quantile.norm}{Boolean specifying is data should be normalized by quantiles. If
\code{TRUE}, then the \code{\link[preprocessCore]{normalize.quantiles}} function is used.
Defaults to \code{FALSE}.}

\item{verbose}{Boolean specifying whether to display progress messages.}

\item{parallel}{Boolean specifying whether to use parallel processing via
the \pkg{BiocParallel} package. Defaults to \code{TRUE}.}
}
\value{
The function returns an \code{\link{KSomics}} object.
}
\description{
This is only function needed when conducting an analysis using the Komolgorov-Smirnov
algorithm. Analyses can also be conducted with the EMD algorithm using
\code{calculate_emd} or the Cramer Von Mises (CVM) algorithm using \code{calculate_cvm}.

The algorithm is used to compare genomics data between any number of groups.
Usually the data will be gene expression
values from array-based or sequence-based experiments, but data from other
types of experiments can also be analyzed (e.g. copy number variation).

Traditional methods like Significance Analysis of Microarrays (SAM) and Linear
Models for Microarray Data (LIMMA) use significance tests based on summary
statistics (mean and standard deviation) of the two distributions. This
approach tends to give non-significant results if the two distributions are
highly heterogeneous, which can be the case in many biological circumstances
(e.g sensitive vs. resistant tumor samples).

Komolgorov-Smirnov instead calculates a test statistic that is the maximum distance between
two cumulative distribution functions (CDFs). Unlike the EMD score, the KS test statistic
summarizes only the maximum difference (while EMD considers quantity and distance between all
differences).

The KS algorithm implemented in \pkg{EMDomics} has two main steps.
First, a matrix (e.g. of expression data) is divided into data for each of the groups.
Every possible pairwise KS score is then computed and stored in a table. The KS score
for a single gene is calculated by averaging all of the pairwise KS scores. If the user
sets \code{pairwise.p} to true, then the p-values
from the KS test are adjusted using the Benjamini-Hochberg method and stored in a table.
Next, the labels for each of the groups are randomly
permuted a specified number of times, and an EMD score for each permutation is
calculated. The median of the permuted scores for each gene is used as
the null distribution, and the False Discovery Rate (FDR) is computed for
a range of permissive to restrictive significance thresholds. The threshold
that minimizes the FDR is defined as the q-value, and is used to interpret
the significance of the EMD score analogously to a p-value (e.g. q-value
< 0.05 = significant). The q-values returned by the KS test (and adjusted for multiple
significance testing) can be compared to the permuted q-values.
}
\examples{
# 100 genes, 100 samples
dat <- matrix(rnorm(10000), nrow=100, ncol=100)
rownames(dat) <- paste("gene", 1:100, sep="")
colnames(dat) <- paste("sample", 1:100, sep="")

# "A": first 50 samples; "B": next 30 samples; "C": final 20 samples
outcomes <- c(rep("A",50), rep("B",30), rep("C",20))
names(outcomes) <- colnames(dat)

results <- calculate_ks(dat, outcomes, nperm=10, parallel=FALSE)
head(results$ks)
}
\seealso{
\code{\link{EMDomics}} \code{\link{ks.test}}
}

