% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/DATAprepSE.R
\name{DATAprepSE}
\alias{DATAprepSE}
\title{Data preparation for exploratory and statistical analysis
(Main Function)}
\usage{
DATAprepSE(
  RawCounts,
  Column.gene,
  Group.position,
  Time.position,
  Individual.position,
  colData = NULL,
  VARfilter = 0,
  SUMfilter = 0,
  RNAlength = NULL
)
}
\arguments{
\item{RawCounts}{Data.frame with \eqn{N_g} rows and (\eqn{N_{s}+k}) columns,
where \eqn{N_g} is the number of genes,
\eqn{N_s} is the number of samples and
\eqn{k=1} if a column is used to specify gene names, or \eqn{k=0} otherwise.
If \eqn{k=1}, the position of the column containing gene names is given
by \code{Column.gene}.
The data.frame contains non negative integers giving gene expressions of
each gene in each sample.
Column names of the data.frame must describe each sample's information
(individual, biological condition and time) and have the structure described
in the section \code{Details}.}

\item{Column.gene}{Integer indicating the column where gene names are given.
Set \code{Column.gene=NULL} if there is no such column.}

\item{Group.position}{Integer indicating the position of group information
in the string of characters in each sample names (see \code{Details}).
Set \code{Group.position=NULL} if there is only one or no biological
information in the string of character in each sample name.}

\item{Time.position}{Integer indicating the position of time measurement
information in the string of characters in each sample names
(see \code{Details}).
Set \code{Time.position=NULL} if there is only one or no time measurement
information in the string of character in each sample name.}

\item{Individual.position}{Integer indicating the position of the name of
the individual (e.g patient, replicate, mouse, yeasts culture ...)
in the string of characters in each sample names (see \code{Details}).
The names of different individuals must be all different.
Furthermore, if individual names are just numbers, they will be transform
in a vector of class "character" by
\code{\link[=CharacterNumbers]{CharacterNumbers()}} and
a "r" will be added to each individual name ("r" for replicate).}

\item{colData}{\code{NULL} or data.frame with \eqn{N_s} rows and two or
three columns describing the samples. \code{NULL} as default.
Optional input (see \code{Details}).
If \code{Group.position}, \code{Time.position} and
\code{Individual.position} are filled, set \code{colData=NULL}.
\itemize{
\item If samples belong to different times point and different biological
condition
\itemize{
\item the first column must contain the biological condition for each sample.
The column name must be "Group".
\item the second column must contain the time measurement for each sample.
The column name must be "Time".
\item The third column must contain the individual name for each sample.
The column name must be "ID".
}
\item If samples belong to different times point or different biological
condition
\itemize{
\item the first column must contain, either the biological condition
for each sample, or the time measurement for each sample.
The column name must be either "Group", or "Time".
\item The second column must contain the individual name for each sample.
The column name must be "ID".
}
}}

\item{VARfilter}{Positive numeric value, 0 as default.
All rows of \code{RawCounts} which the variance of counts is strictly under
the threshold \code{VARfilter} are deleted}

\item{SUMfilter}{Positive numeric value, 0 as default.
All rows of \code{RawCounts} which the sum of counts is strictly under
the threshold \code{SUMfilter} are deleted.}

\item{RNAlength}{\code{NULL} or "hsapiens" or data.frame with two columns.
\code{NULL} as default.
\itemize{
\item if \code{RNAlength} is a data.frame
\itemize{
\item the first column must contain gene names
(similar to those of \code{RawCounts})
\item the second columns must contain the median of the transcript length
of each gene of the first column
and all rows of \code{RawCounts} whose genes are not included in
the first column of \code{RNAlength} will be deleted.
}
\item if \code{RNAlength=NULL}, no rows will be deleted.
}

If \code{RNAlength} is either "hsapiens" or a data.frame,
\code{Column.gene} can not be \code{NULL}.}
}
\value{
The function returns a SummarizedExperiment object containing
all information for exploratory (unsupervised) analysis and
DE statistical analysis.
}
\description{
This function creates automatically  a SummarizedExperiment (SE) object
from raw counts data to store
\itemize{
\item information for exploratory (unsupervised) analysis using the R function
\code{\link[SummarizedExperiment:SummarizedExperiment-class]{SummarizedExperiment::SummarizedExperiment()}}
\item a DESeq2 object from raw counts data in order to store all information
for statistical (supervised) analysis using the R function
\code{\link[DESeq2:DESeqDataSet]{DESeq2::DESeqDataSetFromMatrix()}}.
}
}
\details{
The column names of \code{RawCounts} must be a vector of strings
of characters containing
\itemize{
\item a string of characters (if \eqn{k=1}) which is the label of the column
containing gene names.
\item \eqn{N_s} sample names which must be strings of characters containing
at least : the name of the individual (e.g patient, mouse, yeasts culture),
its biological condition (if there is at least two) and
the time where data have been collected if there is at least two;
(must be either 't0', 'T0' or '0' for time 0,
't1', 'T1' or '1' for time 1, ...).
}

All these sample information must be separated by underscores
in the sample name. For instance 'CLL_P_t0_r1',
corresponds to the patient 'r1' belonging to the biological condition 'P'
and where data were collected at time 't0'.
I this example, 'CLL' describe the type of cells
(here chronic lymphocytic leukemia) and is not used in our analysis.

In the string of characters 'CLL_P_t0_r1',
'r1' is localized after the third underscore,
so \code{Individual.position=4},
'P' is localized after the first underscore, so \code{Group.position=2} and
't0' is localized after the second underscore, so \code{Time.position=3}.

If the user does not have all these sample information separated by
underscores in the sample name, the user can build the data.frame
\code{colData} describing the samples.
}
\examples{
BgCdEx <- rep(c("P", "NP"), each=27)
TimeEx <- rep(paste0("t", seq_len(9) - 1), times=6)
IndvEx <- rep(paste0("pcl", seq_len(6)), each=9)

SampleNAMEex <- paste(BgCdEx, IndvEx, TimeEx, sep="_")
RawCountEx <- data.frame(Gene.name=paste0("Name", seq_len(10)),
                         matrix(sample(seq_len(100),
                                       length(SampleNAMEex)*10,
                                       replace=TRUE),
                                ncol=length(SampleNAMEex), nrow=10))
colnames(RawCountEx) <- c("Gene.name", SampleNAMEex)
##------------------------------------------------------------------------##
resDATAprepSE <- DATAprepSE(RawCounts=RawCountEx,
                            Column.gene=1,
                            Group.position=1,
                            Time.position=3,
                            Individual.position=2)
##
## colDataEx <- data.frame(Group=BgCdEx, Time=TimeEx, ID=IndvEx)
}
\seealso{
The \code{\link[=DATAprepSE]{DATAprepSE()}} function
\itemize{
\item is used by the following functions of our package :
\code{\link[=DATAnormalization]{DATAnormalization()}},
\code{\link[=DEanalysisGlobal]{DEanalysisGlobal()}}.
\item calls the R function
\code{\link[DESeq2:DESeqDataSet]{DESeq2::DESeqDataSetFromMatrix()}}
in order to create the DESeq2 object and
\code{\link[SummarizedExperiment:SummarizedExperiment-class]{SummarizedExperiment::SummarizedExperiment()}}
in order to create the SummarizedExperiment object
}
}
