\name{prunePairs}
\alias{prunePairs}

\title{Prune read pairs}

\description{Prune the read pairs that represent potential artifacts in a Hi-C library}

\usage{
prunePairs(file.in, param, file.out=file.in, max.frag=NA, min.inward=NA, min.outward=NA)
}

\arguments{
	\item{file.in}{a character string specifying the path to the index file produced by \code{\link{preparePairs}}}
	\item{param}{a \code{pairParam} object containing read extraction parameters}
	\item{file.out}{a character string specifying a path to an output index file}
	\item{max.frag}{an integer scalar specifying the maximum length of any sequenced DNA fragment}
	\item{min.inward}{an integer scalar specifying the minimum distance between inward-facing reads on the same chromosome}
	\item{min.outward}{an integer scalar specifying the minimum distance between outward-facing reads on the same chromosome}
}

\value{
An integer vector is invisibly returned, containing \code{total}, the total number of read pairs; \code{length}, the number of read pairs with fragment lengths greater than \code{max.frag}; \code{inward}, the number of inward-facing read pairs with gap distances less than \code{min.inward}; and \code{outward}, the number of outward-facing read pairs with gap distances less than \code{min.outward}.

Multiple data frame objects are also produced within the specified \code{out} file, for each corresponding data frame object in \code{file.in}.
For each object, the number of rows may be reduced due to the removal of read pairs corresponding to potential artifacts.
}

\details{
This function removes potential artifacts from the input index file, based on the coordinates of the reads in each pair.
It will then produce a new HDF5 file containing only the retained read pairs.

Non-\code{NA} values for \code{min.inward} and \code{min.outward} are designed to protect against dangling ends and self-circles, respectively. 
This is particularly true when restriction digestion is incomplete, as said structures do not form within a single restriction fragment and cannot be identified earlier. 
These can be removed by discarding inward- and outward-facing read pairs that are too close together.

A finite value for \code{max.frag} also protects against non-specific cleavage.  
This refers to the length of the actual DNA fragment used in sequencing and is computed from the distance between each read and its nearest downstream restriction site.
Off-target cleavage will result in larger distances than expected. 
However, \code{max.frag} should not be set for DNase Hi-C experiments where there is no concept of non-specific cleavage.

Note the distinction between \emph{restriction} fragments and \emph{sequencing} fragments. 
The former is generated by pre-ligation digestion, and is of concern when choosing \code{min.inward} and \code{min.outward}.
The latter is generated by post-ligation shearing and is of concern when choosing \code{max.frag}.

Suitable values for each parameter can be obtained with the output of \code{\link{getPairData}}. 
For example, values for \code{min.inward} can be obtained by setting a suitable lower bound on the distribution of non-\code{NA} values for \code{insert} with \code{orientation==1}.

\code{prunePairs} will now respect any settings of \code{restrict}, \code{discard} and \code{cap} in the \code{pairParam} input object.
Reads will be correspondingly removed from the file if they lie outside of restricted chromosomes, within discarded regions or exceed the cap for a restriction fragment pair.
Note that \code{cap} will be ignored for DNase-C experiments as this depends on an unknown bin size.
}

\examples{
hic.file <- system.file("exdata", "hic_sort.bam", package="diffHic")
cuts <- readRDS(system.file("exdata", "cuts.rds", package="diffHic"))
param <- pairParam(cuts)

# Note: don't save to a temporary file for actual data.
fout <- tempfile(fileext=".h5")
fout2 <- tempfile(fileext=".h5")
invisible(preparePairs(hic.file, param, fout))
x <- prunePairs(fout, param, fout2)

require(rhdf5)
h5read(fout2, "chrA/chrA")

x <- prunePairs(fout, param, fout2, max.frag=50)
h5read(fout2, "chrA/chrA")

x <- prunePairs(fout, param, fout2, min.inward=50)
h5read(fout2, "chrA/chrA")

x <- prunePairs(fout, param, fout2, min.outward=50)
h5read(fout2, "chrA/chrA")
}

\author{Aaron Lun}

\seealso{
	\code{\link{preparePairs}}, \code{\link{getPairData}}, \code{\link{squareCounts}}
}

\references{
Jin F et al. (2013). A high-resolution map of the three-dimensional chromatin interactome in human cells. \emph{Nature} doi:10.1038/nature12644.
}

\keyword{preprocessing}
