% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/STAR.R
\name{STAR.align.single}
\alias{STAR.align.single}
\title{Align single or paired end pair with STAR}
\usage{
STAR.align.single(
  file1,
  file2 = NULL,
  output.dir,
  index.dir,
  star.path = STAR.install(),
  fastp = install.fastp(),
  steps = "tr-ge",
  adapter.sequence = "auto",
  quality.filtering = FALSE,
  min.length = 20,
  mismatches = 3,
  trim.front = 0,
  trim.tail = 0,
  max.multimap = 10,
  alignment.type = "Local",
  allow.introns = TRUE,
  max.cpus = min(90, BiocParallel::bpparam()$workers),
  resume = NULL,
  multiQC = FALSE,
  keep.contaminants = FALSE,
  keep.unaligned.genome = FALSE,
  keep.index.in.memory = FALSE,
  script.single = system.file("STAR_Aligner", "RNA_Align_pipeline.sh", package = "ORFik")
)
}
\arguments{
\item{file1}{library file, if paired must be R1 file. Allowed formats are:
(.fasta, .fastq, .fq, or.fa) with or without compression of .gz. This filename usually
 contains a suffix of .1}

\item{file2}{default NULL, set if paired end to R2 file. Allowed formats are:
(.fasta, .fastq, .fq, or.fa) with or without compression of .gz. This filename usually
 contains a suffix of .2}

\item{output.dir}{directory to save indices, default:
paste0(dirname(arguments[1]), "/STAR_index/"), where arguments is the
arguments input for this function.}

\item{index.dir}{path to STAR index folder. Path returned from ORFik function
STAR.index, when you created the index folders.}

\item{star.path}{path to STAR, default: STAR.install(),
if you don't have STAR installed at default location, it will install it there,
set path to a runnable star if you already have it.}

\item{fastp}{path to fastp trimmer, default: install.fastp(), if you
have it somewhere else already installed, give the path. Only works for
unix (linux or Mac OS), if not on unix, use your favorite trimmer and
give the output files from that trimmer as input.dir here.}

\item{steps}{a character, default: "tr-ge", trimming then genome alignment\cr
 steps of depletion and alignment wanted:
 The posible candidates you can use are:\cr
\itemize{
 \item{tr : trim reads}
 \item{co : contamination merged depletion}
 \item{ph : phix depletion}
 \item{rR : rrna depletion}
 \item{nc : ncrna depletion}
 \item{tR : trna depletion (Mature tRNA, so no intron checks done)}
 \item{ge : genome alignment}
 \item{all: run steps: "tr-co-ge" or "tr-ph-rR-nc-tR-ge", depending on if you
 have merged contaminants or not}
}
 If not "all", a subset of these ("tr-co-ph-rR-nc-tR-ge")\cr
 If co (merged contaminants) is used, non of the specific contaminants can be specified,
 since they should be a subset of co.\cr
 The step where you align to the genome is usually always included, unless you
 are doing pure contaminant analysis or only trimming.
 For Ribo-seq and TCP(RCP-seq) you should do rR (ribosomal RNA depletion),
 so when you made the
 STAR index you need the rRNA step, either use rRNA from .gtf or manual download.
 (usually just download a Silva rRNA database
 for SSU&LSU at: https://www.arb-silva.de/) for your species.}

\item{adapter.sequence}{character, default: "auto". Auto detect adapter using fastp
adapter auto detection, checking first 1.5M reads. (Auto detection of adapter will
not work 100\% of the time (if the library is of low quality), then you must rerun
this function with specified adapter from fastp adapter analysis.
, using FASTQC or other adapter detection tools, else alignment will most likely fail!).
If already trimmed or trimming not wanted:
adapter.sequence = "disable" .You can manually assign adapter like:
"ATCTCGTATGCCGTCTTCTGCTTG" or "AAAAAAAAAAAAA". You can also specify one of the three
presets:\cr
\itemize{
 \item{illumina (TrueSeq ~75/100 bp sequencing) : AGATCGGAAGAGC}
 \item{small_RNA (standard for ~50 bp sequencing): TGGAATTCTCGG}
 \item{nextera: CTGTCTCTTATA}
}
Paired end auto detection uses overlap sequence of pairs, to use the slower
more secure paired end adapter detection, specify as: "autoPE".}

\item{quality.filtering}{logical, default FALSE. Not needed for modern
library prep of RNA-seq, Ribo-seq etc (usually < ~ 0.5% of reads are removed).
If you are aligning bad quality data, set this to TRUE.\cr
These filters will then be applied (default of fastp), filter if:
\itemize{
 \item{Number of N bases in read : > 5}
 \item{Read quality : > 40\% of bases in the read are <Q15}
}}

\item{min.length}{20, minimum length of aligned read without mismatches
to pass filter. Anything under 20 is dangerous, as chance of random hits will
become high!}

\item{mismatches}{3, max non matched bases. Excludes soft-clipping, this only
filters reads that have defined mismatches in STAR.
Only applies for genome alignment step.}

\item{trim.front}{0, default trim 0 bases on 5' ends.
Ignored if tr (trim) is not one of the arguments in "steps".
For Ribo-seq use default 0, unless you have 5' end custom barcodes to remove.
Alignment to STAR might fail if you have large barcodes, which are not removed!}

\item{trim.tail}{0, default trim 0 bases on 3' ends.
Ignored if tr (trim) is not one of the arguments in "steps".
For Ribo-seq use default 0, unless you have 3' end custom barcodes to remove.
Alignment to STAR might fail if you have large barcodes, which are not removed!}

\item{max.multimap}{numeric, default 10. If a read maps to more locations than specified,
will skip the read. Set to 1 to only get unique mapping reads. Only applies for
genome alignment step. The depletions are allowing for multimapping.}

\item{alignment.type}{default: "Local": standard local alignment with soft-clipping allowed,
"EndToEnd" (global): force end-to-end read alignment, does not soft-clip.}

\item{allow.introns}{logical, default TRUE. Allow large gaps of N in reads
during genome alignment, if FALSE:
sets --alignIntronMax to 1 (no introns). NOTE: You will still get some spliced reads
if you assigned a gtf at the index step.}

\item{max.cpus}{integer, default: \code{min(90, BiocParallel:::bpparam()$workers)},
number of threads to use. Default is minimum of 90 and maximum cores - 2. So if you
have 8 cores it will use 6. Note: FASTP will use maximum 16 threads as from testing
I see performance actually degrades using anything higher. From testing I also see
STAR gets no performance gain after ~50 threads. I do suspect this will change
when hard drives gets better in the future.}

\item{resume}{default: NULL, continue from step, lets say steps are "tr-ph-ge":
(trim, phix depletion, genome alignment) and resume is "ge", you will then use
the assumed already trimmed and phix depleted data and start at genome alignment,
useful if something crashed. Like if you specified wrong STAR version, but the trimming
step was completed. Resume mode can only run 1 step at the time.}

\item{multiQC}{logical, default TRUE. Do mutliQC comparison of STAR
alignment between all the samples. Outputted in aligned/LOGS folder.
See ?STAR.multiQC}

\item{keep.contaminants}{logical, default FALSE. Create and keep
contaminant aligning bam files, default is to only keep unaliged fastq reads,
which will be further processed in "ge" genome alignment step. Useful if you
want to do further processing on contaminants, like specific coverage of
specific tRNAs etc.}

\item{keep.unaligned.genome}{logical, default FALSE. Create and keep
reads that did not align at the genome alignment step,
default is to only keep the aliged bam file. Useful if you
want to do further processing on plasmids/custom sequences.}

\item{keep.index.in.memory}{logical or character, default FALSE (i.e. LoadAndRemove).
For STAR.align.single:\cr
If TRUE, will keep index in memory, useful if you need to loop over single calls,
instead of using STAR.align.folder (remember last run should use FALSE, to remove index).
For STAR.align.folder:\cr
Only applies to last library, will always keep for all libraries before last.
Alternative useful for MAC machines especially is "noShared", for machines
that do not support shared memory index, usually gives error: "abort trap 6".}

\item{script.single}{location of STAR single file alignment script,
default internal ORFik file. You can change it and give your own if you
need special alignments.}
}
\value{
output.dir, can be used as as input in ORFik::create.experiment
}
\description{
Given a single NGS fastq/fasta library, or a paired setup of 2 mated
libraries. Run either combination of fastq trimming, contamination removal and
genome alignment. Works for (Linux, Mac and WSL (Windows Subsystem Linux))
}
\details{
Can only run on unix systems (Linux, Mac and WSL (Windows Subsystem Linux)),
and requires a minimum of 30GB memory on genomes like human, rat, zebrafish etc.\cr
If for some reason the internal STAR alignment bash script will not work for you,
like if you want more customization of the STAR/fastp arguments.
You can copy the internal alignment script,
edit it and give that as the script used for this function.\cr
The trimmer used is fastp (the fastest I could find), also works on
(Linux, Mac and WSL (Windows Subsystem Linux)).
If you want to use your own trimmer set file1/file2 to the location of
the trimmed files from your program.\cr
A note on trimming from creator of STAR about trimming:
"adapter trimming it definitely needed for short RNA sequencing.
For long RNA-seq, I would agree with Devon that in most cases adapter trimming
is not advantageous, since, by default, STAR performs local (not end-to-end) alignment,
i.e. it auto-trims." So trimming can be skipped for longer reads.
}
\examples{

## Specify output libraries (using temp config)
config_file <- tempfile()
#config.save(config_file, base.dir = tempdir())
#config <- ORFik::config(config_file)
#project <- ORFik::config.exper("yeast_1", "Saccharomyces_cerevisiae", "RNA-seq", config)
# Get genome of yeast (quite small)
# arguments <- getGenomeAndAnnotation("Saccharomyces cerevisiae", project["ref"])
# index <- STAR.index(arguments)

## Make fake reads
#genome <- readDNAStringSet(arguments["genome"])
#which_chromosomes <- sample(seq_along(genome), 1000, TRUE, prob = width(genome))
#nt50_windows <- lapply(which_chromosomes, function(x)
# {window <- sample(width(genome[x]) - 51, 1); genome[[x]][seq(window, window+49)]})
#nt50_windows <- DNAStringSet(nt50_windows)
#names(nt50_windows) <- paste0("read_", seq_along(nt50_windows))
#dir.create(project["fastq RNA-seq"], recursive = TRUE)
#fake_fasta <- file.path(project["fastq RNA-seq"], "fake-RNA-seq.fasta")
#writeXStringSet(nt50_windows, fake_fasta, format = "fasta")
## Align the fake reads and import bam
# STAR.align.single(fake_fasta, NULL, project["bam RNA-seq"], index, steps = "ge")
#bam_file <- list.files(file.path(project["bam RNA-seq"], "aligned"),
#  pattern = "\\.bam$", full.names = TRUE)
#fimport(bam_file)
}
\seealso{
Other STAR: 
\code{\link{STAR.align.folder}()},
\code{\link{STAR.allsteps.multiQC}()},
\code{\link{STAR.index}()},
\code{\link{STAR.install}()},
\code{\link{STAR.multiQC}()},
\code{\link{STAR.remove.crashed.genome}()},
\code{\link{getGenomeAndAnnotation}()},
\code{\link{install.fastp}()}
}
\concept{STAR}
