1 Introduction

The sangerseqR package provides basic functions for importing and working with sanger sequencing data files. It currently functions with Scf and ABIF files. The Scf file specification is an open source, although somewhat limited, data file type. Several tools designed to view and or edit chromatogram data can convert file types to Scf. The ABIF file specification is a proprietary data storage file specification for sequencing data generated by Applied Biosystems machines. More information on each filetype can be found at the following sites:

The objects and functions included in this package were developed as part of the Poly Peak Parser web application (http://yost.genetics.utah.edu/software.php), which automates the process of seperating ambiguous double peaks from Sanger sequencing individuals containing heterozygous indels. This package contains a complete working copy of the Poly Peak Parser web application that can be run locally using the PolyPeakParser() function. In addition to the web program, this package also provides general objects and functions for working with Sanger sequencing data.

This vignette will walk you through a typical workflow using two sequence files: 1) homozygous.scf and 2) heterozygous.ab1. As their names indicate, the first example contains results typical of sequencing from a PCR product of a homozygous individual or from a plasmid. The second example contains results from sequencing the same region in an individual with a small indel.

2 Loading Data

The first step of most workflows will be to upload data from a sequencing results file. This can be done using one of three included functions: read.abif(), read.scf() and readsangerseq(). The first two functions directly import all of the fields into abif and scf class objects, respectively. These classes are meant as intermediate classes and exist to allow the user to inspect the file contents, as file contents may vary between basecallers and sequencing machines. Users should generally use the readsangerseq() function. This function automatically detects and reads in the file type and then extracts the fields necessary to create a sangerseq class object, which is used by all of the other functions in this package.

2.1 read.abif

read.abif takes a single argument for the filename of the abif file to be read. The resulting object contains three major parts. The header, containing information on the file structure, the directory, containing information on each of the data fields included in the file, and the data fields. Here is an example:

hetab1 <- read.abif(system.file("extdata", "heterozygous.ab1", package = "sangerseqR"))
str(hetab1, list.len = 20)
## Formal class 'abif' [package "sangerseqR"] with 3 slots
##   ..@ header   :Formal class 'abifHeader' [package "sangerseqR"] with 9 slots
##   .. .. ..@ abif       : chr "ABIF"
##   .. .. ..@ version    : int 101
##   .. .. ..@ name       : chr "tdir"
##   .. .. ..@ number     : int 1
##   .. .. ..@ elementtype: int 1023
##   .. .. ..@ elementsize: int 28
##   .. .. ..@ numelements: int 130
##   .. .. ..@ dataoffset : int 323971
##   .. .. ..@ datahandle : int 0
##   ..@ directory:Formal class 'abifDirectory' [package "sangerseqR"] with 7 slots
##   .. .. ..@ name       : chr [1:130] "AEPt" "AEPt" "APFN" "APXV" ...
##   .. .. ..@ tagnumber  : int [1:130] 1 2 2 1 1 1 1 1 1 1 ...
##   .. .. ..@ elementtype: int [1:130] 4 4 18 19 19 19 2 5 4 4 ...
##   .. .. ..@ elementsize: int [1:130] 2 2 1 1 1 1 1 4 2 2 ...
##   .. .. ..@ numelements: int [1:130] 1 1 6 2 6 2 4503 1 1 1 ...
##   .. .. ..@ datasize   : int [1:130] 2 2 6 2 6 2 4503 4 2 2 ...
##   .. .. ..@ dataoffset : int [1:130] 1113325568 1113325568 173231 838860800 163360 956301312 163366 0 65536 145752064 ...
##   ..@ data     :List of 130
##   .. ..$ AEPt.1 : int 16988
##   .. ..$ AEPt.2 : int 16988
##   .. ..$ APFN.2 : chr "SeqA"
##   .. ..$ APXV.1 : chr "2"
##   .. ..$ APrN.1 : chr "SeqA"
##   .. ..$ APrV.1 : chr "9"
##   .. ..$ APrX.1 : chr "?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?>\n<AnalysisProtocolContainer doAnalysis=\"true\" na"| __truncated__
##   .. ..$ ARTN.1 : int 0
##   .. ..$ ASPF.1 : int 1
##   .. ..$ ASPt.1 : int 2224
##   .. ..$ ASPt.2 : int 2224
##   .. ..$ AUDT.1 : int [1:1370] 64 126 65 54 55 79 81 183 49 123 ...
##   .. ..$ B1Pt.1 : int 2223
##   .. ..$ B1Pt.2 : int 2223
##   .. ..$ BCTS.1 : chr "201306-13 17:26:28 -06:00"
##   .. ..$ BufT.1 : int [1:1596] -27 -27 -27 -27 -27 -27 -27 -27 -27 -27 ...
##   .. ..$ CMNT.1 : chr "ID:119209><WELL:G02>"
##   .. ..$ CTID.1 : chr "bdt1735"
##   .. ..$ CTNM.1 : chr "bdt1735"
##   .. ..$ CTOw.1 : chr "aadamson"
##   .. .. [list output truncated]

As you can see, the file is very long and contains a lot of Data fields (130 in this example). However, most of these contain run information and only a few are directly relevant to data analysis:

Data Field Description
DATA.9-DATA.12 Vectors containing the signal intensities for each channel.
FWO.1 A string containing the base corresponding to each channel. For example, if it is “ACGT”, then DATA.9 = A, DATA.10 = C, DATA.11 = G and DATA.12 = T.
PLOC.2 Peak locations as an index of the trace vectors.
PBAS.1, PBAS.2 Primary basecalls. PBAS.1 may contain bases edited in the original basecaller, while PBAS.2 always contains the basecaller’s calls.
P1AM.1 Amplitude of primary basecall peaks.
P2BA.1 (optional) Contains the secondary basecalls.
P2AM.1 (optional) Amplitude of the secondary basecall peaks.

2.2 read.scf

Like read.abif, read.scf takes a single argument with the filename. However, the data structure of the resulting scf object is far less complicated, containing only a header with file structure information, a matrix of the trace data (@sample_points), a matrix of relative probabilities of each base at each position (@sequence_probs), basecall positions (@basecall_positions), basecalls (@basecalls) and optionally a comments sections with the run data (@comments). The last slot (@private) is rarely used and impossible to interpret without knowing how it was created.

homoscf <- read.scf(system.file("extdata", "homozygous.scf", package = "sangerseqR"))
str(homoscf)
## Formal class 'scf' [package "sangerseqR"] with 7 slots
##   ..@ header            :Formal class 'scfHeader' [package "sangerseqR"] with 14 slots
##   .. .. ..@ scf             : chr "scf"
##   .. .. ..@ samples         : int 16275
##   .. .. ..@ samples_offset  : int 128
##   .. .. ..@ bases           : int 722
##   .. .. ..@ bases_left_clip : int 0
##   .. .. ..@ bases_right_clip: int 0
##   .. .. ..@ bases_offset    : int 130328
##   .. .. ..@ comments_size   : int 1731
##   .. .. ..@ comments_offset : int 138992
##   .. .. ..@ version         : num 300
##   .. .. ..@ sample_size     : int 2
##   .. .. ..@ code_set        : int 2
##   .. .. ..@ private_size    : int 0
##   .. .. ..@ private_offset  : int 140723
##   ..@ sample_points     : num [1:16275, 1:4] 187 190 199 220 255 304 354 389 404 402 ...
##   ..@ sequence_probs    : int [1:722, 1:4] 0 0 0 0 0 0 0 0 0 0 ...
##   ..@ basecall_positions: int [1:722] 2 18 25 39 45 56 63 68 85 94 ...
##   ..@ basecalls         : chr "ARGKRAMMYWACTATAGGGCGGAATTGAATTTAGCGGCCGCGAATTCGCCCTTTGGCAAGAGAGCGACAGTCAGTCGGACTTACGAGTTGTTTTTACAGGCGCAATTCTTT"| __truncated__
##   ..@ comments          : chr "STRT6/21/2013\n18:04:27\nSTOP=6/21/2013\n20:02:45\nSIGN=G=124,A=134,T=204,C=159\nAEPt=16308\nAEPt=16308\nAPFN=S"| __truncated__
##   ..@ private           : raw [1:2] 00 31

2.3 readsangerseq

The readsangerseq function is a convenience function equivalent to sangerseq(read.abif(file)) or sangerseq(read.scf(file)). It should generally be used when the contents of the file do not need to be directly accessed because it returns a sangerseq object, described below.

3 Sangerseq Class Objects

The sangerseq class is the backbone of the sangerseqR package and contains the chromatogram data necesary to perform all other functions. It can be created in two ways: from an abif or scf object using the sangerseq method or directly from an abif or scf file using readsangerseq.

# from a sequence file object
homosangerseq <- sangerseq(homoscf)

# directly from the file
hetsangerseq <- readsangerseq(system.file("extdata", "heterozygous.ab1",
    package = "sangerseqR"))
str(hetsangerseq)
## Formal class 'sangerseq' [package "sangerseqR"] with 7 slots
##   ..@ primarySeqID  : chr "From ab1 file"
##   ..@ primarySeq    :Formal class 'DNAString' [package "Biostrings"] with 5 slots
##   .. .. ..@ shared         :Formal class 'SharedRaw' [package "XVector"] with 2 slots
##   .. .. .. .. ..@ xp                    :<externalptr> 
##   .. .. .. .. ..@ .link_to_cached_object:<environment: 0x64cc8c912928> 
##   .. .. ..@ offset         : int 0
##   .. .. ..@ length         : int 605
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   ..@ secondarySeqID: chr "From ab1 file"
##   ..@ secondarySeq  :Formal class 'DNAString' [package "Biostrings"] with 5 slots
##   .. .. ..@ shared         :Formal class 'SharedRaw' [package "XVector"] with 2 slots
##   .. .. .. .. ..@ xp                    :<externalptr> 
##   .. .. .. .. ..@ .link_to_cached_object:<environment: 0x64cc8c912928> 
##   .. .. ..@ offset         : int 0
##   .. .. ..@ length         : int 605
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   ..@ traceMatrix   : int [1:16215, 1:4] 0 0 0 1 2 4 4 2 1 0 ...
##   ..@ peakPosMatrix : num [1:605, 1:4] 4 13 21 31 43 58 64 73 83 98 ...
##   ..@ peakAmpMatrix : int [1:605, 1:4] 380 694 836 934 1367 1063 2072 1502 1234 539 ...

The slots are as follows:

Slot Description
primarySeqID Identification of the primary Basecalls.
primarySeq The primary Basecalls formatted as a DNAString object.
secondarySeqID Identification of the secondary Basecalls.
secondarySeq The secondary Basecalls formatted as a DNAString object.
traceMatrix A numerical matrix containing 4 columns corresponding to the normalized signal values for the chromatogram traces. Column order = A,C,G,T.
peakPosMatrix A numerical matrix containing the position of the maximum peak values for each base within each Basecall window. If no peak was detected for a given base in a given window, then “NA”. Column order = A,C,G,T.
peakAmpMatrix A numerical matrix containing the maximum peak amplitudes for each base within each Basecall window. If no peak was detected for a given base in a given window, then 0. Column order = A,C,G,T.

Accessor functions also exist for each slot in the sangerseq object. Most of the accessors return the data in its native format, but the primarySeq and secondarySeq accessors can optionally return the data as a character string or a DNAString class object from the Biostrings package by setting string=TRUE or string=FALSE, respectively. The DNAString class contains several convenient functions for manipulating the sequence, including generating the reverse compliment and performing alignments. The Biostrings package is automatically loaded with the sangerseq package, so all methods should be available.

# default is to return a DNAString object
Seq1 <- primarySeq(homosangerseq)
reverseComplement(Seq1)
## 722-letter DNAString object
## seq: TTAACCCTCACTAAAAGGGAATTAGTCCTGCAGGTT...CGCTAAATTCAATTCCGCCCTATAGTWRKKTYMCYT
# can return as string
primarySeq(homosangerseq, string = TRUE)
## [1] "ARGKRAMMYWACTATAGGGCGGAATTGAATTTAGCGGCCGCGAATTCGCCCTTTGGCAAGAGAGCGACAGTCAGTCGGACTTACGAGTTGTTTTTACAGGCGCAATTCTTTTTTTAGAATATTATACATTCATCTGGCTTTTTGGGTGCACCGATGAGAGATCCAGTTTTCACAGCGAACGCTATGGCTTATCACCCTTTTCACGCGCACAGGCCGGCCGACTTTCCCATGTCAGCTTTCCTTGCGGCGGCTCAACCTTCGTTCTTTCCAGCGCTCACTTTACCAGTAAACCGCTGGCGGATCATGCGCTCTCCGGTGCGGCTGAAGCTGGTTTACACGCGGCGCTTGGACATCACCACCAGGCGGCTCATCTGCGCTCTTTCAAGGGTCTCGAGCCAGAGGAGGATGTTGAGGACGATCCTAAAGTTACATTAGAAGCTAAGGAGCTTTGGGATCAATTCCACAAAATTGGAACAGAAATGGTCATCACTAAATCAGGAAGGTAAGGTCTTTACATTATTTAACCTATTGAATGCTGCATAGGGTGATGTTATTATATTACTCCGCGAAGAGTTGGGTCTATTTTATCGTAAAATATACTTTACATTATAAAATATTGCTCGGTTAAAATTCAGATGTACTGGATGCTGACATAGCATCGAAGCCTCTAARGGCGAATTCGTTTAAACCTGCAGGACTAATTCCCTTTTAGTGAGGGTTAA"

4 Creating Chromatograms

Basic chromatogram plots can be made using the chromatogram function. These plots are optimized for printing, so they contain several rows to plot all of the data simultaneously. The downside of this is that it can give an error if the graphics device dimensions are not large enough. If this occurs, we suggest you provide a filename in the command to save it to a pdf automatically sized to fit everything. Several parameters can also be set to affect how the plot appears. These are documented in the chromatogram help file.

chromatogram(hetsangerseq, width = 80, height = 3, trim5 = 50, trim3 = 100,
    showcalls = "both")