Contents

1 Introduction

Seqinfo is an R/Bioconductor infrastructure package that implements a simple S4 class, the Seqinfo class, for storing the following information about a collection of genomic sequences:

The sequences described in a Seqinfo object are typically, but not necessarily, the chromosomes and/or scaffolds of a specific genome assembly of a given organism.

Note that Seqinfo objects are rarely used as standalone objects. Instead, they are used as part of higher-level objects to represent their seqinfo() component. Examples of such higher-level objects are GRanges, RangedSummarizedExperiment, VCF, GAlignments, TxDb, etc… defined in other Bioconductor infrastructure packages.

2 Install and load the package

Like any Bioconductor package, Seqinfo should be installed with BiocManager::install():

if (!require("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("Seqinfo")

BiocManager::install() will take care of installing the package dependencies that are missing.

Load the package:

library(Seqinfo)

3 The Seqinfo constructor

Seqinfo objects can be created with the Seqinfo() constructor function:

## Note that all the arguments (except 'genome') must have the
## same length. 'genome' can be of length 1, whatever the lengths
## of the other arguments are.
si1 <- Seqinfo(seqnames=c("chr1", "chr2", "chr3", "chrM"),
               seqlengths=c(100, 200, NA, 15),
               isCircular=c(NA, FALSE, FALSE, TRUE),
               genome="toy")
si1
## Seqinfo object with 4 sequences (1 circular) from toy genome:
##   seqnames seqlengths isCircular genome
##   chr1            100         NA    toy
##   chr2            200      FALSE    toy
##   chr3             NA      FALSE    toy
##   chrM             15       TRUE    toy

One special form of calling the Seqinfo() constructor function is to specify only the genome argument and set it to the name of an NCBI assembly (e.g. Seqinfo(genome="GRCh38.p14")) or UCSC genome (e.g. Seqinfo(genome="hg38")), in which case the sequence information is fetched from NCBI or UCSC. This requires the GenomeInfoDb package (the package only needs to be installed):

library(GenomeInfoDb)  # just making sure that the package is installed

Seqinfo(genome="GRCh38.p14")
## Seqinfo object with 709 sequences (1 circular) from GRCh38.p14 genome:
##   seqnames       seqlengths isCircular     genome
##   1               248956422      FALSE GRCh38.p14
##   2               242193529      FALSE GRCh38.p14
##   3               198295559      FALSE GRCh38.p14
##   4               190214555      FALSE GRCh38.p14
##   5               181538259      FALSE GRCh38.p14
##   ...                   ...        ...        ...
##   HSCHR22_8_CTG1     145162      FALSE GRCh38.p14
##   HSCHRX_3_CTG7      188004      FALSE GRCh38.p14
##   HSCHRX_1_CTG14     619716      FALSE GRCh38.p14
##   HSCHRX_2_CTG14     294119      FALSE GRCh38.p14
##   HSCHRX_3_CTG3      330493      FALSE GRCh38.p14
Seqinfo(genome="hg38")
## Seqinfo object with 711 sequences (1 circular) from hg38 genome:
##   seqnames             seqlengths isCircular genome
##   chr1                  248956422      FALSE   hg38
##   chr2                  242193529      FALSE   hg38
##   chr3                  198295559      FALSE   hg38
##   chr4                  190214555      FALSE   hg38
##   chr5                  181538259      FALSE   hg38
##   ...                         ...        ...    ...
##   chr22_KQ759761v1_alt     145162      FALSE   hg38
##   chrX_KV766199v1_alt      188004      FALSE   hg38
##   chrX_MU273395v1_alt      619716      FALSE   hg38
##   chrX_MU273396v1_alt      294119      FALSE   hg38
##   chrX_MU273397v1_alt      330493      FALSE   hg38

See ?Seqinfo for more information.

4 Seqinfo accessors

Various accessor functions are provided:

length(si1)
## [1] 4
seqnames(si1)
## [1] "chr1" "chr2" "chr3" "chrM"
names(si1)
## [1] "chr1" "chr2" "chr3" "chrM"
seqlevels(si1)
## [1] "chr1" "chr2" "chr3" "chrM"
seqlengths(si1)
## chr1 chr2 chr3 chrM 
##  100  200   NA   15
isCircular(si1)
##  chr1  chr2  chr3  chrM 
##    NA FALSE FALSE  TRUE
genome(si1)
##  chr1  chr2  chr3  chrM 
## "toy" "toy" "toy" "toy"

See ?Seqinfo for more information.

5 Operations on Seqinfo objects

5.1 Subset by seqnames

A Seqinfo object can be subsetted by seqnames:

si1[c("chrY", "chr3", "chr1")]
## Seqinfo object with 3 sequences from 2 genomes (NA, toy):
##   seqnames seqlengths isCircular genome
##   chrY             NA         NA   <NA>
##   chr3             NA      FALSE    toy
##   chr1            100         NA    toy

5.2 Rename, drop, add and/or reorder the sequences

Rename:

si <- si1
seqlevels(si) <- sub("chr", "ch", seqlevels(si))
si
## Seqinfo object with 4 sequences (1 circular) from toy genome:
##   seqnames seqlengths isCircular genome
##   ch1             100         NA    toy
##   ch2             200      FALSE    toy
##   ch3              NA      FALSE    toy
##   chM              15       TRUE    toy

Reorder:

seqlevels(si) <- rev(seqlevels(si))
si
## Seqinfo object with 4 sequences (1 circular) from toy genome:
##   seqnames seqlengths isCircular genome
##   chM              15       TRUE    toy
##   ch3              NA      FALSE    toy
##   ch2             200      FALSE    toy
##   ch1             100         NA    toy

Drop/add/reorder:

seqlevels(si) <- c("ch1", "ch2", "chY")
si
## Seqinfo object with 3 sequences from 2 genomes (toy, NA):
##   seqnames seqlengths isCircular genome
##   ch1             100         NA    toy
##   ch2             200      FALSE    toy
##   chY              NA         NA   <NA>

Rename/reorder/drop/add:

seqlevels(si) <- c(chY="Y", ch1="1", "22")
si
## Seqinfo object with 3 sequences from 2 genomes (NA, toy):
##   seqnames seqlengths isCircular genome
##   Y                NA         NA   <NA>
##   1               100         NA    toy
##   22               NA         NA   <NA>

See ?Seqinfo for more information.

5.3 Merge Seqinfo objects

Two Seqinfo objects can be merged if they are compatible:

si2 <- Seqinfo(seqnames=c("chr3", "chr4", "chrM"),
               seqlengths=c(300, NA, 15))
si2
## Seqinfo object with 3 sequences from an unspecified genome:
##   seqnames seqlengths isCircular genome
##   chr3            300         NA   <NA>
##   chr4             NA         NA   <NA>
##   chrM             15         NA   <NA>
merge(si1, si2)  # rows for chr3 and chrM are merged
## Warning in .merge_two_Seqinfo_objects(x, y): Each of the 2 combined objects has sequence levels not in the other:
##   - in 'x': chr1, chr2
##   - in 'y': chr4
##   Make sure to always combine/compare objects based on the same reference
##   genome (use suppressWarnings() to suppress this warning).
## Seqinfo object with 5 sequences (1 circular) from 2 genomes (toy, NA):
##   seqnames seqlengths isCircular genome
##   chr1            100         NA    toy
##   chr2            200      FALSE    toy
##   chr3            300      FALSE    toy
##   chrM             15       TRUE    toy
##   chr4             NA         NA   <NA>
suppressWarnings(merge(si1, si2))
## Seqinfo object with 5 sequences (1 circular) from 2 genomes (toy, NA):
##   seqnames seqlengths isCircular genome
##   chr1            100         NA    toy
##   chr2            200      FALSE    toy
##   chr3            300      FALSE    toy
##   chrM             15       TRUE    toy
##   chr4             NA         NA   <NA>

Note that, strictly speaking, merging two Seqinfo objects is not a commutative operation, i.e., in general z1 <- merge(x, y) is not identical to z2 <- merge(y, x). However z1 and z2 are guaranteed to contain the same information (i.e. the same rows, but typically not in the same order):

suppressWarnings(merge(si2, si1))
## Seqinfo object with 5 sequences (1 circular) from 2 genomes (toy, NA):
##   seqnames seqlengths isCircular genome
##   chr3            300      FALSE    toy
##   chr4             NA         NA   <NA>
##   chrM             15       TRUE    toy
##   chr1            100         NA    toy
##   chr2            200      FALSE    toy

Trying to merge Seqinfo objects that are not compatible will raise an error:

## This contradicts what 'x' says about circularity of chr3 and chrM:
isCircular(si2)[c("chr3", "chrM")] <- c(TRUE, FALSE)
si2
## Seqinfo object with 3 sequences (1 circular) from an unspecified genome:
##   seqnames seqlengths isCircular genome
##   chr3            300       TRUE   <NA>
##   chr4             NA         NA   <NA>
##   chrM             15      FALSE   <NA>
merge(si1, si2)  # ERROR!
## Error in mergeNamedAtomicVectors(isCircular(x), isCircular(y), what = c("sequence",  :
##   sequences chr3, chrM have incompatible circularity flags:
##   - in 'x': FALSE, TRUE
##   - in 'y': TRUE, FALSE

See ?Seqinfo for more information.

6 Seqinfo objects as parts of higher-level objects

Seqinfo objects are typically found as part of higher-level objects to represent their seqinfo() component. For example, in TxDb objects:

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
class(txdb)
## [1] "TxDb"
## attr(,"package")
## [1] "GenomicFeatures"
seqinfo(txdb)
## Seqinfo object with 711 sequences (1 circular) from hg38 genome:
##   seqnames             seqlengths isCircular genome
##   chr1                  248956422      FALSE   hg38
##   chr2                  242193529      FALSE   hg38
##   chr3                  198295559      FALSE   hg38
##   chr4                  190214555      FALSE   hg38
##   chr5                  181538259      FALSE   hg38
##   ...                         ...        ...    ...
##   chr22_KQ759761v1_alt     145162      FALSE   hg38
##   chrX_KV766199v1_alt      188004      FALSE   hg38
##   chrX_MU273395v1_alt      619716      FALSE   hg38
##   chrX_MU273396v1_alt      294119      FALSE   hg38
##   chrX_MU273397v1_alt      330493      FALSE   hg38

and in BSgenome objects:

library(BSgenome.Hsapiens.UCSC.hg38)
bsg <- BSgenome.Hsapiens.UCSC.hg38
class(bsg)
## [1] "BSgenome"
## attr(,"package")
## [1] "BSgenome"
seqinfo(bsg)
## Seqinfo object with 711 sequences (1 circular) from hg38 genome:
##   seqnames             seqlengths isCircular genome
##   chr1                  248956422      FALSE   hg38
##   chr2                  242193529      FALSE   hg38
##   chr3                  198295559      FALSE   hg38
##   chr4                  190214555      FALSE   hg38
##   chr5                  181538259      FALSE   hg38
##   ...                         ...        ...    ...
##   chr22_KQ759761v1_alt     145162      FALSE   hg38
##   chrX_KV766199v1_alt      188004      FALSE   hg38
##   chrX_MU273395v1_alt      619716      FALSE   hg38
##   chrX_MU273396v1_alt      294119      FALSE   hg38
##   chrX_MU273397v1_alt      330493      FALSE   hg38

Sanity checks:

stopifnot(identical(seqinfo(txdb), Seqinfo(genome="hg38")))
stopifnot(identical(seqinfo(bsg), Seqinfo(genome="hg38")))

7 Session information

Here is the output of sessionInfo() on the system on which this document was compiled:

sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] BSgenome.Hsapiens.UCSC.hg38_1.4.5       
##  [2] TxDb.Hsapiens.UCSC.hg38.knownGene_3.21.0
##  [3] BSgenome_1.77.0                         
##  [4] rtracklayer_1.69.0                      
##  [5] BiocIO_1.19.0                           
##  [6] Biostrings_2.77.1                       
##  [7] XVector_0.49.0                          
##  [8] GenomicFeatures_1.61.3                  
##  [9] AnnotationDbi_1.71.0                    
## [10] Biobase_2.69.0                          
## [11] GenomicRanges_1.61.0                    
## [12] GenomeInfoDb_1.45.4                     
## [13] IRanges_2.43.0                          
## [14] S4Vectors_0.47.0                        
## [15] Seqinfo_0.99.0                          
## [16] BiocGenerics_0.55.0                     
## [17] generics_0.1.4                          
## [18] BiocStyle_2.37.0                        
## 
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.49.0             SummarizedExperiment_1.39.0
##  [3] rjson_0.2.23                xfun_0.52                  
##  [5] bslib_0.9.0                 lattice_0.22-7             
##  [7] vctrs_0.6.5                 tools_4.5.0                
##  [9] bitops_1.0-9                curl_6.2.3                 
## [11] parallel_4.5.0              RSQLite_2.4.0              
## [13] blob_1.2.4                  pkgconfig_2.0.3            
## [15] Matrix_1.7-3                lifecycle_1.0.4            
## [17] compiler_4.5.0              Rsamtools_2.25.0           
## [19] codetools_0.2-20            htmltools_0.5.8.1          
## [21] sass_0.4.10                 RCurl_1.98-1.17            
## [23] yaml_2.3.10                 crayon_1.5.3               
## [25] jquerylib_0.1.4             BiocParallel_1.43.3        
## [27] cachem_1.1.0                DelayedArray_0.35.1        
## [29] abind_1.4-8                 digest_0.6.37              
## [31] restfulr_0.0.15             bookdown_0.43              
## [33] fastmap_1.2.0               grid_4.5.0                 
## [35] SparseArray_1.9.0           cli_3.6.5                  
## [37] S4Arrays_1.9.1              XML_3.99-0.18              
## [39] UCSC.utils_1.5.0            bit64_4.6.0-1              
## [41] rmarkdown_2.29              httr_1.4.7                 
## [43] matrixStats_1.5.0           bit_4.6.0                  
## [45] png_0.1-8                   memoise_2.0.1              
## [47] evaluate_1.0.3              knitr_1.50                 
## [49] rlang_1.1.6                 DBI_1.2.3                  
## [51] BiocManager_1.30.25         jsonlite_2.0.0             
## [53] R6_2.6.1                    MatrixGenerics_1.21.0      
## [55] GenomicAlignments_1.45.0