Seqinfo 0.99.0
Seqinfo is an R/Bioconductor infrastructure package that implements a simple S4 class, the Seqinfo class, for storing the following information about a collection of genomic sequences:
the sequence names a.k.a. the seqnames;
the sequence lengths a.k.a. the seqlengths;
the sequence circularity flags, that is, whether the sequences are circular or not;
the sequence genomes, that is, the genome assembly that each sequence is coming from, if any.
The sequences described in a Seqinfo object are typically, but not necessarily, the chromosomes and/or scaffolds of a specific genome assembly of a given organism.
Note that Seqinfo objects are rarely used as standalone objects. Instead,
they are used as part of higher-level objects to represent their seqinfo()
component. Examples of such higher-level objects are GRanges,
RangedSummarizedExperiment, VCF, GAlignments, TxDb, etc… defined
in other Bioconductor infrastructure packages.
Like any Bioconductor package, Seqinfo should be installed
with BiocManager::install()
:
if (!require("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("Seqinfo")
BiocManager::install()
will take care of installing the package dependencies
that are missing.
Load the package:
library(Seqinfo)
Seqinfo objects can be created with the Seqinfo()
constructor function:
## Note that all the arguments (except 'genome') must have the
## same length. 'genome' can be of length 1, whatever the lengths
## of the other arguments are.
si1 <- Seqinfo(seqnames=c("chr1", "chr2", "chr3", "chrM"),
seqlengths=c(100, 200, NA, 15),
isCircular=c(NA, FALSE, FALSE, TRUE),
genome="toy")
si1
## Seqinfo object with 4 sequences (1 circular) from toy genome:
## seqnames seqlengths isCircular genome
## chr1 100 NA toy
## chr2 200 FALSE toy
## chr3 NA FALSE toy
## chrM 15 TRUE toy
One special form of calling the Seqinfo()
constructor function is to
specify only the genome
argument and set it to the name of an NCBI
assembly (e.g. Seqinfo(genome="GRCh38.p14")
) or UCSC genome (e.g.
Seqinfo(genome="hg38")
), in which case the sequence information is
fetched from NCBI or UCSC. This requires the GenomeInfoDb
package (the package only needs to be installed):
library(GenomeInfoDb) # just making sure that the package is installed
Seqinfo(genome="GRCh38.p14")
## Seqinfo object with 709 sequences (1 circular) from GRCh38.p14 genome:
## seqnames seqlengths isCircular genome
## 1 248956422 FALSE GRCh38.p14
## 2 242193529 FALSE GRCh38.p14
## 3 198295559 FALSE GRCh38.p14
## 4 190214555 FALSE GRCh38.p14
## 5 181538259 FALSE GRCh38.p14
## ... ... ... ...
## HSCHR22_8_CTG1 145162 FALSE GRCh38.p14
## HSCHRX_3_CTG7 188004 FALSE GRCh38.p14
## HSCHRX_1_CTG14 619716 FALSE GRCh38.p14
## HSCHRX_2_CTG14 294119 FALSE GRCh38.p14
## HSCHRX_3_CTG3 330493 FALSE GRCh38.p14
Seqinfo(genome="hg38")
## Seqinfo object with 711 sequences (1 circular) from hg38 genome:
## seqnames seqlengths isCircular genome
## chr1 248956422 FALSE hg38
## chr2 242193529 FALSE hg38
## chr3 198295559 FALSE hg38
## chr4 190214555 FALSE hg38
## chr5 181538259 FALSE hg38
## ... ... ... ...
## chr22_KQ759761v1_alt 145162 FALSE hg38
## chrX_KV766199v1_alt 188004 FALSE hg38
## chrX_MU273395v1_alt 619716 FALSE hg38
## chrX_MU273396v1_alt 294119 FALSE hg38
## chrX_MU273397v1_alt 330493 FALSE hg38
See ?Seqinfo
for more information.
Various accessor functions are provided:
length(si1)
## [1] 4
seqnames(si1)
## [1] "chr1" "chr2" "chr3" "chrM"
names(si1)
## [1] "chr1" "chr2" "chr3" "chrM"
seqlevels(si1)
## [1] "chr1" "chr2" "chr3" "chrM"
seqlengths(si1)
## chr1 chr2 chr3 chrM
## 100 200 NA 15
isCircular(si1)
## chr1 chr2 chr3 chrM
## NA FALSE FALSE TRUE
genome(si1)
## chr1 chr2 chr3 chrM
## "toy" "toy" "toy" "toy"
See ?Seqinfo
for more information.
A Seqinfo object can be subsetted by seqnames:
si1[c("chrY", "chr3", "chr1")]
## Seqinfo object with 3 sequences from 2 genomes (NA, toy):
## seqnames seqlengths isCircular genome
## chrY NA NA <NA>
## chr3 NA FALSE toy
## chr1 100 NA toy
Rename:
si <- si1
seqlevels(si) <- sub("chr", "ch", seqlevels(si))
si
## Seqinfo object with 4 sequences (1 circular) from toy genome:
## seqnames seqlengths isCircular genome
## ch1 100 NA toy
## ch2 200 FALSE toy
## ch3 NA FALSE toy
## chM 15 TRUE toy
Reorder:
seqlevels(si) <- rev(seqlevels(si))
si
## Seqinfo object with 4 sequences (1 circular) from toy genome:
## seqnames seqlengths isCircular genome
## chM 15 TRUE toy
## ch3 NA FALSE toy
## ch2 200 FALSE toy
## ch1 100 NA toy
Drop/add/reorder:
seqlevels(si) <- c("ch1", "ch2", "chY")
si
## Seqinfo object with 3 sequences from 2 genomes (toy, NA):
## seqnames seqlengths isCircular genome
## ch1 100 NA toy
## ch2 200 FALSE toy
## chY NA NA <NA>
Rename/reorder/drop/add:
seqlevels(si) <- c(chY="Y", ch1="1", "22")
si
## Seqinfo object with 3 sequences from 2 genomes (NA, toy):
## seqnames seqlengths isCircular genome
## Y NA NA <NA>
## 1 100 NA toy
## 22 NA NA <NA>
See ?Seqinfo
for more information.
Two Seqinfo objects can be merged if they are compatible:
si2 <- Seqinfo(seqnames=c("chr3", "chr4", "chrM"),
seqlengths=c(300, NA, 15))
si2
## Seqinfo object with 3 sequences from an unspecified genome:
## seqnames seqlengths isCircular genome
## chr3 300 NA <NA>
## chr4 NA NA <NA>
## chrM 15 NA <NA>
merge(si1, si2) # rows for chr3 and chrM are merged
## Warning in .merge_two_Seqinfo_objects(x, y): Each of the 2 combined objects has sequence levels not in the other:
## - in 'x': chr1, chr2
## - in 'y': chr4
## Make sure to always combine/compare objects based on the same reference
## genome (use suppressWarnings() to suppress this warning).
## Seqinfo object with 5 sequences (1 circular) from 2 genomes (toy, NA):
## seqnames seqlengths isCircular genome
## chr1 100 NA toy
## chr2 200 FALSE toy
## chr3 300 FALSE toy
## chrM 15 TRUE toy
## chr4 NA NA <NA>
suppressWarnings(merge(si1, si2))
## Seqinfo object with 5 sequences (1 circular) from 2 genomes (toy, NA):
## seqnames seqlengths isCircular genome
## chr1 100 NA toy
## chr2 200 FALSE toy
## chr3 300 FALSE toy
## chrM 15 TRUE toy
## chr4 NA NA <NA>
Note that, strictly speaking, merging two Seqinfo objects is not a
commutative operation, i.e., in general z1 <- merge(x, y)
is not
identical to z2 <- merge(y, x)
. However z1
and z2
are guaranteed
to contain the same information (i.e. the same rows, but typically not
in the same order):
suppressWarnings(merge(si2, si1))
## Seqinfo object with 5 sequences (1 circular) from 2 genomes (toy, NA):
## seqnames seqlengths isCircular genome
## chr3 300 FALSE toy
## chr4 NA NA <NA>
## chrM 15 TRUE toy
## chr1 100 NA toy
## chr2 200 FALSE toy
Trying to merge Seqinfo objects that are not compatible will raise an error:
## This contradicts what 'x' says about circularity of chr3 and chrM:
isCircular(si2)[c("chr3", "chrM")] <- c(TRUE, FALSE)
si2
## Seqinfo object with 3 sequences (1 circular) from an unspecified genome:
## seqnames seqlengths isCircular genome
## chr3 300 TRUE <NA>
## chr4 NA NA <NA>
## chrM 15 FALSE <NA>
merge(si1, si2) # ERROR!
## Error in mergeNamedAtomicVectors(isCircular(x), isCircular(y), what = c("sequence", :
## sequences chr3, chrM have incompatible circularity flags:
## - in 'x': FALSE, TRUE
## - in 'y': TRUE, FALSE
See ?Seqinfo
for more information.
Seqinfo objects are typically found as part of higher-level objects to
represent their seqinfo()
component. For example, in TxDb objects:
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
class(txdb)
## [1] "TxDb"
## attr(,"package")
## [1] "GenomicFeatures"
seqinfo(txdb)
## Seqinfo object with 711 sequences (1 circular) from hg38 genome:
## seqnames seqlengths isCircular genome
## chr1 248956422 FALSE hg38
## chr2 242193529 FALSE hg38
## chr3 198295559 FALSE hg38
## chr4 190214555 FALSE hg38
## chr5 181538259 FALSE hg38
## ... ... ... ...
## chr22_KQ759761v1_alt 145162 FALSE hg38
## chrX_KV766199v1_alt 188004 FALSE hg38
## chrX_MU273395v1_alt 619716 FALSE hg38
## chrX_MU273396v1_alt 294119 FALSE hg38
## chrX_MU273397v1_alt 330493 FALSE hg38
and in BSgenome objects:
library(BSgenome.Hsapiens.UCSC.hg38)
bsg <- BSgenome.Hsapiens.UCSC.hg38
class(bsg)
## [1] "BSgenome"
## attr(,"package")
## [1] "BSgenome"
seqinfo(bsg)
## Seqinfo object with 711 sequences (1 circular) from hg38 genome:
## seqnames seqlengths isCircular genome
## chr1 248956422 FALSE hg38
## chr2 242193529 FALSE hg38
## chr3 198295559 FALSE hg38
## chr4 190214555 FALSE hg38
## chr5 181538259 FALSE hg38
## ... ... ... ...
## chr22_KQ759761v1_alt 145162 FALSE hg38
## chrX_KV766199v1_alt 188004 FALSE hg38
## chrX_MU273395v1_alt 619716 FALSE hg38
## chrX_MU273396v1_alt 294119 FALSE hg38
## chrX_MU273397v1_alt 330493 FALSE hg38
Sanity checks:
stopifnot(identical(seqinfo(txdb), Seqinfo(genome="hg38")))
stopifnot(identical(seqinfo(bsg), Seqinfo(genome="hg38")))
Here is the output of sessionInfo()
on the system on which this document
was compiled:
sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] BSgenome.Hsapiens.UCSC.hg38_1.4.5
## [2] TxDb.Hsapiens.UCSC.hg38.knownGene_3.21.0
## [3] BSgenome_1.77.0
## [4] rtracklayer_1.69.0
## [5] BiocIO_1.19.0
## [6] Biostrings_2.77.1
## [7] XVector_0.49.0
## [8] GenomicFeatures_1.61.3
## [9] AnnotationDbi_1.71.0
## [10] Biobase_2.69.0
## [11] GenomicRanges_1.61.0
## [12] GenomeInfoDb_1.45.4
## [13] IRanges_2.43.0
## [14] S4Vectors_0.47.0
## [15] Seqinfo_0.99.0
## [16] BiocGenerics_0.55.0
## [17] generics_0.1.4
## [18] BiocStyle_2.37.0
##
## loaded via a namespace (and not attached):
## [1] KEGGREST_1.49.0 SummarizedExperiment_1.39.0
## [3] rjson_0.2.23 xfun_0.52
## [5] bslib_0.9.0 lattice_0.22-7
## [7] vctrs_0.6.5 tools_4.5.0
## [9] bitops_1.0-9 curl_6.2.3
## [11] parallel_4.5.0 RSQLite_2.4.0
## [13] blob_1.2.4 pkgconfig_2.0.3
## [15] Matrix_1.7-3 lifecycle_1.0.4
## [17] compiler_4.5.0 Rsamtools_2.25.0
## [19] codetools_0.2-20 htmltools_0.5.8.1
## [21] sass_0.4.10 RCurl_1.98-1.17
## [23] yaml_2.3.10 crayon_1.5.3
## [25] jquerylib_0.1.4 BiocParallel_1.43.3
## [27] cachem_1.1.0 DelayedArray_0.35.1
## [29] abind_1.4-8 digest_0.6.37
## [31] restfulr_0.0.15 bookdown_0.43
## [33] fastmap_1.2.0 grid_4.5.0
## [35] SparseArray_1.9.0 cli_3.6.5
## [37] S4Arrays_1.9.1 XML_3.99-0.18
## [39] UCSC.utils_1.5.0 bit64_4.6.0-1
## [41] rmarkdown_2.29 httr_1.4.7
## [43] matrixStats_1.5.0 bit_4.6.0
## [45] png_0.1-8 memoise_2.0.1
## [47] evaluate_1.0.3 knitr_1.50
## [49] rlang_1.1.6 DBI_1.2.3
## [51] BiocManager_1.30.25 jsonlite_2.0.0
## [53] R6_2.6.1 MatrixGenerics_1.21.0
## [55] GenomicAlignments_1.45.0