Contents

1 Introduction

SynExtend started out as a package of tools for working with objects of class Synteny built from the package DECIPHER’s FindSynteny() function.

Synteny maps provide a powerful tool for quantifying and visualizing where pairs of genomes share ordered information. Typically these maps are built from predictions of orthologous feature pairs, where groups of pairs that provide contiguous and sequential blocks in their respective genomes are referred to as a ‘syntenic block’. The identification of syntenic blocks can then be used to further interrogate the predicted orthologs themselves, or query topics like genomic rearrangements, ancestral genome reconstruction, or shared functional neighborhoods.

FindSynteny takes a different approach, finding exactly matched shared k-mers and determining where shared k-mers, or blocks of proximally shared k-mers are significant. Combining the information generated by FindSynteny with locations of genomic features allows us to simply mark where features are linked by syntenic k-mers. These linked features represent potential orthologous pairs, and can be easily evaluated on the basis of the k-mers that they share, or alignment.

2 Package Structure

Currently SynExtend contains a set of functions for inferring orthologous relationships between features in different genomes, as well as tools for predicting feature interaction networks.

2.1 Installation

  1. Install the latest version of R using CRAN.
  2. Install SynExtend with BiocManager:
if (!requireNamespace("BiocManager",
                      quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("SynExtend")

3 Usage

Using the FindSynteny function in DECIPHER builds an object of class Synteny. In this tutorial, a prebuilt DECIPHER database is used. For database construction see ?Seqs2DB in DECIPHER. This example starts with a database containing three endosymbiont genomes that were chosen to keep package data to a minimum.

library(SynExtend)
## Loading required package: DECIPHER
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## Loading required package: generics
## 
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
## 
##     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
##     setequal, union
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following object is masked from 'package:utils':
## 
##     data
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
##     mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     rank, rbind, rownames, sapply, saveRDS, scale, sequence, table,
##     tapply, transform, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: Seqinfo
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
## 
## Attaching package: 'SynExtend'
## The following object is masked from 'package:stats':
## 
##     dendrapply
library(DBI)

file01 <- system.file("extdata",
                      "example_db.sqlite",
                      package = "SynExtend")

syn <- FindSynteny(dbFile = file01)
## ================================================================================
## 
## Time difference of 4.96 secs

Synteny maps represent where genomes share order. Simply printing a synteny object to the console displays a gross level view of the data inside. Objects of class Synteny can also be plotted to provide clear visual representations of the data inside. The genomes used in this example are distantly related and fairly dissimilar.

syn
##           1         3        4
## 1     1 seq  18% hits 26% hits
## 3 50 blocks     1 seq 60% hits
## 4 66 blocks 54 blocks    1 seq
pairs(syn)

Data present inside objects of class Synteny can also be accessed relatively easily. The object itself is functionally a matrix of lists, with data describing exactly matched k-mers present in the upper triangle, and data describing blocks of chained k-mers in the lower triangle. For more information see ?FindSynteny in the package DECIPHER.

The upper triangle of the synteny upbject is populated by the individual hits.

print(head(syn[[1, 2]]))
##      index1 index2 strand width start1 start2 frame1 frame2
## [1,]      1      1      0   275  63033 219267      3      3
## [2,]      1      1      0   317  63810 219999      3      3
## [3,]      1      1      0  1031  64194 220364      3      2
## [4,]      1      1      0   210  65231 221395      3      2
## [5,]      1      1      0  1523  65481 221645      3      2
## [6,]      1      1      0   353 161688 105431      3      2

The lower triangle of the synteny object is populated by the block assignments of the hits.

print(head(syn[[2, 1]]))
##      index1 index2 strand score start1 start2   end1   end2 first_hit last_hit
## [1,]      1      1      0  1914  63033 219267  67003 223167         1        5
## [2,]      1      1      0  1826 161688 105431 167990 111398         6       25
## [3,]      1      1      0  1751 707150  43365 710999  47149        26       32
## [4,]      1      1      0  1588  52605      1  55680   2978        33       40
## [5,]      1      1      0  1157 154168  97916 161487 104655        41       58
## [6,]      1      1      0  1116 314963 134777 318319 138024        59       66

To take advantage of these synteny maps, we can then overlay the gene calls for each genome present on top of our map.

Next, GFF annotations for the associated genomes are parsed to provide gene calls in a use-able format. GFFs are not the only possible source of appropriate gene calls, but they are the source that was used during package construction and testing. GFFs are imported via rtracklater::import, and converted to a DataFrame via SquaregffBy. Gene calls for both the SummarizePairs and NucleotideOverlap functions must be a named list, and those names must all be present within dimnames(syn)[[1]].

gr_obj <- rtracklayer::import(system.file("extdata",
                                          "GCF_023585725.1_ASM2358572v1_genomic.gff.gz",
                                          package = "SynExtend"))
genecalls_object <- SquaregffBy(gff_object = gr_obj,
                                verbose = TRUE)
## parsing parent-child heirarchies...
## ================================================================================
## checking feature / contig bound conflicts...
## ================================================================================
## Time difference of 8.251893 secs
head(genecalls_object)
## DataFrame with 6 rows and 14 columns
##       Index    Strand     Start      Stop        Type                 ID
##   <integer> <integer> <integer> <integer> <character>        <character>
## 1         1         1      2972      3853        gene gene-M9404_RS00010
## 2         1         0      4525      5064        gene gene-M9404_RS00015
## 3         1         0      5220      5295        gene gene-M9404_RS00020
## 4         1         0      5877      6365        gene gene-M9404_RS00025
## 5         1         0      6392      7630        gene gene-M9404_RS00030
## 6         1         0      7890      8804        gene gene-M9404_RS00035
##           Range                Product    Coding Translation_Table
##   <IRangesList>            <character> <logical>       <character>
## 1     2972-3853 archaetidylserine de..      TRUE                11
## 2     4525-5064      oligoribonuclease      TRUE                11
## 3     5220-5295               tRNA-Gly     FALSE                NA
## 4     5877-6365 tRNA (adenosine(37)-..      TRUE                11
## 5     6392-7630 N-acetylmuramoyl-L-a..      TRUE                11
## 6     7890-8804 tRNA (adenosine(37)-..      TRUE                11
##          Contig                 Dbxref           Notes         Phase
##     <character>        <CharacterList> <CharacterList> <IntegerList>
## 1 NZ_CP097750.1 GenBank:WP_250247045.1                             0
## 2 NZ_CP097750.1 GenBank:WP_250247043.1                             0
## 3 NZ_CP097750.1                                                    0
## 4 NZ_CP097750.1 GenBank:WP_250247041.1                             0
## 5 NZ_CP097750.1 GenBank:WP_250248113.1                             0
## 6 NZ_CP097750.1 GenBank:WP_250247037.1                             0
# in an effort to be space conscious, not all original gffs are kept within this package
genecalls <- get(data("genecalls", package = "SynExtend"))

SynExtend’s gffToDataFrame function will directly import gff files into a usable format, and includes other extracted information.

# this is similar to what was printed above...
head(genecalls[[1]])
## DataFrame with 6 rows and 14 columns
##       Index    Strand     Start      Stop        Type                 ID
##   <integer> <integer> <integer> <integer> <character>        <character>
## 1         1         1      2972      3853        gene gene-M9404_RS00010
## 2         1         0      4525      5064        gene gene-M9404_RS00015
## 3         1         0      5220      5295        gene gene-M9404_RS00020
## 4         1         0      5877      6365        gene gene-M9404_RS00025
## 5         1         0      6392      7630        gene gene-M9404_RS00030
## 6         1         0      7890      8804        gene gene-M9404_RS00035
##           Range                Product    Coding Translation_Table
##   <IRangesList>            <character> <logical>       <character>
## 1     2972-3853 archaetidylserine de..      TRUE                11
## 2     4525-5064      oligoribonuclease      TRUE                11
## 3     5220-5295               tRNA-Gly     FALSE                NA
## 4     5877-6365 tRNA (adenosine(37)-..      TRUE                11
## 5     6392-7630 N-acetylmuramoyl-L-a..      TRUE                11
## 6     7890-8804 tRNA (adenosine(37)-..      TRUE                11
##          Contig                 Dbxref           Notes         Phase
##     <character>        <CharacterList> <CharacterList> <IntegerList>
## 1 NZ_CP097750.1 GenBank:WP_250247045.1                             0
## 2 NZ_CP097750.1 GenBank:WP_250247043.1                             0
## 3 NZ_CP097750.1                                                    0
## 4 NZ_CP097750.1 GenBank:WP_250247041.1                             0
## 5 NZ_CP097750.1 GenBank:WP_250248113.1                             0
## 6 NZ_CP097750.1 GenBank:WP_250247037.1                             0

SynExtend’s primary functions provide a way to identify where pairs of genes are explicitly linked by syntenic hits, and then summarize those links. The first step is just identifying those links. We’re appending some funny business before generating the links because the example database that was loaded above is shipped without amino acid sequences to be conscious about package size.

tmp_db <- tempfile()

system(command = paste("cp",
                       file01,
                       tmp_db))

drv <- dbDriver("SQLite")
conn01 <- dbConnect(drv = drv,
                    tmp_db)

l01 <- NucleotideOverlap(SyntenyObject = syn,
                         GeneCalls = genecalls,
                         Verbose = TRUE)
## 
## Reconciling genecalls.
## ================================================================================
## Finding connected features.
## ================================================================================
## Time difference of 0.3560526 secs

The l01 object generated by NucleotideOverlap is a raw representation of positions on the synteny map where shared k-mers link genes between paired genomes. As such, it is analagous in shape to objects of class Synteny. This raw object is unlikely to be useful to most users, but has been left exposed to ensure that this data remains accessible should it be advantageous.

head(l01[[1, 2]])
##      QueryGene SubjectGene ExactOverlap QueryIndex SubjectIndex QLeftPos
## [1,]        24         105          332          1            1    27166
## [2,]        25         185          287          1            1    28116
## [3,]        27         186         1343          1            1    29879
## [4,]        33         189          844          1            1    35460
## [5,]        34         190         1007          1            1    37977
## [6,]        38         136         1131          1            1    42467
##      QRightPos SLeftPos SRightPos MaxKmerSize TotalKmerHits
## [1,]     27497   113974    114305         332             1
## [2,]     28402   223609    223895         287             1
## [3,]     31221   224340    225700         767             3
## [4,]     36319   226993    227852         476             2
## [5,]     39466   228899    230364         353             4
## [6,]     43626   155470    156635         320             4
head(l01[[2, 1]])
##      QueryGene SubjectGene ExactOverlap QueryIndex SubjectIndex QLeftPos
## [1,]        24         105          332          1            1    27166
## [2,]        25         185          287          1            1    28116
## [3,]        27         186          150          1            1    29879
## [4,]        27         186          426          1            1    30029
## [5,]        27         186          767          1            1    30455
## [6,]        33         189          368          1            1    35460
##      QRightPos SLeftPos SRightPos QuerySubKey SubjectSubKey SyntenicOrigin
## [1,]     27497   113974    114305           1             1            154
## [2,]     28402   223609    223895           1             1            159
## [3,]     30028   224340    224489           1             1             81
## [4,]     30454   224502    224927           1             1             82
## [5,]     31221   224934    225700           1             1             83
## [6,]     35827   226993    227360           1             1            122

This raw data can be processed to provide a straightforward summary of candidate pairs.

p01 <- SummarizePairs(SynExtendObject = l01,
                      DataBase01 = conn01,
                      Verbose = TRUE)
## Collecting pairs.
## ================================================================================
## Time difference of 7.147877 secs

The object p01 is a data.frame where each row is populated by information about a candidate orthologous pair. SummarizePairs is just summarizing the pairs identified by NucleotideOverlap, and offers no direct assignment of accuracy or truthiness. We can pass the object created by it to EvaluatePairs to get a sense of which candidates are acceptable to keep. EvaluatePairs has an un-intuitive mix of functionalities, but in its most basic form, it will generate a decoy data set is FALSE pairs, and drop initial candidates that appear below a False Discovery Rate threshold

p02 <- EvaluatePairs(InputPairs = p01,
                            DataBase01 = conn01,
                            EvaluationMethod = "none",
                            FDRCriteria = c("Delta_Background" = 0.001)) # this is the default behavior

SynExtend also has a variety of other helper functions and a suite of other tools. This is the most basic usage of the tools present here.

Session Info:

sessionInfo()
## R version 4.6.0 RC (2026-04-17 r89917)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.24-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] DBI_1.3.0           SynExtend_1.25.1    DECIPHER_3.9.0     
##  [4] Biostrings_2.81.2   Seqinfo_1.3.0       XVector_0.53.0     
##  [7] IRanges_2.47.2      S4Vectors_0.51.3    BiocGenerics_0.59.6
## [10] generics_0.1.4      BiocStyle_2.41.0   
## 
## loaded via a namespace (and not attached):
##  [1] SummarizedExperiment_1.43.0 rjson_0.2.23               
##  [3] xfun_0.58                   bslib_0.11.0               
##  [5] lattice_0.22-9              Biobase_2.73.1             
##  [7] vctrs_0.7.3                 tools_4.6.0                
##  [9] bitops_1.0-9                curl_7.1.0                 
## [11] parallel_4.6.0              RSQLite_3.53.1             
## [13] blob_1.3.0                  pkgconfig_2.0.3            
## [15] BiocBaseUtils_1.15.1        Matrix_1.7-5               
## [17] cigarillo_1.3.0             lifecycle_1.0.5            
## [19] compiler_4.6.0              Rsamtools_2.29.0           
## [21] tinytex_0.59                codetools_0.2-20           
## [23] htmltools_0.5.9             sass_0.4.10                
## [25] RCurl_1.98-1.18             yaml_2.3.12                
## [27] crayon_1.5.3                jquerylib_0.1.4            
## [29] BiocParallel_1.47.0         DelayedArray_0.39.3        
## [31] cachem_1.1.0                magick_2.9.1               
## [33] abind_1.4-8                 digest_0.6.39              
## [35] restfulr_0.0.16             bookdown_0.46              
## [37] grid_4.6.0                  fastmap_1.2.0              
## [39] SparseArray_1.13.2          cli_3.6.6                  
## [41] magrittr_2.0.5              S4Arrays_1.13.0            
## [43] XML_3.99-0.23               bit64_4.8.2                
## [45] rmarkdown_2.31              httr_1.4.8                 
## [47] matrixStats_1.5.0           bit_4.6.0                  
## [49] otel_0.2.0                  memoise_2.0.1              
## [51] evaluate_1.0.5              knitr_1.51                 
## [53] GenomicRanges_1.65.0        BiocIO_1.23.3              
## [55] rtracklayer_1.73.0          rlang_1.2.0                
## [57] Rcpp_1.1.1-1.1              BiocManager_1.30.27        
## [59] jsonlite_2.0.0              R6_2.6.1                   
## [61] MatrixGenerics_1.25.0       GenomicAlignments_1.49.0