1 Introduction

DOTSeq is an R package for detecting differential translation of open reading frames (ORFs) from ribosome profiling (Ribo-seq)^1,2 and matched RNA-seq datasets.³ Translation is a tightly regulated process, where multiple ORFs within the same gene can modulate protein synthesis.

While useful, traditional gene-level approaches (e.g., annota2seq and terapadog available at Bioconductor^4,5) may overlook these regulatory layers. DOTSeq addresses this limitation by modelling translation at the ORF level.

DOTSeq provides two complementary modules:

Differential ORF Usage (DOU): Detects cis-translational regulation by testing whether the relative usage of ORFs within a gene changes across biological conditions. Implemented via a beta-binomial generalised linear model (GLM) (glmmTMB).⁶
Differential Translation Efficiency (DTE): Tests for changes in ribosome occupancy relative to RNA abundance across conditions using a negative-binomial GLM (DESeq2).⁷

Together, DOU and DTE modules allow global identification of translation-specific effects, revealing regulatory events such as uORF-mediated repression.

2 Load required library

We start by loading the DOTSeq package and its dependencies. DOTSeq builds on the Bioconductor ecosystem, e.g., rtracklayer and txdbmaker for parsing GTF files, and the SummarizedExperiment container for storing count matrices and metadata.^8–11

library(DOTSeq)
library(SummarizedExperiment, quietly = TRUE)

3 Example dataset

We demonstrate DOTSeq functionalities using a HeLa cell cycle dataset.¹² This dataset includes the snapshots of translation landscapes in HeLa cells synchronised at Mitotic Cycling, Mitotic Arrest, and Interphase. Each condition contains two biological replicates.

Here, we use only a small portion of the dataset to speed up the vignette execution while still demonstrating all key functionalities.

dir <- system.file("extdata", package = "DOTSeq")
list.files(dir)

## [1] "featureCounts.cell_cycle_subset.txt.gz" 
## [2] "gencode.v47.orf_flattened_subset.bed.gz"
## [3] "gencode.v47.orf_flattened_subset.gtf.gz"
## [4] "metadata.txt.gz"

4 Input data

Ribo-seq and RNA-seq reads are preprocessed using standard pipelines. Reads were adapter-trimmed using Cutadapt,¹³ aligned to the reference genome using STAR,¹⁴ and summarised at the ORF level using featureCounts.¹⁵

Each row of the input count matrix corresponds to an ORF, while columns contain read counts for individual samples. The first six columns represent ORF-level annotation fields such as gene ID, transcript ID, chromosome, and coordinates. This ensures accurate mapping between read counts and annotated ORFs, which is critical for our analysis.

cnt <- read.table(
    file.path(dir, "featureCounts.cell_cycle_subset.txt.gz"), 
    header = TRUE, 
    comment.char = "#"
)
names(cnt) <- gsub(".*(SRR[0-9]+).*", "\\1", names(cnt))
head(cnt)

##              Geneid   Chr   Start     End Strand Length SRR24230465 SRR24230467
## 1 ENSG00000003056.8 chr12 8940362 8940426      -     65           0           1
## 2 ENSG00000003056.8 chr12 8940428 8940634      -    207           0           1
## 3 ENSG00000003056.8 chr12 8940636 8940698      -     63           0           0
## 4 ENSG00000003056.8 chr12 8941261 8941269      -      9           0           0
## 5 ENSG00000003056.8 chr12 8941316 8941360      -     45           0           0
## 6 ENSG00000003056.8 chr12 8941369 8941392      -     24           0           0
##   SRR24230469 SRR24230471 SRR24230480 SRR24230482 SRR24230462 SRR24230466
## 1           0           0           0           0           4          15
## 2           0           0           0           0           2           1
## 3           0           0           0           0           0           0
## 4           0           0           0           0           0           0
## 5           0           0           0           0           1           2
## 6           0           0           0           0           5           5
##   SRR24230472 SRR24230474 SRR24230477 SRR24230479
## 1           7           6           5          40
## 2           0           0           3          10
## 3           0           0           0           0
## 4           1           1           0           0
## 5           2           0           1           4
## 6           2           7           4          30

5 ORF annotation

Flattened ORF-level annotations were generated using the orf_to_gtf.py wrapper for RIBOSS.¹⁶ Alternatively, annotations can be prepared within R using getORFs() (see Prepare annotations). This pre-processing step ensures that each ORF occupies a distinct, non-overlapping genomic interval, avoiding ambiguity when assigning reads. This design follows the same logic as DEXSeq’s exon flattening, but at the level of ORFs rather than exons.¹⁷

gtf <- file.path(dir, "gencode.v47.orf_flattened_subset.gtf.gz")
bed <- file.path(dir, "gencode.v47.orf_flattened_subset.bed.gz")

6 Prepare condition table

The condition table defines the experimental design for both Ribo-seq and RNA-seq samples. It allows DOTSeq to match Ribo/RNA-seq pairs and to model translation-specific interactions.

Each row corresponds to one sequencing run. The table must include four essential columns: - run: unique identifier for each sequencing run

strategy: either "ribo" or "rna" (or "0" or "1"), indicating the sequencing strategy
replicate: biological replicate identifier
condition: experimental condition (e.g., control, treatment)

Additional columns (e.g., batch) can also be included. In the example below, we load the metadata and restrict the analysis to samples pre-treated with cycloheximide (chx) prior to library preparation:

cond <- read.table(file.path(dir, "metadata.txt.gz"))

# Ensure required column names for DOTSeq
names(cond) <- c("run", "strategy", "replicate", "treatment", "condition")

# Filter to include only chx-treated samples
cond <- cond[cond$treatment == "chx", ]
head(cond)

##            run strategy replicate treatment       condition
## 1  SRR24230462      rna         2       chx Mitotic_Cycling
## 4  SRR24230465     ribo         2       chx Mitotic_Cycling
## 5  SRR24230466      rna         1       chx Mitotic_Cycling
## 6  SRR24230467     ribo         1       chx Mitotic_Cycling
## 8  SRR24230469     ribo         2       chx  Mitotic_Arrest
## 10 SRR24230471     ribo         1       chx  Mitotic_Arrest

7 Run DOTSeq

The main function DOTSeq() performs the differential analysis workflow, starting from parsing input files to post hoc inference. It sequentially runs three key steps:

Construction of two SummarizedExperiment objects (DOTSeqDataSet()).
model fitting for DOU via beta-binomial GLM (fitDOU() using glmmTMB) and DTE via negative-binomial GLM (DESeq2).
Post hoc inference using emmeans¹⁸ for contrasts and ashr¹⁹ for adaptive shrinkage of effect sizes (testDOU()).

While each step can be run individually, we construct DOTSeqDataSets with DOTSeqDataSetsFromFeatureCounts(), followed by the DOTSeq() main function here for simplicity. To capture the translation-specific interaction effect (\(\beta_{\text{int}}\)), DOTSeq() uses formula = ~ condition * strategy with dispersion_modeling = "auto" by default. The design and dispersion formulas can be customised to include batch and random effects.

# Create a DOTSeqDataSets object
datasets <- DOTSeqDataSetsFromFeatureCounts(
    count_table = cnt,
    condition_table = cond,
    flattened_gtf = gtf,
    flattened_bed = bed,
    verbose = FALSE
)

# Run the DOTSeq workflow
d <- DOTSeq(datasets = datasets)

## starting Differential ORF Usage (DOU) analysis

## starting post hoc analysis

## effect sizes for interaction_specific contrasts calculated and shrunk succesfully

## effect sizes for strategy_specific contrasts calculated and shrunk successfully

## DOU runtime: 3 mins 44.655 secs

## DOU model fitting summary: Mitotic_Arrest - Interphase

##   models fitted: 695

##   non-convergence or invalid Hessian: 1434

##   not fitted (filtered out): 4513

## DOU model fitting summary: Mitotic_Cycling - Interphase

##   models fitted: 695

##   non-convergence or invalid Hessian: 1434

##   not fitted (filtered out): 4513

## DOU model fitting summary: Mitotic_Cycling - Mitotic_Arrest

##   models fitted: 695

##   non-convergence or invalid Hessian: 1434

##   not fitted (filtered out): 4513

## starting Differential Translation Efficiency (DTE) analysis

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

## starting post hoc analysis

## DTE runtime: 8.446 secs

## DTE model fitting summary: Mitotic_Cycling - Mitotic_Arrest

##   models fitted: 1137

##   padj = NA (filtered or flagged): 5505

## DTE model fitting summary: Mitotic_Arrest - Interphase

##   models fitted: 1941

##   padj = NA (filtered or flagged): 4701

## DTE model fitting summary: Mitotic_Cycling - Interphase

##   models fitted: 1712

##   padj = NA (filtered or flagged): 4930

In this case Interphase was automatically assigned as the baseline. The target and baseline conditions can explicitly defined as target = "Mitotic_Cycling" and baseline = "Interphase".

8 Inspect results

The output is a DOTSeqDataSets object that encapsulates the post hoc contrast results for DOU and DTE analyses.

show(d)

## DOTSeqDataSets

##   DOU: DOUData

##   DTE: DTEData

## Use getDOU() or getDTE() to access contents.

## Use getContrasts() if post hoc inference has been performed.

As indicated, we will use getContrasts() to extract the results.

results <- getContrasts(d, type = "interaction")
results

## $DOU
## DataFrame with 2085 rows and 10 columns
##        betahat   sebetahat    waldpval    waldpadj    posterior     qvalue
##      <numeric>   <numeric>   <numeric>   <numeric>    <numeric>  <numeric>
## 1    -0.315677    0.291932 2.79546e-01 0.339763478    -0.046734 0.21887806
## 2     1.868553    0.457288 4.38573e-05 0.000104955     1.837607 0.00012196
## 3     0.896892    0.305651 3.34228e-03 0.005784391     0.791524 0.01020028
## 4    -1.420373    0.538238 8.31677e-03 0.013632323    -1.202033 0.01277786
## 5    -1.303596    0.440822 3.10452e-03 0.005399848    -1.199171 0.00661518
## ...        ...         ...         ...         ...          ...        ...
## 2081 -21.21978 1.86395e+04  0.99909167   0.9999930 -1.07384e-08  0.4883262
## 2082  -1.38197 9.02272e-01  0.12560732   0.3046531 -1.91868e-01  0.3665421
## 2083   2.09437 7.46692e-01  0.00503382   0.0296057  8.40108e-01  0.0853905
## 2084  -1.09952 4.58713e-01  0.01653135   0.0755777 -5.26534e-01  0.1287686
## 2085   1.04066 3.59667e-01  0.00381093   0.0244888  7.78991e-01  0.0488665
##            lfdr       lfsr                 orf_id               contrast
##       <numeric>  <numeric>            <character>                  <Rle>
## 1    0.85195633 0.87264884 ENSG00000003056.8:O008 Mitotic_Arrest - Int..
## 2    0.00160116 0.00162365 ENSG00000003056.8:O013 Mitotic_Arrest - Int..
## 3    0.11748084 0.11895566 ENSG00000004478.8:O001 Mitotic_Arrest - Int..
## 4    0.14775248 0.15132045 ENSG00000004478.8:O002 Mitotic_Arrest - Int..
## 5    0.07948470 0.08091441 ENSG00000004478.8:O005 Mitotic_Arrest - Int..
## ...         ...        ...                    ...                    ...
## 2081   0.832061   0.916030 ENSG00000133872.14:O.. Mitotic_Cycling - Mi..
## 2082   0.784013   0.804548 ENSG00000133872.14:O.. Mitotic_Cycling - Mi..
## 2083   0.356171   0.361086 ENSG00000133872.14:O.. Mitotic_Cycling - Mi..
## 2084   0.484493   0.489162 ENSG00000134153.10:O.. Mitotic_Cycling - Mi..
## 2085   0.232850   0.234358 ENSG00000134153.10:O.. Mitotic_Cycling - Mi..
## 
## $DTE
## log2 fold change (MMSE): 0,0,0,0,-1,+1 
## Wald test p-value: 0,0,0,0,-1,+1 
## DataFrame with 19926 rows and 7 columns
##        baseMean log2FoldChange     lfcSE      pvalue        padj
##       <numeric>      <numeric> <numeric>   <numeric>   <numeric>
## 1      6.612288      0.0517657  0.782074    0.819230          NA
## 2      1.566572     -0.0695095  0.829717    0.638754          NA
## 3      0.000000      0.0000000  0.854322          NA          NA
## 4      0.224656      0.0000000  0.843721    1.000000          NA
## 5      0.967834      0.0000000  0.833729    1.000000          NA
## ...         ...            ...       ...         ...         ...
## 19922   11.4769    -0.19809388  0.891331 7.36964e-01 8.91883e-01
## 19923   10.0378     0.00327321  1.050203 9.94842e-01 9.95424e-01
## 19924   30.6207    -1.41437812  0.723772 1.37107e-02 4.97302e-02
## 19925   10.2243    -0.13656856  1.028994 7.95281e-01 9.17725e-01
## 19926 1918.8799     1.42139618  0.194002 1.07081e-13 2.90989e-12
##                       orf_id               contrast
##                  <character>            <character>
## 1     ENSG00000003056.8:O001 Mitotic_Cycling - Mi..
## 2     ENSG00000003056.8:O002 Mitotic_Cycling - Mi..
## 3     ENSG00000003056.8:O003 Mitotic_Cycling - Mi..
## 4     ENSG00000003056.8:O004 Mitotic_Cycling - Mi..
## 5     ENSG00000003056.8:O005 Mitotic_Cycling - Mi..
## ...                      ...                    ...
## 19922 ENSG00000134153.10:O.. Mitotic_Cycling - In..
## 19923 ENSG00000134153.10:O.. Mitotic_Cycling - In..
## 19924 ENSG00000134153.10:O.. Mitotic_Cycling - In..
## 19925 ENSG00000134153.10:O.. Mitotic_Cycling - In..
## 19926 ENSG00000134153.10:O.. Mitotic_Cycling - In..

These data frames summarise differential translation results for each ORF. Key statistics include:

posterior: The shrunken log-odds fold change in ORF usage (DOU), reflecting the estimated effect size after empirical Bayes shrinkage.
lfsr (Local False Sign Rate): The probability that the estimated effect has the wrong sign, providing a measure of directional uncertainty.
log2FoldChange: The shrunken log2 fold change in ORF translation efficiency (DTE), indicating relative changes in ribosome loading.
padj: The adjusted p-value for DTE, accounting for multiple testing.

The DTE effect size (log2FoldChange) was derived from DESeq2::lfcShrink() within the DOTSeq() main function. We specify type = ashr to apply adaptive shrinkage and compute all pairwise contrasts between conditions.

Additional statistics for DOU results include:

betahat: The raw (unshrunken) effect size estimate.
sebetahat: The standard error of the raw estimate.
qvalue and lfdr (Local False Discovery Rate): Alternative measures of significance and false discovery, also computed by ashr.
waldpval and waldpadj: Wald test results are included for completeness, providing frequentist estimates of significance.

We can also inspect the rowData and rowRanges of the SummarizedExperiment objects to explore ORF level annotation. This includes mappings from gene to transcript to ORF IDs, as well as orf_type, which currently supports uORF, mORF, and dORF. Filtering criteria count_filter and singlet_filter determine whether an ORF is retained (is_kept) for model fitting. DOUResults stores the model fit results and the emmeans objects for post hoc contrasts.

rowData(getDOU(d))

## DataFrame with 2129 rows and 7 columns
##                                    gene_id  orf_number    orf_type count_filter
##                                <character> <character> <character>    <logical>
## ENSG00000003056.8:O007   ENSG00000003056.8         007        dORF         TRUE
## ENSG00000003056.8:O008   ENSG00000003056.8         008        dORF         TRUE
## ENSG00000003056.8:O009   ENSG00000003056.8         009        mORF         TRUE
## ENSG00000003056.8:O013   ENSG00000003056.8         013        uORF         TRUE
## ENSG00000004478.8:O001   ENSG00000004478.8         001        mORF         TRUE
## ...                                    ...         ...         ...          ...
## ENSG00000134153.10:O001 ENSG00000134153.10         001        dORF         TRUE
## ENSG00000134153.10:O002 ENSG00000134153.10         002        dORF         TRUE
## ENSG00000134153.10:O003 ENSG00000134153.10         003        dORF         TRUE
## ENSG00000134153.10:O004 ENSG00000134153.10         004        dORF         TRUE
## ENSG00000134153.10:O005 ENSG00000134153.10         005        mORF         TRUE
##                         singlet_filter   is_kept DOUResults
##                              <logical> <logical>     <list>
## ENSG00000003056.8:O007            TRUE      TRUE  <PostHoc>
## ENSG00000003056.8:O008            TRUE      TRUE  <PostHoc>
## ENSG00000003056.8:O009            TRUE      TRUE  <PostHoc>
## ENSG00000003056.8:O013            TRUE      TRUE  <PostHoc>
## ENSG00000004478.8:O001            TRUE      TRUE  <PostHoc>
## ...                                ...       ...        ...
## ENSG00000134153.10:O001           TRUE      TRUE  <PostHoc>
## ENSG00000134153.10:O002           TRUE      TRUE  <PostHoc>
## ENSG00000134153.10:O003           TRUE      TRUE  <PostHoc>
## ENSG00000134153.10:O004           TRUE      TRUE  <PostHoc>
## ENSG00000134153.10:O005           TRUE      TRUE  <PostHoc>

The rowRanges slot retains the genomic coordinates corresponding to ORFs, extending from the translational start (AUG) to stop codons. These ranges may include intronic regions, depending on the gene structures.

rowRanges(getDOU(d))

## GRanges object with 2129 ranges and 7 metadata columns:
##                           seqnames            ranges strand |
##                              <Rle>         <IRanges>  <Rle> |
##    ENSG00000003056.8:O007    chr12   8941510-8941518      - |
##    ENSG00000003056.8:O008    chr12   8941590-8941805      - |
##    ENSG00000003056.8:O009    chr12   8941818-8946404      - |
##    ENSG00000003056.8:O013    chr12   8949621-8949713      - |
##    ENSG00000004478.8:O001    chr12   2795140-2803258      + |
##                       ...      ...               ...    ... .
##   ENSG00000134153.10:O001    chr15 34084030-34084071      - |
##   ENSG00000134153.10:O002    chr15 34084109-34084123      - |
##   ENSG00000134153.10:O003    chr15 34084135-34084188      - |
##   ENSG00000134153.10:O004    chr15 34084192-34084206      - |
##   ENSG00000134153.10:O005    chr15 34084334-34101839      - |
##                                      gene_id  orf_number    orf_type
##                                  <character> <character> <character>
##    ENSG00000003056.8:O007  ENSG00000003056.8         007        dORF
##    ENSG00000003056.8:O008  ENSG00000003056.8         008        dORF
##    ENSG00000003056.8:O009  ENSG00000003056.8         009        mORF
##    ENSG00000003056.8:O013  ENSG00000003056.8         013        uORF
##    ENSG00000004478.8:O001  ENSG00000004478.8         001        mORF
##                       ...                ...         ...         ...
##   ENSG00000134153.10:O001 ENSG00000134153.10         001        dORF
##   ENSG00000134153.10:O002 ENSG00000134153.10         002        dORF
##   ENSG00000134153.10:O003 ENSG00000134153.10         003        dORF
##   ENSG00000134153.10:O004 ENSG00000134153.10         004        dORF
##   ENSG00000134153.10:O005 ENSG00000134153.10         005        mORF
##                           count_filter singlet_filter   is_kept DOUResults
##                              <logical>      <logical> <logical>     <list>
##    ENSG00000003056.8:O007         TRUE           TRUE      TRUE  <PostHoc>
##    ENSG00000003056.8:O008         TRUE           TRUE      TRUE  <PostHoc>
##    ENSG00000003056.8:O009         TRUE           TRUE      TRUE  <PostHoc>
##    ENSG00000003056.8:O013         TRUE           TRUE      TRUE  <PostHoc>
##    ENSG00000004478.8:O001         TRUE           TRUE      TRUE  <PostHoc>
##                       ...          ...            ...       ...        ...
##   ENSG00000134153.10:O001         TRUE           TRUE      TRUE  <PostHoc>
##   ENSG00000134153.10:O002         TRUE           TRUE      TRUE  <PostHoc>
##   ENSG00000134153.10:O003         TRUE           TRUE      TRUE  <PostHoc>
##   ENSG00000134153.10:O004         TRUE           TRUE      TRUE  <PostHoc>
##   ENSG00000134153.10:O005         TRUE           TRUE      TRUE  <PostHoc>
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths

9 Visualisation

DOTSeq provides the plotDOT() function for visualising DOU and DTE results. The following plots can be generated:

ou <- results$DOU[results$DOU$contrast == "Mitotic_Cycling - Interphase", ]
te <- results$DTE[results$DTE$contrast == "Mitotic_Cycling - Interphase", ]

results <- merge(ou, te, by = c("orf_id", "contrast"), all = TRUE)

9.1 Venn diagram:

To explore the relationship between DOU and DTE, we can visualise the overlap between significant ORFs identified by each module. ORFs that are significant in DOU and DTE analyses suggest changes in both usage and translation efficiency, potentially indicating strong translational control.

plotDOT(plot_type = "venn", results = results, force_new_device = FALSE)

## Venn diagram plotted

9.2 Composite scatter plot:

To compare DOU and DTE, we generate a composite scatter plot of their respective effect sizes. Marginal density plots along each axis illustrate the distribution of effect sizes for each metric. Spearman correlation is calculated and reported.

For this scatter plot, points are colour-coded by significance, allowing us to identify ORFs with strong effects in one analysis but weak in the other. This visualisation helps highlight patterns of differential translation across ORFs.

plotDOT(
    plot_type = "composite", 
    results = results, 
    plot_params = list(color_by = "significance", legend_position = "bottomright"),
    force_new_device = FALSE
)

## Spearman correlation between DOU and DTE estimates: 0.823, p-value: <2e-16

## composite plot colored by significance plotted

If DOUData is given as an input, points can also be colour-coded by orf_type. This enables quick identification of patterns in differential translation across different ORF types.

plotDOT(
    plot_type = "composite", 
    results = results, 
    data = getDOU(d), 
    plot_params = list(color_by = "orf_type", legend_position = "bottomright"),
    force_new_device = FALSE
)

## Spearman correlation between DOU and DTE estimates: 0.823, p-value: <2e-16

## composite plot colored by orf_type plotted

9.3 Volcano plot:

To highlight ORFs with strong differential usage, we construct a volcano plot comparing significance versus DOU effect size.

By default, gene identifiers are mapped from Ensembl gene IDs to HGNC gene symbols using biomaRt This mapping improves plot readability by displaying gene symbols. The resulting ID mapping dataframe can be stored as a dataframe and reused across plots to avoid repeated downloads.

mapping <- plotDOT(
    plot_type = "volcano", 
    results = results,
    id_mapping = TRUE,
    plot_params = list(color_by = "significance", top_hits = 3, legend_position = "topright"),
    force_new_device = FALSE
)

## retrieving gene annotation from BioMart

## Ensembl site unresponsive, trying useast mirror

## retrieved 275 gene symbols

## capping volcano plot y-axis at -log10(lfsr) = 20

## volcano plot colored by significance plotted

Similarly, volcano plots can be colour-coded by significance or ORF types, depending on whether rowdata is given as an input.

plotDOT(
    plot_type = "volcano", 
    results = results,
    data = getDOU(d),
    id_mapping = mapping,
    plot_params = list(color_by = "orf_type", top_hits = 3, legend_position = "topright"),
    force_new_device = FALSE
)

## capping volcano plot y-axis at -log10(lfsr) = 20

## volcano plot colored by orf_type plotted

9.4 Heatmap:

To visualise the top-ranked genes, we generate a heatmap based on DOU metrics for the top 20 uORF-regulated genes.

plotDOT(
    plot_type = "heatmap", 
    results = results, 
    data = getDOU(d), 
    id_mapping = mapping, 
    plot_params = list(rank_by = "score", top_hits = 20),
    force_new_device = FALSE
)

## plotting heatmap for the top 20 DOU genes

## heatmap plotted

9.5 ORF usage plot:

We can visualise the estimated ORF usage within a gene as inferred by the DOU module. This plot combines the probabilistic estimates of ORF usage with post hoc statistical significance annotations, allowing interpretation of DOU across conditions. If the ID mapping dataframe provided, the gene symbol is displayed as the plot title.

orderby <- c("Mitotic_Cycling", "Mitotic_Arrest", "Interphase")

id <- "TAF2"

plotDOT(
    plot_type = "usage",
    data = getDOU(d), 
    gene_id = id, 
    id_mapping = mapping, 
    plot_params = list(order_by = orderby),
    force_new_device = FALSE
)

## ORF usage plot for TAF2 plotted

The plotDOT() function supports querying according to the gene IDs or gene symbols, as listed in the rowData, rowRanges and/or the ID mapping dataframe. Stable gene IDs (e.g., ENSG00000060339) can be used without their version suffix (e.g., .14).

id <- "ENSG00000060339"

plotDOT(
    plot_type = "usage",
    data = getDOU(d),
    gene_id = id, 
    id_mapping = mapping, 
    plot_params = list(order_by = orderby),
    force_new_device = FALSE
)

## ORF usage plot for ENSG00000060339 plotted

10 Summary

DOTSeq offers a unified framework for modelling translational control at the ORF level. DOTSeq has two modules capturing distinct biological signals: DOU tests for shifts in relative usage within genes (cis-regulation), whereas DTE tests for changes in per-ORF translation efficiency relative to transcript abundance. They can be used together to provide a fuller picture of translational regulation.

11 References

1. Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. & Weissman, J. S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. science 324, 218–223 (2009).

2. Ingolia, N. T., Hussmann, J. A. & Weissman, J. S. Ribosome profiling: Global views of translation. Cold Spring Harbor perspectives in biology 11, a032698 (2019).

3. Lim, C. S. & Chieng, G. S. Differential analysis of translation efficiency and usage of open reading frames using dotseq. bioRxiv 2025–09 (2025).

4. Oertlin, C. et al. Generally applicable transcriptome-wide analysis of translation using anota2seq. Nucleic acids research 47, e70–e70 (2019).

5. Carancini, G. Terapadog: Translational Efficiency Regulation Analysis Using the Padog Method. (2025). doi:10.18129/B9.bioc.terapadog.

6. Brooks, M. E. et al. GlmmTMB balances speed and flexibility among packages for zero-inflated generalized linear mixed modeling. The R Journal 9, 378–400 (2017).

7. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome biology 15, 550 (2014).

8. Lawrence, M., Gentleman, R. & Carey, V. Rtracklayer: An r package for interfacing with genome browsers. Bioinformatics 25, 1841 (2009).

9. Pagès, H., Carlson, M., Aboyoun, P., Falcon, S. & Morgan, M. Txdbmaker: Tools for making txdb objects from genomic annotations. R. package version 1.6 1, (2025).

10. Lawrence, M. et al. Software for computing and annotating genomic ranges. PLoS computational biology 9, e1003118 (2013).

11. Morgan, M., Obenchain, V., Hester, J. & Pagès, H. Summarized Experiment: A Container (S4 Class) for Matrix-Like Assays. (2025). doi:10.18129/B9.bioc.SummarizedExperiment.

12. Ly, J. et al. Nuclear release of eIF1 restricts start-codon selection during mitosis. Nature 635, 490–498 (2024).

13. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal 17, 10–12 (2011).

14. Dobin, A. et al. STAR: Ultrafast universal rna-seq aligner. Bioinformatics 29, 15–21 (2013).

15. Liao, Y., Smyth, G. K. & Shi, W. FeatureCounts: An efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).

16. Lim, C. S., Gibbon, A. K., Tran Nguyen, A. T., Chieng, G. S. & Brown, C. M. RIBOSS detects novel translational events by combining long-and short-read transcriptome and translatome profiling. Briefings in bioinformatics 26, bbaf164 (2025).

17. Anders, S., Reyes, A. & Huber, W. Detecting differential usage of exons from rna-seq data. Genome research 22, 2008–2017 (2012).

18. Lenth, R. V. & Piaskowski, J. Emmeans: Estimated Marginal Means, Aka Least-Squares Means. (2025).

19. Stephens, M. False discovery rates: A new deal. Biostatistics 18, 275–294 (2017).

20. Tjeldnes, H. et al. ORFik: A comprehensive r toolkit for the analysis of translation. BMC bioinformatics 22, 336 (2021).

Appendix

A Prepare annotations

Besides using the orf_to_gtf.py wrapper,¹⁶ we can generate ORF-level annotation natively in R using the getORFs() function, which is based on the ORFik’s engine.²⁰ This approach creates a GRanges object.

library(withr)

falink <- "https://ftp.ensembl.org/pub/release-78/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP5.cdna.all.fa.gz"
gtflink <- "https://ftp.ensembl.org/pub/release-78/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP5.78.gtf.gz"

annotation <- basename(gtflink)
sequences <- basename(falink)

download.file(gtflink, destfile=annotation)
download.file(falink, destfile=sequences)

# This will generate flattened ORF-level annotation
gr <- getORFs(
    sequences = sequences,
    annotation = annotation,
    organism = "Drosophila melanogaster"
)

withr::defer(unlink(c(annotation, sequences)))

We perform read counting using the RNA-seq BAM files from the pasillaBamSubset package. We can optionally clean up the BAM files before read counting by discarding reads mapped to introns, noncoding genes, and intergenic regions. Note that this dataset is not derived from a Ribo-seq experiment and is used solely for illustrative purposes. It should not be used for biological interpretation.

library(pasillaBamSubset)
library(GenomeInfoDb)

bam_list <- c(untreated1_chr4(), untreated3_chr4())

# Keep only reads mapped to the exons of coding genes
temp_dir <- tempdir()
getExonicReads(gr = gr, seqlevels_style = "UCSC", bam_files = bam_list, bam_output_dir = temp_dir, coding_genes_only = TRUE)

# Get the list of filtered BAM files
bam_list <- list.files(
    path = temp_dir,
    pattern = "*exonic.*",
    full.names = TRUE
)

cnt <- countReads(gr = gr, bam_files = bam_list)
withr::defer(unlink(bam_list))

Here we make use the ORF-level annotation and create count and condition tables as input for DOTSeqDataSetsFromSummarizeOverlaps(). This is alternative to using DOTSeqDataSetsFromFeatureCounts() as demonstrated at the beginning.

set.seed(42)

# Create count_table
# Create two replicates for each condition with random scaling
rna_treated_reps <- do.call(cbind, replicate(2, cnt[["untreated3_chr4.exonic.sorted.bam"]] * runif(nrow(cnt), min = 0.5, max = 2), simplify = FALSE))
rna_control_reps <- do.call(cbind, replicate(2, cnt[["untreated3_chr4.exonic.sorted.bam"]] * runif(nrow(cnt), min = 0.1, max = 0.5), simplify = FALSE))
ribo_treated_reps <- do.call(cbind, replicate(2, cnt[["untreated1_chr4.exonic.sorted.bam"]] * runif(nrow(cnt), min = 0.5, max = 2), simplify = FALSE))
ribo_control_reps <- do.call(cbind, replicate(2, cnt[["untreated1_chr4.exonic.sorted.bam"]] * runif(nrow(cnt), min = 0.1, max = 0.5), simplify = FALSE))

# Combine and name columns
colnames(rna_treated_reps) <- paste0("rna_treated", 1:2)
colnames(rna_control_reps) <- paste0("rna_control", 1:2)
colnames(ribo_treated_reps) <- paste0("ribo_treated", 1:2)
colnames(ribo_control_reps) <- paste0("ribo_control", 1:2)

cnt_expanded <- cbind(
    rna_treated_reps, 
    rna_control_reps, 
    ribo_treated_reps, 
    ribo_control_reps
)

# Convert numbers to integer
cnt_expanded <- round(cnt_expanded)
storage.mode(cnt_expanded) <- "integer"

rownames(cnt_expanded) <- rownames(cnt)
cnt_expanded <- as.data.frame(cnt_expanded)


# Create condition_table
# Sample names from cnt_expanded
sample_names <- colnames(cnt_expanded)

# Define condition and strategy for each sample
condition <- c(
    rep("treated", 2), 
    rep("control", 2), 
    rep("treated", 2), 
    rep("control", 2)
)
strategy <- c(rep("RNA", 4), rep("Ribo", 4))

cond <- data.frame(
    run = sample_names,
    replicate = c(1,2),
    condition = factor(condition, levels = c("control", "treated")),
    strategy = factor(strategy, levels = c("RNA", "Ribo"))
)

# Create a DOTSeqDataSets object
d <- DOTSeqDataSetsFromSummarizeOverlaps(
    count_table = cnt_expanded, 
    condition_table = cond, 
    annotation = gr
)

invisible(file.remove(bam_list))
invisible(file.remove(sub("\\.gz$", ".sqlite", annotation)))

B Simulation and benchmarking

DOTSeq includes the function simDOT() for generating simulated Ribo-seq and RNA-seq count matrices. It enables benchmarking under controlled scenarios with configurable effect sizes, batch effects, and ORF types.

Simulations in our manuscript³ show that:

The DOU module performs well at low-to-moderate effect sizes on DOU-type scenarios. This module is specifically designed to detect condition-dependent shifts in the expected proportion of Ribo-seq to RNA-seq counts for individual ORFs relative to other ORFs within the same gene (i.e., changes in ORF usage). This module targets cis-regulatory usage changes that alter relative ORF proportions within genes.
The DTE module performs better at large-magnitude changes in translation efficiency. This module captures ORF-level changes in ribosome occupancy relative to RNA abundance.

C Session info

sessionInfo()

## R Under development (unstable) (2025-12-22 r89219)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SummarizedExperiment_1.41.0 Biobase_2.71.0             
##  [3] GenomicRanges_1.63.1        Seqinfo_1.1.0              
##  [5] IRanges_2.45.0              S4Vectors_0.49.0           
##  [7] BiocGenerics_0.57.0         generics_0.1.4             
##  [9] MatrixGenerics_1.23.0       matrixStats_1.5.0          
## [11] DOTSeq_0.99.4               BiocStyle_2.39.0           
## 
## loaded via a namespace (and not attached):
##   [1] eulerr_7.0.4             RColorBrewer_1.1-3       jsonlite_2.0.0          
##   [4] magrittr_2.0.4           magick_2.9.0             TH.data_1.1-5           
##   [7] estimability_1.5.1       GenomicFeatures_1.63.1   farver_2.1.2            
##  [10] nloptr_2.2.1             rmarkdown_2.30           BiocIO_1.21.0           
##  [13] vctrs_0.6.5              memoise_2.0.1            minqa_1.2.8             
##  [16] Rsamtools_2.27.0         RCurl_1.98-1.17          SQUAREM_2021.1          
##  [19] tinytex_0.58             mixsqp_0.3-54            progress_1.2.3          
##  [22] htmltools_0.5.9          S4Arrays_1.11.1          curl_7.0.0              
##  [25] truncnorm_1.0-9          SparseArray_1.11.10      sass_0.4.10             
##  [28] bslib_0.9.0              sandwich_3.1-1           httr2_1.2.2             
##  [31] emmeans_2.0.1            zoo_1.8-15               cachem_1.1.0            
##  [34] TMB_1.9.19               GenomicAlignments_1.47.0 lifecycle_1.0.5         
##  [37] pkgconfig_2.0.3          Matrix_1.7-4             R6_2.6.1                
##  [40] fastmap_1.2.0            GenomeInfoDbData_1.2.15  rbibutils_2.4           
##  [43] digest_0.6.39            numDeriv_2016.8-1.1      AnnotationDbi_1.73.0    
##  [46] DESeq2_1.51.6            irlba_2.3.5.1            RSQLite_2.4.5           
##  [49] labeling_0.4.3           filelock_1.0.3           invgamma_1.2            
##  [52] polyclip_1.10-7          httr_1.4.7               abind_1.4-8             
##  [55] mgcv_1.9-4               compiler_4.6.0           withr_3.0.2             
##  [58] bit64_4.6.0-1            S7_0.2.1                 BiocParallel_1.45.0     
##  [61] DBI_1.2.3                ggsignif_0.6.4           biomaRt_2.67.1          
##  [64] MASS_7.3-65              rappdirs_0.3.3           DelayedArray_0.37.0     
##  [67] rjson_0.2.23             tools_4.6.0              otel_0.2.0              
##  [70] glue_1.8.0               restfulr_0.0.16          nlme_3.1-168            
##  [73] polylabelr_0.3.0         grid_4.6.0               glmmTMB_1.1.13          
##  [76] gtable_0.3.6             BSgenome_1.79.1          hms_1.1.4               
##  [79] data.table_1.18.0        xml2_1.5.1               XVector_0.51.0          
##  [82] stringr_1.6.0            pillar_1.11.1            splines_4.6.0           
##  [85] dplyr_1.1.4              BiocFileCache_3.1.0      lattice_0.22-7          
##  [88] survival_3.8-3           rtracklayer_1.71.3       bit_4.6.0               
##  [91] tidyselect_1.2.1         locfit_1.5-9.12          Biostrings_2.79.4       
##  [94] pbapply_1.7-4            knitr_1.51               reformulas_0.4.3.1      
##  [97] bookdown_0.46            xfun_0.55                stringi_1.8.7           
## [100] UCSC.utils_1.7.1         yaml_2.3.12              boot_1.3-32             
## [103] evaluate_1.0.5           codetools_0.2-20         cigarillo_1.1.0         
## [106] tibble_3.3.1             BiocManager_1.30.27      cli_3.6.5               
## [109] xtable_1.8-4             Rdpack_2.6.4             jquerylib_0.1.4         
## [112] dichromat_2.0-0.1        Rcpp_1.1.1               GenomeInfoDb_1.47.2     
## [115] dbplyr_2.5.1             coda_0.19-4.1            png_0.1-8               
## [118] XML_3.99-0.20            parallel_4.6.0           ggplot2_4.0.1           
## [121] blob_1.2.4               prettyunits_1.2.0        bitops_1.0-9            
## [124] lme4_1.1-38              txdbmaker_1.7.3          mvtnorm_1.3-3           
## [127] scales_1.4.0             purrr_1.2.1              crayon_1.5.3            
## [130] rlang_1.1.7              ashr_2.2-63              KEGGREST_1.51.1         
## [133] multcomp_1.4-29

DOTSeq: Detecting Differential ORF Usage in Ribosome Profiling Data

13 January 2026

Package