DOTSeq 0.99.4
DOTSeq is an R package for detecting differential translation of open
reading frames (ORFs) from ribosome profiling (Ribo-seq)1,2 and matched RNA-seq datasets.3 Translation is a tightly regulated process,
where multiple ORFs within the same gene can modulate protein synthesis.
While useful, traditional gene-level approaches (e.g.,
annota2seq
and
terapadog
available at Bioconductor4,5)
may overlook these regulatory layers. DOTSeq addresses this limitation by
modelling translation at the ORF level.
DOTSeq provides two complementary modules:
Differential ORF Usage (DOU): Detects cis-translational regulation by testing whether the relative usage of ORFs within a gene changes across biological conditions. Implemented via a beta-binomial generalised linear model (GLM) (glmmTMB).6
Differential Translation Efficiency (DTE): Tests for changes in ribosome occupancy relative to RNA abundance across conditions using a negative-binomial GLM (DESeq2).7
Together, DOU and DTE modules allow global identification of translation-specific effects, revealing regulatory events such as uORF-mediated repression.
We start by loading the DOTSeq package and its dependencies.
DOTSeq builds on the Bioconductor ecosystem, e.g.,
rtracklayer
and
txdbmaker
for parsing GTF files, and the
SummarizedExperiment
container for storing count matrices and metadata.8–11
library(DOTSeq)
library(SummarizedExperiment, quietly = TRUE)
We demonstrate DOTSeq functionalities using a HeLa cell cycle dataset.12 This dataset includes the snapshots of translation
landscapes in HeLa cells synchronised at Mitotic Cycling, Mitotic Arrest,
and Interphase. Each condition contains two biological replicates.
Here, we use only a small portion of the dataset to speed up the vignette execution while still demonstrating all key functionalities.
dir <- system.file("extdata", package = "DOTSeq")
list.files(dir)
## [1] "featureCounts.cell_cycle_subset.txt.gz"
## [2] "gencode.v47.orf_flattened_subset.bed.gz"
## [3] "gencode.v47.orf_flattened_subset.gtf.gz"
## [4] "metadata.txt.gz"
Ribo-seq and RNA-seq reads are preprocessed using standard pipelines. Reads were adapter-trimmed using Cutadapt,13 aligned to the reference genome using STAR,14 and summarised at the ORF level using featureCounts.15
Each row of the input count matrix corresponds to an ORF, while columns contain read counts for individual samples. The first six columns represent ORF-level annotation fields such as gene ID, transcript ID, chromosome, and coordinates. This ensures accurate mapping between read counts and annotated ORFs, which is critical for our analysis.
cnt <- read.table(
file.path(dir, "featureCounts.cell_cycle_subset.txt.gz"),
header = TRUE,
comment.char = "#"
)
names(cnt) <- gsub(".*(SRR[0-9]+).*", "\\1", names(cnt))
head(cnt)
## Geneid Chr Start End Strand Length SRR24230465 SRR24230467
## 1 ENSG00000003056.8 chr12 8940362 8940426 - 65 0 1
## 2 ENSG00000003056.8 chr12 8940428 8940634 - 207 0 1
## 3 ENSG00000003056.8 chr12 8940636 8940698 - 63 0 0
## 4 ENSG00000003056.8 chr12 8941261 8941269 - 9 0 0
## 5 ENSG00000003056.8 chr12 8941316 8941360 - 45 0 0
## 6 ENSG00000003056.8 chr12 8941369 8941392 - 24 0 0
## SRR24230469 SRR24230471 SRR24230480 SRR24230482 SRR24230462 SRR24230466
## 1 0 0 0 0 4 15
## 2 0 0 0 0 2 1
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 1 2
## 6 0 0 0 0 5 5
## SRR24230472 SRR24230474 SRR24230477 SRR24230479
## 1 7 6 5 40
## 2 0 0 3 10
## 3 0 0 0 0
## 4 1 1 0 0
## 5 2 0 1 4
## 6 2 7 4 30
Flattened ORF-level annotations were generated using the orf_to_gtf.py
wrapper for RIBOSS.16 Alternatively, annotations can be
prepared within R using getORFs()
(see Prepare annotations). This pre-processing step
ensures that each ORF occupies a distinct, non-overlapping genomic interval,
avoiding ambiguity when assigning reads. This design follows the same
logic as DEXSeq’s exon flattening, but at the level of ORFs rather
than exons.17
gtf <- file.path(dir, "gencode.v47.orf_flattened_subset.gtf.gz")
bed <- file.path(dir, "gencode.v47.orf_flattened_subset.bed.gz")
The condition table defines the experimental design for both Ribo-seq and
RNA-seq samples. It allows DOTSeq to match Ribo/RNA-seq pairs and to
model translation-specific interactions.
Each row corresponds to one sequencing run. The table must include four essential columns: - run: unique identifier for each sequencing run
strategy: either "ribo" or "rna" (or "0" or "1"), indicating the
sequencing strategy
replicate: biological replicate identifier
condition: experimental condition (e.g., control, treatment)
Additional columns (e.g., batch) can also be included. In the example below, we load the metadata and restrict the analysis to samples pre-treated with cycloheximide (chx) prior to library preparation:
cond <- read.table(file.path(dir, "metadata.txt.gz"))
# Ensure required column names for DOTSeq
names(cond) <- c("run", "strategy", "replicate", "treatment", "condition")
# Filter to include only chx-treated samples
cond <- cond[cond$treatment == "chx", ]
head(cond)
## run strategy replicate treatment condition
## 1 SRR24230462 rna 2 chx Mitotic_Cycling
## 4 SRR24230465 ribo 2 chx Mitotic_Cycling
## 5 SRR24230466 rna 1 chx Mitotic_Cycling
## 6 SRR24230467 ribo 1 chx Mitotic_Cycling
## 8 SRR24230469 ribo 2 chx Mitotic_Arrest
## 10 SRR24230471 ribo 1 chx Mitotic_Arrest
The main function DOTSeq() performs the differential analysis workflow,
starting from parsing input files to post hoc inference. It sequentially
runs three key steps:
Construction of two SummarizedExperiment objects (DOTSeqDataSet()).
model fitting for DOU via beta-binomial GLM (fitDOU() using glmmTMB)
and DTE via negative-binomial GLM (DESeq2).
Post hoc inference using emmeans18 for contrasts and ashr19 for adaptive shrinkage of effect sizes (testDOU()).
While each step can be run individually, we construct DOTSeqDataSets with
DOTSeqDataSetsFromFeatureCounts(), followed by the DOTSeq() main
function here for simplicity. To capture the translation-specific
interaction effect (\(\beta_{\text{int}}\)), DOTSeq() uses
formula = ~ condition * strategy with dispersion_modeling = "auto" by
default. The design and dispersion formulas can be customised to include
batch and random effects.
# Create a DOTSeqDataSets object
datasets <- DOTSeqDataSetsFromFeatureCounts(
count_table = cnt,
condition_table = cond,
flattened_gtf = gtf,
flattened_bed = bed,
verbose = FALSE
)
# Run the DOTSeq workflow
d <- DOTSeq(datasets = datasets)
## starting Differential ORF Usage (DOU) analysis
## starting post hoc analysis
## effect sizes for interaction_specific contrasts calculated and shrunk succesfully
## effect sizes for strategy_specific contrasts calculated and shrunk successfully
## DOU runtime: 3 mins 44.655 secs
## DOU model fitting summary: Mitotic_Arrest - Interphase
## models fitted: 695
## non-convergence or invalid Hessian: 1434
## not fitted (filtered out): 4513
## DOU model fitting summary: Mitotic_Cycling - Interphase
## models fitted: 695
## non-convergence or invalid Hessian: 1434
## not fitted (filtered out): 4513
## DOU model fitting summary: Mitotic_Cycling - Mitotic_Arrest
## models fitted: 695
## non-convergence or invalid Hessian: 1434
## not fitted (filtered out): 4513
## starting Differential Translation Efficiency (DTE) analysis
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## starting post hoc analysis
## DTE runtime: 8.446 secs
## DTE model fitting summary: Mitotic_Cycling - Mitotic_Arrest
## models fitted: 1137
## padj = NA (filtered or flagged): 5505
## DTE model fitting summary: Mitotic_Arrest - Interphase
## models fitted: 1941
## padj = NA (filtered or flagged): 4701
## DTE model fitting summary: Mitotic_Cycling - Interphase
## models fitted: 1712
## padj = NA (filtered or flagged): 4930
In this case Interphase was automatically assigned as the baseline. The
target and baseline conditions can explicitly defined as
target = "Mitotic_Cycling" and baseline = "Interphase".
The output is a DOTSeqDataSets object that encapsulates the post hoc
contrast results for DOU and DTE analyses.
show(d)
## DOTSeqDataSets
## DOU: DOUData
## DTE: DTEData
## Use getDOU() or getDTE() to access contents.
## Use getContrasts() if post hoc inference has been performed.
As indicated, we will use getContrasts() to extract the results.
results <- getContrasts(d, type = "interaction")
results
## $DOU
## DataFrame with 2085 rows and 10 columns
## betahat sebetahat waldpval waldpadj posterior qvalue
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 -0.315677 0.291932 2.79546e-01 0.339763478 -0.046734 0.21887806
## 2 1.868553 0.457288 4.38573e-05 0.000104955 1.837607 0.00012196
## 3 0.896892 0.305651 3.34228e-03 0.005784391 0.791524 0.01020028
## 4 -1.420373 0.538238 8.31677e-03 0.013632323 -1.202033 0.01277786
## 5 -1.303596 0.440822 3.10452e-03 0.005399848 -1.199171 0.00661518
## ... ... ... ... ... ... ...
## 2081 -21.21978 1.86395e+04 0.99909167 0.9999930 -1.07384e-08 0.4883262
## 2082 -1.38197 9.02272e-01 0.12560732 0.3046531 -1.91868e-01 0.3665421
## 2083 2.09437 7.46692e-01 0.00503382 0.0296057 8.40108e-01 0.0853905
## 2084 -1.09952 4.58713e-01 0.01653135 0.0755777 -5.26534e-01 0.1287686
## 2085 1.04066 3.59667e-01 0.00381093 0.0244888 7.78991e-01 0.0488665
## lfdr lfsr orf_id contrast
## <numeric> <numeric> <character> <Rle>
## 1 0.85195633 0.87264884 ENSG00000003056.8:O008 Mitotic_Arrest - Int..
## 2 0.00160116 0.00162365 ENSG00000003056.8:O013 Mitotic_Arrest - Int..
## 3 0.11748084 0.11895566 ENSG00000004478.8:O001 Mitotic_Arrest - Int..
## 4 0.14775248 0.15132045 ENSG00000004478.8:O002 Mitotic_Arrest - Int..
## 5 0.07948470 0.08091441 ENSG00000004478.8:O005 Mitotic_Arrest - Int..
## ... ... ... ... ...
## 2081 0.832061 0.916030 ENSG00000133872.14:O.. Mitotic_Cycling - Mi..
## 2082 0.784013 0.804548 ENSG00000133872.14:O.. Mitotic_Cycling - Mi..
## 2083 0.356171 0.361086 ENSG00000133872.14:O.. Mitotic_Cycling - Mi..
## 2084 0.484493 0.489162 ENSG00000134153.10:O.. Mitotic_Cycling - Mi..
## 2085 0.232850 0.234358 ENSG00000134153.10:O.. Mitotic_Cycling - Mi..
##
## $DTE
## log2 fold change (MMSE): 0,0,0,0,-1,+1
## Wald test p-value: 0,0,0,0,-1,+1
## DataFrame with 19926 rows and 7 columns
## baseMean log2FoldChange lfcSE pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 6.612288 0.0517657 0.782074 0.819230 NA
## 2 1.566572 -0.0695095 0.829717 0.638754 NA
## 3 0.000000 0.0000000 0.854322 NA NA
## 4 0.224656 0.0000000 0.843721 1.000000 NA
## 5 0.967834 0.0000000 0.833729 1.000000 NA
## ... ... ... ... ... ...
## 19922 11.4769 -0.19809388 0.891331 7.36964e-01 8.91883e-01
## 19923 10.0378 0.00327321 1.050203 9.94842e-01 9.95424e-01
## 19924 30.6207 -1.41437812 0.723772 1.37107e-02 4.97302e-02
## 19925 10.2243 -0.13656856 1.028994 7.95281e-01 9.17725e-01
## 19926 1918.8799 1.42139618 0.194002 1.07081e-13 2.90989e-12
## orf_id contrast
## <character> <character>
## 1 ENSG00000003056.8:O001 Mitotic_Cycling - Mi..
## 2 ENSG00000003056.8:O002 Mitotic_Cycling - Mi..
## 3 ENSG00000003056.8:O003 Mitotic_Cycling - Mi..
## 4 ENSG00000003056.8:O004 Mitotic_Cycling - Mi..
## 5 ENSG00000003056.8:O005 Mitotic_Cycling - Mi..
## ... ... ...
## 19922 ENSG00000134153.10:O.. Mitotic_Cycling - In..
## 19923 ENSG00000134153.10:O.. Mitotic_Cycling - In..
## 19924 ENSG00000134153.10:O.. Mitotic_Cycling - In..
## 19925 ENSG00000134153.10:O.. Mitotic_Cycling - In..
## 19926 ENSG00000134153.10:O.. Mitotic_Cycling - In..
These data frames summarise differential translation results for each ORF. Key statistics include:
posterior: The shrunken log-odds fold change in ORF usage (DOU),
reflecting the estimated effect size after empirical Bayes shrinkage.
lfsr (Local False Sign Rate): The probability that the estimated effect
has the wrong sign, providing a measure of directional uncertainty.
log2FoldChange: The shrunken log2 fold change in ORF translation
efficiency (DTE), indicating relative changes in ribosome loading.
padj: The adjusted p-value for DTE, accounting for multiple testing.
The DTE effect size (log2FoldChange) was derived from
DESeq2::lfcShrink() within the DOTSeq() main function. We specify
type = ashr to apply adaptive shrinkage and compute all pairwise
contrasts between conditions.
Additional statistics for DOU results include:
betahat: The raw (unshrunken) effect size estimate.
sebetahat: The standard error of the raw estimate.
qvalue and lfdr (Local False Discovery Rate): Alternative measures
of significance and false discovery, also computed by ashr.
waldpval and waldpadj: Wald test results are included for completeness,
providing frequentist estimates of significance.
We can also inspect the rowData and rowRanges of the
SummarizedExperiment objects to explore ORF level annotation. This
includes mappings from gene to transcript to ORF IDs, as well as orf_type,
which currently supports uORF, mORF, and dORF. Filtering criteria
count_filter and singlet_filter determine whether an ORF is retained
(is_kept) for model fitting. DOUResults stores the model fit
results and the emmeans objects for post hoc contrasts.
rowData(getDOU(d))
## DataFrame with 2129 rows and 7 columns
## gene_id orf_number orf_type count_filter
## <character> <character> <character> <logical>
## ENSG00000003056.8:O007 ENSG00000003056.8 007 dORF TRUE
## ENSG00000003056.8:O008 ENSG00000003056.8 008 dORF TRUE
## ENSG00000003056.8:O009 ENSG00000003056.8 009 mORF TRUE
## ENSG00000003056.8:O013 ENSG00000003056.8 013 uORF TRUE
## ENSG00000004478.8:O001 ENSG00000004478.8 001 mORF TRUE
## ... ... ... ... ...
## ENSG00000134153.10:O001 ENSG00000134153.10 001 dORF TRUE
## ENSG00000134153.10:O002 ENSG00000134153.10 002 dORF TRUE
## ENSG00000134153.10:O003 ENSG00000134153.10 003 dORF TRUE
## ENSG00000134153.10:O004 ENSG00000134153.10 004 dORF TRUE
## ENSG00000134153.10:O005 ENSG00000134153.10 005 mORF TRUE
## singlet_filter is_kept DOUResults
## <logical> <logical> <list>
## ENSG00000003056.8:O007 TRUE TRUE <PostHoc>
## ENSG00000003056.8:O008 TRUE TRUE <PostHoc>
## ENSG00000003056.8:O009 TRUE TRUE <PostHoc>
## ENSG00000003056.8:O013 TRUE TRUE <PostHoc>
## ENSG00000004478.8:O001 TRUE TRUE <PostHoc>
## ... ... ... ...
## ENSG00000134153.10:O001 TRUE TRUE <PostHoc>
## ENSG00000134153.10:O002 TRUE TRUE <PostHoc>
## ENSG00000134153.10:O003 TRUE TRUE <PostHoc>
## ENSG00000134153.10:O004 TRUE TRUE <PostHoc>
## ENSG00000134153.10:O005 TRUE TRUE <PostHoc>
The rowRanges slot retains the genomic coordinates corresponding to ORFs,
extending from the translational start (AUG) to stop codons. These ranges
may include intronic regions, depending on the gene structures.
rowRanges(getDOU(d))
## GRanges object with 2129 ranges and 7 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## ENSG00000003056.8:O007 chr12 8941510-8941518 - |
## ENSG00000003056.8:O008 chr12 8941590-8941805 - |
## ENSG00000003056.8:O009 chr12 8941818-8946404 - |
## ENSG00000003056.8:O013 chr12 8949621-8949713 - |
## ENSG00000004478.8:O001 chr12 2795140-2803258 + |
## ... ... ... ... .
## ENSG00000134153.10:O001 chr15 34084030-34084071 - |
## ENSG00000134153.10:O002 chr15 34084109-34084123 - |
## ENSG00000134153.10:O003 chr15 34084135-34084188 - |
## ENSG00000134153.10:O004 chr15 34084192-34084206 - |
## ENSG00000134153.10:O005 chr15 34084334-34101839 - |
## gene_id orf_number orf_type
## <character> <character> <character>
## ENSG00000003056.8:O007 ENSG00000003056.8 007 dORF
## ENSG00000003056.8:O008 ENSG00000003056.8 008 dORF
## ENSG00000003056.8:O009 ENSG00000003056.8 009 mORF
## ENSG00000003056.8:O013 ENSG00000003056.8 013 uORF
## ENSG00000004478.8:O001 ENSG00000004478.8 001 mORF
## ... ... ... ...
## ENSG00000134153.10:O001 ENSG00000134153.10 001 dORF
## ENSG00000134153.10:O002 ENSG00000134153.10 002 dORF
## ENSG00000134153.10:O003 ENSG00000134153.10 003 dORF
## ENSG00000134153.10:O004 ENSG00000134153.10 004 dORF
## ENSG00000134153.10:O005 ENSG00000134153.10 005 mORF
## count_filter singlet_filter is_kept DOUResults
## <logical> <logical> <logical> <list>
## ENSG00000003056.8:O007 TRUE TRUE TRUE <PostHoc>
## ENSG00000003056.8:O008 TRUE TRUE TRUE <PostHoc>
## ENSG00000003056.8:O009 TRUE TRUE TRUE <PostHoc>
## ENSG00000003056.8:O013 TRUE TRUE TRUE <PostHoc>
## ENSG00000004478.8:O001 TRUE TRUE TRUE <PostHoc>
## ... ... ... ... ...
## ENSG00000134153.10:O001 TRUE TRUE TRUE <PostHoc>
## ENSG00000134153.10:O002 TRUE TRUE TRUE <PostHoc>
## ENSG00000134153.10:O003 TRUE TRUE TRUE <PostHoc>
## ENSG00000134153.10:O004 TRUE TRUE TRUE <PostHoc>
## ENSG00000134153.10:O005 TRUE TRUE TRUE <PostHoc>
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
DOTSeq provides the plotDOT() function for visualising DOU and DTE
results. The following plots can be generated:
ou <- results$DOU[results$DOU$contrast == "Mitotic_Cycling - Interphase", ]
te <- results$DTE[results$DTE$contrast == "Mitotic_Cycling - Interphase", ]
results <- merge(ou, te, by = c("orf_id", "contrast"), all = TRUE)
To explore the relationship between DOU and DTE, we can visualise the overlap between significant ORFs identified by each module. ORFs that are significant in DOU and DTE analyses suggest changes in both usage and translation efficiency, potentially indicating strong translational control.
plotDOT(plot_type = "venn", results = results, force_new_device = FALSE)
## Venn diagram plotted
To compare DOU and DTE, we generate a composite scatter plot of their respective effect sizes. Marginal density plots along each axis illustrate the distribution of effect sizes for each metric. Spearman correlation is calculated and reported.
For this scatter plot, points are colour-coded by significance,
allowing us to identify ORFs with strong effects in one analysis but
weak in the other. This visualisation helps highlight patterns of
differential translation across ORFs.
plotDOT(
plot_type = "composite",
results = results,
plot_params = list(color_by = "significance", legend_position = "bottomright"),
force_new_device = FALSE
)
## Spearman correlation between DOU and DTE estimates: 0.823, p-value: <2e-16
## composite plot colored by significance plotted
If DOUData is given as an input, points can also be colour-coded by
orf_type. This enables quick identification of patterns in differential
translation across different ORF types.
plotDOT(
plot_type = "composite",
results = results,
data = getDOU(d),
plot_params = list(color_by = "orf_type", legend_position = "bottomright"),
force_new_device = FALSE
)
## Spearman correlation between DOU and DTE estimates: 0.823, p-value: <2e-16
## composite plot colored by orf_type plotted
To highlight ORFs with strong differential usage, we construct a volcano plot comparing significance versus DOU effect size.
By default, gene identifiers are mapped from Ensembl gene IDs to HGNC
gene symbols using biomaRt This mapping improves plot readability by
displaying gene symbols. The resulting ID mapping dataframe can be
stored as a dataframe and reused across plots to avoid repeated downloads.
mapping <- plotDOT(
plot_type = "volcano",
results = results,
id_mapping = TRUE,
plot_params = list(color_by = "significance", top_hits = 3, legend_position = "topright"),
force_new_device = FALSE
)
## retrieving gene annotation from BioMart
## Ensembl site unresponsive, trying useast mirror
## retrieved 275 gene symbols
## capping volcano plot y-axis at -log10(lfsr) = 20
## volcano plot colored by significance plotted
Similarly, volcano plots can be colour-coded by significance or ORF types, depending on whether rowdata is given as an input.
plotDOT(
plot_type = "volcano",
results = results,
data = getDOU(d),
id_mapping = mapping,
plot_params = list(color_by = "orf_type", top_hits = 3, legend_position = "topright"),
force_new_device = FALSE
)
## capping volcano plot y-axis at -log10(lfsr) = 20
## volcano plot colored by orf_type plotted
To visualise the top-ranked genes, we generate a heatmap based on DOU metrics for the top 20 uORF-regulated genes.
plotDOT(
plot_type = "heatmap",
results = results,
data = getDOU(d),
id_mapping = mapping,
plot_params = list(rank_by = "score", top_hits = 20),
force_new_device = FALSE
)
## plotting heatmap for the top 20 DOU genes
## heatmap plotted
We can visualise the estimated ORF usage within a gene as inferred by the DOU module. This plot combines the probabilistic estimates of ORF usage with post hoc statistical significance annotations, allowing interpretation of DOU across conditions. If the ID mapping dataframe provided, the gene symbol is displayed as the plot title.
orderby <- c("Mitotic_Cycling", "Mitotic_Arrest", "Interphase")
id <- "TAF2"
plotDOT(
plot_type = "usage",
data = getDOU(d),
gene_id = id,
id_mapping = mapping,
plot_params = list(order_by = orderby),
force_new_device = FALSE
)
## ORF usage plot for TAF2 plotted
The plotDOT() function supports querying according to the gene
IDs or gene symbols, as listed in the rowData, rowRanges and/or the
ID mapping dataframe. Stable gene IDs (e.g., ENSG00000060339) can be
used without their version suffix (e.g., .14).
id <- "ENSG00000060339"
plotDOT(
plot_type = "usage",
data = getDOU(d),
gene_id = id,
id_mapping = mapping,
plot_params = list(order_by = orderby),
force_new_device = FALSE
)
## ORF usage plot for ENSG00000060339 plotted
DOTSeq offers a unified framework for modelling translational control at
the ORF level. DOTSeq has two modules capturing distinct biological
signals: DOU tests for shifts in relative usage within genes
(cis-regulation), whereas DTE tests for changes in per-ORF translation
efficiency relative to transcript abundance. They can be used together to
provide a fuller picture of translational regulation.
1. Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. & Weissman, J. S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. science 324, 218–223 (2009).
2. Ingolia, N. T., Hussmann, J. A. & Weissman, J. S. Ribosome profiling: Global views of translation. Cold Spring Harbor perspectives in biology 11, a032698 (2019).
3. Lim, C. S. & Chieng, G. S. Differential analysis of translation efficiency and usage of open reading frames using dotseq. bioRxiv 2025–09 (2025).
4. Oertlin, C. et al. Generally applicable transcriptome-wide analysis of translation using anota2seq. Nucleic acids research 47, e70–e70 (2019).
5. Carancini, G. Terapadog: Translational Efficiency Regulation Analysis Using the Padog Method. (2025). doi:10.18129/B9.bioc.terapadog.
6. Brooks, M. E. et al. GlmmTMB balances speed and flexibility among packages for zero-inflated generalized linear mixed modeling. The R Journal 9, 378–400 (2017).
7. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome biology 15, 550 (2014).
8. Lawrence, M., Gentleman, R. & Carey, V. Rtracklayer: An r package for interfacing with genome browsers. Bioinformatics 25, 1841 (2009).
9. Pagès, H., Carlson, M., Aboyoun, P., Falcon, S. & Morgan, M. Txdbmaker: Tools for making txdb objects from genomic annotations. R. package version 1.6 1, (2025).
10. Lawrence, M. et al. Software for computing and annotating genomic ranges. PLoS computational biology 9, e1003118 (2013).
11. Morgan, M., Obenchain, V., Hester, J. & Pagès, H. Summarized Experiment: A Container (S4 Class) for Matrix-Like Assays. (2025). doi:10.18129/B9.bioc.SummarizedExperiment.
12. Ly, J. et al. Nuclear release of eIF1 restricts start-codon selection during mitosis. Nature 635, 490–498 (2024).
13. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal 17, 10–12 (2011).
14. Dobin, A. et al. STAR: Ultrafast universal rna-seq aligner. Bioinformatics 29, 15–21 (2013).
15. Liao, Y., Smyth, G. K. & Shi, W. FeatureCounts: An efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).
16. Lim, C. S., Gibbon, A. K., Tran Nguyen, A. T., Chieng, G. S. & Brown, C. M. RIBOSS detects novel translational events by combining long-and short-read transcriptome and translatome profiling. Briefings in bioinformatics 26, bbaf164 (2025).
17. Anders, S., Reyes, A. & Huber, W. Detecting differential usage of exons from rna-seq data. Genome research 22, 2008–2017 (2012).
18. Lenth, R. V. & Piaskowski, J. Emmeans: Estimated Marginal Means, Aka Least-Squares Means. (2025).
19. Stephens, M. False discovery rates: A new deal. Biostatistics 18, 275–294 (2017).
20. Tjeldnes, H. et al. ORFik: A comprehensive r toolkit for the analysis of translation. BMC bioinformatics 22, 336 (2021).
Besides using the orf_to_gtf.py wrapper,16
we can generate ORF-level annotation natively in R using the getORFs()
function, which is based on the
ORFik’s
engine.20 This approach creates a GRanges object.
library(withr)
falink <- "https://ftp.ensembl.org/pub/release-78/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP5.cdna.all.fa.gz"
gtflink <- "https://ftp.ensembl.org/pub/release-78/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP5.78.gtf.gz"
annotation <- basename(gtflink)
sequences <- basename(falink)
download.file(gtflink, destfile=annotation)
download.file(falink, destfile=sequences)
# This will generate flattened ORF-level annotation
gr <- getORFs(
sequences = sequences,
annotation = annotation,
organism = "Drosophila melanogaster"
)
withr::defer(unlink(c(annotation, sequences)))
We perform read counting using the RNA-seq BAM files from the pasillaBamSubset package. We can optionally clean up the BAM files before read counting by discarding reads mapped to introns, noncoding genes, and intergenic regions. Note that this dataset is not derived from a Ribo-seq experiment and is used solely for illustrative purposes. It should not be used for biological interpretation.
library(pasillaBamSubset)
library(GenomeInfoDb)
bam_list <- c(untreated1_chr4(), untreated3_chr4())
# Keep only reads mapped to the exons of coding genes
temp_dir <- tempdir()
getExonicReads(gr = gr, seqlevels_style = "UCSC", bam_files = bam_list, bam_output_dir = temp_dir, coding_genes_only = TRUE)
# Get the list of filtered BAM files
bam_list <- list.files(
path = temp_dir,
pattern = "*exonic.*",
full.names = TRUE
)
cnt <- countReads(gr = gr, bam_files = bam_list)
withr::defer(unlink(bam_list))
Here we make use the ORF-level annotation and create count and condition
tables as input for DOTSeqDataSetsFromSummarizeOverlaps(). This is alternative to
using DOTSeqDataSetsFromFeatureCounts() as demonstrated at the beginning.
set.seed(42)
# Create count_table
# Create two replicates for each condition with random scaling
rna_treated_reps <- do.call(cbind, replicate(2, cnt[["untreated3_chr4.exonic.sorted.bam"]] * runif(nrow(cnt), min = 0.5, max = 2), simplify = FALSE))
rna_control_reps <- do.call(cbind, replicate(2, cnt[["untreated3_chr4.exonic.sorted.bam"]] * runif(nrow(cnt), min = 0.1, max = 0.5), simplify = FALSE))
ribo_treated_reps <- do.call(cbind, replicate(2, cnt[["untreated1_chr4.exonic.sorted.bam"]] * runif(nrow(cnt), min = 0.5, max = 2), simplify = FALSE))
ribo_control_reps <- do.call(cbind, replicate(2, cnt[["untreated1_chr4.exonic.sorted.bam"]] * runif(nrow(cnt), min = 0.1, max = 0.5), simplify = FALSE))
# Combine and name columns
colnames(rna_treated_reps) <- paste0("rna_treated", 1:2)
colnames(rna_control_reps) <- paste0("rna_control", 1:2)
colnames(ribo_treated_reps) <- paste0("ribo_treated", 1:2)
colnames(ribo_control_reps) <- paste0("ribo_control", 1:2)
cnt_expanded <- cbind(
rna_treated_reps,
rna_control_reps,
ribo_treated_reps,
ribo_control_reps
)
# Convert numbers to integer
cnt_expanded <- round(cnt_expanded)
storage.mode(cnt_expanded) <- "integer"
rownames(cnt_expanded) <- rownames(cnt)
cnt_expanded <- as.data.frame(cnt_expanded)
# Create condition_table
# Sample names from cnt_expanded
sample_names <- colnames(cnt_expanded)
# Define condition and strategy for each sample
condition <- c(
rep("treated", 2),
rep("control", 2),
rep("treated", 2),
rep("control", 2)
)
strategy <- c(rep("RNA", 4), rep("Ribo", 4))
cond <- data.frame(
run = sample_names,
replicate = c(1,2),
condition = factor(condition, levels = c("control", "treated")),
strategy = factor(strategy, levels = c("RNA", "Ribo"))
)
# Create a DOTSeqDataSets object
d <- DOTSeqDataSetsFromSummarizeOverlaps(
count_table = cnt_expanded,
condition_table = cond,
annotation = gr
)
invisible(file.remove(bam_list))
invisible(file.remove(sub("\\.gz$", ".sqlite", annotation)))
DOTSeq includes the function simDOT() for generating simulated Ribo-seq
and RNA-seq count matrices. It enables benchmarking under controlled
scenarios with configurable effect sizes, batch effects, and ORF types.
Simulations in our manuscript3 show that:
The DOU module performs well at low-to-moderate effect sizes on DOU-type scenarios. This module is specifically designed to detect condition-dependent shifts in the expected proportion of Ribo-seq to RNA-seq counts for individual ORFs relative to other ORFs within the same gene (i.e., changes in ORF usage). This module targets cis-regulatory usage changes that alter relative ORF proportions within genes.
The DTE module performs better at large-magnitude changes in translation efficiency. This module captures ORF-level changes in ribosome occupancy relative to RNA abundance.
sessionInfo()
## R Under development (unstable) (2025-12-22 r89219)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SummarizedExperiment_1.41.0 Biobase_2.71.0
## [3] GenomicRanges_1.63.1 Seqinfo_1.1.0
## [5] IRanges_2.45.0 S4Vectors_0.49.0
## [7] BiocGenerics_0.57.0 generics_0.1.4
## [9] MatrixGenerics_1.23.0 matrixStats_1.5.0
## [11] DOTSeq_0.99.4 BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] eulerr_7.0.4 RColorBrewer_1.1-3 jsonlite_2.0.0
## [4] magrittr_2.0.4 magick_2.9.0 TH.data_1.1-5
## [7] estimability_1.5.1 GenomicFeatures_1.63.1 farver_2.1.2
## [10] nloptr_2.2.1 rmarkdown_2.30 BiocIO_1.21.0
## [13] vctrs_0.6.5 memoise_2.0.1 minqa_1.2.8
## [16] Rsamtools_2.27.0 RCurl_1.98-1.17 SQUAREM_2021.1
## [19] tinytex_0.58 mixsqp_0.3-54 progress_1.2.3
## [22] htmltools_0.5.9 S4Arrays_1.11.1 curl_7.0.0
## [25] truncnorm_1.0-9 SparseArray_1.11.10 sass_0.4.10
## [28] bslib_0.9.0 sandwich_3.1-1 httr2_1.2.2
## [31] emmeans_2.0.1 zoo_1.8-15 cachem_1.1.0
## [34] TMB_1.9.19 GenomicAlignments_1.47.0 lifecycle_1.0.5
## [37] pkgconfig_2.0.3 Matrix_1.7-4 R6_2.6.1
## [40] fastmap_1.2.0 GenomeInfoDbData_1.2.15 rbibutils_2.4
## [43] digest_0.6.39 numDeriv_2016.8-1.1 AnnotationDbi_1.73.0
## [46] DESeq2_1.51.6 irlba_2.3.5.1 RSQLite_2.4.5
## [49] labeling_0.4.3 filelock_1.0.3 invgamma_1.2
## [52] polyclip_1.10-7 httr_1.4.7 abind_1.4-8
## [55] mgcv_1.9-4 compiler_4.6.0 withr_3.0.2
## [58] bit64_4.6.0-1 S7_0.2.1 BiocParallel_1.45.0
## [61] DBI_1.2.3 ggsignif_0.6.4 biomaRt_2.67.1
## [64] MASS_7.3-65 rappdirs_0.3.3 DelayedArray_0.37.0
## [67] rjson_0.2.23 tools_4.6.0 otel_0.2.0
## [70] glue_1.8.0 restfulr_0.0.16 nlme_3.1-168
## [73] polylabelr_0.3.0 grid_4.6.0 glmmTMB_1.1.13
## [76] gtable_0.3.6 BSgenome_1.79.1 hms_1.1.4
## [79] data.table_1.18.0 xml2_1.5.1 XVector_0.51.0
## [82] stringr_1.6.0 pillar_1.11.1 splines_4.6.0
## [85] dplyr_1.1.4 BiocFileCache_3.1.0 lattice_0.22-7
## [88] survival_3.8-3 rtracklayer_1.71.3 bit_4.6.0
## [91] tidyselect_1.2.1 locfit_1.5-9.12 Biostrings_2.79.4
## [94] pbapply_1.7-4 knitr_1.51 reformulas_0.4.3.1
## [97] bookdown_0.46 xfun_0.55 stringi_1.8.7
## [100] UCSC.utils_1.7.1 yaml_2.3.12 boot_1.3-32
## [103] evaluate_1.0.5 codetools_0.2-20 cigarillo_1.1.0
## [106] tibble_3.3.1 BiocManager_1.30.27 cli_3.6.5
## [109] xtable_1.8-4 Rdpack_2.6.4 jquerylib_0.1.4
## [112] dichromat_2.0-0.1 Rcpp_1.1.1 GenomeInfoDb_1.47.2
## [115] dbplyr_2.5.1 coda_0.19-4.1 png_0.1-8
## [118] XML_3.99-0.20 parallel_4.6.0 ggplot2_4.0.1
## [121] blob_1.2.4 prettyunits_1.2.0 bitops_1.0-9
## [124] lme4_1.1-38 txdbmaker_1.7.3 mvtnorm_1.3-3
## [127] scales_1.4.0 purrr_1.2.1 crayon_1.5.3
## [130] rlang_1.1.7 ashr_2.2-63 KEGGREST_1.51.1
## [133] multcomp_1.4-29