The immReferent package provides a centralized and
easy-to-use interface for downloading, managing, and loading immune
repertoire and HLA reference sequences from the IMGT, IPD-IMGT/HLA and
OGRDB databases. Its primary goal is to ensure that analyses are based
on consistent, up-to-date, and correctly formatted reference data.
This vignette will walk you through the basic functionality of the package.
Or via Bioconductor (once accepted)
The main function for all data retrieval is getIMGT().
It handles both downloading new data from the source and loading
previously downloaded data from a local cache.
The IPD-IMGT/HLA database provides reference sequences for the Human
Leukocyte Antigen (HLA) system. You can download the complete set of
nucleotide or protein sequences using gene = "HLA".
# Download all available HLA protein sequences
# This will download the file to the cache on the first run
hla_prot <- getIMGT(gene = "HLA",
type = "PROT")
# Inspect the result
print(hla_prot)## AAStringSet object of length 43796:
## width seq names
## [1] 365 MAVMAPRTLLLLLSGALALTQ...TQAASSDSAQGSDVSLTACKV HLA:HLA00001 A*01...
## [2] 200 MAVMAPRTLLLLLSGALALTQ...GGARGTGLTAGSGPGSHTIQX HLA:HLA02169 A*01...
## [3] 365 MAVMAPRTLLLLLSGALALTQ...TQAASSDSAQGSDVSLTACKV HLA:HLA14798 A*01...
## [4] 365 MAVMAPRTLLLLLSGALALTQ...TQAASSDSAQGSDVSLTACKV HLA:HLA15760 A*01...
## [5] 365 MAVMAPRTLLLLLSGALALTQ...TQAASSDSAQGSDVSLTACKV HLA:HLA16415 A*01...
## ... ... ...
## [43792] 704 MRLPDLRPWTSLLLVDAALLW...AQLQEGQDLYSRLVQQRLMDX HLA:HLA38075 TAP2...
## [43793] 704 MRLPDLRPWTSLLLVDAALLW...AQLQEGQDLYSRLVQQRLMDX HLA:HLA38025 TAP2...
## [43794] 704 MRLPDLRPWTSLLLADAALLW...AQLQEGQDLYSRLVQQRLMDX HLA:HLA38029 TAP2...
## [43795] 704 MRLPDLRPWTSLLLVDAALLW...AQLQEGQDLYSRLVQQRLMDX HLA:HLA38159 TAP2...
## [43796] 704 MRLPDLRPWTSLLLVDAALLW...AQLQEGQDLYSRLVQQRLMDX HLA:HLA38463 TAP2...
## Number of sequences: 43796
## First sequence name: HLA:HLA00001 A*01:01:01:01 365 bp
For T-cell receptor (TCR) and B-cell receptor (BCR) genes, you can specify the species and the gene or gene family you are interested in.
# Download human IGHV nucleotide sequences
ighv_nuc <- getIMGT(species = "human",
gene = "IGHV",
type = "NUC")
# Inspect the result
print(ighv_nuc)## DNAStringSet object of length 469:
## width seq names
## [1] 320 CAGGTTCAGCTGGTGCAGTCTG...CCGTGTATTACTGTGCGAGAGA M99641|IGHV1-18*0...
## [2] 300 CAGGTTCAGCTGGTGCAGTCTG...CCTAAGATCTGACGACACGGCC X60503|IGHV1-18*0...
## [3] 320 CAGGTTCAGCTGGTGCAGTCTG...CCGTGTATTACTGTGCGAGAGA HM855463|IGHV1-18...
## [4] 320 CAGGTTCAGCTGGTGCAGTCTG...CCGTGTATTACTGTGCGAGAGA KC713938|IGHV1-18...
## [5] 320 CAGGTGCAGCTGGTGCAGTCTG...TCGTGTATTACTGTGCGAGAGA X07448|IGHV1-2*01...
## ... ... ...
## [465] 320 GAGGCCCAGCTTACAGAGTCTG...CAGCATTTAACTGTGCAGGAAA AB019438|IGHV8-51...
## [466] 320 GAGGCCCAGCTTACAGAGTCTG...CAGCATTTAACTGTGCAGGAAA BK063799|IGHV8-51...
## [467] 320 GAGGCCCAGCTTACAGAGTCTG...CAGCATTTAACTGTGCAGGAAA IMGT000055|IGHV8-...
## [468] 320 GAGGCCCAGCTTACAGAGTCTG...CAGCATTTAACTGTGCAGGAAA AC279961|IGHV8-51...
## [469] 320 GAGGCCCAGCTTACAGAGTCTG...CAGCATTTAACTGTGCAGGAAA BK068299|IGHV8-51...
You can also download entire families of genes at once by specifying
a group name like "IGH", "IGK",
"TRB", etc.
# Download all mouse TRB genes (V, D, J, and C)
trb_mouse <- getIMGT(species = "mouse",
gene = "TRB",
type = "NUC")
# This object will contain TRBV, TRBD, TRBJ, and TRBC sequences
print(trb_mouse)## DNAStringSet object of length 76:
## width seq names
## [1] 326 GTGACTTTGCTGGAGCAAAACCC...TGTACTGCACCTGCAGTGCAGA AE000663|TRBV1*01...
## [2] 324 GTGACTTTGCTGGAGCAAAACCC...CTTGTACTGCACCTGCAGTGCG X01642|TRBV1*02|M...
## [3] 325 GATGGTGGAATCACCCAGACACC...TTCTGGGCCAGCAGTGAACAAA X16694|TRBV10*01|...
## [4] 324 GATTCTGGGGTTGTCCAGTCTCC...GTACTTCTGTGCCAGCTCTCTC M15614|TRBV12-1*0...
## [5] 321 GATTCTGGGGTTGTCCAGTCTCC...TATGTACTTCTGTGCCAGCTCT M30881|TRBV12-1*0...
## ... ... ...
## [72] 47 CTCCTATGAACAGTACTTCGGTCCCGGCACCAGGCTCACGGTTTTAG K02802|TRBJ2-7*01...
## [73] 47 CTCCTATGAACAGTACTTCGGTCCCGGCACTAGGCTCACGGTTTTAG M16122|TRBJ2-7*02...
## [74] 591 NGATTGCGGCTTTCCTATGCAAG...TATGGTCAAAAGAAAGAATTCA IMGT000132|TRBC1*...
## [75] 519 NAGGATCTGAGAAATGTGACTCC...CATGGTCAAGAAAAAAAATTCC M26057+M26058+M26...
## [76] 519 NAGGATCTGAGAAATGTGACTCC...CATGGTCAAGAAAAAAAATTCC AE000665|TRBC2*03...
OGRDB provides AIRR-compliant germline sets for immunoglobulin loci (and growing coverage more broadly). You can retrieve:
DNAStringSet (and optionally translate to
AAStringSet).# Human IGH nucleotide sequences (gapped FASTA)
igh_ogrdb <- getOGRDB(
species = "human",
locus = "IGH",
type = "NUC",
format = "FASTA_GAPPED"
)
igh_ogrdb## DNAStringSet object of length 236:
## width seq names
## [1] 17 GGTACAACTGGAACGAC IGHD1-1*01
## [2] 17 GGTATAACCGGAACCAC IGHD1-14*01
## [3] 17 GGTATAACTGGAACGAC IGHD1-20*01
## [4] 20 GGTATAGTGGGAGCTACTAC IGHD1-26*01
## [5] 17 GGTATAACTGGAACTAC IGHD1-7*01
## ... ... ...
## [232] 320 CAGGTGCAGCTGGTGCAGTCTG...CCATGTATTACTGTGCGAGATA IGHV7-81*01
## [233] 320 GAGGCCCAGCTTACAGAGTCTG...CAGCATTTAACTGTGCAGGAAA IGHV8-51-1*02
## [234] 320 GAGGCCCAGCTTACAGAGTCTG...CAGCATTTAACTGTGCAGGAAA IGHV8-51-1*03
## [235] 320 GAGGCCCAGCTTACAGAGTCTG...CAGCATTTAACTGTGCAGGAAA IGHV8-51-1*04
## [236] 320 GAGGCCCAGCTTACAGAGTCTG...CAGCATTTAACTGTGCAGGAAA IGHV8-51-1*05
Pulling the AIRR-formatted sequences:
# Human IGK sequences via AIRR JSON (parsed to DNAStringSet)
igk_airr <- getOGRDB(
species = "human",
locus = "IGK",
type = "NUC",
format = "AIRR"
)
igk_airr## DNAStringSet object of length 71:
## width seq names
## [1] 764 ATGAGGCTCCCTGCTCAGCTCCT...GGTACAGCCCTGAACAGAAACC IGKV2D-40*01
## [2] 499 ATGAGGGTCCCCGCTCAGCTCCT...GTTACACACCCGAACAAAAACC IGKV1-NL1*01
## [3] 582 ATGTCGCCATCACAACTCATTGG...GTTACAACCCAGAACAAAAACT IGKV6D-21*02
## [4] 833 ATGAGGCTCCTTGCTCAGCTTCT...GGTACAGCTCTGAACAAAAACC IGKV2-24*02
## [5] 499 ATGAGGGTCCCCGCTCAGCTCCT...GTTACCAACCCGAACATAAACC IGKV1-12*01
## ... ... ...
## [67] 77 GATTTTTGTTAAGGGGAAAGTAA...GGGACACGACTGGAGATTAAAC IGKJ5*01
## [68] 77 GGTTTCTGTTCAGCAAGACAATG...GGGACCAAGGTGGAAATCAAAC IGKJ1*01
## [69] 78 AGTTTTTGTATAGGAGGGAAGTT...GGGACCAAGCTGGAGATCAAAC IGKJ2*04
## [70] 77 GGTTTTTGTTGAGGGGAAAGGGT...GGGACCAAGGTGGAGATCAAAC IGKJ4*01
## [71] 76 GGTTTTTGTAAAGGGAAAAGTTA...GGGACCAAAGTGGATATCAAAC IGKJ3*01
immReferent automatically caches all downloaded data to
avoid repeated downloads and to enable offline work.
You can see all the files currently in your cache using the
listIMGT() or listOGRDB() function.
## [1] "/github/home/.immReferent/Human/ogrdb/Human_IGH_VDJ_published_gapped.fasta"
## [2] "/github/home/.immReferent/Human/ogrdb/Human_IGKappa_VJ_published_airr.json"
## [3] "/github/home/.immReferent/human/hla/hla_prot.fasta"
## [4] "/github/home/.immReferent/human/vdj/imgt_human_IGHV.fasta"
## [5] "/github/home/.immReferent/immReferent_log.yaml"
## [6] "/github/home/.immReferent/mouse/constant/imgt_mouse_TRBC.fasta"
## [7] "/github/home/.immReferent/mouse/vdj/imgt_mouse_TRBD.fasta"
## [8] "/github/home/.immReferent/mouse/vdj/imgt_mouse_TRBJ.fasta"
## [9] "/github/home/.immReferent/mouse/vdj/imgt_mouse_TRBV.fasta"
## character(0)
When you call getIMGT(), it will always load data from
the cache if it’s available. If you want to only load from the
cache and prevent any possibility of a download, you can use
loadIMGT(). This function is useful in offline environments
or for ensuring strict reproducibility.
# This will load from the cache if available, or download otherwise
ighv_nuc <- getIMGT(species = "human",
gene = "IGHV",
type = "NUC")
# This will load from the cache, or fail if not found and offline
ighv_nuc_from_cache <- loadIMGT(species = "human",
gene = "IGHV",
type = "NUC")Similar to the above, we can pull and load from OGRDB using
getOGRDB() and loadOGRDB().
# This will load from the cache if available, or download otherwise
igh_nuc <- getOGRDB(species = "human",
locus = "IGH",
type = "NUC",
format = "FASTA_GAPPED")
# This will load from the cache, or fail if not found and offline
igh_from_cache <- loadOGRDB(species = "human",
locus = "IGH",
type = "NUC",
format = "FASTA_GAPPED")If you suspect the online data has been updated and you want to
re-download it, you can use refreshIMGT() or
refreshOGRDB(). This is just a convenient shortcut for
getIMGT(..., refresh = TRUE).
immReferent provides export functions to create
reference databases compatible with popular immune repertoire analysis
tools. These functions format the downloaded sequences appropriately for
each tool.
MiXCR is a popular tool for
analyzing immune repertoire sequencing data. The
exportMiXCR() function creates FASTA files formatted for
MiXCR’s buildLibrary command.
# Download human IGH sequences
igh_seqs <- getIMGT(species = "human",
gene = "IGH",
type = "NUC",
suppressMessages = TRUE)
# Export to MiXCR format
mixcr_dir <- tempdir()
mixcr_files <- exportMiXCR(igh_seqs, mixcr_dir, chain = "IGH")
# View created files
print(mixcr_files)## list()
# View first few lines of V gene file
if (!is.null(mixcr_files$v_genes)) {
cat(head(readLines(mixcr_files$v_genes), 4), sep = "\n")
}The output files can be used with MiXCR’s buildLibrary
command:
TRUST4 is a tool
for reconstructing immune repertoires from RNA-seq data. The
exportTRUST4() function creates a FASTA file compatible
with TRUST4’s --ref parameter.
# Export to TRUST4 format (includes constant regions by default)
trust4_file <- tempfile(fileext = ".fa")
exportTRUST4(igh_seqs, trust4_file)
# View header format
cat(head(readLines(trust4_file), 6), sep = "\n")## >M99641
## CAGGTTCAGCTGGTGCAGTCTGGAGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGG
## TTACACCTTT............ACCAGCTATGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGG
## GATGGATCAGCGCTTAC......AATGGTAACACAAACTATGCACAGAAGCTCCAG...GGCAGAGTCACCATGACCACA
## GACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA
## >X60503
Use the exported file with TRUST4:
Cell
Ranger from 10x Genomics includes VDJ analysis capabilities. The
exportCellRanger() function creates a FASTA file suitable
for building a custom VDJ reference.
# Export V genes for Cell Ranger
cellranger_file <- tempfile(fileext = ".fa")
exportCellRanger(igh_seqs, cellranger_file)
# View header format
cat(head(readLines(cellranger_file), 4), sep = "\n")## >M99641
## CAGGTTCAGCTGGTGCAGTCTGGAGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGG
## TTACACCTTT............ACCAGCTATGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGG
## GATGGATCAGCGCTTAC......AATGGTAACACAAACTATGCACAGAAGCTCCAG...GGCAGAGTCACCATGACCACA
Note: For a complete Cell Ranger VDJ reference, you also need a GTF
file with gene annotations and should use
cellranger mkvdjref to build the reference.
IgBLAST is NCBI’s tool
for analyzing immunoglobulin and T cell receptor sequences. The
exportIgBLAST() function creates FASTA files with
simplified headers compatible with IgBLAST.
# Export to IgBLAST format
igblast_dir <- tempdir()
igblast_files <- exportIgBLAST(igh_seqs, igblast_dir,
organism = "human",
receptor_type = "ig")
# View created files
print(igblast_files)## list()
# View header format
if (!is.null(igblast_files$v_genes)) {
cat(head(readLines(igblast_files$v_genes), 4), sep = "\n")
}After exporting, create BLAST databases using
makeblastdb:
This has been a general overview of the capabilities of immReferent for downloading and caching immune receptor and HLA sequences from IMGT and OGRDB, as well as exporting them to formats compatible with popular analysis tools. If you have any questions, comments, or suggestions, feel free to visit the GitHub repository.
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] immReferent_0.99.6 BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] jsonlite_2.0.0 selectr_0.5-1 compiler_4.5.2
## [4] BiocManager_1.30.27 crayon_1.5.3 stringr_1.6.0
## [7] xml2_1.5.1 Biostrings_2.79.3 jquerylib_0.1.4
## [10] IRanges_2.45.0 Seqinfo_1.1.0 yaml_2.3.12
## [13] fastmap_1.2.0 R6_2.6.1 XVector_0.51.0
## [16] generics_0.1.4 curl_7.0.0 knitr_1.51
## [19] BiocGenerics_0.57.0 tibble_3.3.0 maketools_1.3.2
## [22] bslib_0.9.0 pillar_1.11.1 rlang_1.1.6
## [25] stringi_1.8.7 cachem_1.1.0 xfun_0.55
## [28] sass_0.4.10 sys_3.4.3 cli_3.6.5
## [31] magrittr_2.0.4 digest_0.6.39 rvest_1.0.5
## [34] lifecycle_1.0.4 S4Vectors_0.49.0 vctrs_0.6.5
## [37] evaluate_1.0.5 glue_1.8.0 buildtools_1.0.0
## [40] stats4_4.5.2 rmarkdown_2.30 httr_1.4.7
## [43] tools_4.5.2 pkgconfig_2.0.3 htmltools_0.5.9