1 Setup

2 Introduction

This vignette demonstrates how to explore and visualize DMS and model scores from the ProteinGym database v1.2. Specifically, it walks through the functionality to generate heatmaps of all pairwise DMS substitution mutants and projects these scores onto 3D protein structures. ProteinGymR uses functionality from the Bioconductor package ComplexHeatmap and r3dmol from CRAN under the hood to generate the heatmaps and 3D protein structures, respectively. Finally, this vignette demonstrates how to contrast models with DMS experiment scores using correlation plots.

3 Visualize DMS scores along a protein

Explore the “ACE2_HUMAN” assay from Chan et al. 2020 and create a heatmap of the DMS scores with plot_dms_heatmap(). If the argument dms_data is not specified, the default will load the most recent DMS substitution data from ProteinGym with ProteinGymR::dms_substitutions(). This function only requires a specific assay name. To obtain all assay names, run: names(dms_substitutions()). By default, the function also plots the full range of positions where DMS data is available for this assay. To plot a specific region of interest, use the arguments start_pos() and end_pos() which takes in an integer for the first and last residue position to plot in the protein.

## 'dms_data' not provided, using DMS data loaded with dms_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache

The heatmap shows the DMS score at each position along the given protein (x-axis) where a residue was mutated (alternate amino acid on displayed on the y-axis and the reference allele at the position is shown on top). For this demonstration, we subset to the first 1-100 positions and grouped the amino acids by their physiochemical properties (DE,KRH,NQ,ST,PGAVIL,MC,FYW). See [here][physiochem] for more information. As a note, not all positions along the protein sequence may be subjected to mutation for every DMS assay. This results from the specific research objectives, prioritization choices of the investigators, or technical constraints inherent to the experimental design.

A low DMS score indicates low fitness, while a higher DMS score indicates high fitness. We can think of higher DMS scores as being more benign, while lower DMS score indicates more pathogenic regions.

Based on the “ACE2_HUMAN_Chan_2020” assay, we can see that at positions 90 and 92, fitness remained high despite across amino acid changes; possibly suggestive of a benign region of the protein. However, several mutations at position 48 resulted in low fitness. This could represent an important region for protein function where any perturbation would likely be deleterious.

Let’s plot another assay, specifying a region and invoking the ComplexHeatmap row clustering under the hood. For more details about this clustering method or to view more function parameters, read the function documentation with ?plot_dms_heatmap().

## 'dms_data' not provided, using DMS data loaded with dms_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache

In this region of the SHOC2_HUMAN protein, mutating to a K (y-axis) seemed to have the most benign affect across all mutations.

4 Visualize model scores along a protein

ProteinGymR Bioc 3.21 provides functionality to generate heatmaps of zero-shot mode scores for 79 variant effect prediction models and 11 semi-supervised models with the function plot_zeroshot_heatmap(). The required arguments for this function are the assay name to plot (same as for the DMS heatmap), and a model to plot. For a complete list of models, run available_models() for zero-shot models, and supervised_available_models() for the 11 semi-supervised models. If model_data is not provided, the default model scores from ProteinGym will be loaded in from ProteinGymR::zeroshot_substitutions().

## 'model_data' not provided, using default data loaded with zeroshot_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache

As with the DMS scores, we are plotting the GEMME zero-shot scores for positions 1 to 100 in the assay “ACE2_HUMAN_Chan_2020”. At first glance, both the DMS data and GEMME model reveal position 48 to be quite pathogenic across amino acid substitutions.Note that the model scores here are mostly negative; however because these are model prediction scores, negative values do not necessarily indicate lower fitness after mutation as with DMS scores. Thus, model scores are always represented with another color palette to distinguish from experimental scores. Also note, model scores are not rescaled or normalized across the 79 models, and comparison across of raw model predictions should be cautioned in this context. For more information on model scores and how to interpret them, consult the original ProteinGym publication.

It can be useful to visualize the DMS and model scores side by side for a given assay to compare the experimental DMS scores and predicted zero-shot scores outputted from the model. This is easily done with %v% which stacks the heatmaps in one column, while + will display them in two columns, side by side. This can also be done with any output of class ComplexHeatmap::Heatmap().


5 3D protein structure

This section demonstrates how to explore and visualize DMS or model scores on a 3D protein structure using the package r3dmol under the hood. The function requires DMS or model assay to aggregate scores that will be projected onto the 3D structure.

By default, if no data_scores argument is provided, the DMS substitutions from dms_substitutions() are loaded in, or if viewing model scores, set this argument to any model available in ProteinGym v1.2. Get a list of zero-shot and semi-supervised models with available_models() and supervised_available_model().

If a model is chosen, a helper function is invoked which normalizes the model prediction scores using a rank-based normal quantile transformation. The result is a set of normalized scores that preserve the rank order of the models scores, while standardizing the distribution. Transformed values typically fall between -3 and 3. This normalization ensures the scores are approximately standard normally distributed (mean = 0, SD = 1), allowing comparisons across models.

The user may also specify what aggregation method to use for calculating the summary statistic at each residue position. By default, the mean DMS score/model prediction score is calculated for each position. See the function documentation for details: ?plot_structure()

First, let’s use all the default settings. The only required arguments are the assay_name.

Importantly, because the plot shows one protein structure, all DMS fitness scores across amino acids are aggregated within a position. By defaut this aggregation function is just the average of all the DMS scores at that position. However, it is possible to set any user-defined aggregation function with the aggregation_func argument.

For DMS assays, a score of zero will always be represented as white, corresponding to the biological interpretation of neutral fitness effect.

## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## 'dms_data' not provided, using DMS data loaded with dms_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
2.54
0
-2.85


In this example, we are plotting the 3D structure of the ACE2_HUMAN protein and overlaying the mean DMS score across all mutants in a given position. Chan et al. 2020 who generated the DMS assay data only experimentally assessed a subset of the entire ACE2_HUMAN protein. By default the function only colors the regions where there is information available in the assay. Red colors represent more pathogenic (lower DMS scores) and blue colors show more benign positions (higher DMS scores). Regions that appear white indicate closer to no change before and after the DMS perturbation. Grey regions represent the range of the protein assessed in the assay; however, only the colored regions include DMS data. Finally, by default, regions of the protein itself outside the range of the experimental assay have the “ball and stick” representation.

We can also overlap model scores from the any of our zero-shot or semi-supervised models. Do this by setting data_scores argument to any model string matching available_models() or supervised_available_models(). Here, let’s demonstrate plotting the “Kermut” model and also allowing the full visualization of the complete protein structure, rather than just the “ball and stick” representation. Do this by seting the argument full_structure == TRUE.


## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## Using semi-supervised model scores loaded with supervised_substitutions()
## No fold_scheme specified, using contiguous scheme as default.
## Loading semi-supervised model scores with contiguous folding scheme
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## 'full_structure' is set to TRUE by default, displaying complete protein structure.
3
0.31
-2.38


Now we can more clearly see the entire protein structure for ACE2_HUMAN in the ribbon representation, and we have overlaid the model prediction scores from the Kermut model.

Some assays extensively assessed nearly every position of the complete protein, for example: the C6KNH7_9INFA protein from Lee et al. 2018. Let’s visualize this protein and set the aggregation method to view the minimum DMS score across all mutants at each position by setting aggregation_fun = min. To view a specific region in detail: use start_pos and end_pos.

## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## 'dms_data' not provided, using DMS data loaded with dms_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
-0.23
0
-4.85


As we might expect, the minimum DMS value (more pathogenic) is almost always a negative number across all positions of this protein. Therefore, there seems to be at least one amino acid mutation that could severely disrupt the fitness at any position of this protein.

Fnally, it is possible to use the same color scheme as the popEVE mutation portal. We can do this for any of the heatmaps or protein structure plots. Do this by setting the color_scheme argument = “EVE”.

6 Correlate DMS scores with model scores

The dms_corr_plot() function allows the user to evaluate the correlation between experimental and model prediction scores. By default, it takes in a protein UniProt ID and runs a Spearman correlation between the ProteinGym DMS assay scores and AlphaMissense predicted pathogenicity scores. It returns a ggplot for visualization. However, as with plot_structure(), you may specify any model in ProteinGym v1.2 to examine.


dms_corr_plot(uniprotId = "Q9NV35")
## Using default AlphaMissense modelfrom `ProteinGymR::am_scores()`.
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## 'dms_table' not provided, using default table from `ProteinGymR::dms_substitutions()`
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## [1] "r = -0.68; Pval = 0"


By default, the dms_corr_plot() function gathers any of the 217 DMS assays of the chosen UniProt ID and correlates the average DMS score across relevant assays and the AlphaMissense model predictions.

Although the default uses the AlphaMissense scores, it is simple to correlate DMS experimental scores with predictions from any of the 79 zero-shot or 11 supervised models. Below is an example of the workflow to accomplish this for the same protein “Q9NV35”.


7 Correlate prediction scores between two models.

Similar to the above, we can also explore the correlation between two different models for a given protein instead of looking at the DMS experimental data. We can do this for the protein “P04637” and the model_corr_plot() function. By default, the function only requires a UniProt ID, and uses “AlphaMissense” and “EVE_single” models as defaults. Let’s change that to “Kermut” and “ProteinNPT” or our demonstration.

model_corr_plot(
     uniprotId = "P04637",
     model1 = "Kermut",
     model2 = "ProteinNPT"
)
## No fold_scheme specified, using contiguous scheme as default.
## Loading semi-supervised model scores with contiguous folding scheme
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## No fold_scheme specified, using contiguous scheme as default.
## Loading semi-supervised model scores with contiguous folding scheme
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## [1] "r = 0.63; Pval = 0"


There seems to be good correlation between the model predictions for all variants in assays assessing the “P04637” protein.

8 Reference

Notin, P., Kollasch, A., Ritter, D., van Niekerk, L., Paul, S., Spinner, H., Rollins, N., Shaw, A., Orenbuch, R., Weitzman, R., Frazer, J., Dias, M., Franceschi, D., Gal, Y., & Marks, D. (2023). ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems (Vol. 36, pp. 64331-64379). Curran Associates, Inc.

9 Session Info

sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] ggExtra_0.10.1        ComplexHeatmap_2.25.0 ggplot2_3.5.2        
## [4] stringr_1.5.1         dplyr_1.1.4           tidyr_1.3.1          
## [7] ProteinGymR_1.3.2     BiocStyle_2.37.0     
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.2.3            httr2_1.1.2          rlang_1.1.6         
##   [4] magrittr_2.0.3       clue_0.3-66          GetoptLong_1.0.5    
##   [7] matrixStats_1.5.0    compiler_4.5.0       RSQLite_2.3.11      
##  [10] png_0.1-8            vctrs_0.6.5          maps_3.4.2.1        
##  [13] pkgconfig_2.0.3      shape_1.4.6.1        crayon_1.5.3        
##  [16] fastmap_1.2.0        magick_2.8.6         dbplyr_2.5.0        
##  [19] XVector_0.49.0       labeling_0.4.3       promises_1.3.2      
##  [22] rmarkdown_2.29       UCSC.utils_1.5.0     tinytex_0.57        
##  [25] purrr_1.0.4          bit_4.6.0            xfun_0.52           
##  [28] cachem_1.1.0         pals_1.10            queryup_1.0.5       
##  [31] GenomeInfoDb_1.45.3  jsonlite_2.0.0       gghalves_0.1.4      
##  [34] blob_1.2.4           later_1.4.2          spdl_0.0.5          
##  [37] parallel_4.5.0       cluster_2.1.8.1      R6_2.6.1            
##  [40] stringi_1.8.7        bslib_0.9.0          RColorBrewer_1.1-3  
##  [43] jquerylib_0.1.4      Rcpp_1.0.14          bookdown_0.43       
##  [46] iterators_1.0.14     knitr_1.50           IRanges_2.43.0      
##  [49] httpuv_1.6.16        tidyselect_1.2.1     dichromat_2.0-0.1   
##  [52] yaml_2.3.10          doParallel_1.0.17    codetools_0.2-20    
##  [55] miniUI_0.1.2         curl_6.2.2           tibble_3.2.1        
##  [58] withr_3.0.2          Biobase_2.69.0       shiny_1.10.0        
##  [61] bio3d_2.4-5          KEGGREST_1.49.0      r3dmol_0.1.2        
##  [64] evaluate_1.0.3       BiocFileCache_2.99.1 ggdist_3.3.3        
##  [67] circlize_0.4.16      ExperimentHub_2.99.0 Biostrings_2.77.1   
##  [70] pillar_1.10.2        BiocManager_1.30.25  filelock_1.0.3      
##  [73] foreach_1.5.2        stats4_4.5.0         distributional_0.5.0
##  [76] generics_0.1.4       BiocVersion_3.22.0   S4Vectors_0.47.0    
##  [79] scales_1.4.0         xtable_1.8-4         glue_1.8.0          
##  [82] mapproj_1.2.11       tools_4.5.0          AnnotationHub_3.99.1
##  [85] forcats_1.0.0        Cairo_1.6-2          AnnotationDbi_1.71.0
##  [88] colorspace_2.1-1     RcppSpdlog_0.0.22    cli_3.6.5           
##  [91] rappdirs_0.3.3       viridisLite_0.4.2    gtable_0.3.6        
##  [94] sass_0.4.10          digest_0.6.37        BiocGenerics_0.55.0 
##  [97] htmlwidgets_1.6.4    rjson_0.2.23         farver_2.1.2        
## [100] memoise_2.0.1        htmltools_0.5.8.1    lifecycle_1.0.4     
## [103] httr_1.4.7           GlobalOptions_0.1.2  mime_0.13           
## [106] bit64_4.6.0-1