Background

Bio3D¹ is an R package that provides interactive tools for structural bioinformatics. The primary focus of Bio3D is the analysis of bimolecular structure, sequence and simulation data (Grant et al. 2006).

Normal mode analysis (NMA) is one of the major simulation techniques used to probe large-scale motions in biomolecules. Typical application is for the prediction of functional motions in proteins. Version 2.0 of the Bio3D package now includes extensive NMA facilities (Skjaerven et al. 2015). These include a unique collection of multiple elastic network model force-fields, automated ensemble analysis methods, and variance weighted NMA (see also the NMA Vignette). Here we provide an in-depth demonstration of ensemble NMA with working code that comprise complete executable examples².

Requirements

Detailed instructions for obtaining and installing the Bio3D package on various platforms can be found in the Installing Bio3D Vignette available both on-line and from within the Bio3D package. In addition to Bio3D the MUSCLE and CLUSTALO multiple sequence alignment programs (available from the muscle home page and clustalo home page) must be installed on your system and in the search path for executables. Please see the installation vignette for further details.

About this document

This vignette was generated using Bio3D version 2.4.1.9000.

Part II: Ensemble NMA across multiple species of DHFR

In this vignette we extend the analysis from Part I by including a more extensive search of distant homologues within the DHFR family. Based on a HMMER search we identify and collect protein species down to a pairwise sequence identity of 20%. Normal modes analysis (NMA) across these species reveals a remarkable similarity of the fluctuation profiles, but also features which are characteristic to specific species.

HMMER search for distantly related DHFR species

Below we use the sequence of E.coli DHFR to perform an initial search against the Pfam HMM database with function hmmer(). The arguments type and db specifies the type of hmmer search and the database to search, respectively. In this particular example, our query sequence is searched against the Pfam profile HMM library (arguments type="hmmscan" and db="pfam") to identify its respecitve protein family. The to hmmer() will return a data frame object containing the Pfam accession ID ($acc), description of the identified family ($desc), family name ($name), etc.

library(bio3d)

# get sequence of Ecoli DHFR
seq <- get.seq("1rx2_A")

## Warning in get.seq("1rx2_A"): Removing existing file: seqs.fasta

# scan the Pfam database for our sequence
pfam <- hmmer(seq, type="hmmscan", db="pfam")
pfam$hit.tbl

Sidenote: The hmmer() function facilitates four different types of searches at a multitude of databases. Use function help(hmmer) for a complete overview of the different options.

Having identified the Pfam entry of our query protein we can use function pfam() to fetch the curated sequence alignment of the DHFR family. Use function print.fasta() to print a short summary of the downloaded sequence alignment to the screen. Note that if argument alignment=TRUE the sequence alignmenment itself will be written to screen.

# download pfam alignment for the DHFR family
pfam.aln <- pfam(pfam$hit.tbl$acc[1])
print(pfam.aln, alignment=FALSE)

## 
## Call:
##   pfam(id = pfam$hit.tbl$acc[1])
## 
## Class:
##   fasta
## 
## Alignment dimensions:
##   55 sequence rows; 206 position columns (114 non-gap, 92 gap) 
## 
## + attr: id, ali, call

The next hmmer search builds a profile HMM from the Pfam multiple sequence alignment and uses this HMM to search against a target sequence database (use type="hmmsearch"). In this case our target sequence database is the PDB (db="pdb"), but there are also other options such as "swissprot" and "uniprot".

# use Pfam alignment in search 
hmm <- hmmer(pfam.aln, type="hmmsearch", db="pdb")

Function plot.hmmer() (the equivalent to plot.blast()) provides a quick overview of the rearch results, and can aid in the identification of a sensible hit similarity threshold. The normalized scores (-log(E-Value)) are shown in the upper panel, and the lower panel provides an overview of the kingdom and specie each hit are associated with. Here we specify a cutoff of 56 yielding 655 hits:

hits <- plot(hmm, cutoff=56)

##   * Possible cutoff values:    56 0 
##             Yielding Nhits:    696 770 
## 
##   * Chosen cutoff value of:    56 
##             Yielding Nhits:    696

Overview of hits obtained from the HMMER search. The two upper panels show the normalized scores and bitscore. The two lower panels show a bar graph of the distribution of significant hits according to the bitscores and taxonomic kingdom. The first bar plot corresponds to the hits above cutoff, while the second par plot corresponds to those hits below the cutoff. Observe e.g. that the majority of the top scoring hits is from bacteria (red) while there are several eukaryotes (green) with lower scores.

Side-note: To visualize the hmmer search results in a web-browser go to the URL in the hmm$url attribute:

# view hmmer results in web-browser
print(hmm$url)

## [1] "http://www.ebi.ac.uk/Tools/hmmer/results/13A8B756-F6C6-11EA-BDF8-1E9153F04F9B"

An summary over the hit species can be obtained by investigatin the hmm$hit.tbl$species attribute:

ids <- hits$acc
species <- hmm$hit.tbl$species[hits$inds]

# print collected species
print(unique(species))

##  [1] "Geobacillus stearothermophilus"                        
##  [2] "Bacillus anthracis"                                    
##  [3] "Coxiella burnetii"                                     
##  [4] "Bacillus anthracis str. Sterne (GCA_000832635)"        
##  [5] "Bacillus anthracis str. Sterne"                        
##  [6] "Moritella profunda"                                    
##  [7] "Yersinia pestis CO92"                                  
##  [8] "Klebsiella pneumoniae CG43"                            
##  [9] "Escherichia coli str. K-12 substr. DH10B"              
## [10] "Escherichia coli"                                      
## [11] "Escherichia coli str. K-12 substr. MC4100"             
## [12] "Escherichia coli (strain K12)"                         
## [13] "Escherichia coli CFT073"                               
## [14] "Enterococcus faecalis V583"                            
## [15] "Staphylococcus aureus"                                 
## [16] "Escherichia coli UTI89"                                
## [17] "Staphylococcus aureus MUF168"                          
## [18] "Staphylococcus aureus RF122"                           
## [19] "Staphylococcus aureus (strain bovine RF122 / ET3-1)"   
## [20] "Streptococcus pneumoniae"                              
## [21] "Lactobacillus casei"                                   
## [22] "Mycobacterium tuberculosis (strain ATCC 25177 / H37Ra)"
## [23] "Mycobacterium tuberculosis H37Rv (GCA_000667805)"      
## [24] "Mycobacterium tuberculosis"                            
## [25] "Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv)"
## [26] "Mycolicibacterium smegmatis MC2 155"                   
## [27] "Mycobacterium avium"                                   
## [28] "synthetic construct"                                   
## [29] "Haloferax volcanii"                                    
## [30] "Klebsiella pneumoniae"                                 
## [31] "Schistosoma mansoni"                                   
## [32] "Pneumocystis carinii"                                  
## [33] "Mus musculus"                                          
## [34] "Gallus gallus"                                         
## [35] "Homo sapiens"                                          
## [36] "Cryptosporidium hominis"                               
## [37] "Candida albicans"                                      
## [38] "[Candida] glabrata CBS 138"                            
## [39] "[Candida] glabrata"                                    
## [40] "Toxoplasma gondii"                                     
## [41] "Babesia bovis T2Bo"                                    
## [42] "Babesia bovis"                                         
## [43] "Plasmodium vivax"                                      
## [44] "Plasmodium falciparum"

Retrieve and process structures from the PDB

Having identified relevant PDB structures through the hmmer search we proceed by fetching and pre-processing the PDB files with functions get.pdb() and pdbsplit().

As in the previous vignette, we are interested in protein structures without missing in-structure residues, and we also want to limit the number of identifical conformers:

# fetch and split PDBs
raw.files <- get.pdb(ids, path = "raw_pdbs", gzip=TRUE)
files <- pdbsplit(raw.files, ids = ids,
                  path = "raw_pdbs/split_chain", ncore=4)
pdbs.all <- pdbaln(files)

# exclude hits with fusion proteins
gaps <- gap.inspect(pdbs.all$ali)
pdbs <- trim(pdbs.all, row.inds=which(gaps$row > 200))

# exclude specific hits
excl.inds <- unlist(lapply(c("5dxv"), grep, pdbs$id))
pdbs <- trim(pdbs, row.inds=-excl.inds)

# exclude structures with missing residues
conn <- inspect.connectivity(pdbs, cut=4.05)
pdbs <- trim(pdbs, row.inds=which(conn))
          
# exclude conformational redundant structures
rd <- filter.rmsd(pdbs$xyz, cutoff=0.25, fit=TRUE, ncore=4)
pdbs <- trim(pdbs, row.inds=rd$ind)

In this particular case a standard sequence alignment (e.g. through function pdbaln() or seqaln()) is not sufficient for a correct alignment. We will therefore make use of the Pfam profile alignment, and align our selected PDBs to this using argument profile to function seqaln(). Subsequently, we re-read the fasta file, and use function read.fasta.pdb() to obtain aligned C-alpha atom data (including coordinates etc.) for the PDB ensemble:

# align pdbs to Pfam-profile alignment
aln <- seqaln(pdbs, profile=pfam.aln, exefile="clustalo", extra.args="--dealign")

# final alignment will also contain the profile
# store only PDBs in alignment
aln$ali <- aln$ali[1:length(pdbs$id),]
aln$id  <- aln$id[1:length(pdbs$id)]

# re-read PDBs to match the new alignment
pdbs <- read.fasta.pdb(aln)

# exclude gap-only columns
pdbs <- trim(pdbs)

# refit coordinates
pdbs$xyz <- pdbfit(pdbs)

# refit coordinates, and write PDBs to disk
pdbs$xyz <- pdbfit(pdbs, outpath="flsq/")

# fetch IDs again
ids <- basename.pdb(pdbs$id)
species <- hmm$hit.tbl$species[hmm$hit.tbl$acc %in% ids]

# labels for annotating plots
labs <- paste(substr(species, 1,1), ". ", 
              lapply(strsplit(species, " "), function(x) x[2]), sep="")
print(unique(labs))

##  [1] "G. stearothermophilus" "B. anthracis"          "M. profunda"          
##  [4] "Y. pestis"             "K. pneumoniae"         "E. coli"              
##  [7] "E. faecalis"           "S. aureus"             "S. pneumoniae"        
## [10] "L. casei"              "M. tuberculosis"       "M. smegmatis"         
## [13] "M. avium"              "H. volcanii"           "S. mansoni"           
## [16] "P. carinii"            "M. musculus"           "G. gallus"            
## [19] "H. sapiens"            "C. albicans"           "[. glabrata"

The pdbs object now contains aligned C-alpha atom data, including Cartesian coordinates, residue numbers, residue types, and B-factors. The sequence alignment is also stored by default to the FASTA format file ‘aln.fa’ (to view this you can use an alignment viewer such as SEAVIEW, see Requirements section above).

Sequence conservation analysis

Function seqidentity() can be used to calculate the sequence identity for the PDBs ensemble. Below we also print a summary of the calculated sequence identities, and perform a clustering of the structures based on sequence identity:

seqide <- seqidentity(pdbs)
summary(c(seqide))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2130  0.3080  0.3380  0.4485  0.3910  1.0000

hc <- hclust(as.dist(1-seqide))
grps.seq <- cutree(hc, h=0.6)

hclustplot(hc, k=3, labels=labs, cex=0.25, fillbox=FALSE)

Clustering of collected structures based on sequence identity.

Normal modes analysis

Function nma.pdbs() will calculate the normal modes of each protein structures stored in the pdbs object. The normal modes are calculated on the full structures as provided by object pdbs. Use argument rm.gaps=FALSE to visualize fluctuations also of un-aligned residues:

modes <- nma(pdbs, rm.gaps=FALSE, ncore=4)

The modes object of class enma contains aligned normal mode data including fluctuations, RMSIP data (only whenrm.gaps=FALSE), and aligned eigenvectors. A short summary of the modes object can be obtain by calling the function print(), and the aligned fluctuations can be plotted with function plot.enma():

print(modes)

## 
## Call:
##   nma.pdbs(pdbs = pdbs, rm.gaps = FALSE, ncore = 4)
## 
## Class:
##   enma
## 
## Number of structures:
##   259
## 
## Attributes stored:
##   - Aligned atomic fluctuations
##   - Aligned eigenvectors (gaps not removed)
##   - Dimensions of x$U.subspace: 720x381x259
## 
## Coordinates were aligned prior to NMA calculations
## 
## + attr: fluctuations, rmsip, U.subspace, L, full.nma, xyz,
##         call

cons <- conserv(pdbs$ali)
sse <- pdbs2sse(pdbs, rm.gaps=FALSE, ind=1)

## Warning in dssp.pdb(pdb.ref, ...): Non-protein residues detected in input PDB:
## SO4, HOH

resno <- pdbs$resno[1, ]
resno[is.na(resno)] <- ""
xlab <- paste0("Residue number (reference PDB: ", basename.pdb(pdbs$id[1]), ")")

par(mfrow=c(2,1), mar=c(4, 4, 2, 2))
plot(modes, pdbs, ylim=c(0,2), col=grps.seq, label=NULL, xlab=xlab)

## Warning in dssp.pdb(pdb.ref, ...): Non-protein residues detected in input PDB:
## SO4, HOH

plot.bio3d(cons, resno=resno, sse=sse, ylab="Conservation", xlab=xlab)

Flexibility profiles and sequence conservation. The figure shows the modes fluctuations colored according their sequence identity. The lower panel shows the sequence conservation for the PDBs. The plot is generated with function plot.enma() along with function call to conerv().

In some cases it can be difficult to interpret the fluctuation plot when all lines are plotted on top of each other. Argument spread=TRUE adds a small gap between grouped fluctuation profiles. Use this argument in combination with a new groups (grps) variable to function plot.enma():

grps <- rep(NA, length(grps.seq))
grps[grep("coli", labs)]=1
grps[grep("aureus", labs)]=2
grps[grep("anthracis", labs)]=3
grps[grep("tubercu", labs)]=4
grps[grep("casei", labs)]=5
grps[grep("sapiens", labs)]=6
grps[grep("albicans", labs)]=7
grps[grep("glabrata", labs)]=8
grps[grep("carinii", labs)]=9

plot(modes, pdbs=pdbs, col=grps, spread=TRUE, ylim=c(0,1.5), label=NULL)

## Warning in dssp.pdb(pdb.ref, ...): Non-protein residues detected in input PDB:
## NDP, TOP, HOH

Flexibility profiles for selected species. The plot is generated with function plot.enma() with argument spread=TRUE.

Visualize modes

A function call to mktrj.enma() will generate a trajectory PDB file for the visualization of a specific normal mode for one of the structures in the pdbs object. This allows for a visual comparison of the calculated normal modes. Below we make a PDB trajectory of the first mode (argument m.inds=1) of 3 relevant species (e.g. argument s.inds=1). Note that we use grep() to fetch the indices (in the modes and pdbs objects) of the relevant species:

inds <- c(grep("coli", species)[1], 
          grep("sapiens", species)[1],
          grep("albicans", species)[1])

# E.coli
mktrj(modes, pdbs, m.inds=1, s.inds=inds[1], file="ecoli-mode1.pdb")

# H. sapiens
mktrj(modes, pdbs, m.inds=1, s.inds=inds[2], file="hsapiens-mode1.pdb")

# C. albicans
mktrj(modes, pdbs, m.inds=1, s.inds=inds[3], file="calbicans-mode1.pdb")

Mode comparison of E.coli, H.sapiens, and C.albicans. The trajectories are made with function mktrj.enma() and visualized in PyMol.

Document Details

This document is shipped with the Bio3D package in both R and PDF formats. All code can be extracted and automatically executed to generate Figures and/or the PDF with the following commands:

library(rmarkdown)
render("Bio3D_nma-dhfr-partII.Rmd", "all")

Information About the Current Bio3D Session

print(sessionInfo(), FALSE)

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] bio3d_2.4-1.9000
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5         digest_0.6.25      crayon_1.3.4       rprojroot_1.3-2   
##  [5] assertthat_0.2.1   grid_4.0.2         R6_2.4.1           backports_1.1.9   
##  [9] magrittr_1.5       evaluate_0.14      highr_0.8          stringi_1.5.3     
## [13] rlang_0.4.7        fs_1.5.0           ragg_0.3.1         rmarkdown_2.3     
## [17] pkgdown_1.6.1.9000 desc_1.2.0         tools_4.0.2        stringr_1.4.0     
## [21] parallel_4.0.2     yaml_2.2.1         xfun_0.17          compiler_4.0.2    
## [25] systemfonts_0.3.1  cpp11_0.2.1        memoise_1.1.0      htmltools_0.5.0   
## [29] knitr_1.29

References

Grant, B. J., A. P. D. C Rodrigues, K. M. Elsawy, A. J. Mccammon, and L. S. D. Caves. 2006. “Bio3d: An R Package for the Comparative Analysis of Protein Structures.” Bioinformatics 22: 2695–6. https://doi.org/10.1093/bioinformatics/btl461.

Skjaerven, L., X. Q. Yao, G. Scarabelli, and B. J. Grant. 2015. “Integrating Protein Structural Dynamics and Evolutionary Analysis with Bio3d.” BMC Bioinformatics 15: 399. https://doi.org/10.1186/s12859-014-0399-6.

The latest version of the package, full documentation and further vignettes (including detailed installation instructions) can be obtained from the main Bio3D website: http://thegrantlab.org/bio3d/↩︎
This vignette contains executable examples, see help(vignette) for further details.↩︎

Ensemble NMA across multiple species of DHFR

Lars Skjaerven, Xin-Qiu Yao, Guido Scarabelli & Barry J. Grant