Background

Bio3D1 is an R package that provides interactive tools for structural bioinformatics. The primary focus of Bio3D is the analysis of bimolecular structure, sequence and simulation data (Grant et al. 2006).

Normal mode analysis (NMA) is one of the major simulation techniques used to probe large-scale motions in biomolecules. Typical application is for the prediction of functional motions in proteins. Version 2.0 of the Bio3D package now includes extensive NMA facilities (Skjaerven et al. 2015). These include a unique collection of multiple elastic network model force-fields, automated ensemble analysis methods, and variance weighted NMA (see also the NMA Vignette). Here we provide an in-depth demonstration of ensemble NMA with working code that comprise complete executable examples2.

Requirements

Detailed instructions for obtaining and installing the Bio3D package on various platforms can be found in the Installing Bio3D Vignette available both on-line and from within the Bio3D package. In addition to Bio3D the MUSCLE and CLUSTALO multiple sequence alignment programs (available from the muscle home page and clustalo home page) must be installed on your system and in the search path for executables. Please see the installation vignette for further details.

About this document

This vignette was generated using Bio3D version 2.4.1.9000.

Part II: Ensemble NMA across multiple species of DHFR

In this vignette we extend the analysis from Part I by including a more extensive search of distant homologues within the DHFR family. Based on a HMMER search we identify and collect protein species down to a pairwise sequence identity of 20%. Normal modes analysis (NMA) across these species reveals a remarkable similarity of the fluctuation profiles, but also features which are characteristic to specific species.

Retrieve and process structures from the PDB

Having identified relevant PDB structures through the hmmer search we proceed by fetching and pre-processing the PDB files with functions get.pdb() and pdbsplit().

As in the previous vignette, we are interested in protein structures without missing in-structure residues, and we also want to limit the number of identifical conformers:

# fetch and split PDBs
raw.files <- get.pdb(ids, path = "raw_pdbs", gzip=TRUE)
files <- pdbsplit(raw.files, ids = ids,
                  path = "raw_pdbs/split_chain", ncore=4)
pdbs.all <- pdbaln(files)

# exclude hits with fusion proteins
gaps <- gap.inspect(pdbs.all$ali)
pdbs <- trim(pdbs.all, row.inds=which(gaps$row > 200))

# exclude specific hits
excl.inds <- unlist(lapply(c("5dxv"), grep, pdbs$id))
pdbs <- trim(pdbs, row.inds=-excl.inds)

# exclude structures with missing residues
conn <- inspect.connectivity(pdbs, cut=4.05)
pdbs <- trim(pdbs, row.inds=which(conn))
          
# exclude conformational redundant structures
rd <- filter.rmsd(pdbs$xyz, cutoff=0.25, fit=TRUE, ncore=4)
pdbs <- trim(pdbs, row.inds=rd$ind)

In this particular case a standard sequence alignment (e.g. through function pdbaln() or seqaln()) is not sufficient for a correct alignment. We will therefore make use of the Pfam profile alignment, and align our selected PDBs to this using argument profile to function seqaln(). Subsequently, we re-read the fasta file, and use function read.fasta.pdb() to obtain aligned C-alpha atom data (including coordinates etc.) for the PDB ensemble:

# align pdbs to Pfam-profile alignment
aln <- seqaln(pdbs, profile=pfam.aln, exefile="clustalo", extra.args="--dealign")

# final alignment will also contain the profile
# store only PDBs in alignment
aln$ali <- aln$ali[1:length(pdbs$id),]
aln$id  <- aln$id[1:length(pdbs$id)]

# re-read PDBs to match the new alignment
pdbs <- read.fasta.pdb(aln)

# exclude gap-only columns
pdbs <- trim(pdbs)

# refit coordinates
pdbs$xyz <- pdbfit(pdbs)
# refit coordinates, and write PDBs to disk
pdbs$xyz <- pdbfit(pdbs, outpath="flsq/")
# fetch IDs again
ids <- basename.pdb(pdbs$id)
species <- hmm$hit.tbl$species[hmm$hit.tbl$acc %in% ids]
# labels for annotating plots
labs <- paste(substr(species, 1,1), ". ", 
              lapply(strsplit(species, " "), function(x) x[2]), sep="")
print(unique(labs))
##  [1] "G. stearothermophilus" "B. anthracis"          "M. profunda"          
##  [4] "Y. pestis"             "K. pneumoniae"         "E. coli"              
##  [7] "E. faecalis"           "S. aureus"             "S. pneumoniae"        
## [10] "L. casei"              "M. tuberculosis"       "M. smegmatis"         
## [13] "M. avium"              "H. volcanii"           "S. mansoni"           
## [16] "P. carinii"            "M. musculus"           "G. gallus"            
## [19] "H. sapiens"            "C. albicans"           "[. glabrata"

The pdbs object now contains aligned C-alpha atom data, including Cartesian coordinates, residue numbers, residue types, and B-factors. The sequence alignment is also stored by default to the FASTA format file ‘aln.fa’ (to view this you can use an alignment viewer such as SEAVIEW, see Requirements section above).

Sequence conservation analysis

Function seqidentity() can be used to calculate the sequence identity for the PDBs ensemble. Below we also print a summary of the calculated sequence identities, and perform a clustering of the structures based on sequence identity:

seqide <- seqidentity(pdbs)
summary(c(seqide))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2130  0.3080  0.3380  0.4485  0.3910  1.0000
hc <- hclust(as.dist(1-seqide))
grps.seq <- cutree(hc, h=0.6)
hclustplot(hc, k=3, labels=labs, cex=0.25, fillbox=FALSE)
Clustering of collected structures based on sequence identity.

Clustering of collected structures based on sequence identity.

Normal modes analysis

Function nma.pdbs() will calculate the normal modes of each protein structures stored in the pdbs object. The normal modes are calculated on the full structures as provided by object pdbs. Use argument rm.gaps=FALSE to visualize fluctuations also of un-aligned residues:

modes <- nma(pdbs, rm.gaps=FALSE, ncore=4)

The modes object of class enma contains aligned normal mode data including fluctuations, RMSIP data (only whenrm.gaps=FALSE), and aligned eigenvectors. A short summary of the modes object can be obtain by calling the function print(), and the aligned fluctuations can be plotted with function plot.enma():

print(modes)
## 
## Call:
##   nma.pdbs(pdbs = pdbs, rm.gaps = FALSE, ncore = 4)
## 
## Class:
##   enma
## 
## Number of structures:
##   259
## 
## Attributes stored:
##   - Aligned atomic fluctuations
##   - Aligned eigenvectors (gaps not removed)
##   - Dimensions of x$U.subspace: 720x381x259
## 
## Coordinates were aligned prior to NMA calculations
## 
## + attr: fluctuations, rmsip, U.subspace, L, full.nma, xyz,
##         call
cons <- conserv(pdbs$ali)
sse <- pdbs2sse(pdbs, rm.gaps=FALSE, ind=1)
## Warning in dssp.pdb(pdb.ref, ...): Non-protein residues detected in input PDB:
## SO4, HOH
resno <- pdbs$resno[1, ]
resno[is.na(resno)] <- ""
xlab <- paste0("Residue number (reference PDB: ", basename.pdb(pdbs$id[1]), ")")

par(mfrow=c(2,1), mar=c(4, 4, 2, 2))
plot(modes, pdbs, ylim=c(0,2), col=grps.seq, label=NULL, xlab=xlab)
## Warning in dssp.pdb(pdb.ref, ...): Non-protein residues detected in input PDB:
## SO4, HOH
plot.bio3d(cons, resno=resno, sse=sse, ylab="Conservation", xlab=xlab)
Flexibility profiles and sequence conservation. The figure shows the modes fluctuations colored according their sequence identity. The lower panel shows the sequence conservation for the PDBs. The plot is generated with function **plot.enma()** along with function call to **conerv()**.

Flexibility profiles and sequence conservation. The figure shows the modes fluctuations colored according their sequence identity. The lower panel shows the sequence conservation for the PDBs. The plot is generated with function plot.enma() along with function call to conerv().

In some cases it can be difficult to interpret the fluctuation plot when all lines are plotted on top of each other. Argument spread=TRUE adds a small gap between grouped fluctuation profiles. Use this argument in combination with a new groups (grps) variable to function plot.enma():

grps <- rep(NA, length(grps.seq))
grps[grep("coli", labs)]=1
grps[grep("aureus", labs)]=2
grps[grep("anthracis", labs)]=3
grps[grep("tubercu", labs)]=4
grps[grep("casei", labs)]=5
grps[grep("sapiens", labs)]=6
grps[grep("albicans", labs)]=7
grps[grep("glabrata", labs)]=8
grps[grep("carinii", labs)]=9

plot(modes, pdbs=pdbs, col=grps, spread=TRUE, ylim=c(0,1.5), label=NULL)
## Warning in dssp.pdb(pdb.ref, ...): Non-protein residues detected in input PDB:
## NDP, TOP, HOH
Flexibility profiles for selected species. The plot is generated with function **plot.enma()** with argument `spread=TRUE`.

Flexibility profiles for selected species. The plot is generated with function plot.enma() with argument spread=TRUE.

Visualize modes

A function call to mktrj.enma() will generate a trajectory PDB file for the visualization of a specific normal mode for one of the structures in the pdbs object. This allows for a visual comparison of the calculated normal modes. Below we make a PDB trajectory of the first mode (argument m.inds=1) of 3 relevant species (e.g. argument s.inds=1). Note that we use grep() to fetch the indices (in the modes and pdbs objects) of the relevant species:

inds <- c(grep("coli", species)[1], 
          grep("sapiens", species)[1],
          grep("albicans", species)[1])

# E.coli
mktrj(modes, pdbs, m.inds=1, s.inds=inds[1], file="ecoli-mode1.pdb")

# H. sapiens
mktrj(modes, pdbs, m.inds=1, s.inds=inds[2], file="hsapiens-mode1.pdb")

# C. albicans
mktrj(modes, pdbs, m.inds=1, s.inds=inds[3], file="calbicans-mode1.pdb")

Mode comparison of E.coli, H.sapiens, and C.albicans. The trajectories are made with function mktrj.enma() and visualized in PyMol.

Document Details

This document is shipped with the Bio3D package in both R and PDF formats. All code can be extracted and automatically executed to generate Figures and/or the PDF with the following commands:

library(rmarkdown)
render("Bio3D_nma-dhfr-partII.Rmd", "all")

Information About the Current Bio3D Session

print(sessionInfo(), FALSE)
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] bio3d_2.4-1.9000
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5         digest_0.6.25      crayon_1.3.4       rprojroot_1.3-2   
##  [5] assertthat_0.2.1   grid_4.0.2         R6_2.4.1           backports_1.1.9   
##  [9] magrittr_1.5       evaluate_0.14      highr_0.8          stringi_1.5.3     
## [13] rlang_0.4.7        fs_1.5.0           ragg_0.3.1         rmarkdown_2.3     
## [17] pkgdown_1.6.1.9000 desc_1.2.0         tools_4.0.2        stringr_1.4.0     
## [21] parallel_4.0.2     yaml_2.2.1         xfun_0.17          compiler_4.0.2    
## [25] systemfonts_0.3.1  cpp11_0.2.1        memoise_1.1.0      htmltools_0.5.0   
## [29] knitr_1.29

References

Grant, B. J., A. P. D. C Rodrigues, K. M. Elsawy, A. J. Mccammon, and L. S. D. Caves. 2006. “Bio3d: An R Package for the Comparative Analysis of Protein Structures.” Bioinformatics 22: 2695–6. https://doi.org/10.1093/bioinformatics/btl461.

Skjaerven, L., X. Q. Yao, G. Scarabelli, and B. J. Grant. 2015. “Integrating Protein Structural Dynamics and Evolutionary Analysis with Bio3d.” BMC Bioinformatics 15: 399. https://doi.org/10.1186/s12859-014-0399-6.


  1. The latest version of the package, full documentation and further vignettes (including detailed installation instructions) can be obtained from the main Bio3D website: http://thegrantlab.org/bio3d/↩︎

  2. This vignette contains executable examples, see help(vignette) for further details.↩︎