cna_vignette.spin.Rmd
Bio3D1 is a family of R packages that provide interactive tools for structural bioinformatics. The primary focus of Bio3D is the analysis of bimolecular structure, sequence and simulation data.
The Bio3D package, bio3d.cna, contains functionality for the creation and analysis of protein structure networks. This vignette will introduce this functionality by building and analyzing networks obtained from atomic correlation data (named correlation network analysis) derived from normal mode analysis (NMA) and molecular dynamics (MD) simulation (Figure 1)2. Note that Bio3D contains extensive functionality for both NMA and MD analysis. An introduction to each of these areas is detailed in separate NMA and MD tutorals/vignettes available from http://thegrantlab.org/bio3d.
Detailed instructions for obtaining and installing the Bio3D packages on various platforms can be found in the Installing Bio3D Vignette available on-line. Note that this vignette also makes use of the IGRAPH R package, which can be installed with the command:
install.packages("igraph")
In this section we will first focus on correlation network analysis derived from NMA, and in subsequent sections MD.
The code snippet below first loads the Bio3D and related packages, then reads an example PDB structure from the RCSB online database. We then follow the flow in Figure 1 by performing NMA, then dynamic cross-correlation analysis, and finally network analysis with the cna() function.
library(bio3d) library(bio3d.cna) library(bio3d.nma) library(igraph) pdbfile <- system.file("examples/hivp.pdb", package = "bio3d") pdb <- read.pdb( pdbfile ) modes <- nma(pdb) cij <- dccm(modes) net <- cna(cij, cutoff.cij=0.3)
To enable quick examination of the network analysis results we provide cna specific variants of the print(), summary() and plot() functions, as demonstrated below:
print(net)
##
## Call:
## cna(cij = cij, cutoff.cij = 0.3)
##
## Structure:
## - NETWORK NODES#: 198 EDGES#: 752
## - COMMUNITY NODES#: 8 EDGES#: 11
##
## + attr: network, communities, community.network,
## community.cij, cij, call
# Summary information as a table x <- summary(net)
## id size members
## 1 36 c(1:8, 89:107, 190:198)
## 2 20 c(9:11, 22:32, 83:88)
## 3 24 c(12:21, 60:73)
## 4 28 c(33:46, 55:59, 74:82)
## 5 16 c(47:54, 146:153)
## 6 17 c(108, 122:130, 183:189)
## 7 28 c(109:121, 159:173)
## 8 29 c(131:145, 154:158, 174:182)
# Other useful things are returned from **summary.cna()** attributes(x)
## $names
## [1] "id" "size" "members" "tbl"
# List the residues belonging to the 5th community x$members[5]
## $`5`
## 47 48 49 50 51 52 53 54 146 147 148 149 150 151 152 153
## 47 48 49 50 51 52 53 54 146 147 148 149 150 151 152 153
# Plot both the ‘full’ all-residue network and simplified community network plot(net, pdb, full = TRUE, vertex.label.cex=0.7) plot(net, pdb)
The raw correlation matrix obtained from NMA (or indeed MD) typically has many small but non-zero value elements. By default the cna() function excludes some of the weakest values (close to zero) to avoid building an edge heavy network, which would considerably slow down further analysis. The cutoff.cij argument in the cna() function specifies the lower boundary for considering correlation values. All values whose absolute value is lower than cutoff.cij will be ignored when defining network edges.
Note that the above code plots both a full all-residue network as well as a more coarse grained community network - see Figure 2. In this example, and by default, the Girvan-Newman clustering method was used to map the full network into communities of highly intra-connected but loosely inter-connected nodes that were in turn used to derive the simplified network displayed in Figure 2 (right). We will discuss community detection in a separate section below and for now simply highlight that a comparison of the full network and the distribution and connectivity of communities between similar biomolecular systems (e.g. in the presence and absence of a binding partner or mutation) has the potential to highlight differences in coupled motions that may be indicative of potential allosteric pathways. This approach has previously been applied to investigate allostery in tRNA–protein complexes, kinesin motor domains, G-proteins, thrombin, and other systems [1-5].
Another useful function for the inspection of community structure is the vmd.cna() function (invoked by calling vmd()), which permits interactive network visualization in the VMD molecular graphics program (see Figure 3).
# View the network in VMD (see Figure 3) vmd(net, pdb, launch = TRUE)
One of the main advantages of correlation network analysis is that it often facilitates the interpretation of correlated motions that can be challenging to rationalize from the output of a dynamic correlation map alone (e.g. the output of the dccm() function). For example, the identified communities of highly intra-connected residues can be tied into the visualization of the correlation data (Figure 4) and cross referenced to the network visualizations presented in Figures 2 and 3. In this example the colored marginal segments of Figure 4 match the identified communities in Figures 2 and 3:
# Plot the correlation matrix with community annotation, see Figure 4. plot(cij, margin.segments = net$communities$membership, main="")
To more fully appreciate the results of network analysis one should become somewhat familiar with the structure of the object returned from the cna() function. This will allow you to generate customized visualizations beyond those in Figures 2 to 4 and also enable you to more fully examine the effect of different parameters on network structure and topology.
The output of the cna() function is a list object of class cna with both network and community structure attributes. These are summarized at the bottom of the print(net) results above (see the line beginning with ‘+ attr’) and can be explicitly listed with the attributes(cna) command:
attributes(net)
## $names
## [1] "network" "communities" "community.network"
## [4] "community.cij" "cij" "call"
##
## $class
## [1] "cna"
Here we introduce each of these components in turn:
$network
: The full protein structure network typically with a node per residue and connecting edges weighted by their corresponding correlation values (i.e., input cij values optionally filtered by the user defined ‘cutoff.cij’). This filtered cij matrix is also returned in the output as $cij
, as listed below.
$communities
: A community clustering object detailing the results of clustering the full $network
. This is itself a list object with a number of attributes (see attributes(net$communities)
for details). Notable attributes include $communities$membership
, $communities$modularity
and $communities$algorithm
. These components detail the nodes (typically residues) belonging to each community; the modularity (which represents the difference in probability of intra- and intercommunity connections for a given network division); and the clustering algorithm used respectively.
$cij
: The filtered correlation matrix used to build the full $network.
$community.network
: A coarse-grained community network object with the number of nodes equal to the number of communities present in $communities
(i.e., the community clustering of the full network discussed below) and edges based on the inter-community coupling as determined by the input ‘collapse.method’ (by default this is the maximum coupling between the original nodes of the respective communities).
$community.cij
: A cij matrix obtained by applying a “collapse.method” on $cij
. The rows and columns match the number of communities in $communities. The individual values are based on the cij couplings of inter-community residues. The currently available collapse methods for defining these values include: max, mean, trimmed and median.
Both the $network
and $community.network
attributes are igraph compatible network objects [6] and contain additional node (or vertex) and edge annotations that can be accessed with the V() and E() functions in the following way:
## [1] "#0000FF" "#0000FF" "#0000FF" "#0000FF" "#0000FF" "#0000FF"
V(net$community.network)$size
## [1] 36 20 24 28 16 17 28 29
E(net$community.network)$weight
## [1] 0.4959895 0.4579519 0.5620385 0.5190477 0.7274177 0.4988692 0.3649480
## [8] 0.3648937 0.6414050 0.5867744 0.4989730
V(net$community.network)$name
## [1] "1" "2" "3" "4" "5" "6" "7" "8"
In addition to various Bio3D functions, the full set of extensive igraph package features can thus be used to further analyze these network objects.
The $communities
attribute returned by the cna() function provides details of the community clustering procedure. This procedure aims to split the full network into highly correlated local substructures (referred to as communities) within which the network connections are dense but between which they are sparser. A number of community detection algorithms can be used to solve this so called graph-partitioning problem. The default Girvan-Newman edge-betweenness approach is a divisive algorithm that is based on the use of the edge betweenness as a partitioning criterion. “Betweenness” is calculated by finding the shortest path(s) between a pair of vertexes and scoring each of the edges on this/these path(s) with the inverse value of the number of shortest paths. (So if there was only one path of the shortest length, each edge on it would score 1 and if there were 10 paths of that length, each edge would score 1/10.) This is done for every pair of vertexes. In this way each edge accumulates a “betweenness” score for the whole network. The network is separated into clusters by removing the edge with the highest “betweenness”, then recalculating betweenness and repeating. The method is fully described in (Girvan and Newman, 2002 PNAS) [7].
A useful quantity used to measure the quality of a given community partition is the modularity (detailed in the $communities$modularity
component of cna() output). The modularity represents the difference in probability of intra- and inter-community connections for a given network division. Modularity values fall in the range from 0 to 1, with higher values indicating higher quality of the community structure. The optimum community structures obtained for typical MD and NMA derived correlation networks fall in the 0.4 to 0.7 range.
head(net$communities$modularity)
## [1] -5.770111e-03 -4.430715e-03 -3.616681e-03 -8.542082e-04 2.043481e-05
## [6] 2.076868e-03
max(net$communities$modularity)
## [1] 0.7381234
The community.tree() function can be used to reconstruct the community membership vector for each clustering step. Here we plot modularity vs number of communities.
# See Figure 5. tree <- community.tree(net, rescale=TRUE) plot( tree$num.of.comms, tree$modularity, xlab="Communities", ylab="Modularity")
Note that there are many points near the maximum modularity value in Figure 5. In such cases a number of alternative community partitions may be equally high scoring and therefore worth inspecting more closely. For example we will often favor a partition point with a smaller number of overall communities but slightly lower modularity score based on some prior knowledge of the system (for example results from a related system).
Below we inspect the maximum modularity value partitioning and the number of communities (k) at the point of maximum modularity. Based on a quick inspection of the clustering dendrogram we explore a new membership partitioning (at k=3). Again this example is purely for demonstrating some of the available functionality within the package, and we encourage motivated readers to explore the true structure within their data by examining multiple structures together with the effects of different input parameters and clustering procedures.
max.mod.ind <- which.max(tree$modularity) tree$num.of.comms[ max.mod.ind ]
## [1] 8
# Membership vector at this partition point tree$tree[max.mod.ind,]
## [1] 1 1 1 1 1 1 1 1 2 2 2 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4
## [38] 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4
## [75] 4 4 4 4 4 4 4 4 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 6 7 7 7
## [112] 7 7 7 7 7 7 7 7 7 7 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 5 5 5
## [149] 5 5 5 5 5 8 8 8 8 8 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 6 6 6
## [186] 6 6 6 6 1 1 1 1 1 1 1 1 1
# Should be the same as that contained in the original CNA network object all( net$communities$membership == tree$tree[max.mod.ind,] )
## [1] TRUE
# Inspect the clustering dendrogram (Figure 6). # To match colors use 'colors=vmd_colors()[memb.k3]' h <- as.hclust(net$communities) hclustplot(h, k=3)
# Inspect a new membership partitioning (at k=3) memb.k3 <- tree$tree[ tree$num.of.comms == 3, ]
# Produce a new k=3 community network (Figure 7) net.3 <- network.amendment(net, memb.k3) plot(net.3, pdb)
# View the network in VMD (see Figure 8) vmd(net.3, trim.pdb(pdb, atom.select(pdb,"calpha")), launch=TRUE )
For this example section we apply correlation network analysis to a short molecular dynamics trajectory of Human Immunodeficiency Virus aspartic protease (HIVpr). This trajectory is included with the bio3d package and stored in CHARMM/NAMD DCD format. To reduce file size and speed up example execution, non C-alpha atoms have been previously removed (see the Trajectory Analysis vignette for full details).
The code snippet below first sets the file paths for the example HIVpr starting structure (pdbfile) and trajectory data (dcdfile), then reads these files (producing the objects dcd and pdb).
dcdfile <- system.file("examples/hivp.dcd", package = "bio3d") pdbfile <- system.file("examples/hivp.pdb", package = "bio3d")
# Read MD data dcd <- read.dcd(dcdfile)
## Warning in readChar(trj, 80): truncating string with embedded nuls
## Warning in readChar(trj, 80): truncating string with embedded nuls
pdb <- read.pdb(pdbfile)
inds <- atom.select(pdb, resno = c(24:27, 85:90), elety = "CA") trj <- fit.xyz(fixed = pdb$xyz, mobile = dcd, fixed.inds = inds$xyz, mobile.inds = inds$xyz)
Once we have the superposed trajectory frames we can asses the extent to which the atomic fluctuations of individual residues (in this very short example simulation) are correlated with one another and build a network from this data:
Due to the often noisy nature of correlations calculated from short MD simulations, we typically suggest running longer simulations and multiple replica simulations from which a consensus correlation matrix and consensus network can be generated and compared to results obtained from individual simulations. For example, see the following works: Scarabelli and Grant, (2014). Kinesin-5 allosteric inhibitors uncouple the dynamics of nucleotide, microtubule and neck-linker binding sites, Biophys J. 107:2204 and Yao, et al., and Grant, (2016). Dynamic coupling and allosteric networks in the alpha subunit of heterotrimeric G proteins. J. Biol. Chem. 291:4742. A similar approach can be applied to multiple correlation matrices calculated from an ensemble normal mode analysis. An example is given in the Section Ensemble correlation network analysis below.
Note that the dccm() function can be slow to run on very large trajectories. In such cases you may want to use multiple CPU cores for the calculation by setting the ‘ncore’ option to an appropriate value for your computer - see help(dccm) for details.
# View the correlations in pymol (see Figure 10). pymol(cij, pdb, type = "launch")
# View the structure mapped network (see Figure 11). vmd(net, pdb, launch = TRUE)
As noted above, maximization of modularity sometimes creates unexpected community partitions splitting visually obvious domains into many small community ‘islands’. In such cases we look into partitions with modularity close to the maximal value but with an overall smaller number of communities (see above).
# See Figure 12. tree <- community.tree(net, rescale=TRUE) plot( tree$num.of.comms, tree$modularity, xlab="Communities", ylab="Modularity" )
# Example function using community.tree() and network.amendment() mod.select <- function(x, thres=0.1) { remodel <- community.tree(x, rescale = TRUE) n.max = length(unique(x$communities$membership)) ind.max = which(remodel$num.of.comms == n.max) v = remodel$modularity[length(remodel$modularity):ind.max] v = rev(diff(v)) fa = which(v>=thres)[1] - 1 ncomm = ifelse(is.na(fa), min(remodel$num.of.comms), n.max - fa) ind <- which(remodel$num.of.comms == ncomm) network.amendment(x, remodel$tree[ind, ]) }
nnet = mod.select(net) nnet # See Figure 13. plot(nnet, pdb)
# See Figure 14. vmd(nnet, pdb, launch=TRUE)
In the original Luthey-Schulten and co-workers approach to correlation network analysis[1], edges were only included between “in contact” nodes. Contacting nodes were defined as those having any heavy atom within 4.5 Å for greater than 75% of the simulation. The weight of these contact edges was then taken from the cross-correlation data. This approach effectively ignores long range correlations between residues that are not in physical contact for the majority of the trajectory. To replicate this approach one can simply define a contact map with the cmap() function and provide this to the cna() function along with the required cross-correlation matrix, e.g.:
One could also simply multiple the correlation matrix by the contact map to zero out these long-range correlations. However, we currently prefer not to use this filtering step as we have found that this approach can remove potentially interesting correlations that might prove to be important for the types of long-range coupling we are most interested in exploring. For example, compare the plot of correlation values below in Figure 15 to those in Figure 4.
# See Figure 15. plot((cij * cm), main="")
## Plot non-filtered DCCM ##plot.dccm(cij, margin.segments = net$communities$membership)
It is also possible to use a consensus of contact map and dynamical correlation matrices from replica simulations with the functions filter.cmap() and filter.dccm() - see their respective help pages for further details. (A more advanced use of filter.dccm() for an ensemble correlation network analysis is available in the Section 4.)
We have already introduced the plot.cna()
and vmd.cna()
functions in earlier sections. Here we focus on methods for customizing network layouts, colors, labels, nodes and edges for visualization.
The layout.cna() function can be used to determine the geometric center of protein residues and communities and thus provide 3D and 2D network and community graph layouts that most closely match the protein systems 3D coords.
layout3D <- layout.cna(net, pdb, k=3) layout2D <- layout.cna(net, pdb, k=2)
Other network layout options are available via the igraph package, see ?igraph::layout in R for details. In addition, one can provide any 2D plotting coordinate values to plot.cna() to generate alternate orientations.
The identify.cna() function (invoked by calling identify()) allows one to click on the nodes of a plot.cna() network plot to identify the residues within a give set of nodes. This can also be a useful first step to determine the placement of text labels for a plot -see help(identify.cna) for details.
The vmd_colors() function is used by default to define node color. This can be changed to any input color vector as demonstrated below. Note that the vmd_colors() function outputs a vector of colors that is ordered as they are in the VMD molecular graphics program and changing the colors will likely break this correspondence.
grp.col <- bwr.colors(20)[net$communities$membership]
Useful options in plot.cna() to be aware of include the optional flags ‘mark.groups’ and ‘mark.col’, which let one draw and color areas around groups of residues corresponding to the community clustering or any other residue partitioning of interest (see Figure 16):
grp.col <- list() for (i in 1:max(net$communities$membership)) { grp.tmp <- which(net$communities$membership == i) grp.col[[i]] <- grp.tmp }
colbar.full <- vmd_colors()[net$communities$membership] colbar.comms <- vmd_colors(max(net$communities$membership), alpha = 0.5) # See Figure 16. plot(net, pdb, full = TRUE, mark.groups = grp.col, mark.col = colbar.comms)
The prune.cna() function allows one to remove communities under a certain size or with less less than a certain number of edges to other communities prior to visualization or further analysis. For example, to remove communities composed of only 1 residue (see Figure 17):
net.pruned <- prune.cna(net, size.min = 2)
## id size members
## 1 21 c(1:7, 95:102, 193:198)
## 2 2 8:9
## 3 24 c(10:23, 63:72)
## 4 4 24:27
## 5 47 c(28:62, 73:84)
## 6 1 85
## 7 1 86
## 8 1 87
## 9 2 88:89
## 10 5 90:94
## 11 5 103:107
## 12 26 c(108:122, 162:172)
## 13 6 123:128
## 14 44 c(129:161, 173:183)
## 15 2 184:185
## 16 1 186
## 17 2 187:188
## 18 4 189:192
## Removing Nodes: 7, 8, 9, 16, 17, 6
## id size edges members
## 6 1 1 85
## 7 1 0 86
## 8 1 0 87
## 9 2 0 88:89
## 16 1 0 186
## 17 2 0 187:188
Applying this command to our network example, you can see how the communities composed by less than 2 residues are deleted from the graph.
Node centrality describes the distribution of the edges in the network. In protein networks, identifying the nodes with many edges (hubs) as well as the segments with a high number of connections can yield insights into the internal dynamic coordination of protein regions. Centrality measures can also be employed to characterize protein sections showing differences in coupled motions between networks derived from different states (e.g. ligand bound and unbound etc.).
There are a number of different measures of node centrality, from the simplest one that counts the number of edges of each node (node degree) to the more complex (such as betweenness centrality). For example, the betweenness centrality of a node is defined as the number of unique shortest paths crossing that node. This measure has the advantage of considering the whole network topology and not only the closest neighbors in its calculation (as is the case with node degree).
The igraph package provides different functions to perform these calculations (such as degree, betweenness and closeness). Below we illustrate an example output of the betweenness centrality on our simple HIVP network:
node.betweenness <- betweenness(net$network)
# See Figure 18. plot(node.betweenness, xlab="Residue No", ylab="Centrality", type="h")
Interestingly, in extended trajectories of this system the regions with the highest centrality values are the two flaps of the HIVpr structure, whose dynamics and interactions are likely important of substrate processing. To examine this for your own system you could map these values (or any other metric of interest) to the B-factor column of a PDB file for viewing in VMD or other molecular graphics software that can color by this column:
# Output a PDB file with normalized betweenness mapped to B-factor write.pdb(pdb, b=normalize.vector(node.betweenness), file="tmp.pdb")
Identifying residues potentially involved in the dynamic coupling of distal protein regions can be facilitated by calculating possible linking paths through the correlation network objects. In this analysis, multiple distinct shortest or (sub)optimal paths between pre-specified sites (termed ‘source’ and ‘sink’, which are usually functionally relevant such as the active site and an allosteric site) are searched. To run such an analysis, we can use the Bio3D function cnapath() and associated tools for printing and visualization (e.g., the cnapath variants of summary(), print(), plot(), and vmd()).
pa <- cnapath(net, from=32, to=131, k=50)
The ‘from’ and ‘to’ arguments in above function call represent the node IDs of the two ends (i.e., ‘source’ and ‘sink’, respectively) in the path analysis. The number of paths calculated is specified by the argument ‘k’ (Here we explore 50 suboptimal paths).
A path is a set of network edges connecting the two sites, and path length is defined by the sum of edge weights. For example, network edges can be weighted by \(-ln(|cij|)\), where cij is the correlation between residues forming the edge. In this case, path length distribution can be used to measure the overall coupling strength of the network (longer paths mean weaker coupling). Such information can be obtained by using the function print() (or summary() for a similar effect). Also generated by the function is the importance of a residue measured by normalized node degeneracy (i.e., the fraction of suboptimal paths going through the node or residue).
print(pa)
## Number of paths in network(s):
## 50
##
## Path length distribution:
##
## [2.287,2.297] (2.297,2.306] (2.306,2.315] (2.315,2.324] (2.324,2.333]
## 4 6 13 12 15
##
## Node degeneracy table:
##
## 32 46 47 48 53 54 55 56 131 147 152 153 155
## 1.00 0.12 0.22 0.66 0.56 0.26 0.40 0.98 1.00 1.00 0.64 1.00 0.42
# Or, simply type 'pa'
# Compare two or more sets of paths # (e.g., from networks of the same system but under distinct states) print(pa1, pa2, ...)
To quickly inspect on path length distribution and node degeneracy, use plot.cnapath():
plot(pa, label="HIVP", pdb=pdb)
Or, for the same effect:
print(pa, label="HIVP", pdb=pdb, plot=TRUE)
Returning a value for further analyses (of, e.g., node degeneracy):
out <- print(pa, pdb=pdb) out$degeneracy
To visulize paths use the function vmd.cnapath() (NOTE: you need VMD installed properly).
vmd(pa, pdb=pdb, spline=TRUE, col='cyan', launch=TRUE)
Set spline=FALSE to get faster loading of paths in VMD. Also, set launch=FALSE if you want to view paths later. To do that, load the VMD script file generated by the function from the command line (e.g., in Unix-like systems, type vmd -e vmd.cnapath.vmd
. Read VMD manual for more detailed instruction). Customized output file names are supported through the out.prefix argument of the function (See ?vmd.cnapath
for more detail).
To view example applications of the method and how key residues involved in allosteric signaling were identified, please read the following work: Scarabelli and Grant, (2014). Kinesin-5 allosteric inhibitors uncouple the dynamics of nucleotide, microtubule and neck-linker binding sites, Biophys J, 107, 2204-2213 and Yao, et al., and Grant, (2016). Dynamic coupling and allosteric networks in the alpha subunit of heterotrimeric G proteins. J Biol Chem, 291, 4742-4753.
Although we encourage users to do suboptimal path analysis with Bio3D, users may have their own preferred software packages. In this case, Bio3D can be used to do all calculations during the network construction. Then, the results are interfaced with other software for subsequent path analysis. For example, users may choose the WISP program [8] to search shortest paths between select network nodes. The program requires an input adjacency matrix of the network, which can be calculated with Bio3D and saved in a file using the following command:
write.table(net$cij, quote=FALSE, row.names=FALSE, col.names=FALSE, file="adj.txt")
In this section, we will introduce the use of Bio3D to perform an ensemble correlation network analysis (eCNA). The method represents an enhanced version of the protein structure or correlation network analysis introduced in preceding sections that is based on a single structure or correlation matrix. In eCNA, multiple correlation matrices are taken as input, either from multiple MD simulations or an ensemble NMA. Matrices sampled for the same state (e.g., from multiple simulation replicas or NMA of the same structural category) are examined to identify significant correlations for the state, which are then used to define edges of the state-specific network. The objectives here are to screen out spurious long-distance strong correlations due to statistical variations (i.e., false positives in a correlation network analysis using a single correlation matrix), and simultaneously to capture robustly strong correlations that may be excluded in contact map based method just because the residue distance is beyond an empirical cutoff.
In the following example we will use the dataset named ‘transducin’ available from the bio3d package, which contains 53 pre-assembled crystallographic structures of G\(\alpha\)i categorized into active GTP-analog-bound, inactive GDP-bound, and inhibitory GDI-bound states. The dataset includes aligned PDB coordinates (pdbs
), structural invariant core positions (core
), and annotations for each PDB structure (annotation
). Further details of the dataset can be obtained via help(transducin)
along with the Comparative Protein Structure Analysis with Bio3D vignette available online.
attach(transducin)
In Bio3D, residue cross-correlations are calculated with the dccm() function. The dynamical data can be from MD, NMA, multiple experimental structures, or a single multi-model PDB from, e.g., NMR. In this example, we employ NMA applied to the pre-assembled aligned structures. For clarity, we only consider structures in the GTP and GDI state.
Significant correlations are then identified with the function filter.dccm(), which inspects both robustness of correlations across structures and spatial distance between related residues. Here, the correlation inspection is done for each structural category or group. Briefly, a correlation value (denoted by ‘cij’) is retained if:
A user-defined cutoff value \(c\) can be provided. Otherwise, a roughly guessed cutoff value will be used, which approximately keeps 5% of all residue-residue correlations. A contact map is also required, which can be either provided by the user (pre-calculated using, e.g., the bio3d function cmap()) or calculated internally. The internal algorithm in filter.dccm() defines a contact by that the C\(\alpha\)-C\(\alpha\) distance is no greater than 10 Å for 75% of all the structures in a group. See ?filter.dccm
for more detail.
# The 'scut=0' indicates to include all residue pairs for the contact map # calculation, including neighboring residues in sequence. cij <- filter.dccm(cijs, cutoff.cij=0.35, xyz=pdbs, fac=annotation[, "state3"], scut=0)
Correlation networks are finally constructed with the function cna(). Note that since correlation matrices are already filtered in the last step, here the ‘cutoff.cij’ is set to zero in cna()
. Networks are then visualized with the function plot.cna().
nets <- cna(cij, cutoff.cij = 0) #plot(nets, pdb=pdb)
In a plot by simply calling plot.cna()
, colors of community vertexes are not consistent across networks, which impedes network comparisons. This is due to small variations of community definition between networks. A convenient tool to correct it is the community.aln() function provided in the bio3d.cna package. This function finds equivalent communities from two or more networks and re-assign colors to them in a consistent way across networks.
nets.aln <- community.aln(nets) layout(matrix(c(1:4), 2, 2)) par(mar=c(0.1, 0.1, 3, 0.1)) cent.gtp = layout.cna(nets$GTP, pdb=pdb, k=3)[,1:2] cent.gdi = layout.cna(nets$GDI, pdb=pdb, k=3)[,1:2] plot(nets$GTP, layout=cent.gtp, vertex.label=NA, main="GTP") plot(nets.aln$GTP, layout=cent.gtp, vertex.label=NA) plot(nets$GDI, layout=cent.gdi, vertex.label=NA, main="GDI") plot(nets.aln$GDI, layout=cent.gdi, vertex.label=NA)
The eCNA method can be coupled with suboptimal path analysis for more detailed network comparisons. Results are visualized with vmd.cnapath(). Here, the source and sink nodes are chosen to represent the receptor (which catalyzes the G protein activation) binding site and the active site. It clearly shows that distinct suboptimal paths are present in the two networks connecting the source and sink nodes, and the paths in the GTP (active) network are overall much shorter (indicating much stronger coupling between the distal sites) than those in the GDI (inhibited) network.
pa <- cnapath(nets, from=307, to=168, k=100)
vmd(pa, pdb=pdb, spline=TRUE)
Quantitative comparisons of path length distributions can be done by calling the plot.cnapath() function. More detailed results are returned by the print.cnapath() function, which contains normalized node degeneracy for identifying specific crucial residues underlying each set of coupling paths.
plot(pa, pdb=pdb)
out <- print(pa, pdb=pdb)
out$degeneracy
## K31 L32 L33 L34 M194 F195 D196 V197 G198 T215 C216 I217 F219 I220 S259
## GDI 0.00 0.00 0.00 1.0 0.00 0.00 1.00 0.29 1 0.00 0.00 0.00 0.99 1 0
## GTP 0.41 0.36 0.47 0.1 0.23 0.52 0.96 0.38 1 0.44 0.53 0.17 0.00 0 1
## N265 C321 A322 N327 V328 F330 V331 F332 D333 A334 V335 T336 D337 I338 I339
## GDI 0.91 0.92 0.1 0.89 0.2 0.4 0.48 0.2 0.24 0.3 0.47 0.32 0.1 0.18 1
## GTP 0.00 0.00 0.0 0.00 0.0 0.0 0.00 0.0 0.00 0.0 0.00 0.00 0.0 0.00 1
detach(transducin)
In this vignette we have confined ourselves to introducing relatively simple correlation network generation protocols. However, the individual Bio3D functions for network analysis permit extensive customization of the network generation procedure. For example, one can chose different node representations (including all heavy atoms, residue center of mass, Calpha atoms for each residue, or collections of user defined atoms), different correlation methods (including Pearson and mutual information), different community detection methods (including Girvan-Newman betweenness, Random walk, and Greedy modularity optimization approaches), as well as customized network layouts and result visualizations. We encourage users to explore these options and provide feedback as appropriate.
Although the protocols detailed herein can provide compelling results in a number of cases, defining an optimal procedure for network generation is an area of active research. For example rearrangement of some residues in the communities and changes in the total number of communities may be observed among different community structures based on different independent MD simulations of the same system. These variations of community structures are partially due to the small changes in the calculated cij values that reflect statistical fluctuations among independent simulations. However, changes in mutual information or correlation values may still be relatively small and differences among independent community structures may be largely due to limitations of the modularity metric, a methodology that was originally developed for binary adjacency matrices, where edges are counted without taking into account their lengths. We are currently developing a modularity methodology that is more closely aligned to protein dynamics data. We are also developing integrated 3D viewing functions (so VMD and PyMol will no longer be absolutely required during the exploratory interactive phase of analysis), as well as functionality to facilitate analysis of the interactions that generate changes in the community network.
[1] Sethi A, Eargle J, Black AA, Luthey-Schulten Z. Proc Natl Acad Sci U S A.2009, 106(16):6620-5.
[2] Scarabelli G and Grant BJ, PLoS Comput Biol 9, e1003329.
[3] Yao X, Grant BJ, Biophys J 105, L08–L10.
[4] Yao X, Malik RU, Griggs NW, Skjaerven L, Traynor JR, Sivaramakrishnan S, and Grant BJ, J Biol Chem 291, 4742-4753.
[5] Gasper PM, Fuglestad B, Komives EA, Markwick PR, McCammon JA. Proc Natl Acad Sci U S A. 2012 Dec 26;109(52):21216-22.
[6] Gábor Csárdi, Tamás Nepusz. InterJournal Complex Systems, 1695, 2006.
[7] Girvan M, Newman ME. Proc Natl Acad Sci U S A. 2002 Jun 11;99(12):7821-6.
[8] Van Wart AT, Durrant J, Votapka L, Amaro RE. J. Chem. Theory Comput., 2014, 10 (2), pp 511–517.
This document is available from the Bio3D website in R, HTML, and PDF formats. All code can be extracted and automatically executed to generate Figures and/or the PDF with the following commands:
print(sessionInfo(), FALSE)
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] igraph_1.2.5 bio3d.nma_0.1.0.9000 bio3d.cna_0.1.0.9000
## [4] bio3d_2.4-1.9000
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.5 cpp11_0.2.1 knitr_1.29 magrittr_1.5
## [5] lattice_0.20-41 R6_2.4.1 ragg_0.3.1 rlang_0.4.7
## [9] highr_0.8 stringr_1.4.0 tools_4.0.2 parallel_4.0.2
## [13] grid_4.0.2 xfun_0.17 htmltools_0.5.0 systemfonts_0.3.1
## [17] yaml_2.2.1 assertthat_0.2.1 rprojroot_1.3-2 digest_0.6.25
## [21] pkgdown_1.6.1.9000 crayon_1.3.4 fs_1.5.0 memoise_1.1.0
## [25] evaluate_0.14 rmarkdown_2.3 stringi_1.5.3 compiler_4.0.2
## [29] desc_1.2.0 backports_1.1.9 pkgconfig_2.0.3
The latest version of the package, full documentation and further vignettes (including detailed installation instructions) can be obtained from the main Bio3D website: [http://thegrantlab.org/bio3d/(http:/thegrantlab.org/bio3d/)↩︎
This document contains executable code that generates all figures contained within this document. See help(vignette) within R for full details.↩︎