NCBI BLAST Sequence Search and Summary Plot of Hit Statistics

Usage

blast.pdb(seq, database = "pdb", time.out = NULL, chain.single=TRUE)
get.blast(urlget, time.out = NULL, chain.single=TRUE)
"plot"(x, cutoff = NULL, cut.seed=NULL, cluster=TRUE, mar=c(2, 5, 1, 1), cex=1.5, ...)

Arguments

seq
a single element or multi-element character vector containing the query sequence. Alternatively a ‘fasta’ object from function get.seq or ‘pdb’ object from function read.pdb can be provided.
database
a single element character vector specifying the database against which to search. Current options are ‘pdb’, ‘nr’ and ‘swissprot’.
time.out
integer specifying the number of seconds to wait for the blast reply before a time out occurs.
urlget
the URL to retrieve BLAST results; Usually it is returned by blast.pdb if time.out is set and met.
chain.single
logical, if TRUE double NCBI character PDB database chain identifiers are simplified to lowercase '1WF4_GG' > '1WF4_g'. If FALSE no conversion to match RCSB PDB files is performed.
x
BLAST results as obtained from the function blast.pdb.
cutoff
A numeric cutoff value, in terms of minus the log of the evalue, for returned hits. If null then the function will try to find a suitable cutoff near ‘cut.seed’ which can be used as an initial guide (see below).
cut.seed
A numeric seed cutoff value, used for initial cutoff estimation. If null then a seed position is set to the point of largest drop-off in normalized scores (i.e. the biggest jump in E-values).
cluster
Logical, if TRUE (and ‘cutoff’ is null) a clustering of normalized scores is performed to partition hits in groups by similarity to query. If FALSE the partition point is set to the point of largest drop-off in normalized scores.
mar
A numerical vector of the form c(bottom, left, top, right) which gives the number of lines of margin to be specified on the four sides of the plot.
cex
a numerical single element vector giving the amount by which plot labels should be magnified relative to the default.
...
extra plotting arguments.

Description

Run NCBI blastp, on a given sequence, against the PDB, NR and swissprot sequence databases. Produce plots that facilitate hit selection from the match statistics of a BLAST result.

Details

The blast.pdb function employs direct HTTP-encoded requests to the NCBI web server to run BLASTP, the protein search algorithm of the BLAST software package.

BLAST, currently the most popular pairwise sequence comparison algorithm for database searching, performs gapped local alignments via a heuristic strategy: it identifies short nearly exact matches or hits, bidirectionally extends non-overlapping hits resulting in ungapped extended hits or high-scoring segment pairs(HSPs), and finally extends the highest scoring HSP in both directions via a gapped alignment (Altschul et al., 1997)

For each pairwise alignment BLAST reports the raw score, bitscore and an E-value that assess the statistical significance of the raw score. Note that unlike the raw score E-values are normalized with respect to both the substitution matrix and the query and database lengths.

Here we also return a corrected normalized score (mlog.evalue) that in our experience is easier to handle and store than conventional E-values. In practice, this score is equivalent to minus the natural log of the E-value. Note that, unlike the raw score, this score is independent of the substitution matrix and and the query and database lengths, and thus is comparable between BLASTP searches.

Examining plots of BLAST alignment lengths, scores, E-values and normalized scores (-log(E-Value) from the blast.pdb function can aid in the identification sensible hit similarity thresholds. This is facilitated by the plot.blast function.

If a ‘cutoff’ value is not supplied then a basic hierarchical clustering of normalized scores is performed with initial group partitioning implemented at a hopefully sensible point in the vicinity of ‘h=cut.seed’. Inspection of the resultant plot can then be use to refine the value of ‘cut.seed’ or indeed ‘cutoff’. As the ‘cutoff’ value can vary depending on the desired application and indeed the properties of the system under study it is envisaged that ‘plot.blast’ will be called multiple times to aid selection of a suitable ‘cutoff’ value. See the examples below for further details.

Value

The function blast.pdb returns a list with three components, hit.tbl, raw, and url. The function plot.blast produces a plot on the active graphics device and returns a list object with four components, hits, pdb.id, acc, and inds. See below:

hit.tbl
a data frame summarizing BLAST results for each reported hit. It contains following major columns:
  • ‘bitscore’, a numeric vector containing the raw score for each alignment.
  • ‘evalue’, a numeric vector containing the E-value of the raw score for each alignment.
  • ‘mlog.evalue’, a numeric vector containing minus the natural log of the E-value.
  • ‘acc’, a character vector containing the accession database identifier of each hit.
  • ‘pdb.id’, a character vector containing the PDB database identifier of each hit.

raw
a data frame containing the raw BLAST output. Note multiple hits may appear in the same row.

url
a single element character vector with the NCBI result URL and RID code. This can be passed to the get.blast function.

hits
an ordered matrix detailing the subset of hits with a normalized score above the chosen cutoff. Database identifiers are listed along with their cluster group number.

pdb.id
a character vector containing the PDB database identifier of each hit above the chosen threshold.

acc
a character vector containing the gi database identifier of each hit above the chosen threshold.

inds
a numeric vector containing the indices of the hits relative to the input blast object.

References

Grant, B.J. et al. (2006) Bioinformatics 22, 2695--2696.

‘BLAST’ is the work of Altschul et al.: Altschul, S.F. et al. (1990) J. Mol. Biol. 215, 403--410.

Full details of the ‘BLAST’ algorithm, along with download and installation instructions can be obtained from: http://www.ncbi.nlm.nih.gov/BLAST/.

Note

Online access is required to query NCBI blast services.

Examples

pdb <- read.pdb("4q21")
Note: Accessing on-line PDB file
blast <- blast.pdb( pdbseq(pdb) )
Searching ... please wait (updates every 5 seconds) RID = YVA3R9PP015 . Reporting 319 hits
head(blast$hit.tbl)
queryid subjectids identity positives alignmentlength 1 Query_6213 gi|231226|pdb|6Q21|A 100 100 168 2 Query_6213 gi|231227|pdb|6Q21|B 100 100 168 3 Query_6213 gi|231228|pdb|6Q21|C 100 100 168 4 Query_6213 gi|231229|pdb|6Q21|D 100 100 168 5 Query_6213 gi|15988032|pdb|1IOZ|A 100 100 168 6 Query_6213 gi|157829765|pdb|1AA9|A 100 100 168 mismatches gapopens q.start q.end s.start s.end evalue bitscore 1 0 0 1 168 1 168 1.11e-124 348 2 0 0 1 168 1 168 1.11e-124 348 3 0 0 1 168 1 168 1.11e-124 348 4 0 0 1 168 1 168 1.11e-124 348 5 0 0 1 168 1 168 1.11e-124 348 6 0 0 1 168 1 168 1.11e-124 348 mlog.evalue pdb.id acc 1 285.4162 6Q21_A 231226 2 285.4162 6Q21_B 231227 3 285.4162 6Q21_C 231228 4 285.4162 6Q21_D 231229 5 285.4162 1IOZ_A 15988032 6 285.4162 1AA9_A 157829765
top.hits <- plot(blast)
* Possible cutoff values: 241 74 Yielding Nhits: 233 319 * Chosen cutoff value of: 241 Yielding Nhits: 233
head(top.hits$hits)

pdb.id acc group 1 "6Q21_A" "231226" "1" 2 "6Q21_B" "231227" "1" 3 "6Q21_C" "231228" "1" 4 "6Q21_D" "231229" "1" 5 "1IOZ_A" "15988032" "1" 6 "1AA9_A" "157829765" "1"
## Use 'get.blast()' to retrieve results at a later time. #x <- get.blast(blast$url) #head(x$hit.tbl) # Examine and download 'best' hits top.hits <- plot.blast(blast, cutoff=188)
* Possible cutoff values: 241 74 Yielding Nhits: 233 319 * Chosen cutoff value of: 188 Yielding Nhits: 233

head(top.hits$hits)
pdb.id acc group 1 "6Q21_A" "231226" "1" 2 "6Q21_B" "231227" "1" 3 "6Q21_C" "231228" "1" 4 "6Q21_D" "231229" "1" 5 "1IOZ_A" "15988032" "1" 6 "1AA9_A" "157829765" "1"
#get.pdb(top.hits)

See also

plot.blast, hmmer, seqaln, get.pdb

Author

Barry Grant