[BioPython] blast parser ideas
Jeffrey Chang
jchang@SMI.Stanford.EDU
Tue, 9 Nov 1999 19:14:12 -0800 (PST)
Hi everybody,
I have been giving some thought to writing BLAST parsers. Some of the
problems that I think a BLAST parser should address are:
1. Multiple flavors of BLAST (blastp, blastn, tblastn, etc.)
2. Multiple versions of BLAST (blast1, blast2, psi-blast, phi-blast,
wu-blast)
3. Frequent format changes
4. BLAST reports can be large.
5. Much of the output may not relevant.
I'm not happy with the default design where a parser takes some BLAST
information and fills out some data structure. In my experience,
problems 2 and 3 make this design hard to manage, and you end up with
code that's no longer backwards compatible or lots of bits of parser
code.
Instead, I have been thinking of using an event-oriented parser. This
style of parser has been discussed in bionet.software and is used for
the *ML parsers in the standard Python distribution. I believe Andrew
Dalke has played with this in various projects.
The way this works, is that the client feeds data into a Parser
object. The parser recognizes information in the data stream and
calls a function in a Consumer to handle the information. The
Consumer is supplied by the client, and can do application-specific
things with the data. Typically, it would capture the information in
a data structure suitable for the application.
For example:
class Parser:
def parse(self, handle, consumer):
# Read data from handle.
# Call functions in the consumer, as appropriate.
class Consumer:
#def start_XXX(self):
# # Called at the beginning of a section named XXX.
#def end_XXX(self):
# # Called at the end of a section named XXX.
#def XXX(self, data):
# # Called when data of type XXX is encountered.
def unhandled_section(self):
pass
def unhandled(self, data):
pass
def __getattr__(self, attr):
method = getattr(self, attr)
if method is None:
if attr[:6] == 'start_' or attr[:4] == 'end_':
method = unhandled_section
else:
method = unhandled
return method
Here's a sample consumer that would just print out alignment
information from some BLAST output:
class MyConsumer(Consumer):
def start_alignment(self):
print "Alignments start"
def end_alignment(self):
print "Alignments end"
def query(self, data):
print "Query line:", data,
def align(self, data):
print "Align line:", data,
def sbjct(self, data):
print "Sbjct line:", data,
A sample session might be:
>>> parser = Parser()
>>> consumer = MyConsumer()
>>> parser.parse(open('blast_data.txt'), consumer)
Alignments start
Query line: Query: 8 DRLNTTFRQMEQELAIFAAHLE 29
Align line: +RL +R++E+ L+ AH+E
Sbjct line: Sbjct: 135 ERLLWLYREVERPLSAVLAHME 156
Alignments end
>>>
This event-oriented design decouples the parsing from the handler, so
you can use the same consumer for multiple versions and flavors of
BLAST. Plus, you can ignore data that you're not interested in by not
implementing handler methods in your consumer.
By doing things this way, the Parser and Consumer need to agree on an
interface through which to pass data. Thus, I went through the latest
BLAST code and named the lines in the output. These will be the names
of the methods in the Consumer class.
I've attached the list of names as well as some sample BLAST output.
Please let me know what you think about the parser ideas as well as
the proposed names.
Thanks,
Jeff
SECTION
DATATYPE WHEN AVAILABLE
header
version
reference
query
database_information
descriptions
description
score
title
length
score
identities
frame frame
strand strand
alignment
query
align
sbjct
database_report
database not subset
posted_date not subset
num_letters_in_database not subset
num_sequences_in_database not subset
num_letters_searched subset
num_sequences_searched subset
parameters
matrix
gap_penalties gapped mode
second_pass_hits not two pass method
second_pass_sequences not two pass method
second_pass_extends not two pass method
second_pass_good_extends not two pass method
num_hits two pass method
num_sequences two pass method
num_extends two pass method
num_good_extends two pass method
num_seqs_better_e gapped and not blastn
hsps_no_gap gapped and not blastn
hsps_prelim_gapped gapped and not blastn
hsps_prelim_gap_attempted gapped and not blastn
hsps_gapped gapped and not blastn
query_length
database_length
effective_hsp_length
effective_query_length
effective_database_length
effective_search_space
effective_search_space_used
frameshift_decay blastx or tblastn or tblastx
threshold second
window_size
dropoff_1st_pass
gap_x_dropoff
gap_x_dropoff_final not blastn and gapped calculation (?)
gap_trigger
blast_cutoff
BLASTP 2.0.10 [Aug-26-1999]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
Query= test
(140 letters)
Database: sdqib40-1.35.seg.fa
1323 sequences; 223,339 total letters
Searching..................................................done
Score E
Sequences producing significant alignments: (bits) Value
d1rip__ 2.24.7.1.1 Ribosomal S17 protein [Bacillus stearothermo... 23 2.5
d1rlr_1 1.56.1.1.1 (1-212) R1 subunit of ribonucleotide reducta... 23 2.5
d1lfaa_ 3.42.1.1.1 Integrin CD11a/CD18 (LFA-1) [Human (Homo sap... 22 5.6
d1ktq_1 3.38.3.4.2 (1-161) Exonuclease domain of DNA polymerase... 21 9.7
d1prea1 4.88.1.2.2 (1-83) Proaerolysin, N-terminal domain [Aero... 21 9.7
>d1rip__ 2.24.7.1.1 Ribosomal S17 protein [Bacillus
stearothermophilus]
Length = 81
Score = 22.7 bits (47), Expect = 2.5
Identities = 10/36 (27%), Positives = 18/36 (49%)
Query: 16 QMEQELAIFAAHLEQHKLLVARVFSLPEVKKEDEHN 51
+M++ + + ++H L RV + K DEHN
Sbjct: 13 KMDKTITVLVETYKKHPLYGKRVKYSKKYKAHDEHN 48
>d1rlr_1 1.56.1.1.1 (1-212) R1 subunit of ribonucleotide reductase,
N-terminal domain [Escherichia coli]
Length = 212
Score = 22.7 bits (47), Expect = 2.5
Identities = 11/36 (30%), Positives = 20/36 (55%)
Query: 104 NLSQAALVSHIQHINKLKTTFEHIVTVESELPTAAR 139
++SQ L SHIQ + +KT+ H +++ +R
Sbjct: 28 SISQVELRSHIQFYDGIKTSDIHETIIKAAADLISR 63
>d1lfaa_ 3.42.1.1.1 Integrin CD11a/CD18 (LFA-1) [Human (Homo
sapiens)]
Length = 183
Score = 21.6 bits (44), Expect = 5.6
Identities = 10/26 (38%), Positives = 16/26 (61%)
Query: 109 ALVSHIQHINKLKTTFEHIVTVESEL 134
AL+ H++H+ L TF I V +E+
Sbjct: 67 ALLKHVKHMLLLTNTFGAINYVATEV 92
>d1ktq_1 3.38.3.4.2 (1-161) Exonuclease domain of DNA polymerase KF
[Thermus aquaticus]
Length = 161
Score = 20.8 bits (42), Expect = 9.7
Identities = 8/22 (36%), Positives = 15/22 (67%)
Query: 8 DRLNTTFRQMEQELAIFAAHLE 29
+RL +R++E+ L+ AH+E
Sbjct: 135 ERLLWLYREVERPLSAVLAHME 156
>d1prea1 4.88.1.2.2 (1-83) Proaerolysin, N-terminal domain
[Aeromonas hydrophila]
Length = 83
Score = 20.8 bits (42), Expect = 9.7
Identities = 10/28 (35%), Positives = 16/28 (56%)
Query: 37 RVFSLPEVKKEDEHNPLNRIEVKQHLGN 64
R+FSL + D++ P+NR E + N
Sbjct: 9 RLFSLGQGVCGDKYRPVNREEAQSVKSN 36
Database: sdqib40-1.35.seg.fa
Posted date: Nov 1, 1999 4:25 PM
Number of letters in database: 223,339
Number of sequences in database: 1323
Lambda K H
0.322 0.133 0.369
Lambda K H
0.270 0.0470 0.230
# SECTION: PARAMETERS
Matrix: BLOSUM62 # MATRIX
Gap Penalties: Existence: 11, Extension: 1 # GAP_PENALTIES (only if gapped)
Number of Hits to DB: 50604 # SECOND_PASS_HITS
Number of Sequences: 1323
Number of extensions: 1526
Number of successful extensions: 6
Number of sequences better than 10.0: 5
Number of HSP's better than 10.0 without gapping: 5
Number of HSP's successfully gapped in prelim test: 0
Number of HSP's that attempted gapping in prelim test: 1
Number of HSP's gapped (non-prelim): 5
length of query: 140
length of database: 223,339
effective HSP length: 39
effective length of query: 101
effective length of database: 171,742
effective search space: 17345942
effective search space used: 17345942
T: 11
A: 40
X1: 16 ( 7.4 bits)
X2: 38 (14.8 bits)
X3: 64 (24.9 bits)
S1: 41 (21.9 bits)
S2: 42 (20.8 bits)