[BioPython] blast parser ideas
Arne Mueller
a.mueller@icrf.icnet.uk
Wed, 10 Nov 1999 15:23:52 +0000
Jeffrey Chang wrote:
>
> Hi everybody,
>
> I have been giving some thought to writing BLAST parsers. Some of the
> problems that I think a BLAST parser should address are:
> 1. Multiple flavors of BLAST (blastp, blastn, tblastn, etc.)
> 2. Multiple versions of BLAST (blast1, blast2, psi-blast, phi-blast,
> wu-blast)
> 3. Frequent format changes
> 4. BLAST reports can be large.
> 5. Much of the output may not relevant.
>
> I'm not happy with the default design where a parser takes some BLAST
> information and fills out some data structure. In my experience,
> problems 2 and 3 make this design hard to manage, and you end up with
> code that's no longer backwards compatible or lots of bits of parser
> code.
> Instead, I have been thinking of using an event-oriented parser. This
> style of parser has been discussed in bionet.software and is used for
> the *ML parsers in the standard Python distribution. I believe Andrew
> Dalke has played with this in various projects.
>
> The way this works, is that the client feeds data into a Parser
> object. The parser recognizes information in the data stream and
> calls a function in a Consumer to handle the information. The
> Consumer is supplied by the client, and can do application-specific
> things with the data. Typically, it would capture the information in
> a data structure suitable for the application.
Very good idea I think but I'm worried about point 2 of the above list.
If I understood the principle of the system (parser feeds consumer) the
input stream is parsed by the parser object which recognizes information
(like 'Sbject' ...) and then calls
consumer.start_Sbject()
consumer.Sbject()
consumer.end_Sbject()
The consumer can handle 'Sbject' lines of the blast output by defining
the above 3 functions. But in the end it's up to the parser to recignize
certain keywords like 'Sbjct', isn't it? That means the parser has to
recognize all keywords of all different blast programs etc ... (e.g.
'Results from Round' for PSI-Blast) and is not independant anymore.
[> ... ]
skipped the code part ...
>
> This event-oriented design decouples the parsing from the handler, so
> you can use the same consumer for multiple versions and flavors of
> BLAST. Plus, you can ignore data that you're not interested in by not
> implementing handler methods in your consumer.
>
> By doing things this way, the Parser and Consumer need to agree on an
> interface through which to pass data. Thus, I went through the latest
> BLAST code and named the lines in the output. These will be the names
> of the methods in the Consumer class.
>
> I've attached the list of names as well as some sample BLAST output.
> Please let me know what you think about the parser ideas as well as
> the proposed names.
>
> Thanks,
> Jeff
>
> SECTION
> DATATYPE WHEN AVAILABLE
>
> header
> version
> reference
> query
> database_information
>
> descriptions
> description
>
> score
> title
> length
> score
> identities
> frame frame
> strand strand
>
> alignment
> query
> align
> sbjct
query_start
query_end
sbjct_start
sbjct_end
> database_report
> database not subset
> posted_date not subset
> num_letters_in_database not subset
> num_sequences_in_database not subset
> num_letters_searched subset
> num_sequences_searched subset
>
> parameters
> matrix
> gap_penalties gapped mode
> second_pass_hits not two pass method
> second_pass_sequences not two pass method
> second_pass_extends not two pass method
> second_pass_good_extends not two pass method
> num_hits two pass method
> num_sequences two pass method
> num_extends two pass method
> num_good_extends two pass method
> num_seqs_better_e gapped and not blastn
> hsps_no_gap gapped and not blastn
> hsps_prelim_gapped gapped and not blastn
> hsps_prelim_gap_attempted gapped and not blastn
> hsps_gapped gapped and not blastn
> query_length
> database_length
> effective_hsp_length
> effective_query_length
> effective_database_length
> effective_search_space
> effective_search_space_used
> frameshift_decay blastx or tblastn or tblastx
> threshold second
> window_size
> dropoff_1st_pass
> gap_x_dropoff
> gap_x_dropoff_final not blastn and gapped calculation (?)
> gap_trigger
> blast_cutoff
>
>
>
>
> BLASTP 2.0.10 [Aug-26-1999]
>
> Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
> Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
> "Gapped BLAST and PSI-BLAST: a new generation of protein database search
> programs", Nucleic Acids Res. 25:3389-3402.
>
> Query= test
> (140 letters)
>
> Database: sdqib40-1.35.seg.fa
> 1323 sequences; 223,339 total letters
>
> Searching..................................................done
>
> Score E
> Sequences producing significant alignments: (bits) Value
>
> d1rip__ 2.24.7.1.1 Ribosomal S17 protein [Bacillus stearothermo... 23 2.5
> d1rlr_1 1.56.1.1.1 (1-212) R1 subunit of ribonucleotide reducta... 23 2.5
> d1lfaa_ 3.42.1.1.1 Integrin CD11a/CD18 (LFA-1) [Human (Homo sap... 22 5.6
> d1ktq_1 3.38.3.4.2 (1-161) Exonuclease domain of DNA polymerase... 21 9.7
> d1prea1 4.88.1.2.2 (1-83) Proaerolysin, N-terminal domain [Aero... 21 9.7
How can the parser handle this summary block?
[> ...]
I think this event-oriented idea is great! The problesm with blast out
put (which was/is intended to be readable by humans) are huge and
there's no simple solution. THerefore the only reasonable way is to
provide a parser that's flexible which means on the other hand that you
have to spend some time on defining your own interface to the parser for
each application you write.
My suggestion or extension to the event oreinted model:
The consumer class has to define a list with the keywords (i.g.
information) the Parser object recognizes and there's the triplet of
start_KEYWORD, KEWORD, END_KEYWORD functions defined for each of the
keywords in the list. These keywords could also be regular expressions.
THe parser is then a very simple class that's completely independant
from the blast format.
Please do let me know if I'm completely wrong and misunderstood the
event-oriented philosophy.
greetings,
Arne
--
Arne Mueller
Biomolecular Modelling Laboratory
Imperial Cancer Research Fund
44 Lincoln's Inn Fields
London WC2A 3PX, U.K.
phone : +44-(0)171 2693405 | fax :+44-(0)171-269-3534
email : a.mueller@icrf.icnet.uk | http://www.icnet.uk/bmm/