[Biojava-dev] blast parsing continued

Doug Rusch drusch@tcag.org
Wed, 13 Nov 2002 18:54:47 -0500


Alright, here is my attempt at persuasion :

My position is that an open source blast parser should be as generic as possible. In practice this means that everything that might be of value is parsed and it is left to the user to decide what they really want to make use of. The parser should make minimal assumptions about the output, for example the output could be small or large (genome vs genome is a real possibility these days). As a consequence it is important that the implementation should be efficient in terms of both memory and speed. Finally, the parser should not care what the source of the blast data is. In my mind this is not so much an issue of format (such as wu-blast or ncbi-blast) as one of construction. It seems quite reasonable to me that someone might run a number of individual blasts and concatenate the results into a single file for parsing. Being independent of the source of the blast output also means that the parser can not rely on having access to any of the original fasta files or accessory files used by blast.

So my proposal is for the parser to produce stripped down Sequence objects containing empty symbol lists as needed rather than just maintain an id (subject, query, and database) as seen in the current approach or requiring that full Sequence object be available prior to parsing as was the previous case. For the subject and query sequences, I think that the parser should capture the id, definition line, and length. This information can be very valuable for filtering. One example would be to filter based on the global percent identity relative to the query sequence. Alternately the user could parse the species name from the subject defline and filter based on that. Requiring that the query and subject sequences be loaded prior to parsing is cumbersome and in some cases impractical. If the user were doing a large blast (say all of dbest vs nr) they probably wouldnt want all of nr in memory and would probably be even less inclined to read in sequences from an index as needed (slow given that every sequence would most likely be hit multiple times). I can imagine some use cases where the user is providing a service and does not have access to the original fasta files but can see the blast output.

My proposal would make better use of the information available in the blast output in a way that is compatible with the existing code base. It will provide increased efficiency by reducing the situations where the user would need to have access to the source fasta files. It is a more general solution which would make the use of the parser practical in a broader set of situations.

Persuaded?

-----Original Message-----
From:	Matthew Pocock [mailto:matthew_pocock@yahoo.co.uk]
Sent:	Wed 11/13/02 2:03 PM
To:	Doug Rusch
Cc:	biojava-dev@biojava.org
Subject:	Re: [Biojava-dev] blast parsing continued
> What are everyone elses opinions on this?

open to perswasion