[Biojava-l] Fasta search parsing design

Keith James kdj@sanger.ac.uk
05 Dec 2000 12:39:02 +0000


Hi,

For now I'm going with the direct conversion of Fasta -m 10 search
output to biojava.bio.search objects as I want to keep it simple. So
far I've put together a FastaSearchParser which recognises various
elements from the search output and a FastaSearchProcessor (which
implements SearchParseListener) to respond to the parser by doing
'stuff'.

There's a few issues about representing the data as biojava.bio.search
result, hit and subhit objects if I use the interfaces there already.


Interface SeqSimilaritySearchResult

getSequenceDB() - there probably won't be a sensible SequenceDB object
if the search has been done externally. It could return null instead?
(the interface docs discourage this)

getSearcher() - again the interface docs discourage returning
null. Returning a Searcher could be okay if its getSearchableDBs()
returns an empty set, indicating that you can't actually run a search
because the database is external. Maybe omit the Searcher entirely and
allow getSearcher() to return null?


Interface SeqSimilaritySearchHit

getSubHits() - Fasta hits don't have subhits as such. However, you
could view them as a case where they are a hit which only ever
contains one subhit.

Fasta search output also contains extra information (several scores,
positions of the presented alignment in the query and subject
sequences, percent identity). I was thinking of maybe a sub-interface
to specify extra methods. Incidentally, we find % id useful, but if
the alignment is retained you then have the same information in two
places (via a calculation), which is a bad thing I guess.

Also, not having written any Java before, I don't know what memory use
will be like for storing big lists of hits. I've seen Perl hoover up
rather a lot of memory dealing with unfiltered Blast output.

Any advice/comments welcome,

Keith

-- 

-= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =-
The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA