[Biojava-l] Fasta search parsing design

Tue, 05 Dec 2000 13:03:38 +0000

H Keith,

Keith James wrote:

> Hi,
>
> For now I'm going with the direct conversion of Fasta -m 10 search
> output to biojava.bio.search objects as I want to keep it simple. So
> far I've put together a FastaSearchParser which recognises various
> elements from the search output and a FastaSearchProcessor (which
> implements SearchParseListener) to respond to the parser by doing
> 'stuff'.
>
> There's a few issues about representing the data as biojava.bio.search
> result, hit and subhit objects if I use the interfaces there already.
>
> Interface SeqSimilaritySearchResult
>
> getSequenceDB() - there probably won't be a sensible SequenceDB object
> if the search has been done externally. It could return null instead?
> (the interface docs discourage this)

Gerald? In some cases you will have access to the database - esp if you ran
the search localy. Perhaps null is an apropreate 'no value here' thing to
return, or perhaps the method should throw some sort of 'information not
available' exception.

>
>
> getSearcher() - again the interface docs discourage returning
> null. Returning a Searcher could be okay if its getSearchableDBs()
> returns an empty set, indicating that you can't actually run a search
> because the database is external. Maybe omit the Searcher entirely and
> allow getSearcher() to return null?
>

getSearcher should realy return the object that you used to trigger the
fasta search, or at the very least an object that read in the parameters and
could be used to run another similar search (using exec?). If you ran them
from the command-line and dumped them to disk, then you should be able to
find out the parameters & write a little SequenceSearcher instance that
could re-create this.

>
> Interface SeqSimilaritySearchHit
>
> getSubHits() - Fasta hits don't have subhits as such. However, you
> could view them as a case where they are a hit which only ever
> contains one subhit.

This is the way I'd go.

>
> Fasta search output also contains extra information (several scores,
> positions of the presented alignment in the query and subject
> sequences, percent identity). I was thinking of maybe a sub-interface
> to specify extra methods. Incidentally, we find % id useful, but if
> the alignment is retained you then have the same information in two
> places (via a calculation), which is a bad thing I guess.
>

I'd add the position information. Doesn't blast also give you hit
coordinates? The %id field sounds fine to me. Gerald? For programs that
don't report it, it can be easily calculated, as you say.

>
> Also, not having written any Java before, I don't know what memory use
> will be like for storing big lists of hits. I've seen Perl hoover up
> rather a lot of memory dealing with unfiltered Blast output.
>

Perl stores arrays in an incredibly ineficient way. Java arrays have the
same overhead as c arrays + a small amount of book-keeping memory. You
should be fine handeling *lots* of hits.

>
> Any advice/comments welcome,
>
> Keith
>
> --
>
> -= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =-
> The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l