[Biojava-l] Fasta search parsing design

Thomas Down td2@sanger.ac.uk
Tue, 5 Dec 2000 12:59:56 +0000


On Tue, Dec 05, 2000 at 12:39:02PM +0000, Keith James wrote:
> 
> Hi,

Hi...

> There's a few issues about representing the data as biojava.bio.search
> result, hit and subhit objects if I use the interfaces there already.
> 
> 
> Interface SeqSimilaritySearchResult
> 
> getSequenceDB() - there probably won't be a sensible SequenceDB object
> if the search has been done externally. It could return null instead?
> (the interface docs discourage this)

I'm generally a little bit suspicious of returning null values
(they're an easy way to cause bugs), although I'll concede that
in this case it does look kind-of sensible.

On the other hand, you could return a `dummy' SequenceDB which
has a name but doesn't contain any sequences.  This feels a little
more correct to me, and has the advantage that it would be easy
to extend if in the future you want to provide a mechanism for
fetching whole sequences from the remote search database (a potentiall
quite useful function for some appplications).

> getSearcher() - again the interface docs discourage returning
> null. Returning a Searcher could be okay if its getSearchableDBs()
> returns an empty set, indicating that you can't actually run a search
> because the database is external. Maybe omit the Searcher entirely and
> allow getSearcher() to return null?

This seems a shame to me -- SeqSimilaritySearcher looks a potentially
nice interface for `end users' of the code.

I assume we're still talking about cases where we might be launching
searches on remote servers?  I could invisage a situation where a
service like this is wrapped up as a SeqSimilaritySearcher with one
of the aforementioned `dummy' SequenceDBs for each database installed
on the server.

Of course, this isn't necessarily `first-pass' functionality -- Returning
null for now is fine if you don't need this sort of reflections.

> Interface SeqSimilaritySearchHit
> 
> getSubHits() - Fasta hits don't have subhits as such. However, you
> could view them as a case where they are a hit which only ever
> contains one subhit.

I didn't write the interface, but I would assume from the
documentation that in the `no sub-hits' case you are expected
to return a singleton List containing yourself.  Certainly,
there doesn't appear to be any harm which could result from
implementing it this way.

> Fasta search output also contains extra information (several scores,
> positions of the presented alignment in the query and subject
> sequences, percent identity). I was thinking of maybe a sub-interface
> to specify extra methods. Incidentally, we find % id useful, but if
> the alignment is retained you then have the same information in two
> places (via a calculation), which is a bad thing I guess.

Sub-interfacing is fine, if that's useful to you.  Another
possibility to look at might be to attach BioJava `Annotation'
objects to these interfaces -- I guess I don't really have
strong feelings either way on this one.

Gerald (as the original developer of these interfaces): any
comments on this?

> Also, not having written any Java before, I don't know what memory use
> will be like for storing big lists of hits. I've seen Perl hoover up
> rather a lot of memory dealing with unfiltered Blast output.

Well-designed Java is generally a lot more efficient than Perl in
this respect, mainly because OO perl has a large per-object overhead
because of the `object-is-a-hash' trick.  Per-object overhead in
Java  is typically 8-12 bytes in a modern VM -- much better.

>From experience, the Java garbage collector is also rather more
trustworthy.

Happy hacking,

Thomas.
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett