[Biojava-l] blast parsing and empty hits

Simon Brocklehurst simon.brocklehurst@CambridgeAntibody.com
Thu, 03 Oct 2002 20:11:35 +0100


Doug Rusch wrote:
> 
> Actually I have made changes that fix both the no summary and "No hits found" >problems though I have not done extensive testing and I do not know if this would >work for wu-blast yet. Its more of a hack though than a nice solution. It would be >nice to use the regex in 1.4 to put together a nice clear parser and I may do that in >the near future. I am still surprised that this is even a problem. Is the community >that small that obvious problems like this have not been fixed much earlier?
> 

Hi Doug,

I think you're assuming that what I'm sure is a genuine problem for you,
is a problem in a broad variety of use cases.  I suspect that's an
incorrect assumption. In general, people don't get terribly excited at
the prospect of parsing search reports that don't have any hits.
Furthermore, for many use cases, work arounds to deal with missing SAX
events and/or empty documents in the special case of empty blast reports
will often either be trivial or not be required at all.

This issue of the biojava blast SAX driver producing events equivalent
to mal-formed XML in the case of empty blast reports has been known for
a while (I think problems with empty blast reports were first posted to
the biojava list in Dec 2001).  Clearly having this as a known bug, and
not fixing it isn't ideal - but the reason why no-one has fixed this yet
is that it is it simply hasn't caused anyone enough grief yet.

Relating to your previous comments about frequency of code updates. I'm
not sure where you got the idea that the Blast parsing code hadn't been
updated in almost two years.  
If you're interested in update histories of classes in the SAX parsing
biojava package, you can see them at the URL below:

http://cvs.biojava.org/cgi-bin/viewcvs/viewcvs.cgi/biojava-live/src/org/biojava/bio/program/sax/?cvsroot=biojava

The last update seems to be 4 weeks ago when Keith James from the Sanger
center added support for NCBI Blast versions 2.2.2 and 2.2.3.

Despite your initial problems, I do hope you'll give the biojava parser
a chance - it might not actually be as bad as you think! Your bug fixes
for NCBI Blast parsing for empty reports would be really appreciated by
the community, I'm sure. If you get yourself cvs access, you could
easily apply them.

Simon
--
Simon M. Brocklehurst, Ph.D.
Director of Informatics & Robotics
Cambridge Antibody Technology
The Science Park, Melbourn, Cambridgeshire, UK
http://www.CambridgeAntibody.com/
mailto:simon.brocklehurst@CambridgeAntibody.com