[Biojava-l] Blast Parser oddity

Fri Apr 2 08:20:20 EST 2004

Hi!

I am currently evaluating the XML output of NCBI Blast, and the ability 
of BioJava to parse this output. For this purpose, I have done twice the 
identical blastp and blastn (i.e. the same sequence against the same 
database with the same parameters), one time with the standard output, 
and one time with XML output ("-m 7"). I then parsed the files either 
with BlastLikeSAXParser (original output), or with BlastXMLParserFacade 
(XML output) and compared the outcome. Surprisingly, I got two different 
results...

Here is a list of the fields that are different:

SeqSimilaritySearchResult:
   Annotation:
     databaseId
     program
     queryId
     version

SeqSimilaritySearchHit:
   subjectId
   queryStrand
   subjectStrand
   Annotation:
     subjectDescription
     subjectId

SeqSimilaritySearchSubHit:
   queryStrand
   subjectStrand
   score
   numberOfIdentities
   numberOfPositives
   percentageIdentity
   score

These are all rather important fields, for example subjectId, the 
description or score. After looking at it, I think that the output of 
BlastLikeSAXParser is OK, but the one of BlastXMLParserFacade is rotten.

What now? I think that the parsing results are supposed to be identical 
(as good as it gets), but changing the parser might break existing code. 
If it's OK for you, I'd like to volunteer to change BlastXMLParserFacade 
so that the outcome resembles more the one of BlastLikeSAXParser.

By the way, is there a guaranteed set of Annotation entries for these 
different classes? For example, I find percentageIdentity, but no 
percentagePositives.

Greetings,
Christian