[Biojava-l] Blast Parser oddity
Christian Gruber
Christian.Gruber at biomax.com
Fri Apr 2 08:20:20 EST 2004
Hi!
I am currently evaluating the XML output of NCBI Blast, and the ability
of BioJava to parse this output. For this purpose, I have done twice the
identical blastp and blastn (i.e. the same sequence against the same
database with the same parameters), one time with the standard output,
and one time with XML output ("-m 7"). I then parsed the files either
with BlastLikeSAXParser (original output), or with BlastXMLParserFacade
(XML output) and compared the outcome. Surprisingly, I got two different
results...
Here is a list of the fields that are different:
SeqSimilaritySearchResult:
Annotation:
databaseId
program
queryId
version
SeqSimilaritySearchHit:
subjectId
queryStrand
subjectStrand
Annotation:
subjectDescription
subjectId
SeqSimilaritySearchSubHit:
queryStrand
subjectStrand
score
numberOfIdentities
numberOfPositives
percentageIdentity
score
These are all rather important fields, for example subjectId, the
description or score. After looking at it, I think that the output of
BlastLikeSAXParser is OK, but the one of BlastXMLParserFacade is rotten.
What now? I think that the parsing results are supposed to be identical
(as good as it gets), but changing the parser might break existing code.
If it's OK for you, I'd like to volunteer to change BlastXMLParserFacade
so that the outcome resembles more the one of BlastLikeSAXParser.
By the way, is there a guaranteed set of Annotation entries for these
different classes? For example, I find percentageIdentity, but no
percentagePositives.
Greetings,
Christian
More information about the Biojava-l
mailing list