[Biojava-dev] BlastXMLParserFacade - multiple iteration support

James Diggans jdiggans at excelsiortech.com
Sun Dec 5 16:13:45 EST 2004


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


I'm relatively new to BioJava but am using BlastXMLParserFacade to parse
a large body of output from MegaBLAST 2.2.10. I've encountered two minor
problems, one of which may be easy to fix but the other seems rather
systemic. The first, of course, is the lack of support for MegaBLAST
directly but this can be gotten around first by patching
BlastOutputHandler.java in the short term (see patch @ the bottom of
this email).

The larger problem seems to stem from the historical need to wrap the
XML-based BLAST output from NCBI in a parent tag. This is now no longer
necessary. However, from what I can tell, the current StAX framework
assumes that every BlastOutput tag set stems from a *single* query
sequence. That is, if I send 1,000 query sequences up to MegaBLAST and
get a large XML result file back, the array of result objects returned
by the parser call e.g.

ArrayList alignmentResults = new ArrayList();
SearchContentHandler builder = new
	BlastLikeSearchBuilder(alignmentResults,
~                new DummySequenceDB("queries"),
		new DummySequenceDBInstallation());

adapter.setSearchContentHandler(builder);
parser.parse(new InputSource(is));

*always* returns a single SeqSimilaritySearchResult. This is because the
parser pays no attention to the sequence query information within each
Iteration tag set and assumes all hits are related to the query
specified in the BlastOutput tag set when that tag set actually contains
only the first query id by default. I'd like to fix it but as I'm
completely new to StAX (just tracking this bug down has been
educational), I'm going to need some help!

The easiest fix would seem to be to alter the IterationHandler to read
the Iteration_query-id and Itereation_query-def tags (MegaBLAST
populates the queryId in the queryDef field - don't ask me why) but the
'BioJava alignment results DTD' to which the StAX events map the
incoming XML contains a QueryId *only* within a BioJavaHit element. This
means the IterationHandler would need to pass these text elements down
to the IterationHitHandler which is writing the BioJavaHit elements. I
don't see any way to do this and it worries me.

Anyone who knows more about this care to offer some advice?

Regards,
- -j

- --- BlastOutputHandler.java	2003-06-01 05:42:24.000000000 -0400
+++ BlastOutputHandler.java	2004-11-24 23:05:35.375000000 -0500
@@ -146,17 +146,22 @@
~                             else if (program.equals("tblastx")) {
~                                 // dna query translated in all frames
against
~                                 // dna database in all frames
~                                 // irrespective of frame, both
sequences displayed
~                                 // in increasing seq DNA coordinates.
~                                 querySequenceType = "protein";
~                                 hitSequenceType = "protein";
~                             }
- -                            else throw new SAXException("unknown BLAST
program.");
+                            else if (program.equals("megablast")) {
+                                // nucleotide query against dna database
+                                querySequenceType = "dna";
+                                hitSequenceType = "dna";
+                            }
+                            else throw new SAXException("unknown BLAST
program.");
~                         }
~                     };
~                 }
~             }
~         );

~         // delegate handling of <BlastOutput_version>
~         super.addHandler(new
ElementRecognizer.ByLocalName("BlastOutput_version"),

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3-nr1 (Windows XP)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBs3oJ75jgGJzUhNkRAvXDAKDumVIJe2r7R29J7eK+ovSCfVj0DwCfVZOq
z3FQ6vSmLVF6J1/JuSsqRdA=
=xbvc
-----END PGP SIGNATURE-----


More information about the biojava-dev mailing list