[Biojava-dev] BlastXMLParserFacade - MegaBLAST/Iteration_query-def support

James Diggans jdiggans at excelsiortech.com
Wed Dec 1 18:56:19 EST 2004


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


I posted to the list a week or so ago re: support for MegaBLAST in the
BlastXMLParserFacade. Support is the easy part (see attached patch).
MegaBLAST seems to populate the Iteration_query-id tags with
near-nonsense characters:

~    <Iteration>
~      <Iteration_iter-num>0</Iteration_iter-num>
- -->   <Iteration_query-ID>lcl|1_</Iteration_query-ID>
~      <Iteration_query-def>AB2040_B08_061.g1</Iteration_query-def>
~      <Iteration_query-len>552</Iteration_query-len>

What I *actually* want is in the Iteration_query-def field. After
reading over BlastOutputHandler, IterationHandler, IterationHitsHandler
and HitHandler, it would seem that two things strike me as in need of
repair:

1) BlastOutputHandler assumes that the content in BlastOutput_query-ID
applies to the entire document which is not the case. For BLAST output
(at least, for MegaBLAST output) files containing hits for multiple
input query sequences (now that NCBI has repaired their XML format),
it's the Iteration_query-ID (or Iteration_query-def) tag which contains
query-specific information for that group of hits. This needs to be
added - would IterationHandler be the proper location?

2) *_query-def is not supported at *all* in the current framework (it
never makes it into the Annotation associated with the hit). I'm willing
to add this but want to confirm, first, that it is indeed not supported
before I go mucking about.

Thanks,
- -j



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3-nr1 (Windows XP)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBrloj75jgGJzUhNkRAjclAJ9CrUg4bDb2Hy92wXp+HBmEx3DR8QCg8DKV
mRZ2uzAzdRd4kyXsyrp890w=
=UUbt
-----END PGP SIGNATURE-----
-------------- next part --------------
--- C:\Java\clean-biojava\biojava-1.4pre1\src\org\biojava\bio\program\sax\blastxml\BlastOutputHandler.java	2003-06-01 05:42:24.000000000 -0400
+++ C:\Java\biojava-1.4pre1\src\org\biojava\bio\program\sax\blastxml\BlastOutputHandler.java	2004-11-24 23:05:35.375000000 -0500
@@ -146,17 +146,22 @@
                             else if (program.equals("tblastx")) {
                                 // dna query translated in all frames against
                                 // dna database in all frames
                                 // irrespective of frame, both sequences displayed
                                 // in increasing seq DNA coordinates.
                                 querySequenceType = "protein";
                                 hitSequenceType = "protein";
                             }
-                            else throw new SAXException("unknown BLAST program.");
+                            else if (program.equals("megablast")) {
+                                // nucleotide query against dna database
+                                querySequenceType = "dna";
+                                hitSequenceType = "dna";
+                            }
+                            else throw new SAXException("unknown BLAST program.");
                         }
                     };
                 }
             }
         );
 
         // delegate handling of <BlastOutput_version>
         super.addHandler(new ElementRecognizer.ByLocalName("BlastOutput_version"),


More information about the biojava-dev mailing list