[Biojava-dev] blast parsing continued

Matthew Pocock matthew_pocock@yahoo.co.uk
Sat, 16 Nov 2002 14:09:06 +0000


Hi Doug,

You've perswaded me ;-)

Just a handfull of comments. Firstly, you said that your code uses 1.4 
regex stuff. If it does, then please locate it under src-1.4 rather than 
src in the dev tree. Seccondly, whatever happens, we need to make sure 
that there is a functional blast parsing solution for Java 1.2 and 1.3 
users. Thirdly, I'm a little bit worried about us ending up with two 
incompattible blast DTDs. If we modify it for one parser, then if 
humanly possible, could we make sure the other one spits out the same 
sax? I know this may not be trivial, but it would make me happier.

Lastly, I can't remember if you have a cvs account, but I guess it's 
time to sort you out with one so that you can work on this without 
mailing source arround the list all the time. Welcome aboard :-)

Matthew

Doug Rusch wrote:
> Yes an XML parser would be best if I didnt find that the NCBI blast XML output option tends to core dump on me. In any case, here is my modification of the BlastLikeDataSetCollection DTD which I call BlastLikeResultSetCollection. See if this fits with your expectations David, if not we can thrash out what should be changed.
> 
> <!-- BlastLikeResultSetCollection DTD - this is a heavily modified version 
>      of the BlastListDataSetCollection collection. It is currently under
>      development but hopefully will serve as a unified DTD for
>      a variety of analysis tools including :
> 
>                 o BLAST                (NCBI)                         
>                 o WU-BLAST             (Washington University)        
>                 o HMMER                (Washington University)        
>                 o DBA                  (Sanger Center)                
>                 o Genewise             (Sanger Center)                
>                 o Sim4                 (Pennsylvania State University)                
>                                                                       
>      NB This DTD covers output from the above software, when run      
>      in modes such that the detailed output is based around           
>      pairwise alignments.          
> 
>      This is as opposed to other output formats such as ASN.1
> 
>      The root element is a BlastLikeResultSetCollection.  This is
>      described towards the end of the DTD.
>      ================================================================ 
>      The BlastLikeDataSetCollection DTD is Copyright 1999, 2000, 2001 Cambridge
>      Antibody Technology Group plc (CAT). All Rights Reserved.                            
> 
>      The BlastLikeResultSetCollection DTD is Copyright 2002 The Center for the
>      Advancement of Genomics (TCAG). All Rights Reserved.
>                                                                       
>      Version 0.5
> 
>      Author List for BlastLikeResultSetCollection:
>        Primary Author: Douglas Rusch      (TCAG) 
> 
>      Author List for BlastLikeDataSetCollection:                                                     
>        Primary Author: Simon Brocklehurst (CAT)                       
>        Other Authors:  Colin H. Hardman   (CAT)                       
>                        Stuart Johnson     (CAT)                       
>                        Tim Dilks          (CAT)                       
>                        Keith James        (Sanger Center)
>      ================================================================ -->
> 
> <!-- PARAMETER ENTITY DECLARATIONS 
>      ============================= -->
> 
> <!-- ELEMENT DECLARATIONS
>      ==================== -->
> 
> <!-- The RawOutput element is used to represent sections of the
>      output from programs "as is".  This enables information from
>      software to be represented, without being parsed in detail.
>                                                                       -->
> <!ELEMENT biojava:RawOutput (#PCDATA)>
> <!ATTLIST biojava:RawOutput
>                      xml:space       (default|preserve) #IMPLIED >
> 
> <!-- ================================================================ -->
> <!-- Elements for Query, Subject, and Database information            -->
> <!-- Changes include the addition of the description or definition    -->
> <!-- line and the length (in letters) of the subject and query        -->
> <!-- sequences. For the database, length in letters and number of     -->
> <!-- sequences has been added.                                        -->
> <!-- Why is there a metadata field? How is this supposed to be used?? -->
> <!-- Parsers seem to ignore this attribute.                           -->
> 
> <!ELEMENT biojava:QueryInfo EMPTY>
> <!ATTLIST biojava:QueryInfo
>                     id             CDATA  #REQUIRED
>                     desc           PCDATA #IMPLIED
>                     length         CDATA  #IMPLIED
>                     metadata       CDATA  #REQUIRED >
> 
> <!ELEMENT biojava:SbjctInfo EMPTY>
> <!ATTLIST biojava:SbjctInfo
>                      id                  CDATA  #REQUIRED
>                      desc                PCDATA #IMPLIED
>                      length              CDATA  #IMPLIED
>                      metaData            CDATA  #REQUIRED >
> 
> <!ELEMENT biojava:DatabaseInfo EMPTY>
> <!ATTLIST biojava:DatabaseInfo
>                     name		   CDATA  #REQUIRED
>                     letters	       CDATA  #IMPLIED
>                     entries        CDATA  #IMPLIED
>                     metadata       CDATA  #REQUIRED >
> 
> <!-- ================================================================ -->
> <!-- Mainly HSPSummary related information derived from HitSummary.   -->
> <!-- Neither of these names seems correct, perhaps MatchSummary       -->
> <!-- would be best. Changes include removing a count of HSPs and      -->
> <!-- reading frame. Reading frame is easily derived from the          -->
> <!-- coordinates of the alignment. Also removed sumProbability value  -->
> <!-- though this should probably be kept. Added similarity count.     -->
> 
> <!ELEMENT biojava:HSPSummary >
> <!ATTLIST biojava:HSPSummary
>                 score               CDATA #REQUIRED
>                 bitScore            CDATA #IMPLIED
>                 expectValue         CDATA #IMPLIED
>                 identitical         CDATA #IMPLIED
>                 alignmentLength     CDATA #IMPLIED
>                 similar             CDATA #IMPLIED
>                 pValue              CDATA #IMPLIED
>                 sumPValues          CDATA #IMPLIED >
> 
> <!-- ================================================================ -->
> <!-- Elements for Query, Subject, and Match alignment information     -->
> 
> <!ELEMENT biojava:QuerySequence (#PCDATA)>
> <!ATTLIST biojava:QuerySequence 
>                 begin           CDATA #REQUIRED
>                 end             CDATA #REQUIRED
>                 strand			CDATA #REQUIRED
>                 type            CDATA #IMPLIED
>                 gaps			CDATA #IMPLIED >
> 
> <!-- A MatchConsensus element represents the consensus information
>      present in a pairwise alignment produced by Blast-like programs
>      (i.e. the middle line of the alignment).                          -->
> 
> <!ELEMENT biojava:MatchConsensus (#PCDATA)>
> <!ATTLIST biojava:MatchConsensus
>                      xml:space       (default|preserve) #IMPLIED >
> 
> 
> <!ELEMENT biojava:SbjctSequence (#PCDATA)>
> <!ATTLIST biojava:SbjctSequence 
>                 begin           CDATA #REQUIRED
>                 end             CDATA #REQUIRED
>                 strand			CDATA #REQUIRED
>                 type            CDATA #IMPLIED
>                 gaps			CDATA #IMPLIED >
> 
> <!-- The BlastLikeAlignment elements represents information from the
>      pairwise alignments produced by Blast-like programs. Rather than
>      representing the alignment simply as preformatted raw text, it
>      separates out the information into a QuerySequence, a HitSequence
>      and a MatchConsensus.                                             -->
> 
> <!ELEMENT biojava:BlastLikeAlignment (biojava:QuerySequence,
>                                       biojava:MatchConsensus,
>                                       biojava:SbjctSequence) >
> 
> <!ELEMENT biojava:HSP (biojava:HSPSummary, biojava:BlastLikeAlignment?)>
> 
> <!-- HSPCollections model related groups of HSPs. For example, this
>      allows all plus strand HSPs to be grouped separated from all
>      minus strand HSPs                                                 -->
> 
> <!ELEMENT biojava:HSPCollection (biojava:HSP+)>
> 
> <!-- A hit, besides containing the subject and alignment information
>      should also hold things like frameshifts where it is assumed that
>      a frameshift terminates a given match or HSP                      -->
> 
> <!ELEMENT biojava:Hit (biojava:SbjctInfo, biojava:HSPCollection+)>
> <!ATTLIST biojava:Hit >
> 
> <!ELEMENT biojava:Detail (biojava:Hit*)>
> 
> <!-- ================================================================ -->
> <!-- Statistics found at end of blast                                 -->
> 
> <!ELEMENT biojava:KAStats EMPTY>
> <!ELEMENT biojava:KAStats
>                     K              CDATA  #REQUIRED
>                     H              CDATA  #REQUIRED
>                     lambda         CDATA  #REQUIRED >
> 
> <!ELEMENT biojava:GappedKAStats EMPTY>
> <!ELEMENT biojava:GappedKAStats
>                     K              CDATA  #REQUIRED
>                     H              CDATA  #REQUIRED
>                     lambda         CDATA  #REQUIRED >
>                     
> <!ELEMENT biojava:SearchMatrix
>                     name           CDATA  #REQUIRED
>                     matchScore     CDATA  #IMPLIED
>                     mismatchScore  CDATA  #IMPLIED >
> 
> <!ELEMENT biojava:GapPenalties
>                     gapOpen        CDATA  #REQUIRED
>                     gapExtend      CDATA  #REQUIRED >
> 
> <!ELEMENT biojava:SearchSpaceStats
>                     effectiveSpace CDATA  #REQUIRED
>                     usedSpace      CDATA  #REQUIRED >
> 
> <!ELEMENT biojava:Statistics (biojava:KAStats, 
>                            biojava:GappedKAStats,
>                            biojava:SearchMatrix,
>                            biojava:GapPenalties,
>                            biojava:SearchSpaceStats)>
> 
> <!-- ================================================================ -->
> <!-- Relating to overall results of searches                          -->
> 
> <!ELEMENT biojava:Header (biojava:RawOutput?, QueryInfo?, DatabaseInfo? )>
> 
> <!ELEMENT biojava:BlastLikeResultSet (biojava:Header,
>                                       biojava:Summary?,
>                                       biojava:Detail?,
>                                       biojava:Statistics?)>
> <!ATTLIST biojava:BlastLikeResultSet
>                  program             CDATA #REQUIRED
>                  version             CDATA #REQUIRED>
> 
> <!-- A BlastLikeResultSetCollection contains data from groups of results
>      obtained from  bioinformatics software that produces Blast-like 
>      output. For example, it can model the output from Blast run on 
>      multiple sequences. Or it could be used to group together analyses
>      on a single sequence obtained from multiple programs.             -->
> 
> <!ELEMENT biojava:BlastLikeResultSetCollection (biojava:BlastLikeResultSet+) >
> <!ATTLIST biojava:BlastLikeResultSetCollection
>                  xmlns               CDATA #FIXED ""
>                  xmlns:biojava       CDATA #FIXED "http://www.biojava.org" >
> 
> 
> -----Original Message-----
> From:	David Huen [mailto:smh1008@cus.cam.ac.uk]
> Sent:	Fri 11/15/02 11:49 AM
> To:	Doug Rusch; Keith James
> Cc:	biojava-dev@biojava.org
> Subject:	Re: [Biojava-dev] blast parsing continued
> Could I have a copy of whatever DTD you might settle upon please?
> 
> I have a NCBI Blast XML parser that I use that I'd like to check in and an 
> adaptor to implement to make the events match those expected by downstream 
> builders.
> 
> Regards,
> David Huen
> 
> _______________________________________________
> biojava-dev mailing list
> biojava-dev@biojava.org
> http://biojava.org/mailman/listinfo/biojava-dev
> 


-- 
BioJava Consulting LTD - Support and training for BioJava
http://www.biojava.co.uk

__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com