[Biojava-dev] blast parsing continued

Doug Rusch drusch@tcag.org
Fri, 15 Nov 2002 14:28:21 -0500


Yes an XML parser would be best if I didnt find that the NCBI blast XML output option tends to core dump on me. In any case, here is my modification of the BlastLikeDataSetCollection DTD which I call BlastLikeResultSetCollection. See if this fits with your expectations David, if not we can thrash out what should be changed.

<!-- BlastLikeResultSetCollection DTD - this is a heavily modified version 
     of the BlastListDataSetCollection collection. It is currently under
     development but hopefully will serve as a unified DTD for
     a variety of analysis tools including :

                o BLAST                (NCBI)                         
                o WU-BLAST             (Washington University)        
                o HMMER                (Washington University)        
                o DBA                  (Sanger Center)                
                o Genewise             (Sanger Center)                
                o Sim4                 (Pennsylvania State University)                
                                                                      
     NB This DTD covers output from the above software, when run      
     in modes such that the detailed output is based around           
     pairwise alignments.          

     This is as opposed to other output formats such as ASN.1

     The root element is a BlastLikeResultSetCollection.  This is
     described towards the end of the DTD.
     ================================================================ 
     The BlastLikeDataSetCollection DTD is Copyright 1999, 2000, 2001 Cambridge
     Antibody Technology Group plc (CAT). All Rights Reserved.                            

     The BlastLikeResultSetCollection DTD is Copyright 2002 The Center for the
     Advancement of Genomics (TCAG). All Rights Reserved.
                                                                      
     Version 0.5

     Author List for BlastLikeResultSetCollection:
       Primary Author: Douglas Rusch      (TCAG) 

     Author List for BlastLikeDataSetCollection:                                                     
       Primary Author: Simon Brocklehurst (CAT)                       
       Other Authors:  Colin H. Hardman   (CAT)                       
                       Stuart Johnson     (CAT)                       
                       Tim Dilks          (CAT)                       
                       Keith James        (Sanger Center)
     ================================================================ -->

<!-- PARAMETER ENTITY DECLARATIONS 
     ============================= -->

<!-- ELEMENT DECLARATIONS
     ==================== -->

<!-- The RawOutput element is used to represent sections of the
     output from programs "as is".  This enables information from
     software to be represented, without being parsed in detail.
                                                                      -->
<!ELEMENT biojava:RawOutput (#PCDATA)>
<!ATTLIST biojava:RawOutput
                     xml:space       (default|preserve) #IMPLIED >

<!-- ================================================================ -->
<!-- Elements for Query, Subject, and Database information            -->
<!-- Changes include the addition of the description or definition    -->
<!-- line and the length (in letters) of the subject and query        -->
<!-- sequences. For the database, length in letters and number of     -->
<!-- sequences has been added.                                        -->
<!-- Why is there a metadata field? How is this supposed to be used?? -->
<!-- Parsers seem to ignore this attribute.                           -->

<!ELEMENT biojava:QueryInfo EMPTY>
<!ATTLIST biojava:QueryInfo
                    id             CDATA  #REQUIRED
                    desc           PCDATA #IMPLIED
                    length         CDATA  #IMPLIED
                    metadata       CDATA  #REQUIRED >

<!ELEMENT biojava:SbjctInfo EMPTY>
<!ATTLIST biojava:SbjctInfo
                     id                  CDATA  #REQUIRED
                     desc                PCDATA #IMPLIED
                     length              CDATA  #IMPLIED
                     metaData            CDATA  #REQUIRED >

<!ELEMENT biojava:DatabaseInfo EMPTY>
<!ATTLIST biojava:DatabaseInfo
                    name		   CDATA  #REQUIRED
                    letters	       CDATA  #IMPLIED
                    entries        CDATA  #IMPLIED
                    metadata       CDATA  #REQUIRED >

<!-- ================================================================ -->
<!-- Mainly HSPSummary related information derived from HitSummary.   -->
<!-- Neither of these names seems correct, perhaps MatchSummary       -->
<!-- would be best. Changes include removing a count of HSPs and      -->
<!-- reading frame. Reading frame is easily derived from the          -->
<!-- coordinates of the alignment. Also removed sumProbability value  -->
<!-- though this should probably be kept. Added similarity count.     -->

<!ELEMENT biojava:HSPSummary >
<!ATTLIST biojava:HSPSummary
                score               CDATA #REQUIRED
                bitScore            CDATA #IMPLIED
                expectValue         CDATA #IMPLIED
                identitical         CDATA #IMPLIED
                alignmentLength     CDATA #IMPLIED
                similar             CDATA #IMPLIED
                pValue              CDATA #IMPLIED
                sumPValues          CDATA #IMPLIED >

<!-- ================================================================ -->
<!-- Elements for Query, Subject, and Match alignment information     -->

<!ELEMENT biojava:QuerySequence (#PCDATA)>
<!ATTLIST biojava:QuerySequence 
                begin           CDATA #REQUIRED
                end             CDATA #REQUIRED
                strand			CDATA #REQUIRED
                type            CDATA #IMPLIED
                gaps			CDATA #IMPLIED >

<!-- A MatchConsensus element represents the consensus information
     present in a pairwise alignment produced by Blast-like programs
     (i.e. the middle line of the alignment).                          -->

<!ELEMENT biojava:MatchConsensus (#PCDATA)>
<!ATTLIST biojava:MatchConsensus
                     xml:space       (default|preserve) #IMPLIED >


<!ELEMENT biojava:SbjctSequence (#PCDATA)>
<!ATTLIST biojava:SbjctSequence 
                begin           CDATA #REQUIRED
                end             CDATA #REQUIRED
                strand			CDATA #REQUIRED
                type            CDATA #IMPLIED
                gaps			CDATA #IMPLIED >

<!-- The BlastLikeAlignment elements represents information from the
     pairwise alignments produced by Blast-like programs. Rather than
     representing the alignment simply as preformatted raw text, it
     separates out the information into a QuerySequence, a HitSequence
     and a MatchConsensus.                                             -->

<!ELEMENT biojava:BlastLikeAlignment (biojava:QuerySequence,
                                      biojava:MatchConsensus,
                                      biojava:SbjctSequence) >

<!ELEMENT biojava:HSP (biojava:HSPSummary, biojava:BlastLikeAlignment?)>

<!-- HSPCollections model related groups of HSPs. For example, this
     allows all plus strand HSPs to be grouped separated from all
     minus strand HSPs                                                 -->

<!ELEMENT biojava:HSPCollection (biojava:HSP+)>

<!-- A hit, besides containing the subject and alignment information
     should also hold things like frameshifts where it is assumed that
     a frameshift terminates a given match or HSP                      -->

<!ELEMENT biojava:Hit (biojava:SbjctInfo, biojava:HSPCollection+)>
<!ATTLIST biojava:Hit >

<!ELEMENT biojava:Detail (biojava:Hit*)>

<!-- ================================================================ -->
<!-- Statistics found at end of blast                                 -->

<!ELEMENT biojava:KAStats EMPTY>
<!ELEMENT biojava:KAStats
                    K              CDATA  #REQUIRED
                    H              CDATA  #REQUIRED
                    lambda         CDATA  #REQUIRED >

<!ELEMENT biojava:GappedKAStats EMPTY>
<!ELEMENT biojava:GappedKAStats
                    K              CDATA  #REQUIRED
                    H              CDATA  #REQUIRED
                    lambda         CDATA  #REQUIRED >
                    
<!ELEMENT biojava:SearchMatrix
                    name           CDATA  #REQUIRED
                    matchScore     CDATA  #IMPLIED
                    mismatchScore  CDATA  #IMPLIED >

<!ELEMENT biojava:GapPenalties
                    gapOpen        CDATA  #REQUIRED
                    gapExtend      CDATA  #REQUIRED >

<!ELEMENT biojava:SearchSpaceStats
                    effectiveSpace CDATA  #REQUIRED
                    usedSpace      CDATA  #REQUIRED >

<!ELEMENT biojava:Statistics (biojava:KAStats, 
                           biojava:GappedKAStats,
                           biojava:SearchMatrix,
                           biojava:GapPenalties,
                           biojava:SearchSpaceStats)>

<!-- ================================================================ -->
<!-- Relating to overall results of searches                          -->

<!ELEMENT biojava:Header (biojava:RawOutput?, QueryInfo?, DatabaseInfo? )>

<!ELEMENT biojava:BlastLikeResultSet (biojava:Header,
                                      biojava:Summary?,
                                      biojava:Detail?,
                                      biojava:Statistics?)>
<!ATTLIST biojava:BlastLikeResultSet
                 program             CDATA #REQUIRED
                 version             CDATA #REQUIRED>

<!-- A BlastLikeResultSetCollection contains data from groups of results
     obtained from  bioinformatics software that produces Blast-like 
     output. For example, it can model the output from Blast run on 
     multiple sequences. Or it could be used to group together analyses
     on a single sequence obtained from multiple programs.             -->

<!ELEMENT biojava:BlastLikeResultSetCollection (biojava:BlastLikeResultSet+) >
<!ATTLIST biojava:BlastLikeResultSetCollection
                 xmlns               CDATA #FIXED ""
                 xmlns:biojava       CDATA #FIXED "http://www.biojava.org" >


-----Original Message-----
From:	David Huen [mailto:smh1008@cus.cam.ac.uk]
Sent:	Fri 11/15/02 11:49 AM
To:	Doug Rusch; Keith James
Cc:	biojava-dev@biojava.org
Subject:	Re: [Biojava-dev] blast parsing continued
Could I have a copy of whatever DTD you might settle upon please?

I have a NCBI Blast XML parser that I use that I'd like to check in and an 
adaptor to implement to make the events match those expected by downstream 
builders.

Regards,
David Huen