[Biojava-dev] blast parsing continued

Tue, 19 Nov 2002 10:51:18 +0000

On Tuesday 19 Nov 2002 9:35 am, Simon Brocklehurst wrote:
> Doug Rusch wrote:
> > I agree that checking in the code I have now is a problem. Breaking
> >the HMMer, FASTA, and possibly wu-blast parser would be very bad, not to
> >mention that it requires java 1.4. Short of overhauling all the existing
> >tool parsers, there are only a few options that I can see
> >
> > 1) branching
> > 2) creating new packages parallel to the existing parsing code
> >    (search/ssbind/sax)
> > 3) starting a code base for BioJava 2
> >
> > I would like to make my code available for the community to look at,
> > test, and comment on but not at the inconvience of a large number of
> > biojava's users. Is there a prefered solution?
>
> Doug,
>
> I haven't had a chance to look at your DTD in detail, so I don't know
> how much similarity there is to the original.  But, if it were possible
> to make the changes you need *optional* additions in the DTD, then this
> would allow people to slot your new parser right in to their existing
> ContentHandlers.
>
> To me, that seems the ideal solution.  I'd be surprised if your parser
> didn't work better for NCBI Blast - using the new 1.4 regular
> expressions is a neat way to go. It would be nice for people to be able
> just to plug it in.
>
> As I say, I haven't had a chance to look at the DTD - so this may not be
> possible.
>
I think the suggestion I made of extending the existing DTD as an interim 
solution will probably be a useful solution still - it will keep the 
existing parsers working while allowing the new information that is needed 
thru'.

Actually, AFAICT, there are two classes of changes in Doug's proposal,
i) changes that may represent rational improvements on the current DTD but 
will break things, or add functionality:-
ii) changes that change the name of an existing element/entity without added 
functionality and will break things:-

Under i),
a) QuerySequence and SubjctSequence replace QuerySequence and HitSequence.  
The changes see the startPosition and stopPosition replaced by a more 
Biojava-esque begin, end and strand, type and gaps(what this?).  Some 
attention will be needed in this area anyway as NCBI Blast XML is truly 
bizaare in its use of the Hsp_hit-frame and Hsp_query-frame elements.  
Examples are (for blastn) when Hsp_hit-frame is -1 (reverse strand), it is 
the QUERY strand that is shown reverse complemented and the from and to 
coordinates swapped, (for blastx, tblast's) irrespective of whether the 
frames are forward or reversed, the from and to coordinates are always 
forward (ie from < to).  Could we have a standard defined for ourselves to 
avoid confusion downstream of the parsers?
b) biojava:QueryInfo, biojava:SbjctInfo, biojava:DatabaseInfo replace 
biojava:HitId, biojava:HitDescription, biojava:QueryId, biojava:DatabaseId.
The main additions here are the presence of a description line for query and 
length attributes.

These are:-
<!ELEMENT biojava:QueryInfo EMPTY>
<!ATTLIST biojava:QueryInfo
                    id             CDATA  #REQUIRED
                    desc           PCDATA #IMPLIED
                    length         CDATA  #IMPLIED
                    metadata       CDATA  #REQUIRED >

<!ELEMENT biojava:SbjctInfo EMPTY>
<!ATTLIST biojava:SbjctInfo
                     id                  CDATA  #REQUIRED
                     desc                PCDATA #IMPLIED
                     length              CDATA  #IMPLIED
                     metaData            CDATA  #REQUIRED >

<!ELEMENT biojava:DatabaseInfo EMPTY>
<!ATTLIST biojava:DatabaseInfo
                    name		   CDATA  #REQUIRED
                    letters	       CDATA  #IMPLIED
                    entries        CDATA  #IMPLIED
                    metadata       CDATA  #REQUIRED >

c) <biojava:Statistics> introduced to capture the statistics that appear at 
the end of blast output.

Under ii),
a) changes to HSPSummary - 
numberOfIdentities -> identical
alignmentSize -> alignmentLength
numberOfPositives -> similar
strand/frame information has been removed from here into QuerySequence and 
Subjct Sequence

-------------------

As I see it, at least for BJ1.3, we can extend the existing DTD to cover 
these cases so we get the additional functionality without breaking things.  
It would be nice to move away from the more loosely defined (and positively 
mangled by NCBI BlastXML) startPosition and stopPosition perhaps at BJ2.

We can extend <biojava:BlastLikeDataSetCollection>:-
<!ATTLIST biojava:BlastLikeDataSetCollection
                 dtdVersion      CDATA#IMPLIED
                 xmlns               CDATA #FIXED ""
                 xmlns:biojava       CDATA #FIXED "http://www.biojava.org" >

where if dtdVersion is absent, the original DTD subset can be assumed.  For 
sake of argument, we define the new interim DTD as version 2.0.

Extend <biojava:HitId> to get the new info
<!ATTLIST biojava:HitId
                     id                  CDATA #REQUIRED
                     metaData            CDATA #REQUIRED 
                     desc             CDATA#IMPLIED
                     length           CDATA#IMPLIED>

do likewise with <biojava:QueryId>.

Extend biojava:QuerySequence to include sequence type:-
<!ATTLIST biojava:QuerySequence
                startPosition       CDATA #REQUIRED
                stopPosition        CDATA #REQUIRED
                type                      CDATA#IMPLIED >

Do likewise with biojava:HitSequence.
I see this as being less important given that HSPSummary has this kind of 
data.

Extend <biojava:DatabaseId> for the additional info:-
<!ATTLIST biojava:DatabaseId
                    id             CDATA #REQUIRED
                    metaData       CDATA #REQUIRED
                    letters     CDATA#IMPLIED
                    entries    CDATA#IMPLIED >

Add wholesale the <biojava:Statistics element>.

The advantage of doing it this way is that nothing breaks - all existing 
downstream stuff continues to work - elements they don't understand get 
ignored.  Anything we have written to take advantage of the newer info can 
be put in too with minor mods.  it may not be very difficult to modify 
existing parsers to pick up the additional info too.

The disadvantage is that the vagaries of what start and stop mean remains 
with us.  We can however formally define what it means to BJ now and make 
adjustments in our parsers to fit our (naturally more rational) meaning.  
We should formally define what we mean by strand/frame too wrt to the 
sequence depiction we store.

When moving to BJ2 eventually, there will be much rewriting of stuff in this 
area anyway so it could be a good time to make the changes Doug proposes 
then.  But I think it will be important to get that DTD right as changes to 
it will be difficult once a significant body of software gets written 
around it.  We have an opportunity to do that soon but we get one shot 
only.  It would mean at very least that the new proposed DTD covers the 
domain stuff in the existing DTD too in one way or another.  

I am personally unconvinced it is worth the pain of breaking existing 
functionality with a substantially changed DTD that might yet have to be 
changed further.  We do need to stabilise as far as possible the BJ1 series 
and make it bulletproof.  Like it as not, development of BJ2 will take away 
resources from BJ1 and BJ1 needs to be as good as possible before it goes 
into maintenance mode and (at least some) developers get distracted with 
BJ2 concerns.  It is a good time however to explore what that DTD should 
look like.  Perhaps we could introduce a new package called experimental in 
CVS?  

One other aspect to consider eventually  is whether a DTD is really where we 
should be defining what might really be a Java interface.  As it is today 
we have:-

Parsers -> DTD -> Middleware -> (SearchContentHandler interface) -> 
SearchContentHandlers

Could we improve/replace the SearchContentHandler interface in BJ2 and just 
have parsers output to it?

Any, these are my thoughts on the issue and I'm not even minor deity... :-)

Best wishes,
David Huen