[Biojava-dev] blast parsing continued
David Huen
smh1008@cus.cam.ac.uk
Tue, 19 Nov 2002 10:51:18 +0000
On Tuesday 19 Nov 2002 9:35 am, Simon Brocklehurst wrote:
> Doug Rusch wrote:
> > I agree that checking in the code I have now is a problem. Breaking
> >the HMMer, FASTA, and possibly wu-blast parser would be very bad, not to
> >mention that it requires java 1.4. Short of overhauling all the existing
> >tool parsers, there are only a few options that I can see
> >
> > 1) branching
> > 2) creating new packages parallel to the existing parsing code
> > (search/ssbind/sax)
> > 3) starting a code base for BioJava 2
> >
> > I would like to make my code available for the community to look at,
> > test, and comment on but not at the inconvience of a large number of
> > biojava's users. Is there a prefered solution?
>
> Doug,
>
> I haven't had a chance to look at your DTD in detail, so I don't know
> how much similarity there is to the original. But, if it were possible
> to make the changes you need *optional* additions in the DTD, then this
> would allow people to slot your new parser right in to their existing
> ContentHandlers.
>
> To me, that seems the ideal solution. I'd be surprised if your parser
> didn't work better for NCBI Blast - using the new 1.4 regular
> expressions is a neat way to go. It would be nice for people to be able
> just to plug it in.
>
> As I say, I haven't had a chance to look at the DTD - so this may not be
> possible.
>
I think the suggestion I made of extending the existing DTD as an interim
solution will probably be a useful solution still - it will keep the
existing parsers working while allowing the new information that is needed
thru'.
Actually, AFAICT, there are two classes of changes in Doug's proposal,
i) changes that may represent rational improvements on the current DTD but
will break things, or add functionality:-
ii) changes that change the name of an existing element/entity without added
functionality and will break things:-
Under i),
a) QuerySequence and SubjctSequence replace QuerySequence and HitSequence.
The changes see the startPosition and stopPosition replaced by a more
Biojava-esque begin, end and strand, type and gaps(what this?). Some
attention will be needed in this area anyway as NCBI Blast XML is truly
bizaare in its use of the Hsp_hit-frame and Hsp_query-frame elements.
Examples are (for blastn) when Hsp_hit-frame is -1 (reverse strand), it is
the QUERY strand that is shown reverse complemented and the from and to
coordinates swapped, (for blastx, tblast's) irrespective of whether the
frames are forward or reversed, the from and to coordinates are always
forward (ie from < to). Could we have a standard defined for ourselves to
avoid confusion downstream of the parsers?
b) biojava:QueryInfo, biojava:SbjctInfo, biojava:DatabaseInfo replace
biojava:HitId, biojava:HitDescription, biojava:QueryId, biojava:DatabaseId.
The main additions here are the presence of a description line for query and
length attributes.
These are:-
<!ELEMENT biojava:QueryInfo EMPTY>
<!ATTLIST biojava:QueryInfo
id CDATA #REQUIRED
desc PCDATA #IMPLIED
length CDATA #IMPLIED
metadata CDATA #REQUIRED >
<!ELEMENT biojava:SbjctInfo EMPTY>
<!ATTLIST biojava:SbjctInfo
id CDATA #REQUIRED
desc PCDATA #IMPLIED
length CDATA #IMPLIED
metaData CDATA #REQUIRED >
<!ELEMENT biojava:DatabaseInfo EMPTY>
<!ATTLIST biojava:DatabaseInfo
name CDATA #REQUIRED
letters CDATA #IMPLIED
entries CDATA #IMPLIED
metadata CDATA #REQUIRED >
c) <biojava:Statistics> introduced to capture the statistics that appear at
the end of blast output.
Under ii),
a) changes to HSPSummary -
numberOfIdentities -> identical
alignmentSize -> alignmentLength
numberOfPositives -> similar
strand/frame information has been removed from here into QuerySequence and
Subjct Sequence
-------------------
As I see it, at least for BJ1.3, we can extend the existing DTD to cover
these cases so we get the additional functionality without breaking things.
It would be nice to move away from the more loosely defined (and positively
mangled by NCBI BlastXML) startPosition and stopPosition perhaps at BJ2.
We can extend <biojava:BlastLikeDataSetCollection>:-
<!ATTLIST biojava:BlastLikeDataSetCollection
dtdVersion CDATA#IMPLIED
xmlns CDATA #FIXED ""
xmlns:biojava CDATA #FIXED "http://www.biojava.org" >
where if dtdVersion is absent, the original DTD subset can be assumed. For
sake of argument, we define the new interim DTD as version 2.0.
Extend <biojava:HitId> to get the new info
<!ATTLIST biojava:HitId
id CDATA #REQUIRED
metaData CDATA #REQUIRED
desc CDATA#IMPLIED
length CDATA#IMPLIED>
do likewise with <biojava:QueryId>.
Extend biojava:QuerySequence to include sequence type:-
<!ATTLIST biojava:QuerySequence
startPosition CDATA #REQUIRED
stopPosition CDATA #REQUIRED
type CDATA#IMPLIED >
Do likewise with biojava:HitSequence.
I see this as being less important given that HSPSummary has this kind of
data.
Extend <biojava:DatabaseId> for the additional info:-
<!ATTLIST biojava:DatabaseId
id CDATA #REQUIRED
metaData CDATA #REQUIRED
letters CDATA#IMPLIED
entries CDATA#IMPLIED >
Add wholesale the <biojava:Statistics element>.
The advantage of doing it this way is that nothing breaks - all existing
downstream stuff continues to work - elements they don't understand get
ignored. Anything we have written to take advantage of the newer info can
be put in too with minor mods. it may not be very difficult to modify
existing parsers to pick up the additional info too.
The disadvantage is that the vagaries of what start and stop mean remains
with us. We can however formally define what it means to BJ now and make
adjustments in our parsers to fit our (naturally more rational) meaning.
We should formally define what we mean by strand/frame too wrt to the
sequence depiction we store.
When moving to BJ2 eventually, there will be much rewriting of stuff in this
area anyway so it could be a good time to make the changes Doug proposes
then. But I think it will be important to get that DTD right as changes to
it will be difficult once a significant body of software gets written
around it. We have an opportunity to do that soon but we get one shot
only. It would mean at very least that the new proposed DTD covers the
domain stuff in the existing DTD too in one way or another.
I am personally unconvinced it is worth the pain of breaking existing
functionality with a substantially changed DTD that might yet have to be
changed further. We do need to stabilise as far as possible the BJ1 series
and make it bulletproof. Like it as not, development of BJ2 will take away
resources from BJ1 and BJ1 needs to be as good as possible before it goes
into maintenance mode and (at least some) developers get distracted with
BJ2 concerns. It is a good time however to explore what that DTD should
look like. Perhaps we could introduce a new package called experimental in
CVS?
One other aspect to consider eventually is whether a DTD is really where we
should be defining what might really be a Java interface. As it is today
we have:-
Parsers -> DTD -> Middleware -> (SearchContentHandler interface) ->
SearchContentHandlers
Could we improve/replace the SearchContentHandler interface in BJ2 and just
have parsers output to it?
Any, these are my thoughts on the issue and I'm not even minor deity... :-)
Best wishes,
David Huen