[Biojava-dev] blast parsing continued

David Huen smh1008@cus.cam.ac.uk
Sun, 17 Nov 2002 21:05:15 +0000


On Friday 15 Nov 2002 7:28 pm, Doug Rusch wrote:
> Yes an XML parser would be best if I didnt find that the NCBI blast XML
> output option tends to core dump on me. In any case, here is my
> modification of the BlastLikeDataSetCollection DTD which I call
> BlastLikeResultSetCollection. See if this fits with your expectations
> David, if not we can thrash out what should be changed.

I've had a look thru your proposal and have no objections per se.

Could you clarify parts of your proposed DTD please?  You specify for the 
query and hit sequences begin.end and strand.  I take it that these are the 
BJ meanings of the terms?  In the case of strand, what are the actual 
values you intend using(plus|minus|unknown? +|-|.? whatever)?  Next, what 
does the attribute 'gap' mean?  Do I assume that the hit and query sequence 
elements still contain the sequence string, gaps and all for each as 
PCDATA?  And with the same semantics (i.e. when the query is reverse 
complemented for display in normal BLAST output, it still appears in the 
same form?

Of greater concern to me is that this proposed DTD represents a 
non-backwards-compatible DTD with rather than an extension of the previous 
DTD and will have consequences on both parsers and downstream components (I 
think you pointed that out somewhere in one of the earlier threads but I 
might be mistaken).  As BJ is preparing for a 1.3 final release, this would 
not be a good thing to attempt at this stage.  It would seem to me that the 
safer though perhaps less elegant option would be to develop it such that 
the existing DTD is a subset of a new interim DTD proposal in which you 
introduce new elements that pass the additional information you propose.  I 
believe CAT uses the existing parsers inhouse as may perhaps others and it 
does seem an inappropriate time for drastic action on the parsing package.  
A significantly different, better DTD can then be left to BJ2 where we can 
break things (almost) to our heart's content and perhaps rewrite the 
parsing package afresh.

On another note, is there a statement in the current DTD licensing us to use 
it?

Just my $0.02 worth,
David Huen