[Biojava-dev] blast parsing continued
David Huen
smh1008@cus.cam.ac.uk
Sun, 17 Nov 2002 21:05:15 +0000
On Friday 15 Nov 2002 7:28 pm, Doug Rusch wrote:
> Yes an XML parser would be best if I didnt find that the NCBI blast XML
> output option tends to core dump on me. In any case, here is my
> modification of the BlastLikeDataSetCollection DTD which I call
> BlastLikeResultSetCollection. See if this fits with your expectations
> David, if not we can thrash out what should be changed.
I've had a look thru your proposal and have no objections per se.
Could you clarify parts of your proposed DTD please? You specify for the
query and hit sequences begin.end and strand. I take it that these are the
BJ meanings of the terms? In the case of strand, what are the actual
values you intend using(plus|minus|unknown? +|-|.? whatever)? Next, what
does the attribute 'gap' mean? Do I assume that the hit and query sequence
elements still contain the sequence string, gaps and all for each as
PCDATA? And with the same semantics (i.e. when the query is reverse
complemented for display in normal BLAST output, it still appears in the
same form?
Of greater concern to me is that this proposed DTD represents a
non-backwards-compatible DTD with rather than an extension of the previous
DTD and will have consequences on both parsers and downstream components (I
think you pointed that out somewhere in one of the earlier threads but I
might be mistaken). As BJ is preparing for a 1.3 final release, this would
not be a good thing to attempt at this stage. It would seem to me that the
safer though perhaps less elegant option would be to develop it such that
the existing DTD is a subset of a new interim DTD proposal in which you
introduce new elements that pass the additional information you propose. I
believe CAT uses the existing parsers inhouse as may perhaps others and it
does seem an inappropriate time for drastic action on the parsing package.
A significantly different, better DTD can then be left to BJ2 where we can
break things (almost) to our heart's content and perhaps rewrite the
parsing package afresh.
On another note, is there a statement in the current DTD licensing us to use
it?
Just my $0.02 worth,
David Huen