[Biojava-l] request/proposal for change to BlastLike DTD
Keith James
kdj@sanger.ac.uk
12 Jul 2001 10:42:42 +0100
Hi,
In the course of writing the BlastLike SAX -> SeqSimilaritySearch*
object code I encountered a couple of minor difficulties in
transferring some of the search information (metadata, really) to
XML.
I would like to add optional fields for the query sequence ID, the
name (or whatever identifier is available) of the database searched
and the type of sequence (dna/protein).
1. query seq ID
For Blast currently appears in the RawOutput element of Header as the
line 'Query= xxxxx'. I've had to add similar lines into the Fasta
RawOutput element to maintain compatability.
(Fasta m 10 output doesn't explicity put the query ID into the
header, but into every hit, so a query seq ID element analagous to
HitId would be welcome, but I can live without it)
2. database name
Similar situation, only the line is 'Database: xxxx'.
3. sequence types (for both query and hit)
I'm currently resolving this from the progam name for Blast, but it's
harder with Fasta because the progam name is always the same.
Perhaps the Header could be changed from
<!ELEMENT biojava:Header (biojava:RawOutput)>
to
<!ELEMENT biojava:QueryId EMPTY>
<!ATTLIST biojava:QueryId
id CDATA #REQUIRED
metaData CDATA #REQUIRED >
<!ELEMENT biojava:DatabaseId EMPTY>
<!ATTLIST biojava:DatabaseId
id CDATA #REQUIRED
metaData CDATA #REQUIRED >
<!ELEMENT biojava:Header (biojava:RawOutput, QueryId?, DatabaseId?)>
and add
<!ENTITY % sequenceType "(nucleic|protein)">
<!ENTITY % querySequenceType "querySequenceType %sequenceType; ">
<!ENTITY % hitSequenceType "hitSequenceType %sequenceType; ">
to go in the HSPSummary
<!ATTLIST biojava:HSPSummary
score CDATA #REQUIRED
expectValue CDATA #REQUIRED
numberOfIdentities CDATA #REQUIRED
alignmentSize CDATA #REQUIRED
percentageIdentity CDATA #REQUIRED
numberOfPositives CDATA #IMPLIED
percentagePositives CDATA #IMPLIED
pValue CDATA #IMPLIED
sumPValues CDATA #IMPLIED
numberOfGaps CDATA #IMPLIED
%queryStrand; #IMPLIED
%hitStrand; #IMPLIED
%queryFrame; #IMPLIED
%hitFrame; #IMPLIED
%querySequenceType; #IMPLIED
%hitSequenceType; #IMPLIED >
As I mentioned above, changes (adding QueryId) of
<!ELEMENT biojava:HitSummary (biojava:HitId, biojava:HitDescription?)>
to
<!ELEMENT biojava:HitSummary (biojava:HitId, biojava:QueryId?,
biojava:HitDescription?)>
and
<!ELEMENT biojava:Hit (biojava:HitId, biojava:HitDescription?,
biojava:HSPCollection+)>
to
<!ELEMENT biojava:Hit (biojava:HitId, biojava:QueryId?,
biojava:HitDescription?,
biojava:HSPCollection+)>
would also be useful
The changes should not invalidate any XML which validated against the
DTD beforehand. How does that sound? This is just a request for
comments - I'm not going to touch the DTD without full agreement, not
least from the original authors :)
Keith
--
-= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =-
The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA