[Biojava-l] request/proposal for change to BlastLike DTD

Keith James kdj@sanger.ac.uk
12 Jul 2001 10:42:42 +0100


Hi,

In the course of writing the BlastLike SAX -> SeqSimilaritySearch*
object code I encountered a couple of minor difficulties in
transferring some of the search information (metadata, really) to
XML.

I would like to add optional fields for the query sequence ID, the
name (or whatever identifier is available) of the database searched
and the type of sequence (dna/protein).

1. query seq ID

 For Blast currently appears in the RawOutput element of Header as the
 line 'Query= xxxxx'. I've had to add similar lines into the Fasta
 RawOutput element to maintain compatability.

 (Fasta m 10 output doesn't explicity put the query ID into the
 header, but into every hit, so a query seq ID element analagous to
 HitId would be welcome, but I can live without it)

2. database name

 Similar situation, only the line is 'Database: xxxx'. 

3. sequence types (for both query and hit)

 I'm currently resolving this from the progam name for Blast, but it's
 harder with Fasta because the progam name is always the same.


Perhaps the Header could be changed from

<!ELEMENT biojava:Header (biojava:RawOutput)>

to

<!ELEMENT biojava:QueryId EMPTY>
<!ATTLIST biojava:QueryId
                       id           CDATA #REQUIRED
                       metaData     CDATA #REQUIRED >

<!ELEMENT biojava:DatabaseId EMPTY>
<!ATTLIST biojava:DatabaseId
                          id           CDATA #REQUIRED
                          metaData     CDATA #REQUIRED >

<!ELEMENT biojava:Header (biojava:RawOutput, QueryId?, DatabaseId?)>


and add

<!ENTITY % sequenceType    "(nucleic|protein)">

<!ENTITY % querySequenceType "querySequenceType %sequenceType; ">
<!ENTITY % hitSequenceType   "hitSequenceType   %sequenceType; ">

to go in the HSPSummary

<!ATTLIST biojava:HSPSummary
                score               CDATA #REQUIRED
                expectValue         CDATA #REQUIRED
                numberOfIdentities  CDATA #REQUIRED
                alignmentSize       CDATA #REQUIRED
                percentageIdentity  CDATA #REQUIRED
                numberOfPositives   CDATA #IMPLIED
                percentagePositives CDATA #IMPLIED
                pValue              CDATA #IMPLIED
                sumPValues          CDATA #IMPLIED
                numberOfGaps        CDATA #IMPLIED
                %queryStrand;             #IMPLIED
                %hitStrand;               #IMPLIED
                %queryFrame;              #IMPLIED
                %hitFrame;                #IMPLIED 
                %querySequenceType;       #IMPLIED
                %hitSequenceType;         #IMPLIED >


As I mentioned above, changes (adding QueryId) of

<!ELEMENT biojava:HitSummary (biojava:HitId, biojava:HitDescription?)>

to

<!ELEMENT biojava:HitSummary (biojava:HitId, biojava:QueryId?,
                              biojava:HitDescription?)>

and

<!ELEMENT biojava:Hit (biojava:HitId, biojava:HitDescription?,
                       biojava:HSPCollection+)>

to

<!ELEMENT biojava:Hit (biojava:HitId, biojava:QueryId?,
                       biojava:HitDescription?,
                       biojava:HSPCollection+)>

would also be useful

The changes should not invalidate any XML which validated against the
DTD beforehand. How does that sound? This is just a request for
comments - I'm not going to touch the DTD without full agreement, not
least from the original authors :)


Keith

-- 

-= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =-
The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA