[Biojava-l] Blast-like parsing
Simon Brocklehurst
simon.brocklehurst@CambridgeAntibody.com
Mon, 15 May 2000 17:13:58 +0100
Dear All,
We're nearly ready to put some parsing and visualization code into
biojava (better late than never!). Before we finalize the initial
release, there is now an opportunity to have some input. I can't promise
we can/will include all suggestions but any opinions constructive,
destructive or otherwise are very welcome.
A quick recap - as some of you may recall, the idea was to make it
trivial to build applications that use the output from software that
produces Blast-like format e.g. NCBI Blast, WU-Blast, HMMer, DBA etc.
It was decided to write SAX parsers that take parse the raw software
output and generate messages that are send to XML DocumentHandlers.
The BlastLikeDataSetCollection SAXParser currently takes as input raw
output from Blast-like software and sends messages that are consistent
with an XML document conforming to the following DTD. The opportunity
here is to cast your eyes over the DTD and give some feedback.
Any feedback during the next couple of weeks should be OK - the closer
towards that time, the less likely we will be able to incorporate things
before the release.
Simon
<!-- BlastLikeDataSetCollection DTD - attempts to provide a
unified model for the output from the following pieces of
bioinformatics sequence similarity search sofware:
o BLAST (NCBI)
o WU-BLAST (Washington University)
o HMMER (Washington University)
o DBA (Sanger Center)
NB This DTD covers output from the above software, when run
in modes such that the detailed output is based around
pairwise alignments.
This is as opposed to other output formats such as ASN.1
The root element is a BlastLikeDataSetCollection. This is
described towards the end of the DTD.
================================================================
This DTD is Copyright 1999, 2000 Cambridge Antibody Technology
Group plc (CAT). All Rights Reserved.
Author List:
Primary Author: Simon Brocklehurst (CAT)
Other Authors: Colin H. Hardman (CAT)
Stuart Johnson (CAT)
Tim Dilks (CAT)
================================================================
-->
<!-- PARAMETER ENTITY DECLARATIONS
============================= -->
<!ENTITY % strandType "(plus|minus)">
<!ENTITY % frameNumber "(minus3|minus2|minus1|plus1|plus2|plus3)">
<!ENTITY % alignmentType "(local|global)">
<!ENTITY % queryStrand "queryStrand %strandType;">
<!ENTITY % hitStrand "hitStrand %strandType;">
<!ENTITY % queryFrame "queryFrame %frameNumber;">
<!ENTITY % hitFrame "hitFrame %frameNumber;">
<!-- HMMER sequence and model alignment types - i.e. local or
global -->
<!ENTITY % startPositionOfSequence "startPositionOfSequence
%alignmentType; ">
<!ENTITY % endPositionOfSequence "endPositionOfSequence
%alignmentType; ">
<!ENTITY % startPositionOfModel "startPositionOfModel
%alignmentType; ">
<!ENTITY % endPositionOfModel "endPositionOfModel
%alignmentType; ">
<!-- ELEMENT DECLARATIONS
==================== -->
<!-- ================================================================
-->
<!-- Elements used in more than one section of the data set.
-->
<!-- For example, in both Summary and Detail sections
-->
<!ELEMENT biojava:HitDescription (#PCDATA)>
<!ELEMENT biojava:HitId EMPTY>
<!ATTLIST biojava:HitId
id CDATA #REQUIRED
metaData CDATA #REQUIRED >
<!-- The RawOutput element is used to represent sections of the
output from programs "as is". This enables information from
software to be represented, without being parsed in detail.
-->
<!ELEMENT biojava:RawOutput (#PCDATA)>
<!ATTLIST biojava:RawOutput
xml:space (default|preserve) #IMPLIED >
<!-- ================================================================
-->
<!-- Header section related information
-->
<!ELEMENT biojava:Header (biojava:RawOutput)>
<!-- ================================================================
-->
<!-- Summary section related information
-->
<!ELEMENT biojava:HitSummary (biojava:HitId,biojava:HitDescription?)>
<!ATTLIST biojava:HitSummary
score CDATA #REQUIRED
expectValue CDATA #REQUIRED
numberOfHSPs CDATA #IMPLIED
readingFrame CDATA #IMPLIED
numberOfDomains CDATA #IMPLIED >
<!-- DomainSummary and DomainInformation elements are HMMER Specific -->
<!ELEMENT biojava:DomainHit EMPTY>
<!ATTLIST biojava:DomainHit
modelId CDATA #REQUIRED
domainPosition CDATA #REQUIRED
sequenceFrom CDATA #REQUIRED
sequenceTo CDATA #REQUIRED
hmmFrom CDATA #REQUIRED
hmmTo CDATA #REQUIRED
%startPositionOfSequence; #IMPLIED
%endPositionOfSequence; #IMPLIED
%startPositionOfModel; #IMPLIED
%endPositionOfModel; #IMPLIED
score CDATA #REQUIRED
expectValue CDATA #REQUIRED >
<!ELEMENT biojava:DomainSummary (biojava:DomainHit*) >
<!ATTLIST biojava:DomainHit
domainCount CDATA #REQUIRED >
<!-- End of DomainSummarySecion
-->
<!ELEMENT biojava:Summary (biojava:HitSummary*, biojava:DomainSummary?)
>
<!-- ================================================================
-->
<!-- Mainly DetailSection related information
-->
<!ELEMENT biojava:HSPSummary (biojava:RawOutput?)>
<!ATTLIST biojava:HSPSummary
score CDATA #REQUIRED
expectValue CDATA #REQUIRED
numberOfIdentities CDATA #REQUIRED
alignmentSize CDATA #REQUIRED
percentageIdentity CDATA #REQUIRED
numberOfPositives CDATA #IMPLIED
percentagePositives CDATA #IMPLIED
pValue CDATA #IMPLIED
sumPValues CDATA #IMPLIED
HSPCollectionSize CDATA #IMPLIED
numberOfGaps CDATA #IMPLIED
%queryStrand; #IMPLIED
%hitStrand; #IMPLIED
%queryFrame; #IMPLIED
%hitFrame; #IMPLIED >
<!ELEMENT biojava:QuerySequence (#PCDATA)>
<!ATTLIST biojava:QuerySequence
startPosition CDATA #REQUIRED
stopPosition CDATA #REQUIRED >
<!-- A MatchConsensus elemenet represents the consensus information
present in a pairwise alignment produced by Blast-like programs
(i.e. the middle line of the alignment).
-->
<!ELEMENT biojava:MatchConsensus (#PCDATA)>
<!ATTLIST biojava:MatchConsensus
xml:space (default|preserve) #IMPLIED >
<!ELEMENT biojava:HitSequence (#PCDATA)>
<!ATTLIST biojava:HitSequence
startPosition CDATA #REQUIRED
stopPosition CDATA #REQUIRED >
<!-- The BlastLikeAlignment elements represents information from the
pairwise alignments produced by Blast-like programs. Rather than
representing the alignment simply as preformatted raw text, it
separates out the information into a QuerySequence, a HitSequence
and a MatchConsensus.
-->
<!ELEMENT biojava:BlastLikeAlignment (biojava:QuerySequence,
biojava:MatchConsensus,
biojava:HitSequence) >
<!ELEMENT biojava:HSP (biojava:HSPSummary, biojava:BlastLikeAlignment)>
<!-- HSPCollections model related groups of HSPs. For example, this
allows all plus strand HSPs to be grouped separately from all
minus strand HSPs
-->
<!ELEMENT biojava:HSPCollection (biojava:HSP+)>
<!ELEMENT biojava:Hit (biojava:HitId, biojava:HitDescription?,
biojava:HSPCollection+)>
<!ATTLIST biojava:Hit
sequenceLength CDATA #REQUIRED >
<!ELEMENT biojava:Detail (biojava:Hit*)>
<!-- ================================================================
-->
<!-- TailSection related information -->
<!ELEMENT biojava:Trailer (biojava:RawOutput)>
<!-- ================================================================
-->
<!-- Relating to overall results of searches
-->
<!ELEMENT biojava:BlastLikeDataSet (biojava:Header,
biojava:Summary?,
biojava:Detail?,
biojava:Trailer?)>
<!ATTLIST biojava:BlastLikeDataSet
program CDATA #REQUIRED
version CDATA #REQUIRED>
<!-- A BlastLikeDataSetCollection contains data from groups of results
obtained from bioinformatics software that produces Blast-like
output. For example, it can model the output from Blast run on
multiple sequences. Or it could be used to group together analyses
on a single sequence obtained from multiple programs.
-->
<!ELEMENT biojava:BlastLikeDataSetCollection (biojava:BlastLikeDataSet+)
>
<!ATTLIST biojava:BlastLikeDataSetCollection
xmlns CDATA #REQUIRED
xmlns:biojava CDATA #REQUIRED >
--
Simon M. Brocklehurst, Ph.D.
Head of Bioinformatics & Advanced IS
Cambridge Antibody Technology
The Science Park, Melbourn, Cambridgeshire, UK
http://www.CambridgeAntibody.com/
mailto:simon.brocklehurst@CambridgeAntibody.com