[Biojava-dev] blast parsing continued : what is gaps?

Tue, 19 Nov 2002 11:30:34 -0500

Just a quickie email,

For those of you who are wondering what the gaps attribute means on the QuerySequence and SbjctSequence elements. This is the number of gaps introduced ih the subject and query strands to generate the alignment (so in blast a count of the number of dashes).

I think there should be an additional attribute called ambiguous on the Sequence elements. This would give a count of the number of ambiguous characters in the alignment (N's in blastn and X's in the other blasts) otherwise it is impossible to test to see if a low identity was due to low complexity masking (-F T) or due to mismatches without storing the actual alignment.

Hope this clarifies things (yes updating documention as well!)
Doug

-----Original Message-----
From:	David Huen [mailto:david.huen@ntlworld.com]
Sent:	Tue 11/19/02 5:51 AM
To:	Simon Brocklehurst; Doug Rusch
Cc:	biojava-dev@biojava.org
Subject:	Re: [Biojava-dev] blast parsing continued
On Tuesday 19 Nov 2002 9:35 am, Simon Brocklehurst wrote:

a) QuerySequence and SubjctSequence replace QuerySequence and HitSequence.  
The changes see the startPosition and stopPosition replaced by a more 
Biojava-esque begin, end and strand, type and gaps(what this?).  Some 
attention will be needed in this area anyway as NCBI Blast XML is truly 
bizaare in its use of the Hsp_hit-frame and Hsp_query-frame elements.  
Examples are (for blastn) when Hsp_hit-frame is -1 (reverse strand), it is 
the QUERY strand that is shown reverse complemented and the from and to 
coordinates swapped, (for blastx, tblast's) irrespective of whether the 
frames are forward or reversed, the from and to coordinates are always 
forward (ie from < to).  Could we have a standard defined for ourselves to 
avoid confusion downstream of the parsers?