[Bioperl-l] Fwd: [blast-help] Fwd: Question about the definition of 'gaps' in blast -m8 output...

Chris Fields cjfields at illinois.edu
Fri Mar 20 19:06:05 UTC 2009


We have a way to check for both within HSP objects I believe.

1) gaps() is documented to return the number of gap characters within  
the query/hit/total
2) seq_inds('gap', 'hit/query') returns the number of gap positions,  
with the position repeated for every gap character I believe, so  
getting this in scalar context should be similar to gaps().  However,  
to reduce that down to just the gap positions (no repeats) use the  
collapse flag: seq_inds('gap', 'hit/query', 1)

chris

On Mar 20, 2009, at 12:13 PM, Dan Bolser wrote:

> Here is what the man from the NCBI said:
>
>
> ---------- Forwarded message ----------
> From: Peter Cooper <cooper at ncbi.nlm.nih.gov>
> Date: 2009/3/20
> Subject: Re: [blast-help] Fwd: Question about the definition of 'gaps'
> in blast -m8 output...
> To: dan.bolser at gmail.com
> Cc: blast-help at ncbi.nlm.nih.gov
>
>
> Hello,
>
> The number reported tin the -m  8 output is the number of gap
> openings. This will only equal the number of gap characters if the
> length of each gap is 1.
>
>
> Peter
> -------------------------------
> Peter S. Cooper, Ph.D.
> Public Services
> The National Center for Biotechnology Information
> 301-435-5951
>
>
>
>
>
>
> On Mar 19, 2009, at 12:04 PM, romiti wrote:
>
>>
>>
>> Begin forwarded message:
>>
>>> From: User Services Service Account <info at ncbi.nlm.nih.gov>
>>> Date: March 19, 2009 10:23:01 AM EDT
>>> To: romiti at ncbi.nlm.nih.gov
>>> Subject: Question about the definition of 'gaps' in blast -m8  
>>> output...
>>> Reply-To: User Services Service Account <info at ncbi.nlm.nih.gov>
>>>
>>>
>>> ------------- Begin Forwarded Message -------------
>>>
>>> Date: Wed, 18 Mar 2009 19:31:01 +0000
>>> Subject: Question about the definition of 'gaps' in blast -m8  
>>> output...
>>> From: Dan Bolser <dan.bolser at gmail.com>
>>> To: bbb at bioinformatics.org, bioperl-l at lists.open-bio.org, info at ncbi.nlm.nih.gov
>>>
>>>
>>> Hi,
>>>
>>> I'm sure this question comes up again and again, but searching the  
>>> BioPerl
>>> mailing list didn't turn up any answers (to the second question).  
>>> Basically
>>> I want to manually merge HSP's into 'contigious hits', and I want  
>>> to look at
>>> the effect of various parameters on an algorithm to do this. This  
>>> task
>>> prompted me to run a 'sanity check' on the blast data that I had,  
>>> and I
>>> found that this check fails to fulfil my expectation of the data.  
>>> This means
>>> that either I don't understand the data or the results are buggy.
>>>
>>> Can someone clarify the definition of the 'gaps' column in the  
>>> blast -m8
>>> output format for me?
>>>
>>> I thought that the column 'gaps' was basically the number of  
>>> columns in the
>>> HSP that contains a gap character. To test this on my data, I  
>>> checked the
>>> following equality:
>>>
>>> GAPS + 2 =
>>> LENGTH - abs(QUERY_END - QUERY_START) + LENGTH - abs(HIT_END -  
>>> HIT_START)
>>>
>>>
>>> This says that the number of GAPS should be equal to the  
>>> difference between
>>> the LENGTH of the alignment minus the distance between the START  
>>> and END
>>> point on either the QUERY or the HIT (+2 for the 'off by one' error
>>> introduced by the two END-START calculations).
>>>
>>> e.g.
>>>
>>> 10-> MMMMMMMM**MMMM*M <-22
>>>   ||||  ||  | |  |
>>> 20-> MMMM**MMMMM*M*MM <-31
>>>
>>>
>>> where MISMATCHES = 7, LENGTH = 16, QUERY_END - QUERY_START = 12,  
>>> and HIT_END
>>> - HIT_START = 11. The formula gives:
>>>
>>> 7+2(9) = 16-12(4) + 16-11(5)
>>>
>>>
>>> The formula is correct for 11,282 out of 12,745 HSPs in my dataset  
>>> (89%),
>>> however it fails for 1,463 cases (11%). Each of these cases has a  
>>> value of
>>> MISMATCHES smaller than calculated by the formula. The difference  
>>> is usually
>>> 1 or 2, but is seen to go as high as 96, and scales roughly  
>>> linearly with
>>> the size of GAPS.
>>>
>>>
>>> Did I misunderstand what the value of GAPS is supposed to mean?  
>>> How come it
>>> does apparently mean what I thought for so much of the data?
>>>
>>>
>>> Thanks very much for any help on the above.
>>>
>>> Dan.
>>>
>>> ------------- End Forwarded Message -------------
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list