[BioPython] Iterating through an alignment to calculate the number of gaps and their lengths
Peter
biopython at maubp.freeserve.co.uk
Wed Feb 6 21:57:48 UTC 2008
On Feb 6, 2008 9:21 PM, Matthew Abravanel <vmatthewa at gmail.com> wrote:
> Hi Everyone,
>
> I was wondering if anyone could help, I am trying to write a little python
> script to iterate through an alignment and determine the number of gaps the
> alignment has and their lengths and output that information as a list.
> Such as this made up alignemt:
>
> Seq1 ATT-AGC-C
> Seq2 AT--AGCTC
>
> and your program runs and outputs like 2 gaps of length 1 outputted as a
> list like this [1,1] or something like that. I am still learning about
> python strings and iterators and am not sure how you would approach this?
> Appreciate any help you could give. Thanks.
I would start with using Bio.SeqIO to read in the sequences as
SeqRecord objects - I'm assuming you have them in a file (e.g. fasta
format, or maybe clustal?). See the tutorial or
http://biopython.org/wiki/SeqIO
e.g.
from Bio import SeqIO
handle = open("example.fasta")
for rec in SeqIO.parse(handle, "fasta") :
print rec.id, len(rec.seq), rec.seq.count("-")
The above code will simple count the number of gap characters. I
think you wanted to look at the sequence strings and how long each
stretch of gap characters is? Rather than counting the number of gap
characters? Well that is a little more complicated... perhaps
something like this:
from Bio import SeqIO
handle = open("example.fasta")
gap = "-"
for rec in SeqIO.parse(handle, "fasta") :
print rec.id, rec.seq
#TODO - Handle leading or trailing gaps
in_gap = False
gap_len = 0
for letter in rec.seq :
if letter == gap and not in_gap :
#Start of a gap
in_gap = True
assert gap_len == 0, "Logic error?"
gap_len = 1
elif in_gap and letter == gap :
#Continuation of a gap
gap_len += 1
elif in_gap and letter <> gap :
#End of the gap...
print " - Found a gap of length %i" % gap_len
#Reset
in_gap = False
gap_len = 0
Note that this doesn't record a running tally of the gap lengths
found, for which a python dictionary might be sensible.
Peter
More information about the Biopython
mailing list