[BioPython] Question about Seq.count()
Giovanni Marco Dall'Olio
dalloliogm at gmail.com
Fri Oct 19 13:38:50 UTC 2007
2007/10/18, Peter <biopython at maubp.freeserve.co.uk>:
> >>> from Bio.Seq import Seq
> >>> my_seq = Seq("AAACACACGGTTTT")
> >>> my_seq.count("G")
> 2
> >>> my_seq.count("GG")
> 0
I've found the bug!
The code for Bio.Seq.count is:
def count(self, item):
return len([x for x in self.data if x == item])
it does not work for patterns of two nucleotides, because '[x for x in
self.data]' reiterates on a list of strings of one letter each:
>>> s = Seq( 'ACTTgGCATYCGgtGACGACTGGGcATCGGTCAGTCGGTTT')
>>> [x for x in s.data]
['A', 'C', 'T', 'T', 'g', 'G', 'C', 'A', 'T', 'Y', 'C', 'G', 'g', 't',
'G', 'A', 'C', 'G', 'A', 'C', 'T', 'G', 'G', 'G', 'c', 'A', 'T', 'C',
'G', 'G', 'T', 'C', 'A', 'G', 'T', 'C', 'G', 'G', 'T', 'T', 'T']
>>> for x in s.data:
>>> print x, 'GG', x == 'GG'
(always false)
Something like [len('GG' in s.data)] also won't work, because "'GG' in
s.data" returns a Boolean value:
>>> 'GG' in s.data
True
What about using regular expressions instead?
>>> import re
>>> r = re.compile('GG')
>>> count = len(r.findall(my_seq.data))
They don't seem to be too different as for the execution time:
# for i in $( seq 10); do time python -m re -c '"cdasd".count("cc")';
done 2>&1| grep real
real 0m0.091s
real 0m0.106s
real 0m0.081s
real 0m0.110s
real 0m0.076s
real 0m0.109s
real 0m0.109s
real 0m0.062s
real 0m0.110s
real 0m0.062s
# for i in $(seq 10); do time python -m re -c 'len(re.findall("cc",
"cdasd"))'; done 2>&1|grep real
real 0m0.065s
real 0m0.108s
real 0m0.079s
real 0m0.082s
real 0m0.111s
real 0m0.113s
real 0m0.110s
real 0m0.112s
real 0m0.112s
real 0m0.111s
Compiling a short pattern with the re module shouldn't take too much
time and maybe in future implementations, it will allows us to do more
interesting things: for example, we will be able to add an
'ignorecase' parameter to Seq.count:
>>> Bio.Seq('ACAGtcAGgCATGCGG').count('GG', 'ignorecase')
2
>>> Bio.Seq('ACAGtcAGgCATGCGG').count('GG')
1
What do you think?
Cheers,
Giovanni
--
-----------------------------------------------------------
My Blog on Bioinformatics (italian): http://dalloliogm.wordpress.com
More information about the Biopython
mailing list