[BioPython] More string methods for the Seq object
Peter
biopython at maubp.freeserve.co.uk
Fri Sep 26 21:22:48 UTC 2008
>>> Suppose you have some sequences which you have aligned in ClustalW,
>>> and most have leading or trailing gaps characters. e.g. Given
>>> "---SAD-KCNKADND---" (as a Seq object with a gapped protein alphabet)
>>> you might want to strip off the leading and trailing gaps to have just
>>> "SAD-KCNKADND" (as a Seq object with the same alphabet). Right now
>>> the Seq object doesn't have a strip method, so you would have to
>>> switch to a string and back again.
>>
>> Using pure python strings:
>>
>> long_seq_str = "---SAD-KCNKADND---"
>> trimmed_seq_str = long_seq_str.strip("-")
This gives "SAD-KCNKADND", it does NOT remove the internal "-" character.
>> Using Biopython Seq objects:
>>
>> from Bio.Seq import Seq
>> from Bio.Alphabet import generic_protein
>> long_seq = Seq("---SAD-KCNKADND---", generic_protein)
>> #I want to be able to do this:
>> trimmed_seq = long_seq.strip("-")
>> #Right now, I have to do something like this:
>> trimmed_seq = Seq(long_seq.tostring().strip("-"), generic_protein)
This gives Seq("SAD-KCNKADND", ProteinAlphabet()), i.e. it would NOT
remove the internal "-" character.
> While I do like the idea, strip(), as defined here, is inconsistent with the
> Python string version. Python documentation: strip([chars]): "Return a
> copy of the string with the leading and trailing characters removed."
My intended Seq strip method is intended EXACTLY like the python
string apart from the default strip characters (except I would suggest
defaulting to the gap character rather than white space). My proposed
implementation even calls the python string strip method internally.
Have another look at the suggested code:
http://bugzilla.open-bio.org/show_bug.cgi?id=2596
> Rather you should use an alternative word like compress to remove the said
> character from within a sequence.
I suspect you have misunderstood my intension. My Seq object .strip()
method would NOT remove the given characters from the interior of the
sequence - only from the ends.
However, there is certainly a case for wanting an .ungap() method for
the Seq class (or a more general method to remove all of a particular
character), but I hadn't intended to raise this issue yet.
Peter
More information about the Biopython
mailing list