[Biopython-dev] numpy/matlab style index arrays for Seq objects

Fri Dec 21 00:12:43 UTC 2012

Hello,

During the development of a project, I have come across an issue that I
want to share. As far as I know, Bio.Seq.Seq object can only be indexed
using an int or a slice object, just as regular strings:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)
>>> my_seq[4:12]
Seq('GATGGGCC', IUPACUnambiguousDNA())

However, it would be really nice to be able to index Seq objects using
index arrays as in numpy.array, like

>>> my_inidices = [0, 3, 7]
>>> my_seq[my_indices]
Seq('GCG', IUPACUnambiguousDNA())

(Since I'm not really familiar with BioPython API and codebase, please
ignore/forgive me if such thing already exists now.)

For example in my project, I'm trying to eliminate noisy columns of a
MSA fasta file. Let's assume that I have a list of non-noisy column
indices than this would solve my problem:

In [1]: from Bio import AlignIO
In [2]: msa = AlignIO.read("s001.fasta", "fasta")
In [3]: print msa[:, [0, 3, 4]]

SingleLetterAlphabet() alignment with 5 rows and 3 columns
KPG sp2
TPG sp11
SPG sp7
KPP sp6
SPG sp10

I have attached a tiny patch (~4 lines) implementing this stuff. At
first, I have thought keeping the sequence string as numpy.array(list())
to be able to use indexing mechanism of numpy, but it would be
over-engineering so I have just used a simple list comprehension trick.

Regards.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: biopython-index-array-for-seq.diff
Type: text/x-patch
Size: 3845 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20121221/326baacf/attachment-0002.bin>