[BioPython] a sequence set object in biopython?
Giovanni Marco Dall'Olio
dalloliogm at gmail.com
Thu Nov 13 00:16:44 UTC 2008
On Wed, Nov 12, 2008 at 7:36 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Giovanni Marco Dall'Olio wrote:
>>> All sensible use cases - but all seem to be covered by a simple python
>>> list of SeqRecord objects, or in some cases a list of Seq objects
>>> (e.g. the introns example, as I doube the introns have names).
>>
>> Not always.
>> For example, if I have a set of genes in an organism, sometimes I
>> would need to access to only some of them, by their id; so, a
>> __getattribute__ method to make it work as a dictionary could also be
>> useful.
>
> OK, then use a dict of SeqRecords for this, as shown in the tutorial
> chapter for Bio.SeqIO and the wiki. We even have a helper function
> Bio.SeqIO.to_dict() to do this and check for duplicate keys.
I would prefer a SeqRecordSet object with a to_dict method :)
> If you need an order preserving dictionary, there are examples of this
> on the net and there is even PEP372 for adding this to python itself:
> http://www.python.org/dev/peps/pep-0372/
>> The fact is that I think that such an object would be so widely used,
>> that maybe it would be useful to implement it in biopython.
>> What I would do, honestly, is to create a GenericSeqRecordSet class
>> from which to derive Alignment, specifying that in an alignment all
>> the sequences should have the same lenght. It would not require much
>> work and it would change the interface.
>
> I agree that IF we added some sort of "GenericSeqRecordSet class", it
> might be sensible for the alignment objects to subclass it -
> especially if you want it to behave list a python list primarily.
Let's see it from another point of view.
In biopython, if you want to print a set of sequences in fasta format,
you have to do the following:
>>> s1 = SeqRecord(Seq('cacacac'))
>>> s2 = SeqRecord(Seq('cacacac'))
>>> seqs = s1, s2
>>> out = ''
>>> for seq in seqs:
>>> # a "print seq.format('fasta')" statement won't work properly here, because of blank lines
>>> out += seq.format('fasta')
>>> print out
On the other side, printing an alignment in fasta format is a lot simpler:
>>> al = Alignment(SingleLetterAlphabet)
>>> al.add_sequence('s1', 'cacaca')
>>> al.add_sequence('s2, 'cacaca')
>>> print al.format('fasta')
I work more often with sets of sequences rather than with alignments.
So, why it is more difficult to print some un-related sequences in a
certain format, than aligned sequence? I would end up using Alignment
objects also for sequences that are not aligned.
I am also thinking about many format parsers.
Wouldn't it be easier:
>>> seqs = Bio.SeqIO.parse(filehandler, 'fasta')
>>> record_dict = seqs.to_dict()
than invoking SeqIO twice?
> Note that in python sets are not order preserving.
>
>> very tiny little minusculus p.s. if you need help for implement such a
>> thing or anything else I can volounteer :).
>
> That's good to hear :)
>
> However, we'd have to establish the need for this new object first -
> but so far we've only had two people's view so its too early to form a
> consensus. I don't see a strong reason for adding yet another object,
> when the core language provides lists, sets and dict which seem to be
> enough.
Take for example this code you wrote for me before:
> class SeqRecordList(list) :
> """Subclass of the python list, to hold SeqRecord objects only."""
> #TODO - Override the list methods to make sure all the items
> #are indeed SeqRecord objects
>
> def format(self, format) :
> """Returns a string of all the records in a requested file format.
>
> The argument format should be any file format supported by
> the Bio.SeqIO.write() function. This must be a lower case string.
> """
> from Bio import SeqIO
> from StringIO import StringIO
> handle = StringIO()
> SeqIO.write(self, handle, format)
> handle.seek(0)
> return handle.read()
It's very useful, but I don't think a python/biopython newbie would be
able to write it.
That's why I think it should be included.
Last year, I was in another laboratory and I didn't have much
experience with biopython, and I was missing such a kind of object.
> Peter
>
Goodnight!!
--
-----------------------------------------------------------
My Blog on Bioinformatics (italian): http://bioinfoblog.it
More information about the Biopython
mailing list