[Biopython-dev] SeqRecord to file format as string

Jared Flatow jflatow at northwestern.edu
Fri Jun 20 16:16:10 UTC 2008


On Jun 20, 2008, at 9:42 AM, Peter wrote:

> On Wed, Jun 18, 2008 at 4:16 PM, Jared Flatow <jflatow at northwestern.edu 
> > wrote:
>> However, py3k and 2.6 will make available the functionality  
>> described in PEP
>> 3101:
>>
>> http://www.python.org/dev/peps/pep-3101/
>>
>> I think it would be best to define some semantics that are  
>> compatible with
>> this PEP.
>
> That is interesting - the PEP has been accepted, but I guess we should
> wait and see exactly what python 2.6 and 3.0 end up using before
> trying to integrate this into the SeqRecord.

I agree, there's a couple of things that may still change, but the  
betas for 2.6 and 3.0 are out and that PEP has been around a while so  
I would say it's pretty much stable. At least as far as how the  
general mechanism will work, I don't believe that is likely to change.

>> In short, I think creating methods to return formatted versions of  
>> objects
>> (SeqRecords) is a good idea, but most especially if it is done in a  
>> way
>> consistent with the language's vision.
>
> That does sound wise - but I'm a little hazy on how exactly PEP-3101
> will work in practice for generic complex objects.

Yes I had to read it a few times through to understand how exactly it  
will work, here is what I know:

All objects now get the __format__ method which has a signature like  
this:

def __format__(self, format_spec):
	# return a formatted string

The format_spec (format specifier) can be defined by the object, so  
essentially it's totally customizable (if you want to do really crazy  
things there is a Formatter that can be messed with, but we should and  
can avoid this). This object method works like other customizable  
python methods, and there's a corresponding builtin, so calling  
format(obj, "the format specifier") will simply call  
obj.__format__(self, "the format specifier"). Thus we can define the  
format_spec for a SeqRecord to differentiate between FASTA and  
whatever other formats we want to define.

The string class is also getting a .format method which just calls  
the .__format__ method in an OO way instead of using the builtin. We  
can do the same thing, and it seems like most use cases will be to  
call seq_rec.format('fasta'). All this works for all python versions,  
except you typically can't call it using format(seq_rec, 'fasta')  
except in 2.6 or 3.0.

Besides the builtin format, we gain the ability to embed the format  
within other strings. So, using the implementation you provided  
earlier which just returns the underlying Seq as a string if no format  
is specified, we might define the __format__ method like this:

def __format__(self, format_spec=None):
	if format_spec:
            from StringIO import StringIO
            from Bio import SeqIO
            handle = StringIO()
            SeqIO.write([self], handle, format)
            handle.seek(0)
            return handle.read()
	return str(self)

def __str__(self):
	return str(self.seq)

Now that means I can also embed this in formatted strings, like so:

"this is my sequence: {0}".format(seq_rec)

Or:

"this is my sequence in fasta format: {0:fasta}".format(seq_rec)

All in all, its pretty much what you'd expect (and the same as what  
you had before). There's only a few small benefits we get for doing it  
this way (right now), but I don't think we can go wrong using the  
__format__ method like it was meant to be used, and who knows what  
future use cases this may simplify.

jared



More information about the Biopython-dev mailing list