[Biopython-dev] Output sequence files

Brad Chapman chapmanb at arches.uga.edu
Fri May 25 00:28:40 EDT 2001


Hi Iddo!
Nice to hear from you again! Hope things are going well.

Iddo:
> Does Biopython provide anything in the field of writing out a sequence
> (Seq/MutableSeq classes in the usual GenBank/SwissProt/Fasta/... formats?

Nope, not yet. As Sarah noticed, no one has coded this up. I give
a big +1 for adding something that has this type of functionality.

Iddo:
> If Biopython really does not provide this feature, maybe a discussion
> could be started. Writing out the sequence part is easy. Can be
> implemented with a few functions (to_fasta, to_swiss), or those can even
> be methods within Seq/MutableSeq

Sarah:
> Also, to output meaningful genbank files I think we really need 
> to operate on SeqFeature objects? In fact even fasta needs the
> information in a SeqRecord object rather than just a MutableSeq?

I think Sarah is right on this. Seq/MutableSeq classes do not store
any useful annotations on the sequence (except the alphabet/type of
the sequence). Things should focus on SeqRecord, which has all of the
annotation stuff.


I was thinking about this problem while I was writing the output
functionality for GenBank.Record objects (the GenBank specific Record
classes). Here's my 2 cents on what I think should be done:

=> First, someone needs to work on SeqRecord to beef it up and make it 
a nicer class for storing annotation information. Right now,
everything gets shoved into the annotations or features attribute
(take a look at the GenBank stuff for a good example of how someone
(me!), can abuse these badly. I think a more full featured SeqRecord
class would be great.

=> In my mind, instead of focusing on conversions like: 

SeqRecord -> FASTA flat file format

we should do the conversions like:

SeqRecord -> Fasta.Record class -> FASTA flat file format

(and something similar for GenBank, SwissProt, etc). Since in
biopython we have nice classes for representing specific flat file
formats, and also have a  way to output the flat file from the record
(at least for FASTA and GenBank right now), this allows us to use
this strength of biopython and also not duplicate code.

This is a big bonus for more complicated formats like GenBank --
writing a function that outputs FASTA is not too bad, but GenBank is
much more complicated -- I was amazed at the amount of work I had to
do to get output working, even from a GenBank specific record
class. I'd rather not duplicate this type of code.

=> So, since we've already got the Record -> flat file converters (or can 
write them), I think we could focus on writing a converter that will
take a SeqRecord and give you a format specific Record object, like:

class SeqRecordConverter:
    def __init__(self, seq_record):
    
    def to_fasta(self):

    def to_genbank(self):

    def to_swissprot(self):

This could either go in the Bio/SeqRecord.py module or into something
like Bio/Tools/Converter.py, but I think it is better to separate
these functions away from the SeqRecord class itself: this would help
keep SeqRecord small, and would also allow you to use the
SeqRecordConverter with "SeqRecord-like" objects (ie. you could code
up your own SeqRecord-like classes for specialized behavior or
whatever).


So anyways, these are the ideas that have been mulling around in my
brain concerning this. What do people think? Other opinions on how to
implement this type of functionality? 

Thanks Iddo and Sarah -- I'm really glad y'all are interested in
working on this!

Brad




More information about the Biopython-dev mailing list