[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Peter (BioPython Dev) biopython-dev at maubp.freeserve.co.uk
Wed Aug 2 10:45:34 UTC 2006


Leighton Pritchard wrote:
> On Mon, 2006-07-31 at 11:36 +0100, Peter (BioPython Dev) wrote:
> 
>>Question One
>>============
>>Is reading sequence files an important function to you, and if so which 
>>file formats in particular (e.g. Fasta, GenBank, ...)
> 
> Yes.  FASTA (sequence), GenBank, GFF, PTT, EMBL, ClustalW
> 

PTT (Protein table files)

http://www.ibt.unam.mx/biocomputo/hom_make_db.html
(Anyone got an NCBI link for the file format?)

GFF (General Feature Format)

http://genome.ucsc.edu/goldenPath/help/customTrack.html#GFF
http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml

GFF and PTT aren't exactly what I would call sequence files, in that
they don't contain any sequence data.  But thinking about it, maybe
those files could be turned into SeqRecords or SeqFeatures (with empty
sequences).

> 
>>If you have had to write you own code to read a "common" file format 
>>which BioPython doesn't support, please get in touch.
> 
> EMBL and PTT (though PTT is pretty trivial, and my EMBL parser is not
> pretty).
> 

Its looks like there is enough overlap between the EMBL and Genbank to
make sharing code between them a good idea.  Certainly EMBL was a file
format I was thinking we should try to support.

Reading your other comments, it looks like you wouldn't miss FastaRecord
or GenBank records if they were phased out.

Personally, I'm suggesting we try and standardise on having any Sequence
IO framework standardize on returning SeqRecord objects.

Does anyone know if SeqIO stood for Sequence or Sequential Input/Ouput?

I think we should have a generic "Sequence Iterator" object to do this
which takes a file handle, subclassed for each file format - giving a
"Fasta Iterator", a "Genbank Iterator", a "Clustal Iterator" etc.

I'm inclined not to give any choice of parser object (e.g.
Bio.Fasta.SequenceParser vs Bio.Fasta.RecordParser), and always return a
SeqRecord.

The individual readers should offer some level of control, for example
the title2ids function for Fasta files lets the user decide how the
title line should be broken up into id/name/description.  Also for some
file formats the user should be able to specify the alphabet.

Peter




More information about the Biopython-dev mailing list