[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Mon Aug 21 19:26:06 UTC 2006

You probably noticed I sent out a "Dealing with sequence files"
questionnaire on the main discussion list:

http://lists.open-bio.org/pipermail/biopython/2006-August/003171.html

I've had four replies to date (off the list), and with the previous list
discussion and counting myself that makes eight views.  Not a very big
sample I know.

> Question One
> ============
> Is reading sequence files an important function to you, and if so which
> file formats in particular (e.g. Fasta, GenBank, ...)

Fasta very popular, with GenBank also scoring highly.  Michiel and I
both use clustalw.  Apart from EMBL (next question) there wasn't any
other popular file format given.

I'm tempted to ask again regarding multiple alignment formats.

> Question Two
> ============
> Are there any sequence formats you would like to be able to read using 
> BioPython that are not currently supported (e.g. EMBL, ...)

It may have been a leading question, but several respondents would like
to be able to read in EMBL format.

Other requests included:

XML based 454 sequence files
UniGene sequence cluster format

Leighton mentioned:

PTT (Protein table files)
GFF (General Feature Format)

And I wanted to be able to read Stockholm alignments.

> Question Three - Reading Fasta Files
> ====================================
> Which of the following do you currently use (and why)?:
> 
> (a) Bio.Fasta with the RecordParser (giving FastaRecord objects)
> (b) Bio.Fasta with the FeatureParser (giving SeqRecord objects)
> (c) Bio.Fasta with your own parser (Could you tell us more?)
> (d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects)
> (e) Bio.FormatIO (giving SeqRecord objects)
> (f) Other (Could you tell us more?)

A range covering (a), (b) and (d) plus DIY parsers.

> Question Four - Reading GenBank Files
> =====================================
> Which of the following do you currently use (and why)?:
> 
> (a) Bio.GenBank with the FeatureParser (giving SeqRecord objects)
> (b) Bio.GenBank with the RecordParser (giving GenBank Record objects)
> (c) Other (Could you tell us more?)

Both (a) and (b) with no clear majority.

> Question Five - Record Access...
> ================================
> When loading a file with multiple sequences do you use:
> 
> (a) An iterator interface (e.g. Bio.Fasta.Iterator) to give you the
> records one by one in the order from the file.
> 
> (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you
> random access to the records using their identifier.
> 
> (c) A list giving random access by index number (e.g. load the records
> using an iterator but save them in a list).

Most of you use iterators, storing records in memory as required.

> Question Six - Martel, Scanners and Consumers
> =============================================
> Some of BioPython's existing parsers (e.g. those using Martel) use an
> event/callback model, where the scanner component generates parsing
> events which are dealt with by the consumer component.
> 
> Do any of you use this system to modify existing parser behaviour, or
> use it as part of your own personal file parser?
> 
> (a) I don't know, or don't care.  I just the the parsers provided.
> (b) I use this framework to modify a parser in order to do ... (please
> provide details).

Almost everyone said (a) which I think is a good thing if we are going
to try and re-work the BioPython's sequence reading.

> And finally...
> ==============
> Do you have any general questions of comments.

Several people have commented that BioPerl has a nice unified system
with good documentation.

-----------------------------------------------------------------------

Where next...

I think my code could be included "in parallel" with the existing
parsers, without the upheaval of creating a new branch etc.

I have started thinking about writing files too.

Part of this will involve trying to be as consistent as possible about
mapping annotations from different file formats to the SeqRecord
object's annotations dictionary.

http://bugzilla.open-bio.org/show_bug.cgi?id=2059

My code currently on bug 2059 is written as a single python file,
provisionally Bio/SeqIO/__init__.py but this is clearly not a good idea
long term as more file formats are supported.

If we use Bio.SeqIO then the prior existence of Bio/SeqIO/FASTA.py is a
slight annoyance in that I can't use Bio/SeqIO/Fasta.py because the
filenames would clash on Windows.  Some people are using the code in
Bio.SeqIO.FASTA, but I suppose the file could contain both the old code,
and my new fasta interface.

Alternatively, the new system could be put in Bio.SequenceIO or are
there any other suggestions?

Peter