[Biopython-dev] parsing summary

Fri Dec 21 06:04:00 EST 2001

To summarize:

I'm working on a way to minimize the amount of work needed to handle
the standard case of

for record in data_file:
  do_something(record)
  write record to output_file

I think I have an API, which is easy to use

from Bio import SeqRecord

writer = SeqRecord.io.make_writer("genbank")
for record in SeqRecord.io.readFile(open("unknown.dat")):
  do_something(record)
  writer.write(record)

and can handle different intermediate data types

from Bio import SimpleSeq

writer = SimpleSeq.io.make_writer("fasta")
for record in SimpleSeq.io.readFile(open("unknown.dat")):
  do_something(record)
  writer.write(record)

And it's all built on powerful lower-level forms which are still
relatively easy to use.

The biggest problem I have is in registeration of all the different
format and conversion types.  Ideally, added a new format shouldn't
affect performance until its presence is needed.  That speaks for some
sort of file-based discovery mechanism.  The simplest solution is to
load all files at once, but I expect that to yield poor performance.
So there needs to be some sort of defered loading mechanism.  Or at
least such a mechanism should not be precluded.

What I want to do requires coming up with standardized names and data
types.  These include file formats, field types, and data structures.

Thank you for letting me write all this.  It's helped clear
up what my bottlenecks are in this work.  Hopefully you all have
some ideas - or you can way I'm trying to be too clever for my
own good !

                                Andrew
                                dalke at dalkescientific.com