[Biopython-dev] generic format reader interface

Mon Apr 9 02:43:21 EDT 2001

On Mon, 9 Apr 2001, Andrew Dalke wrote:

> We've been putting the different formats under Bio/*.  Bioperl
> makes things available through a standard interface at Bio::IO.
> I like the biopython way since I think that's the only way to
> capture everything a database might do, but I also see the need
> for a centralized way to do I/O.

In keeping with the current philosophy, I'd like to keep the formats
separate.  This would allow people interested in a particular package to
look for all the code under one place.  That is, I should be able to find
all the SWISS-PROT stuff (or as much as possible) under the SwissProt
directory, rather that looking under SwissProt, SeqIO, etc etc etc.

That said, you're right in that having a centralized mechanism for I/O is
very useful.  I recall reading on the bioperl list somewhere that a user
thought it was the most valuable part of the package.  It's convenient,
easy to understand.  However, keeping things separate doesn't preclude a
centralized I/O mechanism.  It just means that you have to write wrappers
in SeqIO that understands where the rest of the code is.  It's an extra
layer of stuff to write, but well worth it, IMHO.

> What I'm thinking of is a centralized registry, which let you
> specify:
>   - input data type (some unique string, like "swissprot version='38'"
>                      or a tuple like ("swissprot", "38") )
>   - requested record type (another unique stream, like "Seq" or
>                          "SProt")
>
> This function would return an iterator for that input and output
> type.  For example:
>
>   iterator = Bio.IO.parseFile(open("sprot.dat"), input="swissprot",
>                               record="Seq")
>
>   while 1:
>       record = iterator.next()
>       if record is None:
>           break
>       ... work with the Seq record ...
>
> Not sure of the details now, but by using this sort of interface
> allows resolution to the best parser available for that need.  Eg,
> it could be something which reads the record into a SProt then
> converts the Sprot to a Seq, or it could go directly from the
> record to a Seq, if someone wants to write the appropriate
> specialization.
>
> What would be really nice is if the API had the ability to allow
> something like
>
>   iterator = Bio.IO.parseFile(open("sprot.dat"), input="swissprot",
>                               record="fasta")
>
> and have this return each record as the FASTA formatted string,
> and work either because:
>   - there is a swissprot -> FASTA string builder directly
> or
>   - there is a swissprot -> Seq builder and a Seq -> FASTA converter

The first would be nice, so that we could preserve information that's not
storable cleanly in the Seq object.  However, I worry about the
N**2-number-of-converters problem.  Do you have some ideas of working
around that?

Jeff

>
> Still thinking about it.
>
> 					Andrew
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev
>