[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Wed Aug 16 14:00:36 UTC 2006

(I changed the subject to that of the previous discussion, as this
isn't really about "contributing comparative genomics tools")

Albert Krewinkel wrote:
> Hello,
>
> I read Peter's SeqIO/__init__.py replacement and if I may say so: I
> love it.  Thanks a lot for this!  Still, there are some things I'd
> like to talk about.

Thank you :) The code is on Bug 2059 for anyone who hasn't looked yet.

http://bugzilla.open-bio.org/show_bug.cgi?id=2059

> The _parse_genbank_features function could also be used to parse embl
> or ddjb features, therefore I think it should be named differently.

First of all, that bit of code is for a new feature which I personally
wanted - to be able to iterate over CDS features in a genbank file.

But yes, I did have in mind that it (and the GenBank parser) could be
re-used to deal with EMBL files.  I have not yet taken the time to
learn the EMBL file format and how it corresponds to the GenBank file
format - but I agree a lot of the code could be shared.

> Since there is a lot of clean up effort right now: How about moving
> the SeqRecord and SeqFeature objects into the Bio.Seq module?  They
> are closely related and seperate modules only clutter the namespace.

What real benefit does that give us?  It will cause a certain amount
of upheaval in the short term as people will have to change their
import statements on existing scripts.  If we do start a new branch
for "big changes" then I have no real problem with this suggest.

> To me, this seems to be a general problem. It's very difficult to find
> a tool to use for a certain problem if one doesn't allready know what
> to look for.  I'd pretty much favour to create modules like
> Bio.structure to group modules like Bio.PDB and Bio.NMR etc.  This is
> a very big change, and therefore I'd like to follow Marc's suggestion
> of splitting off a branch.  In general, I pretty much agree with what
> Marc said in his <rant />.
>
> I cannot estimate how much work it would be to maintain two separate
> biopython distributions, so please forgive me if I re-suggest
> something completely idiotic here.  I just don't believe there is much
> that could be lost that way.

BioPython probably would benefit from a little reorganising - and for
anything drastic like moving entire modules about, a new branch makes
sense.  On the other hand, do we have the man-power to do it?  Are any
of the developers familiar with all of (or even most of) the existing
modules?  I would guess I have used less than half of the modules - I
have looked at the very basics of Bio.PDB for example, but have never
tried Bio.NMR

I would favour gradual incremental (and backwards compatible) changes.
 Such as adding a new sequence reading module and then marking the old
code as depreciated.

For example of some small changes, have any of you looked at:

Bug 2057 - SeqRecord has no __str__ or __repr__
http://bugzilla.open-bio.org/show_bug.cgi?id=2057

Bug 1963 - Adding __str__ method to codon tables and translators
http://bugzilla.open-bio.org/show_bug.cgi?id=1963

Little things in themselves that I think would help.

Peter