[Biopython] get back raw records with SeqIO?

Peter biopython at maubp.freeserve.co.uk
Fri Sep 25 09:50:51 UTC 2009


On Thu, Sep 24, 2009 at 10:51 PM, Cedar McKay <cmckay at u.washington.edu> wrote:
> Hello all. Congratulations on the release of 1.52. I'm very pleased to see
> the large file index feature included.

I hoped you would be - our mailing list discussion earlier in the year
basically triggered including this in Biopython:
http://lists.open-bio.org/pipermail/biopython/2009-June/005281.html

Were you able to update your script using the precursor index code
to use the new Bio.SeqIO.index function? It should have been a drop
in replacement ;)

> And even more thrilled to have more full featured support for writing
> genbank files with SeqIO. Thanks!

I guess you missed that earlier - the GenBank output included features
as of Biopython 1.51, but there have been a few tweaks since then.

> Are there plans to preserve more information in the in_genbank
> --> SeqIO --> out_genbank pipeline? For instance, at the moment,
> AUTHORS, COMMENT, etc are lost.

Like BioPerl, we are not expecting to offer a 100% round trip, but yes
there are some bits (like the references) which still need doing. I haven't
have the time or the need to follow up on those fields yet - but I would
certainly review a patch if you wanted to work on that.

> I have a use question about SeqIO. If I want to get back the raw records
> from a file, can I do that with SeqIO? For example, to parse a genbank file
> with many records, I do:
>
> genbank_records = GenBank.Iterator(in_file_handle)
>
> Can I use SeqIO similarly somehow? Can I tell it not to parse records?

No, the SeqIO system does not break up files into chunks of raw text.
One good reason for this is that it isn't possible for every file format
(e.g. interlaced alignments). For some of the specific file formats it
could be done. The mechanics of this is rather similar to what the
new indexing code is doing internally (for those file formats where it
is possible).

Why do you want to do this? I'd like to understand the desired usage.

> My way works fine, but I presume that Bio.GenBank is going to be
> fazed out sometime.

In the long term, perhaps we will phase out Bio.GenBank, but there
is nothing planed. It currently does both SeqRecord parsing (called
by Bio.SeqIO) and also a lower level more GenBank faithful record
object. This still has its uses (especially while there is still room for
improvement in GenBank output via SeqIO).

Peter



More information about the Biopython mailing list