[Biopython] losing information

Thu Oct 29 10:13:04 UTC 2009

On Thu, Oct 29, 2009 at 4:53 AM, Liam Thompson <dejmail at gmail.com> wrote:
> hi everyone
>
> I'm running a simple script to remove genbank records from
> a GB file that I have indentified as undesirable. The only
> problem is that when the script is run, all the annotation
> info (CDS etc) for entries is lost, only the sequence and ID
> is kept. I was wondering if there is an option I am missing,
> or if I am using an incorrect variable type somewhere. I just
> can't seem to get all the info written.

I guess since you are losing the CDS features you have an
old version of Biopython. From 1.51 onwards we do write
out the feature table, see:
http://www.biopython.org/wiki/SeqIO#File_Formats

However, using Bio.SeqIO to parse and write GenBank files
is still lossy. References are not (yet) written out for example.

There are alternatives: Internally Bio.SeqIO is using
Bio.GenBank to parse the files, and this offers two parsers,
one giving SeqRecord objects (used by SeqIO), and one
giving GenBank specific Records. This later parser should
do a better jobs of preserving the data on output.

That said, I would approach your problem in a very different
way. I would NOT parse the file into objects at all - I would
just loop over the lines, toggling between desired or not,
and outputting the lines for desired records as is. This
assumes your criteria for "desired" is simple to define,
e.g. a list of LOCUS identifiers.

Peter