[Biopython-dev] Bio.SeqIO.convert function?

Peter biopython at maubp.freeserve.co.uk
Mon Aug 10 16:46:16 UTC 2009


On Sat, Aug 8, 2009 at 12:14 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> I've stuck a branch up on github which (thus far) simply defines
> the Bio.SeqIO.convert and Bio.AlignIO.convert functions.
> Adding optimised code can come later.
>
> http://github.com/peterjc/biopython/commits/convert

There is now a new file Bio/SeqIO/_convert.py on this
branch, and a few optimised conversions have been done.
In particular GenBank/EMBL to FASTA, any FASTQ to
FASTA, and inter-conversion between any of the three
FASTQ formats.

In terms of speed, this new code takes under a minute to
convert a 7 million short read FASTQ file to another FASTQ
variant, or to a (line wrapped) FASTA file. In comparison,
using Bio.SeqIO parse/write takes over five minutes.

In terms of code organisation within Bio/SeqIO/_convert.py
I am (as with Bio.SeqIO etc for parsing and writing) just
using a dictionary of functions, keyed on the format names.
Initially, as you can tell from the code history, I was thinking
about having each sub-function potentially dealing with more
than one conversion (e.g. GenBank to anything not needing
features), but have removed this level of complication in the
most recent commit.

The current Bio/SeqIO/_convert.py file actually looks very
long and complicated - but if you ignore the doctests (which
I would probably more to a dedicated unit test), it isn't that
much code at all.

Would anyone like to try this out?

Peter



More information about the Biopython-dev mailing list