[Biopython-dev] Another Biopython release?

Tue Sep 15 13:51:43 UTC 2009

Hi all,

Looking ahead, Tiago has some population genetics code he hopes to
merge into the trunk at the end of the month (or in October), and we
still have Brad's GFF stuff, my SFF work, Kristian's RNA code, Kyle's
misc suggestions, and perhaps most importantly the phylogenetics
GSoC work to consider.

I know it's been only a month since we released Biopython 1.51, but
does anyone (other than me) think that we already have enough done
to warrant another release? The associated CVS freeze would also
serve as a good break point for moving to github (see other threads).

Here is what we have in the NEWS file at the moment:

<quote>
New helper functions Bio.SeqIO.convert() and Bio.AlignIO.convert() allow an
easier way to use Biopython for simple file format conversions. Additionally,
these new functions allow Biopython to offer important file format specific
optimisations (e.g. FASTQ to FASTA, and interconverting FASTQ variants).

New function Bio.SeqIO.indexed_dict() allows indexing of most sequence file
formats (but not alignment file formats), allowing dictionary like random
access to all the entries in the file as SeqRecord objects, keyed on the
record id. This is especially useful for very large sequencing files, where
all the records cannot be held in memory at once. This supplements the more
flexible but memory demanding Bio.SeqIO.to_dict() function.

Bio.SeqIO can now write "phd" format files (used by PHRED, PHRAD and
CONSED), allowing interconversion with FASTQ files, or FASTA+QUAL files.

Bio.Emboss.Applications now includes wrappers for the "new" PHYLIP EMBASSY
package (e.g. fneighbor) which replace the "old" PHYLIP EMBASSY package
(e.g. efneighbor) whose Biopython wrappers are now obsolete.

See also the DEPRECATED file, as several old deprecated modules have finally
been removed (e.g. Bio.EUtils which had been replaced by Bio.Entrez).
</quote>

[As an aside - Cymon and David - do you want to be named in the NEWS
file for the PHD and PHLIPNEW stuff?]

We're still debating the name of the new function Bio.SeqIO.indexed_dict(),
but I am happy with the code (and new documentation) otherwise. The
related extensions to adding indexing via a lookup file or an SQLite
database is another big chunk of work which I don't have time for at the
moment, but the code already in CVS is still extremely useful as is.

Again, I'm biased, but I think the Bio.SeqIO.convert(...) function will be
a popular addition for its convenience, but especially valuable for anyone
wanting to convert between the different FASTQ files where the optimised
conversion code makes a big speed up.

Does doing another quick release (say at some point next week) sound
like a good plan? If people like the idea, then getting some extra testing
in now would be great - especially on the new stuff (it has unit tests of
course, but real world usage is also important - thanks Brad for already
trying out the FASTA indexing).

Peter