[Biopython-dev] Ideas for Biopython 2.0

Peter Cock p.j.a.cock at googlemail.com
Mon Jun 19 11:15:06 UTC 2017


Hi Patrick,

Thank you for your well thought out document - you've clearly
been thinking about this deeply, and perhaps this will be the
push we need as a community to make some of these important
decisions for the future.

[Here I am quoting from your PDF attachment]

Patrick wrote:
> Biopython is a wonderful package, providing a variety of tools that
> make the life of computational biologists a lot easier. Unfortunately
> Biopython is also very crowded: There is a mass of subpackages
> with inconsistent naming schemes, some of them not really
> updated for 10 years. The documentation styles reach from rst
> over numpydoc to standard python doc.

In fact, some of the code hasn't really changed longer than 10 years!
Much of the inconsistency stems from a broad range of contributed
content over the years, without formal coding standards - the fact
we now enforce much of the PEP8 guidelines and push unit test
coverage has been the result of a lot of clean up work.

> I want to propose a community effort to ‘tidy up’ Biopython: Removing
> deprecated subpackages, harmonizing naming and documentation
> schemes, sorting modules, updating for modern scientific Python.
> The following sections will contain the proposed changes. Since
> those drastic changes would make the new version completely
> incompatible with existing Biopython versions, the new version is
> Biopython 2.0. Of course, everything is open for discussion. The
> development could happen in a dedicated Biopython branch or
> alternatively in a new repository. The development and release of
> version 2.0 should be parallel to version 1.x, since 1.x should be
> supported for some further years for compatibilty reasons.

I have pondered this as well - including continuing to ship both
the old "Bio" namespace and its v2.0 replacement in the same
package, which would allow moving code into the new namespace
and leaving in place stubs supporting the old API present.

(This has the advantage of avoiding two code-bases, and having
to manually back-port fixes)

> I know, this endeavour might be a lot of work, but if enough of us
> Biopythoneers work on this, we might finish it in a few months.
> So, now let’s begin with the suggested changes.

Sadly few of us are able to spend as much time on Biopython as
it deserves, so I fear your few month goal is optimistic. This is a
major reason why I've not yet pushed on this, and instead have
been concentrating on things we can fix or improve without
backwards incompatible changes (like coding style, docstring style,
unit test coverage, continuous integration and automated testing).

Are you by chance going to be in Prague for ISMB/ECCB 2017? It
would be great to meet up in person at the (fee) pre-BOSC Codefest
to discuss or prototype some the v2 ideas in person:

https://www.open-bio.org/wiki/Codefest_2017

> 0.1 Deprecation of Python 2.x
> support Since Python 3 is now released for almost 10 years, I propose
> to take the opportunity of not caring for backwards compatibility and
> deprecate the Python 2.x support.

We're about to sign up to http://www.python3statement.org/  and drop
Python 2.7. support by 2020 (when the Python drop support for it), see
http://mailman.open-bio.org/pipermail/biopython-dev/2017-June/021739.html

> 0.2 numpy, scipy and matplotlib as dependencies
> numpy enables convenient and fast data handling in Python and is
> one of the reasons why Python is so popular in scientific computing.
> scipy adds a lot of functionality to numpy and matplotlib is a popular
> tool for plotting. Since all 3 packages are nearly a must-have in
> scientific usage of Python (e.g. bundled in Anaconda), I would
> suggest that these packages are required for using Biopython 2.0.

Large parts of Biopython already assume NumPy, and it is our only
compile time dependency. Jython does not support NumPy, but has
not really taken off. Nowadays PyPy's NumPy support is really good.
It would make the install and testing framework simpler to just require
NumPy absolutely, but that would break Jython...

Quite a few bits of Biopython do use SciPy and matplotlib too. Until
the wheel format took off, this used to be quite a heavy dependency
(compiling scipy is quite demanding), but thankfully that is not a big
issue anymore.

> 0.3 Usage of numpydoc and Sphinx
> numpydoc is used by numpy for its documentation (surprise!). It
> is based on reStructuresText and can be interpreted by Sphinx via
> the numpy extension. Since Biopython lacks a uniform documentation
> format, I suggest the usage of numpydoc. Since the style is well
> defined, we do not have to invent a documentation style ourselves.
> Using Sphinx for creating our documentation, we also have the
> opportunity to put some other ressources into the documentation
> (extended examples, tutorials, etc.). We could also build the entire
> Biopython website this way, so everything is in one place.

We've been working towards this as a possible replacement for the
current documentation - most of the existing docstrings do now
pass docutils' RST validation (frustratingly I had to write a tool to
check this):

https://github.com/peterjc/flake8-rst-docstrings
https://github.com/biopython/biopython/issues/1221

If someone would like to experiment with replacing epydoc with
sphinx for the API documentation that would be great:

https://github.com/biopython/biopython/issues/906

Agreeing a common docstring standard would be nice, and numpydoc
is one of the obvious canidataes.

Shifting our LaTeX based Tutorial over to RST would also be
possible, but I personally would focus on the HTML version of the
docstrings first to improve the online API documentation:

https://github.com/biopython/biopython/issues/907

> 0.4 Lower case package and module naming
> Since we do not have to care for compatibility, we can now name all
> packages and modules uniformly in lower case.

This is one of the big attractions of a Biopython 2.0, for which we'd
have to drop the Bio namespace (and due to case-insenstive file
systems like Windows and most Macs, we could not use "bio" lower
case). Alternatives discussed in the past include biopython (long
but explicit) and biopy (like the style used in NumPy and SciPy).

> 0.5 Removal of outdated subpackages
> Biopython contains a lot of subpackages, that are not really
> maintained anymore. I suggest we try to look for new ‘owners’
> (via the mailing list) for these packages, but otherwise remove
> them in Biopython 2.0.

This does not need to and should not wait for a Biopython 2.0 release.
It already happens gradually, e.g. Bio.NeuralNetwork and Bio.GA
discussed recently:

http://mailman.open-bio.org/pipermail/biopython-dev/2017-June/021728.html

On a new thread please suggest other modules you (or anyone
else) think are no longer worth maintaining - ideally where there
is a mature alternative we can recommend outside of Biopython.

> 0.6 Module placement and imports
>
> For clarity reason, the top-level package (bio), should not contain
> any modules, but only supackages. These pack- ages can contain
> modules or subpackages (folders) by themselves. [...]

It is not encouraged to put code in __init__.py files, so that would
be simple style change to make at the same time.

Counter to your suggestion, I would resist automatically importing
anything at top level - NumPy does this as the long import time
is a major annoyance for some use cases.

> 0.7 Package organisation
> [snip]

This is probably going to be one of the hardest bits of build a
consensus on.

I think broad topics as top level modules makes more sense,
be that sequences, 3D structures, trees, clustering, ... - and
that most of these have their own file parsing requirements
which I personally would put under each top level module as
now. This also should work better for module ownership.

What you have not touched on is the interesting idea of making
Biopython 2.0 more modular - rather than a single installable
lump. The BioRuby project has gone to this extreme with the
BioGems effort (roughly the Ruby equivalent of PyPI):

http://biogems.info/

I think the Python packaging ecosystem has matured enough
now that this could work for Biopython.

As one example, I have been wondering about splitting off
Bio.trie as a small self-contained PyPI package, with us leaving
a tiny import stub in Biopython with a deprecation warning about
getting and using the package directly in future.

Kind regards,

Peter


On Mon, Jun 19, 2017 at 10:46 AM, Patrick Kunzmann
<padix.kleber at gmail.com> wrote:
> Dear Biopython developers and users,
>
> Biopython exists now for more than one decade and with time age-related
> ailments sneaked into the code: A lot of different modules accumulated in
> the top level package, some of them not maintained for more than 5 years;
> the module naming and documentation style is inconsistent; a lot of code is
> not optimized for modern (scientific) Python.
>
> In the attached PDF I present my idea of an endeavor for Biopython
> modernisation. I would be glad to receive feedback and further suggestions
> from you.
>
> Best regards,
>
> Patrick Kunzmann
>
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython-dev



More information about the Biopython-dev mailing list