[Biopython-dev] Ideas for Biopython 2.0

Peter Cock p.j.a.cock at googlemail.com
Mon Jun 19 14:45:05 UTC 2017


I am generally in agreement with your comments Tiago.

Note as per my reply to Patrick, we can't use "bio" (lower case)
as this would be the same directory on disk as "Bio" (title case) on
Windows and most Macs which use a case-insensitive file system.
Thus suggestions like "biopy" and "biopython" instead.

Peter

On Mon, Jun 19, 2017 at 3:30 PM, Tiago Antão <tiagoantao at gmail.com> wrote:
> Some comments:
>
> 1. The code base is very very old. Some things have been modernized but
> others are from the Jurassic era, a time where Python was a minority
> language with very little stuff for scientific computing.
>
> 2. I think the most important thing would be the module system (Bow had some
> cool ideas on this, if I remember)
>
> 3. In a module system there would be some core modules and people could do
> extensions (non-core modules). As Peter said, like biogems.
>
> 4. The rules for the core modules would be way more stricter than for
> non-core
>
> 5. I think there should be a list of allowable dependencies for core
> modules. Given the current maturity of the Python ecosystem I would say
> numpy, scipy, matplotlib but also at the very least pandas (and maybe
> scikit-learn, pymc3 and statsmodels). Being allowed does not mean that they
> will be used from day one. It would just be an indication for developers
> what what they could count on if they want to write a core module
>
> 6. Great that Bio.* and bio.* would share some code base, but we have to
> make sure that lingering problems on Bio.* would not infect bio.* . Shared
> code would have to be done according to bio.* policies and Bio.* changed
> accordingly. This assumes a single code base could be maintaned for all
> modules, that might not be always possible.
>
> 7. numpydoc and Sphinx seem like great ideas
>
> 8. Removal of outdated packages: suggestion start with bio.* empty and see
> who volunteers to port, all the rest would not make it. If there are no
> volunteers to port, then probably there is not enough manpower to maintain
> it anyway and/or is not of much use anymore.
>
> 9. As the person with the biggest number of users on Jython, I say: forget
> supporting it.
>
> 10. I believe documentation should mostly be in Jupyter notebook format:
> this works perfectly as static HTML if needed.
>
> 11. the top-level of bio.* would require a bit of discussion
>
> 12. Time to implement this? We do not have right? So I have the feel this is
> mostly academic (pun intended). This is especially true for the module
> system, sadly (as it would probably be the part regarding the most work -
> especially on the design side)
>
>
> Tiago
>
> On 19 June 2017 at 05:15, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>
>> Hi Patrick,
>>
>> Thank you for your well thought out document - you've clearly
>> been thinking about this deeply, and perhaps this will be the
>> push we need as a community to make some of these important
>> decisions for the future.
>>
>> [Here I am quoting from your PDF attachment]
>>
>> Patrick wrote:
>> > Biopython is a wonderful package, providing a variety of tools that
>> > make the life of computational biologists a lot easier. Unfortunately
>> > Biopython is also very crowded: There is a mass of subpackages
>> > with inconsistent naming schemes, some of them not really
>> > updated for 10 years. The documentation styles reach from rst
>> > over numpydoc to standard python doc.
>>
>> In fact, some of the code hasn't really changed longer than 10 years!
>> Much of the inconsistency stems from a broad range of contributed
>> content over the years, without formal coding standards - the fact
>> we now enforce much of the PEP8 guidelines and push unit test
>> coverage has been the result of a lot of clean up work.
>>
>> > I want to propose a community effort to ‘tidy up’ Biopython: Removing
>> > deprecated subpackages, harmonizing naming and documentation
>> > schemes, sorting modules, updating for modern scientific Python.
>> > The following sections will contain the proposed changes. Since
>> > those drastic changes would make the new version completely
>> > incompatible with existing Biopython versions, the new version is
>> > Biopython 2.0. Of course, everything is open for discussion. The
>> > development could happen in a dedicated Biopython branch or
>> > alternatively in a new repository. The development and release of
>> > version 2.0 should be parallel to version 1.x, since 1.x should be
>> > supported for some further years for compatibilty reasons.
>>
>> I have pondered this as well - including continuing to ship both
>> the old "Bio" namespace and its v2.0 replacement in the same
>> package, which would allow moving code into the new namespace
>> and leaving in place stubs supporting the old API present.
>>
>> (This has the advantage of avoiding two code-bases, and having
>> to manually back-port fixes)
>>
>> > I know, this endeavour might be a lot of work, but if enough of us
>> > Biopythoneers work on this, we might finish it in a few months.
>> > So, now let’s begin with the suggested changes.
>>
>> Sadly few of us are able to spend as much time on Biopython as
>> it deserves, so I fear your few month goal is optimistic. This is a
>> major reason why I've not yet pushed on this, and instead have
>> been concentrating on things we can fix or improve without
>> backwards incompatible changes (like coding style, docstring style,
>> unit test coverage, continuous integration and automated testing).
>>
>> Are you by chance going to be in Prague for ISMB/ECCB 2017? It
>> would be great to meet up in person at the (fee) pre-BOSC Codefest
>> to discuss or prototype some the v2 ideas in person:
>>
>> https://www.open-bio.org/wiki/Codefest_2017
>>
>> > 0.1 Deprecation of Python 2.x
>> > support Since Python 3 is now released for almost 10 years, I propose
>> > to take the opportunity of not caring for backwards compatibility and
>> > deprecate the Python 2.x support.
>>
>> We're about to sign up to http://www.python3statement.org/  and drop
>> Python 2.7. support by 2020 (when the Python drop support for it), see
>> http://mailman.open-bio.org/pipermail/biopython-dev/2017-June/021739.html
>>
>> > 0.2 numpy, scipy and matplotlib as dependencies
>> > numpy enables convenient and fast data handling in Python and is
>> > one of the reasons why Python is so popular in scientific computing.
>> > scipy adds a lot of functionality to numpy and matplotlib is a popular
>> > tool for plotting. Since all 3 packages are nearly a must-have in
>> > scientific usage of Python (e.g. bundled in Anaconda), I would
>> > suggest that these packages are required for using Biopython 2.0.
>>
>> Large parts of Biopython already assume NumPy, and it is our only
>> compile time dependency. Jython does not support NumPy, but has
>> not really taken off. Nowadays PyPy's NumPy support is really good.
>> It would make the install and testing framework simpler to just require
>> NumPy absolutely, but that would break Jython...
>>
>> Quite a few bits of Biopython do use SciPy and matplotlib too. Until
>> the wheel format took off, this used to be quite a heavy dependency
>> (compiling scipy is quite demanding), but thankfully that is not a big
>> issue anymore.
>>
>> > 0.3 Usage of numpydoc and Sphinx
>> > numpydoc is used by numpy for its documentation (surprise!). It
>> > is based on reStructuresText and can be interpreted by Sphinx via
>> > the numpy extension. Since Biopython lacks a uniform documentation
>> > format, I suggest the usage of numpydoc. Since the style is well
>> > defined, we do not have to invent a documentation style ourselves.
>> > Using Sphinx for creating our documentation, we also have the
>> > opportunity to put some other ressources into the documentation
>> > (extended examples, tutorials, etc.). We could also build the entire
>> > Biopython website this way, so everything is in one place.
>>
>> We've been working towards this as a possible replacement for the
>> current documentation - most of the existing docstrings do now
>> pass docutils' RST validation (frustratingly I had to write a tool to
>> check this):
>>
>> https://github.com/peterjc/flake8-rst-docstrings
>> https://github.com/biopython/biopython/issues/1221
>>
>> If someone would like to experiment with replacing epydoc with
>> sphinx for the API documentation that would be great:
>>
>> https://github.com/biopython/biopython/issues/906
>>
>> Agreeing a common docstring standard would be nice, and numpydoc
>> is one of the obvious canidataes.
>>
>> Shifting our LaTeX based Tutorial over to RST would also be
>> possible, but I personally would focus on the HTML version of the
>> docstrings first to improve the online API documentation:
>>
>> https://github.com/biopython/biopython/issues/907
>>
>> > 0.4 Lower case package and module naming
>> > Since we do not have to care for compatibility, we can now name all
>> > packages and modules uniformly in lower case.
>>
>> This is one of the big attractions of a Biopython 2.0, for which we'd
>> have to drop the Bio namespace (and due to case-insenstive file
>> systems like Windows and most Macs, we could not use "bio" lower
>> case). Alternatives discussed in the past include biopython (long
>> but explicit) and biopy (like the style used in NumPy and SciPy).
>>
>> > 0.5 Removal of outdated subpackages
>> > Biopython contains a lot of subpackages, that are not really
>> > maintained anymore. I suggest we try to look for new ‘owners’
>> > (via the mailing list) for these packages, but otherwise remove
>> > them in Biopython 2.0.
>>
>> This does not need to and should not wait for a Biopython 2.0 release.
>> It already happens gradually, e.g. Bio.NeuralNetwork and Bio.GA
>> discussed recently:
>>
>> http://mailman.open-bio.org/pipermail/biopython-dev/2017-June/021728.html
>>
>> On a new thread please suggest other modules you (or anyone
>> else) think are no longer worth maintaining - ideally where there
>> is a mature alternative we can recommend outside of Biopython.
>>
>> > 0.6 Module placement and imports
>> >
>> > For clarity reason, the top-level package (bio), should not contain
>> > any modules, but only supackages. These pack- ages can contain
>> > modules or subpackages (folders) by themselves. [...]
>>
>> It is not encouraged to put code in __init__.py files, so that would
>> be simple style change to make at the same time.
>>
>> Counter to your suggestion, I would resist automatically importing
>> anything at top level - NumPy does this as the long import time
>> is a major annoyance for some use cases.
>>
>> > 0.7 Package organisation
>> > [snip]
>>
>> This is probably going to be one of the hardest bits of build a
>> consensus on.
>>
>> I think broad topics as top level modules makes more sense,
>> be that sequences, 3D structures, trees, clustering, ... - and
>> that most of these have their own file parsing requirements
>> which I personally would put under each top level module as
>> now. This also should work better for module ownership.
>>
>> What you have not touched on is the interesting idea of making
>> Biopython 2.0 more modular - rather than a single installable
>> lump. The BioRuby project has gone to this extreme with the
>> BioGems effort (roughly the Ruby equivalent of PyPI):
>>
>> http://biogems.info/
>>
>> I think the Python packaging ecosystem has matured enough
>> now that this could work for Biopython.
>>
>> As one example, I have been wondering about splitting off
>> Bio.trie as a small self-contained PyPI package, with us leaving
>> a tiny import stub in Biopython with a deprecation warning about
>> getting and using the package directly in future.
>>
>> Kind regards,
>>
>> Peter
>>
>>
>> On Mon, Jun 19, 2017 at 10:46 AM, Patrick Kunzmann
>> <padix.kleber at gmail.com> wrote:
>> > Dear Biopython developers and users,
>> >
>> > Biopython exists now for more than one decade and with time age-related
>> > ailments sneaked into the code: A lot of different modules accumulated
>> > in
>> > the top level package, some of them not maintained for more than 5
>> > years;
>> > the module naming and documentation style is inconsistent; a lot of code
>> > is
>> > not optimized for modern (scientific) Python.
>> >
>> > In the attached PDF I present my idea of an endeavor for Biopython
>> > modernisation. I would be glad to receive feedback and further
>> > suggestions
>> > from you.
>> >
>> > Best regards,
>> >
>> > Patrick Kunzmann
>> >
>> >
>> > _______________________________________________
>> > Biopython-dev mailing list
>> > Biopython-dev at mailman.open-bio.org
>> > http://mailman.open-bio.org/mailman/listinfo/biopython-dev
>>
>> _______________________________________________
>> Biopython-dev mailing list
>> Biopython-dev at mailman.open-bio.org
>> http://mailman.open-bio.org/mailman/listinfo/biopython-dev
>
>
>
>
> --
> Tiago Antao
> Scientific and HPC programmer
> http://tiago.org
> https://github.com/tiagoantao/



More information about the Biopython-dev mailing list