[Biopython] Biopython Enhancement Proposal (BEP): Alphabets

Michiel de Hoon mjldehoon at yahoo.com
Thu Oct 18 04:09:55 UTC 2018


 
Hi Peter,
To keep things manageable, I would suggest to remove Bio.Alphabet first, before considering to introduce a new system to replace it. 
Also, after removing Bio.Alphabet, we may have a better idea if and how it should be replaced.
Best,-Michiel


   On Tuesday, October 16, 2018, 9:28:49 PM GMT+9, Peter Cock <p.j.a.cock at googlemail.com> wrote:  
 
 Thanks Thomas,

That seems like a fair summary.

Related to this, this proof of principle pull request / branch
hides the alphabet from the Seq objects' __repr__:

https://github.com/peterjc/biopython/tree/hide_alphabet
https://github.com/biopython/biopython/pull/1676

That could be applied part of a gradual deprecation of
Bio.Alphabet if we agree to remove it.

Personally I lean to replacing the Bio.Alphabet objects
with a minimal typing system (like an enum, essentially
maintaining the minimal hierarchy of generic, nucleotide,
RNA, DRNA or protein). I started exploring this minimal
typing system idea here:

https://github.com/peterjc/biopython/tree/alpha_lite

That branch still needs more work, but I think it is a
viable approach.

Complete removal of Bio.Alphabet is also a sensible option,
but the above work caught several places in Bio.SeqIO where
existing round-trip parsing/saving requires somewhere to
hold the sequence type. To be viable a pull request or full
Biopython Enhancement Proposal to remove Bio.Alphabet
would need to address this point (e.g. something in the
SeqRecord annotation instead).

Both removing Bio.Alphabet and my minimal enum typing
idea would discard the current (very baroque and fragile)
mechanism for recording gap characters (and other special
symbols like stop symbols). This could require end user
code changes in a few places like using the Seq objects'
.ungap method, but that already supports giving the gap
character as an argument.

Peter

Peter
On Tue, Oct 16, 2018 at 10:28 AM T.A. Wemyss <taw50 at cam.ac.uk> wrote:
>
> Dear all,
>
> Apologies for the second email, but Michiel and I felt it would be
> useful to explain my reasoning behind leaving the BEP process. I started
> off in favour of alphabets, but have since become converted to
> supporting their removal.
>
> Here is some background on my motivation for this:
> - Alphabets have been in Biopython at least since version 1.00a3
> (September 3, 2001)
> - Implementation is inconsistent (see MutableSeq,
> https://github.com/biopython/biopython/issues/1681 )
> - Their purpose is badly defined and their current implementation does
> not clarify this. Therefore any new implementation is likely to cause
> breaking changes for the few people who actually use them.
> - On the entire mailing list, only one person replied to say they used
> alphabets - it's clearly not a widely used feature, and risks just being
> an additional source of confusion.
>
> Michiel has suggested that we proceed directly to removing Alphabets if
> nobody else wants to take over the BEP.
>
> All the best,
> Thomas
>
> On 2018-08-04 03:04, Michiel de Hoon wrote:
> > Dear all,
> >
> > While sequence objects in Biopython have an associated alphabet, the
> > purpose of alphabets in Biopython is currently not well-defined.
> > I can imagine these three interpretations of their purpose:
> >
> >      * To define how the sequence data is stored internally in a Seq
> > object (i.e. what kind of objects are in seq.data);
> >      * To define conceptually what the Seq object contains (e.g. this is a
> > protein, or this is DNA, or this is DNA with or without methylation);
> >      * To define how a Seq object should be presented to the user (e.g. as
> > a single-letter string, a three-letter string, or something else).
> >
> > (and there may be others that I have overlooked).
> >
> > To justify having alphabets as a part of Biopython, their purpose
> > should be clearly defined.
> >
> > Because of the complexity of alphabets and their use in Biopython, we
> > felt that it may be a good idea to have a PEP (Python Enhancement
> > Proposal)-like discussion to define the purpose of alphabets and their
> > technical implementation in Biopython. This would mean that somebody
> > who is in favor of having alphabets in Biopython would work out a
> > proposal with all the details to allow developers and users to think
> > through the implications.
> >
> > Here you can find a description of PEPs and what should go in them:
> > https://www.python.org/dev/peps/pep-0001/ [1]
> >
> > Not all of it is applicable to Biopython, but it may serve as a
> > general guideline.
> >
> > The Alphabet BEP (Biopython Enhancement Proposal) could be hosted on
> > the Biopython website so that everybody can follow the discussion.
> >
> > Since alphabets have been under discussion for more than 10 years, we
> > are thinking to put a time limit to the proposal (e.g., until January
> > 1st, 2020), meaning that if no agreement on the proposal is reached by
> > then, alphabets would be removed from Biopython. This would give
> > people who are in favor of alphabets to make their case, while
> > guaranteeing that a conclusion will be reached (either a well-defined
> > and usable alphabet, or no alphabet) within the next ~1.5 years.
> >
> > Any volunteers? Seq objects and therefore their alphabets are a key
> > feature of Biopython, and working through a BEP can give you the
> > opportunity to help design a major part of Biopython.
> >
> > Best,
> > -Michiel
> >
> >
> >
> > Links:
> > ------
> > [1] https://www.python.org/dev/peps/pep-0001/
> >
> > _______________________________________________
> > Biopython mailing list  -  Biopython at mailman.open-bio.org
> > http://mailman.open-bio.org/mailman/listinfo/biopython
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython
  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20181018/39f5b204/attachment.html>


More information about the Biopython mailing list