[Biopython] Biopython Enhancement Proposal (BEP): Alphabets

Peter Cock p.j.a.cock at googlemail.com
Fri Oct 19 10:12:51 UTC 2018


HI Michiel,

My point was that *something* has to replace Bio.Alphabet
for some of the existing use cases - including various file
formats in SeqIO and AlignIO which record DNA, RNA
vs Protein. e.g.

https://github.com/biopython/biopython/blob/biopython-172/Bio/AlignIO/NexusIO.py#L129
https://github.com/biopython/biopython/blob/biopython-172/Bio/SeqIO/InsdcIO.py#L613
https://github.com/biopython/biopython/blob/biopython-172/Bio/SeqIO/SeqXmlIO.py#L348

This could be a new convention for where to store this
in the SeqRecord, or as I am currently exploring, a
simplified alphabet attribute of the Seq object.

This effort should uncover any other roadblocks to a
removal or another replacement alphabet system would
face - and might help with coming up with a better idea.

Peter

On Thu, Oct 18, 2018 at 5:09 AM Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
> Hi Peter,
>
> To keep things manageable, I would suggest to remove Bio.Alphabet first, before considering to introduce a new system to replace it.
> Also, after removing Bio.Alphabet, we may have a better idea if and how it should be replaced.
>
> Best,
> -Michiel
>
> On Tuesday, October 16, 2018, 9:28:49 PM GMT+9, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
>
> Thanks Thomas,
>
> That seems like a fair summary.
>
> Related to this, this proof of principle pull request / branch
> hides the alphabet from the Seq objects' __repr__:
>
> https://github.com/peterjc/biopython/tree/hide_alphabet
> https://github.com/biopython/biopython/pull/1676
>
> That could be applied part of a gradual deprecation of
> Bio.Alphabet if we agree to remove it.
>
> Personally I lean to replacing the Bio.Alphabet objects
> with a minimal typing system (like an enum, essentially
> maintaining the minimal hierarchy of generic, nucleotide,
> RNA, DRNA or protein). I started exploring this minimal
> typing system idea here:
>
> https://github.com/peterjc/biopython/tree/alpha_lite
>
> That branch still needs more work, but I think it is a
> viable approach.
>
> Complete removal of Bio.Alphabet is also a sensible option,
> but the above work caught several places in Bio.SeqIO where
> existing round-trip parsing/saving requires somewhere to
> hold the sequence type. To be viable a pull request or full
> Biopython Enhancement Proposal to remove Bio.Alphabet
> would need to address this point (e.g. something in the
> SeqRecord annotation instead).
>
> Both removing Bio.Alphabet and my minimal enum typing
> idea would discard the current (very baroque and fragile)
> mechanism for recording gap characters (and other special
> symbols like stop symbols). This could require end user
> code changes in a few places like using the Seq objects'
> .ungap method, but that already supports giving the gap
> character as an argument.
>
> Peter
>
> Peter
> On Tue, Oct 16, 2018 at 10:28 AM T.A. Wemyss <taw50 at cam.ac.uk> wrote:
> >
> > Dear all,
> >
> > Apologies for the second email, but Michiel and I felt it would be
> > useful to explain my reasoning behind leaving the BEP process. I started
> > off in favour of alphabets, but have since become converted to
> > supporting their removal.
> >
> > Here is some background on my motivation for this:
> > - Alphabets have been in Biopython at least since version 1.00a3
> > (September 3, 2001)
> > - Implementation is inconsistent (see MutableSeq,
> > https://github.com/biopython/biopython/issues/1681 )
> > - Their purpose is badly defined and their current implementation does
> > not clarify this. Therefore any new implementation is likely to cause
> > breaking changes for the few people who actually use them.
> > - On the entire mailing list, only one person replied to say they used
> > alphabets - it's clearly not a widely used feature, and risks just being
> > an additional source of confusion.
> >
> > Michiel has suggested that we proceed directly to removing Alphabets if
> > nobody else wants to take over the BEP.
> >
> > All the best,
> > Thomas
> >
> > On 2018-08-04 03:04, Michiel de Hoon wrote:
> > > Dear all,
> > >
> > > While sequence objects in Biopython have an associated alphabet, the
> > > purpose of alphabets in Biopython is currently not well-defined.
> > > I can imagine these three interpretations of their purpose:
> > >
> > >      * To define how the sequence data is stored internally in a Seq
> > > object (i.e. what kind of objects are in seq.data);
> > >      * To define conceptually what the Seq object contains (e.g. this is a
> > > protein, or this is DNA, or this is DNA with or without methylation);
> > >      * To define how a Seq object should be presented to the user (e.g. as
> > > a single-letter string, a three-letter string, or something else).
> > >
> > > (and there may be others that I have overlooked).
> > >
> > > To justify having alphabets as a part of Biopython, their purpose
> > > should be clearly defined.
> > >
> > > Because of the complexity of alphabets and their use in Biopython, we
> > > felt that it may be a good idea to have a PEP (Python Enhancement
> > > Proposal)-like discussion to define the purpose of alphabets and their
> > > technical implementation in Biopython. This would mean that somebody
> > > who is in favor of having alphabets in Biopython would work out a
> > > proposal with all the details to allow developers and users to think
> > > through the implications.
> > >
> > > Here you can find a description of PEPs and what should go in them:
> > > https://www.python.org/dev/peps/pep-0001/ [1]
> > >
> > > Not all of it is applicable to Biopython, but it may serve as a
> > > general guideline.
> > >
> > > The Alphabet BEP (Biopython Enhancement Proposal) could be hosted on
> > > the Biopython website so that everybody can follow the discussion.
> > >
> > > Since alphabets have been under discussion for more than 10 years, we
> > > are thinking to put a time limit to the proposal (e.g., until January
> > > 1st, 2020), meaning that if no agreement on the proposal is reached by
> > > then, alphabets would be removed from Biopython. This would give
> > > people who are in favor of alphabets to make their case, while
> > > guaranteeing that a conclusion will be reached (either a well-defined
> > > and usable alphabet, or no alphabet) within the next ~1.5 years.
> > >
> > > Any volunteers? Seq objects and therefore their alphabets are a key
> > > feature of Biopython, and working through a BEP can give you the
> > > opportunity to help design a major part of Biopython.
> > >
> > > Best,
> > > -Michiel
> > >
> > >
> > >
> > > Links:
> > > ------
> > > [1] https://www.python.org/dev/peps/pep-0001/
> > >
> > > _______________________________________________
> > > Biopython mailing list  -  Biopython at mailman.open-bio.org
> > > http://mailman.open-bio.org/mailman/listinfo/biopython
>
> > _______________________________________________
> > Biopython mailing list  -  Biopython at mailman.open-bio.org
> > http://mailman.open-bio.org/mailman/listinfo/biopython


More information about the Biopython mailing list