[Biopython] Biopython Enhancement Proposal (BEP): Alphabets

Markus Piotrowski Markus.Piotrowski at ruhr-uni-bochum.de
Fri Oct 19 13:06:54 UTC 2018

At least there are a few places, that need to be looked at:

$ grep -r "from Bio import Alphabet" /g/git/biopython/Bio/*

/g/git/biopython/Bio/Align/AlignInfo.py:from Bio import Alphabet
/g/git/biopython/Bio/Align/__init__.py:from Bio import Alphabet
/g/git/biopython/Bio/AlignIO/NexusIO.py:from Bio import Alphabet
/g/git/biopython/Bio/Alphabet/IUPAC.py:from Bio import Alphabet
/g/git/biopython/Bio/Alphabet/Reduced.py:    >>> from Bio import Alphabet
/g/git/biopython/Bio/Alphabet/Reduced.py:from Bio import Alphabet
/g/git/biopython/Bio/Data/CodonTable.py:from Bio import Alphabet
/g/git/biopython/Bio/FSSP/FSSPTools.py:from Bio import Alphabet
/g/git/biopython/Bio/GenBank/__init__.py:        from Bio import Alphabet
/g/git/biopython/Bio/motifs/matrix.py:from Bio import Alphabet
/g/git/biopython/Bio/motifs/__init__.py:        from Bio import Alphabet
/g/git/biopython/Bio/NeuralNetwork/Gene/Schema.py:from Bio import Alphabet
/g/git/biopython/Bio/Phylo/PhyloXML.py:from Bio import Alphabet
/g/git/biopython/Bio/Seq.py:from Bio import Alphabet
/g/git/biopython/Bio/SeqIO/AbiIO.py:from Bio import Alphabet
/g/git/biopython/Bio/SeqIO/InsdcIO.py:from Bio import Alphabet
/g/git/biopython/Bio/SeqIO/SeqXmlIO.py:from Bio import Alphabet
/g/git/biopython/Bio/SeqIO/SffIO.py:from Bio import Alphabet
/g/git/biopython/Bio/SeqIO/SwissIO.py:from Bio import Alphabet
/g/git/biopython/Bio/SeqIO/UniprotIO.py:from Bio import Alphabet
/g/git/biopython/Bio/SeqIO/_index.py:from Bio import Alphabet
/g/git/biopython/Bio/SeqUtils/__init__.py:from Bio import Alphabet
/g/git/biopython/Bio/SubsMat/FreqTable.py:from Bio import Alphabet
/g/git/biopython/Bio/SubsMat/__init__.py:from Bio import Alphabet

$ grep -r "from Bio.Alphabet import" /g/git/biopython/Bio/*

/g/git/biopython/Bio/Align/AlignInfo.py:from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/Align/__init__.py:        >>> from Bio.Alphabet 
import generic_dna
/g/git/biopython/Bio/Align/__init__.py:        >>> from Bio.Alphabet 
import IUPAC, Gapped
/g/git/biopython/Bio/Align/__init__.py:        >>> from Bio.Alphabet 
import IUPAC, Gapped
/g/git/biopython/Bio/Align/__init__.py:        >>> from Bio.Alphabet 
import IUPAC, Gapped
/g/git/biopython/Bio/Align/__init__.py:        >>> from Bio.Alphabet 
import IUPAC, Gapped
/g/git/biopython/Bio/Align/__init__.py:        >>> from Bio.Alphabet 
import generic_dna
/g/git/biopython/Bio/Align/__init__.py:        >>> from Bio.Alphabet 
import generic_dna
/g/git/biopython/Bio/Align/__init__.py:        >>> from Bio.Alphabet 
import generic_dna
/g/git/biopython/Bio/Align/__init__.py:        >>> from Bio.Alphabet 
import generic_dna
/g/git/biopython/Bio/AlignIO/FastaIO.py:from Bio.Alphabet import 
single_letter_alphabet, generic_dna, generic_protein
/g/git/biopython/Bio/AlignIO/FastaIO.py:from Bio.Alphabet import Gapped
/g/git/biopython/Bio/AlignIO/Interfaces.py:from Bio.Alphabet import 
/g/git/biopython/Bio/AlignIO/MafIO.py:from Bio.Alphabet import 
/g/git/biopython/Bio/AlignIO/StockholmIO.py:    >>> from Bio.Alphabet 
import generic_rna
/g/git/biopython/Bio/AlignIO/StockholmIO.py:    >>> from Bio.Alphabet 
import generic_rna
/g/git/biopython/Bio/AlignIO/__init__.py:from Bio.Alphabet import 
Alphabet, AlphabetEncoder, _get_base_alphabet
/g/git/biopython/Bio/Alphabet/Reduced.py:    >>> from Bio.Alphabet 
import Reduced
/g/git/biopython/Bio/Alphabet/__init__.py:        >>> from Bio.Alphabet 
import IUPAC
/g/git/biopython/Bio/Alphabet/__init__.py:        >>> from Bio.Alphabet 
import IUPAC
/g/git/biopython/Bio/Alphabet/__init__.py:        >>> from Bio.Alphabet 
import IUPAC
/g/git/biopython/Bio/codonalign/codonalignment.py:    >>> from 
Bio.Alphabet import generic_dna
/g/git/biopython/Bio/codonalign/codonalignment.py:    >>> from 
Bio.Alphabet import IUPAC, Gapped
/g/git/biopython/Bio/codonalign/codonalignment.py: from Bio.Alphabet 
import generic_nucleotide
/g/git/biopython/Bio/codonalign/codonalphabet.py:from Bio.Alphabet 
import IUPAC, Gapped, HasStopCodon, Alphabet
/g/git/biopython/Bio/codonalign/codonseq.py:from Bio.Alphabet import 
generic_dna, _ungap
/g/git/biopython/Bio/codonalign/__init__.py:from Bio.Alphabet import 
/g/git/biopython/Bio/codonalign/__init__.py:    >>> from Bio.Alphabet 
import IUPAC
/g/git/biopython/Bio/codonalign/__init__.py:    from Bio.Alphabet import 
/g/git/biopython/Bio/codonalign/__init__.py:    from Bio.Alphabet import 
/g/git/biopython/Bio/Data/CodonTable.py:from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/GenBank/Scanner.py:from Bio.Alphabet import 
/g/git/biopython/Bio/GenBank/__init__.py:        from Bio.Alphabet 
import IUPAC
/g/git/biopython/Bio/motifs/alignace.py:from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/motifs/mast.py:from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/motifs/matrix.py:from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/motifs/meme.py:from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/motifs/transfac.py:from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/motifs/__init__.py:        from Bio.Alphabet import 
/g/git/biopython/Bio/motifs/__init__.py:        from Bio.Alphabet import 
/g/git/biopython/Bio/NeuralNetwork/Gene/Motif.py:from Bio.Alphabet 
import _verify_alphabet
/g/git/biopython/Bio/NeuralNetwork/Gene/Pattern.py:from Bio.Alphabet 
import _verify_alphabet
/g/git/biopython/Bio/NeuralNetwork/Gene/Signature.py:from Bio.Alphabet 
import _verify_alphabet
/g/git/biopython/Bio/Nexus/Nexus.py:from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/PDB/Polypeptide.py:from Bio.Alphabet import 
/g/git/biopython/Bio/SearchIO/BlatIO.py:from Bio.Alphabet import generic_dna
/g/git/biopython/Bio/SearchIO/FastaIO.py:from Bio.Alphabet import 
generic_dna, generic_protein
/g/git/biopython/Bio/SearchIO/BlastIO/blast_text.py:from Bio.Alphabet 
import generic_dna, generic_protein
/g/git/biopython/Bio/SearchIO/BlastIO/blast_xml.py:from Bio.Alphabet 
import generic_dna, generic_protein
Bio.Alphabet import generic_protein
/g/git/biopython/Bio/SearchIO/HmmerIO/hmmer2_text.py:from Bio.Alphabet 
import generic_protein
/g/git/biopython/Bio/SearchIO/HmmerIO/hmmer3_domtab.py:from Bio.Alphabet 
import generic_protein
/g/git/biopython/Bio/SearchIO/HmmerIO/hmmer3_tab.py:from Bio.Alphabet 
import generic_protein
/g/git/biopython/Bio/SearchIO/HmmerIO/hmmer3_text.py:from Bio.Alphabet 
import generic_protein
Bio.Alphabet import generic_protein
/g/git/biopython/Bio/SearchIO/_model/hsp.py:from Bio.Alphabet import 
/g/git/biopython/Bio/Seq.py:from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import generic_dna
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import generic_dna
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import 
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import 
generic_dna, generic_rna
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import 
generic_dna, generic_protein
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import 
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import 
generic_dna, generic_rna, generic_protein
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import 
HasStopCodon, generic_protein
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import Gapped, 
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import generic_dna
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import generic_dna
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import Gapped
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import generic_dna
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import IUPAC, 
Gapped, HasStopCodon
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import IUPAC, 
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import generic_dna
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import 
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import 
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import 
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import generic_dna
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import generic_dna
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import Gapped, 
/g/git/biopython/Bio/Seq.py:    >>> from Bio.Alphabet import generic_dna
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import 
/g/git/biopython/Bio/Seq.py:        >>> from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/SeqFeature.py:        >>> from Bio.Alphabet import 
/g/git/biopython/Bio/SeqFeature.py:        >>> from Bio.Alphabet import 
/g/git/biopython/Bio/SeqFeature.py:        >>> from Bio.Alphabet import 
/g/git/biopython/Bio/SeqFeature.py:        >>> from Bio.Alphabet import 
/g/git/biopython/Bio/SeqIO/AceIO.py:from Bio.Alphabet import 
generic_nucleotide, generic_dna, generic_rna, Gapped
/g/git/biopython/Bio/SeqIO/FastaIO.py:from Bio.Alphabet import 
/g/git/biopython/Bio/SeqIO/IgIO.py:from Bio.Alphabet import 
/g/git/biopython/Bio/SeqIO/Interfaces.py:from Bio.Alphabet import 
/g/git/biopython/Bio/SeqIO/PdbIO.py:from Bio.Alphabet import generic_protein
/g/git/biopython/Bio/SeqIO/PirIO.py:from Bio.Alphabet import 
single_letter_alphabet, generic_protein, \
/g/git/biopython/Bio/SeqIO/QualityIO.py:>>> from Bio.Alphabet import 
/g/git/biopython/Bio/SeqIO/QualityIO.py:from Bio.Alphabet import 
/g/git/biopython/Bio/SeqIO/QualityIO.py:    >>> from Bio.Alphabet import 
/g/git/biopython/Bio/SeqIO/TabIO.py:from Bio.Alphabet import 
/g/git/biopython/Bio/SeqIO/__init__.py:from Bio.Alphabet import 
Alphabet, AlphabetEncoder, _get_base_alphabet
/g/git/biopython/Bio/SeqIO/__init__.py:    >>> from Bio.Alphabet import 
/g/git/biopython/Bio/SeqIO/__init__.py:    >>> from Bio.Alphabet import 
/g/git/biopython/Bio/SeqRecord.py:    >>> from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/SeqRecord.py:        >>> from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/SeqRecord.py:        >>> from Bio.Alphabet import 
/g/git/biopython/Bio/SeqRecord.py:        >>> from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/SeqRecord.py:        >>> from Bio.Alphabet import 
/g/git/biopython/Bio/SeqRecord.py:        >>> from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/SeqRecord.py:        >>> from Bio.Alphabet import 
/g/git/biopython/Bio/SeqRecord.py:        >>> from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/SeqRecord.py:        >>> from Bio.Alphabet import 
/g/git/biopython/Bio/Sequencing/Phd.py:from Bio.Alphabet import generic_dna
/g/git/biopython/Bio/SeqUtils/MeltingTemp.py:    >>> from Bio.Alphabet 
import generic_nucleotide
/g/git/biopython/Bio/SeqUtils/ProtParam.py:from Bio.Alphabet import IUPAC
/g/git/biopython/Bio/SeqUtils/__init__.py:    >>> from Bio.Alphabet 
import generic_dna, generic_rna, generic_protein
/g/git/biopython/Bio/SeqUtils/__init__.py:    >>> from Bio.Alphabet 
import generic_dna


Am 19.10.2018 um 12:12 schrieb Peter Cock:
> HI Michiel,
> My point was that *something* has to replace Bio.Alphabet
> for some of the existing use cases - including various file
> formats in SeqIO and AlignIO which record DNA, RNA
> vs Protein. e.g.
> https://github.com/biopython/biopython/blob/biopython-172/Bio/AlignIO/NexusIO.py#L129
> https://github.com/biopython/biopython/blob/biopython-172/Bio/SeqIO/InsdcIO.py#L613
> https://github.com/biopython/biopython/blob/biopython-172/Bio/SeqIO/SeqXmlIO.py#L348
> This could be a new convention for where to store this
> in the SeqRecord, or as I am currently exploring, a
> simplified alphabet attribute of the Seq object.
> This effort should uncover any other roadblocks to a
> removal or another replacement alphabet system would
> face - and might help with coming up with a better idea.
> Peter
> On Thu, Oct 18, 2018 at 5:09 AM Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>> Hi Peter,
>> To keep things manageable, I would suggest to remove Bio.Alphabet first, before considering to introduce a new system to replace it.
>> Also, after removing Bio.Alphabet, we may have a better idea if and how it should be replaced.
>> Best,
>> -Michiel
>> On Tuesday, October 16, 2018, 9:28:49 PM GMT+9, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> Thanks Thomas,
>> That seems like a fair summary.
>> Related to this, this proof of principle pull request / branch
>> hides the alphabet from the Seq objects' __repr__:
>> https://github.com/peterjc/biopython/tree/hide_alphabet
>> https://github.com/biopython/biopython/pull/1676
>> That could be applied part of a gradual deprecation of
>> Bio.Alphabet if we agree to remove it.
>> Personally I lean to replacing the Bio.Alphabet objects
>> with a minimal typing system (like an enum, essentially
>> maintaining the minimal hierarchy of generic, nucleotide,
>> RNA, DRNA or protein). I started exploring this minimal
>> typing system idea here:
>> https://github.com/peterjc/biopython/tree/alpha_lite
>> That branch still needs more work, but I think it is a
>> viable approach.
>> Complete removal of Bio.Alphabet is also a sensible option,
>> but the above work caught several places in Bio.SeqIO where
>> existing round-trip parsing/saving requires somewhere to
>> hold the sequence type. To be viable a pull request or full
>> Biopython Enhancement Proposal to remove Bio.Alphabet
>> would need to address this point (e.g. something in the
>> SeqRecord annotation instead).
>> Both removing Bio.Alphabet and my minimal enum typing
>> idea would discard the current (very baroque and fragile)
>> mechanism for recording gap characters (and other special
>> symbols like stop symbols). This could require end user
>> code changes in a few places like using the Seq objects'
>> .ungap method, but that already supports giving the gap
>> character as an argument.
>> Peter
>> Peter
>> On Tue, Oct 16, 2018 at 10:28 AM T.A. Wemyss <taw50 at cam.ac.uk> wrote:
>>> Dear all,
>>> Apologies for the second email, but Michiel and I felt it would be
>>> useful to explain my reasoning behind leaving the BEP process. I started
>>> off in favour of alphabets, but have since become converted to
>>> supporting their removal.
>>> Here is some background on my motivation for this:
>>> - Alphabets have been in Biopython at least since version 1.00a3
>>> (September 3, 2001)
>>> - Implementation is inconsistent (see MutableSeq,
>>> https://github.com/biopython/biopython/issues/1681 )
>>> - Their purpose is badly defined and their current implementation does
>>> not clarify this. Therefore any new implementation is likely to cause
>>> breaking changes for the few people who actually use them.
>>> - On the entire mailing list, only one person replied to say they used
>>> alphabets - it's clearly not a widely used feature, and risks just being
>>> an additional source of confusion.
>>> Michiel has suggested that we proceed directly to removing Alphabets if
>>> nobody else wants to take over the BEP.
>>> All the best,
>>> Thomas
>>> On 2018-08-04 03:04, Michiel de Hoon wrote:
>>>> Dear all,
>>>> While sequence objects in Biopython have an associated alphabet, the
>>>> purpose of alphabets in Biopython is currently not well-defined.
>>>> I can imagine these three interpretations of their purpose:
>>>>       * To define how the sequence data is stored internally in a Seq
>>>> object (i.e. what kind of objects are in seq.data);
>>>>       * To define conceptually what the Seq object contains (e.g. this is a
>>>> protein, or this is DNA, or this is DNA with or without methylation);
>>>>       * To define how a Seq object should be presented to the user (e.g. as
>>>> a single-letter string, a three-letter string, or something else).
>>>> (and there may be others that I have overlooked).
>>>> To justify having alphabets as a part of Biopython, their purpose
>>>> should be clearly defined.
>>>> Because of the complexity of alphabets and their use in Biopython, we
>>>> felt that it may be a good idea to have a PEP (Python Enhancement
>>>> Proposal)-like discussion to define the purpose of alphabets and their
>>>> technical implementation in Biopython. This would mean that somebody
>>>> who is in favor of having alphabets in Biopython would work out a
>>>> proposal with all the details to allow developers and users to think
>>>> through the implications.
>>>> Here you can find a description of PEPs and what should go in them:
>>>> https://www.python.org/dev/peps/pep-0001/ [1]
>>>> Not all of it is applicable to Biopython, but it may serve as a
>>>> general guideline.
>>>> The Alphabet BEP (Biopython Enhancement Proposal) could be hosted on
>>>> the Biopython website so that everybody can follow the discussion.
>>>> Since alphabets have been under discussion for more than 10 years, we
>>>> are thinking to put a time limit to the proposal (e.g., until January
>>>> 1st, 2020), meaning that if no agreement on the proposal is reached by
>>>> then, alphabets would be removed from Biopython. This would give
>>>> people who are in favor of alphabets to make their case, while
>>>> guaranteeing that a conclusion will be reached (either a well-defined
>>>> and usable alphabet, or no alphabet) within the next ~1.5 years.
>>>> Any volunteers? Seq objects and therefore their alphabets are a key
>>>> feature of Biopython, and working through a BEP can give you the
>>>> opportunity to help design a major part of Biopython.
>>>> Best,
>>>> -Michiel
>>>> Links:
>>>> ------
>>>> [1] https://www.python.org/dev/peps/pep-0001/
>>>> _______________________________________________
>>>> Biopython mailing list  -  Biopython at mailman.open-bio.org
>>>> http://mailman.open-bio.org/mailman/listinfo/biopython
>>> _______________________________________________
>>> Biopython mailing list  -  Biopython at mailman.open-bio.org
>>> http://mailman.open-bio.org/mailman/listinfo/biopython
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython

Dr. Markus Piotrowski
Privatdozent/Akademischer Rat
Lehrstuhl für Molekulargenetik und Physiologie der Pflanzen
ND 3/49
Universitätsstr. 150
44801 Bochum

Tel. xx49-(0)234-3224290
Fax. xx49-(0)234-3214187


More information about the Biopython mailing list