[Biopython-dev] Alphabet case and standards

Mon Jan 12 23:04:46 UTC 2009

On Mon, Jan 12, 2009 at 10:24 PM, Bruce Southey <bsouthey at gmail.com> wrote:
> Hi,
> I am moving a potential discussion away from the bugzilla because it affects
> at least the following Bugs (please add others):
> 2351 (Make Seq more like a string, even subclass string?
> http://bugzilla.open-bio.org/show_bug.cgi?id=2351 ),
> 2532 (Using IUPAC alphabets in mixed case Seq objects
> http://bugzilla.open-bio.org/show_bug.cgi?id=2532 ),
> 2597 (Enforce alphabet letters in Seq objects
> http://bugzilla.open-bio.org/show_bug.cgi?id=2597 )
> 2731 (Adding .upper() and .lower() methods to the Seq object
> http://bugzilla.open-bio.org/show_bug.cgi?id=2731 ).
>
> I am hoping it gets wider feedback than using bugzilla, avoid unnecessary
> duplication and closure of these bugs.

Yes, having a discussion on the mailing list is probably better than
on bugzilla.  I should probably write up my views on this topic
explicitly, but I've tried to do so below in reply to your points.

> From Bug 2351, "Bio.Alphabets.IUPAC defines a number of alphabets with
> defined lists of valid letters which are in upper case ONLY". But various
> applications ignore the alphabet case and hence the standards. So this
> creates the problem of how Biopython should handle alphabet case.
> ...

I don't want to prevent people from using mixed case or lower case
sequences if they want to.  However, I do think doing so with an
alphabet which is intended to be an upper case ONLY should be treated
as an error.

We currently have a number of generic alphabets which DO NOT define
the a set of valid letters.  We also have some IUPAC derived alphabet
which define a set of upper case only expected letters.

So, if you want to use lower or mixed case sequences in a Seq object,
(1) Use a generic alphabet which does not explicitly define the valid
letters (so any characters are allowed)
(2) Use an explicit alphabet which includes the relevant cases.  This
could be a user defined alphabet, or we one added to Biopython.

Most of the time in my personally usage, I don't actually care about
the precise alphabet - the generic DNA/RNA/protein alphabets suffice.
These do not list the expected/allowed letters, and thus can be used
for upper case, lower case or mixed case sequences.  Working with well
defined alphabets is more important when working with things like
BLOSUM matrices.

> One suggestion given in two of the bugs is to change the Alphabet object but
> I believe that this is wrong because you do not know which alphabet to use.

The person creating the Seq object should know what kind of data they
are dealing with, and if they specifically want to use say "mixed case
unambiguous IUPAC DNA" (if this were in Biopython) then that's up to
them.  If you don't know exactly what you are dealing with, fall back
on the generic DNA alphabet, or the generic nucleotide alphabet, or
even the generic single letter alphabet.

> ... Also, if mixed case alphabets are used, then an excessive number
> of alphabets may be required.

We *could* introduce mixed case IUPAC alphabets, and lower case IUPAC
alphabets to complement the existing upper case IUPAC alphabets (see
my patch on 2532).  Yes, this does add a lot of alphabets, and I'm not
entirely keen on this either.  Maybe just adding mixed case versions
would suffice?

> I think that current approach is to force to user to using uppercase when
> interacting with the Alphabet object or derived from it (such as an actual
> alphabet). While this maintains storage of the input case, it does not
> enforce the standard. This is also inefficient because it requires constant
> checks for the correct case.

Right now we don't force the user to do anything.  I would like to
make the alphabet check strict (Bug 2579), or at least give a warning.
 Running with this change locally has flagged up several typos in my
unit tests - I think it is a good thing.

> Similar to the first suggestion in Bug 2731, I think that we should
> automatically changes the case when creating any sequence-related object and
> provide a warning that the input has changed. This enforces standard and
> probably requires small changes to the code but loses the format of the
> input. Outside of Biopython, an example of this is the web version of NCBI
> blast silently converts input case of the query.

My personal view on automatically changing the case of the sequence
string when creating a Seq object: NO WAY.  You're throwing away
potentially important data, and also preventing people from working
with mixed case sequences - for no real benefit.

> Less desirable options:
> a) Enforces the standard such as with Bug 2597 so that an error is return
> for any sequence-related object if the case is incorrect. This is probably a
> little too harsh for a difference in case.

It could be done as a warning for a couple of releases, and later an
error.  Why do you think it is too hash?  Maybe I am being pedantic
here, but lots of code gets written assuming uppercase letters only,
and in this situation having any unwanted lower case caught early is a
good thing.

To my mind the whole point about the user explicity using for example
the IUPAC protein alphabet is they expect the sequence to comply with
the IUPAC conventions.  I *WANT* to get an error if the sequence
contained something invalid like a "@" character, or anything else not
in the IUPAC definition.  Mixed cases are a special case of this (the
IUPAC standards use upper case).

> b) Use regular expressions to ignore case but this will create a large
> penalty especially if it is not required.

I'm not sure what you mean here, but I don't think regular expressions
are required.

Peter