[BioPython] sequence proposals (long)
Jeffrey Chang
jchang@SMI.Stanford.EDU
Thu, 30 Mar 2000 01:50:07 -0800 (PST)
On Mon, 27 Mar 2000, Andrew Dalke wrote:
> Hello,
>
> I've been thinking about how to define the basic sequence classes for
> biopython. I've got a set of proposals on the topic which I would
> love to get feedback about. Since I have a tendency to write long
> emails, I've broken them up into several messages, which I'll be
> sending over the next week or so.
Yikes! And this is the broken-up one!
> This one contains my proposals for the basic sequence protocol, and an
> idea of how to handle alphabets and encodings. I have a bias towards
> Python, but I also compare the Perl, Java and C++ ways of doing
> things.
Sure. I don't think it would be a failure if biopython were to make
sequences classes that were biased (even heavily) toward python's way of
doing things. I'd rather have something that works well here, rather than
sequences that suck equally on all languages! ;)
> Proposal 1) the basic sequence interface
>
> Conceptually, people think about sequences as a string - a list of
> residues. A residue can be an object (as with biojava's ResidueList)
> or a character (as with bioperl's PrimarySeq). Since I am a fan of
> generic programming, I want the basic sequence interface to act the
> same as a string, when possible.
Do you mean sequences should support the same slicing semantics? After
python 1.6, strings will become objects, with their own methods, and how
string objects and biological sequences act will diverge.
> Here is a list of the interface I think all sequences, and
> sequence-like objects, should implement.
>
> 1.1) There is a way to iterate forward through the sequence, element
> by element. When finished, all elements will have been visited in
> order. Forward-only iterators should be rarely implemented.
[discussion of forward iterators]
> 1.2) Random access sequences must be integer subscriptable via the
> appropriate means for a string for the given language.
[... I'm liberally cutting things from the email, for length and
relevance reasons. I hope that I'm not leaving anything without the
proper context. I apologize if I do!]
> 1.3) If the sequence length is known, there must be a way to get
> access to it in constant time. The returned value must be usable to
> get access to the last element of the list.
[...]
> 1.4) Position, lengths and ranges should be used in language
> appropriate form.
>
> This is the standard "do sequences start at 1, like biologists think
> about them or 0 like the language (except for Visual Basic and
> Fortran)." I strongly believe if the language is 0 based, then the
> sequence implementation is 0 based. Output meant for a biologist
> should be converted to 1 based, but not the internals.
Yes, this has been covered here before, and IIRC, the consensus.
> I say "should" in the proposal instead of a "must" because there is
> much contention about the subject. Biojava has sequences starting a 1
> even though Java is a 0 based langauge.
In defense of the biojava people, having 1-based sequences seem less
offensive in java than it would in python, where the subscripting is
overloadable. It feels to me like the semantics of the indexes for a
method call is less stringently enforced than that of subscripting, where
the syntax is built into the language.
> 1.5) Subsequences can be extracted from sequences given a range, and
> the subsequence implements the sequence interface.
[...]
> (should strides be supported?)
What's a stride?
> 1.6) Mutability - if the sequence class is mutable, then
> a) elements must be changed by a means equivalent to subscripts
> b) ranges must be changed by a means equivalent to subslicing
> At the implementor's discretion, neither or both of the following
> are implemented:
> c) assignment to slices may change the length of the list
> (eg, seq[1:5] = "AAAAAAAAAAAAAAAAAAAAAAAAA")
> d) elements may be able to be added and removed
> At the implementor's discretion:
> e) modification to slices may affect the original sequence
>
> This part of my proposal is meant mostly as a guideline to show that
> mutability is a complex problem.
Agreed.
(Slight tangent) I'm not sure if you've mentioned it explicitly, but
we're going to need both mutable and immutable sequences. Immutable
sequences are necessary in order to guarantee the consistency between the
sequence information and any annotations that may have been carried with
it. Because it would be so hairy otherwise, I propose that any annotated
sequences must be immutable.
> As an example of e), should the last line of following return 'C' or
> 'x'?
>
> >>> a = Seq("ABCDE")
> >>> b = a[2:4]
> >>> str(b[0])
> 'C'
> >>> str(a[2])
> 'C'
> >>> b[0] = "x"
> >>> b[0]
> 'x'
> >>> str(a[2])
>
>
> It turns out that an implementation with array.array will return 'C'
> while one with Numeric.array will return 'x'. Which is best?
I don't have a definitive answer for this, but can probably add to the
confusion. I'm not really a fan of the Numeric way of doing this, because
it breaks the usual python idiom where a[:] creates a copy of the a list.
However, the Numeric way does save a lot of memory when accessing just a
region of a large matrix (or DNA sequence).
> 1.7) If the sequence elements are meant to be viewable, their string
> value is found through the normal stringification operation.
>
> In the simplest case, str(seq[i]) should return the single, (or three
> letter) code for the given element. Adding this clause makes it
> easier to support aligned displays of different sequence types.
Yes, but I'm not sure we need to allow this kind of flexibility. I
believe str should just return a human-readable string, and leave
specialized formatting to other functions.
> Note that there can be other ways to get a name for a residue. You
> may prefer the three letter name, or the english name, or the spanish
> name. There is a general need for looking up an property of a
> residue, which will be discussed in one of my future proposals.
OK.
> 1.8) If possible, the stringification of a sequence object must return
> the string value of the elements such that, when given a sequence of
> size N and some x in a valid subrange, then:
>
> str(seq) == str(seq[:x]) + str(seq[x:])
>
>
> I'm not sure about this one, which is why I added the "if possible".
> The problem comes when there are seperators. Suppose you use a three
> letter code, then you might want the stringification of a sequence to
> be: ALA-GLY-PRO-ASP instead of ALAGLYPROASP. But splitting the
> sequence in half and joining them may create
>
> "ALA-GLY" + "PRO-ASP" -> "ALA-GLYPRO-ASP"
>From what I understand, str is just supposed to return a human-readable
string, useful for interactive mode or debugging. I'm uncomfortable
imposing other constraints on it.
> ===============================
>
> Proposal 2) Encodings
>
> My first proposal topic works fine for sequences which can be
> expressed as a set of letters/elements. In addition to the normal
> IUPAC protein/nucleotide usage, here are some of the ways it can be
> used (if you know of more, please let me know):
>
> A) non-standard residues have their own characters. Some are
> mentioned in the IUPAC definitons, like:
> X = selenocysteine; for proteins
> B = Asx = aspartic acid or asparagine; for protein
> W = wyosine; for DNA
> B) non-IUPAC usage
> - perhaps the most common is using upper/lower case as an
> indication of certainty.
> - Aaron J Mackey (ajm6q@virginia.edu) on bioperl-guts points
> out that SEG can convert low-information residues to lower case
> so they act like X in FASTA, but you can still see the original
> characters in the output
> C) non-residue information
> (I especially want more information about these.)
> - gaps, as in alignments (often with "-", "." or " ")
> - stop codons; "*" is often used to show where a protein residue
> would be, if the corresponding codon wasn't a stop codon.
> D) extra-residue information, still with a single character
> - show secondary structure prediction using "H"elix, "S"trand,
> "C"oil, "T"urn
> - digit to show % properties, like % conservation
> E) non-single character definitions, still per-residue
> - a prosite pattern match ("[FILAPVM]"), esp. as when aligned
> to the matching protein,
> - alignment of 3-character nucleotides to 1-character amino acids
>
> All of the above scenerios can be modeled with objects implementing
> the sequence protocol. That is, they are subscriptable, sliceable,
> stringifiable, etc. I know this is true for A)-D) because I've seen
> FASTA files used to store all of them, and the records of a FASTA file
> are easily mapped to the sequence protocol (trivially so, since the
> data is natually stored as a string of characters). And E) works
> because I've restricted it to have a one-to-one mapping to a residue.
>
> When read from a generic FASTA file, the sequence isn't really
> anything other than a glorified string. In a formal sense, you can't
> calculate the molecular weight because you don't know if
> "CCCCCHHHHHHHTCCC" is a protein or nucleotide sequence, or maybe a
> secondary structure prediction! (BTW, I consider the autoguessing
> requirement of bioperl to be a flaw because other encodings exist.)
>
> You can't even get the biological sequence length, because it can
> contain gap characters. Should the length of "A--T" be 2 or 4? I
> think 2, but that violates my "1.3)", above.
It depends on what you consider the sequence length.
I don't consider "A-T" to be a biological sequence. I think the sequence
is "AT", with extra information embedded within. B-D above all describe
cases in which information other than the sequence is contained in the
string representation of the sequence. This should be stored in some
other place, and the sequence protocol preserved for the actual sequence.
For example:
>>> seq = GappedSequence("AT-G--C")
>>> seq[1:3]
'TG'
>>> seq.gapped[1:3]
'T-G'
Here, the sequence slicing returns the actual biological sequence, while
the gapped representation is delegated to another object. The actual
storage of the gap information is unspecified.
This would solve the problem discussed on the bioperl list, where doing
upper or lower was destroying the information.
> The problem is two-fold; what's the per-residue alphabet, and what's
> the encoding for non-residue information? (The last part is C) in my
> list.)
>
>
> I got to thinking about this problem a lot. Staring from basics, if
> space and performance were of no concern, a sequence like "APGA..."
> could be represented as
>
> from IUPAC.Protein import ALA, PRO, GLY
> sequence = [ALA(), PRO(), GLY(), ALA(), ...]
>
>
> where PRO, GLY and ALA are constructors for a residue object. This
> would even solve the typing problem because each residue would be
> typed, and all you need to assert is that the element types are
> homogenous. Also, in Python, a list (the "[]") implements the
> sequence interface.
>
> In the general case, each residue may be different. For example, some
> of the residues may be modified - perhaps methylated, either at
> specific sites or in a statistical sense - or you want to store the 3D
> coordinates for each one.
>
> If the residues are identical and position independent, there's no
> need to create new objects, so this could be turned into:
>
> sequence = [ALA, PRO, GLY, ALA, ...]
>
> These are sometimes called fly-weight objects because the objects are
> reused. The new "objects" are really just the pointers/references to
> the real object. This, BTW, is the biojava approach.
>
> Space and performance are important, and there are fewer than 256 of
> them (okay, fewer than 27) , so this is further reduced to:
>
> sequence_type = "protein"
> sequence = "APGA"
> or
> sequence = Seq("APGA", moltype = "protein")
>
> This is the bioperl approach.
>
> This compression hinges on several predicates:
> a) all residues are identical
> b) there is one-to-one mapping from letters+seqtype to the "real"
> residue objects
> c) residues are of homogenous data types (can't mix protein and
> rna types)
>
> Using a gap character or stop codon symbol violates c) because
> non-homogenous types in the container. A stop codon symbol violates
> b) because there is nothing real about it. It indicates the
> non-existence of an object.
>
> So those characters (and likely others in different contexts) do not
> belong as elements of a sequence. They really should be considered
> part of the sequence itself.
>
> Flipping things around, how would I define something which holds a
> gapped sequence? I would make a new class which has the sequence and
> the location/length of the gaps. It takes the sequence string and
> knows the appropriate way to interpret the gap character. Very much
> like what bioperl's SimpleAlign.pm and UnivAln.pm do, except holding
> only one sequence.
>
>
> There is an interesting question here - should GappedSeq be derived
> from Seq, or vice versa? If GappedSeq is derived from Seq, then by
> good class design, a GappedSeq is usable anywhere a Seq is usable, but
> this is false - I can't use the same algorithm to calculate the
> molecular weight.
Yes. I think GappedSeq will need to behave like Seq, with the
gap-specific stuff in methods specific to GappedSeq. Unless there's
something I'm missing... (It's getting late).
> 2.1) Sequences using a finite alphabet must contain an "alphabet"
> member, which is used to map the alphabet back to the appropriate
> per-residue type.
Seems reasonable. I've never had to use an alphabet, but I've not been in
a situation where the requirement would've gotten in my way, either.
[...]
> There will be more in future emails about how the properties work
> together, so that you can preserve lower case letters if you want, or
> have lower case DNA go to upper case protein, or compute the molecular
> weight of a specialized encoding (eg, for selenocysteine). But I
> really need to implement it first to make sure it works
Looking forward to them!
Jeff
>
>
> Andrew Dalke
> dalke@acm.org
>
>
>
> _______________________________________________
> BioPython mailing list - BioPython@biopython.org
> http://biopython.org/mailman/listinfo/biopython
>