[Biojava-l] Introducing a mutation in a DNA sequence

Thu Apr 2 09:02:00 UTC 2015

Hi all,

thank you all for your answers and comments. I already mentioned the
possibility of using SequenceView to introduce sequence changes or to use the
Edit class (which in the end relies on a list of Sequence objects that are
concatenated similar to the way a SequenceView implementation would work, as
far as I saw that).

Although this probably works perfectly fine for a lot of tasks, I think it
would have disadvantages when sequentially applying lots of mutations/edits to
a sequence (e.g. in an (GUI based) alignment editor with Sequence objects as
data backend or an application that simulates evolution of sequences along a
large phylogenetic tree by sequentially applying mutations). In such cases the
resulting sequence (containing all mutations/edits) would be a sequence view
on the top a stack of other sequence views (each of them defining one
mutation). Depending on the index, a call of getCompoundAt() would lead to a
trace back of the whole stack in the worst case (if that compound was present
in the initial sequence). (I hope this was more understandable this time?)

For such applications it would be nice to have a Sequence implementation
(extending interface) or anything similar that is able to really edit the
underlying sequence without creating a stack of views with increasing size. I
have some applications that would benefit from this (e.g.
http://bioinfweb.info/LibrAlign/ ), but of course I would like to ask the
community how relevant such a feature would be in general.

@Andreas: Why would I not have an EditableSequence interface extend from
Sequence?

Generally I would have nothing against this, because all methods from Sequence
could be inherited and implementations of EditableSequence could be passed to
methods which have parameters of type Sequence. (This was my initial idea, I
posted in November.) Of course that would still violate the idea of atomic
sequences a bit, because methods with parameters of type Sequence would have
to check if the passed objects also implement EditableSequence to know that
they cannot assume their contents to be immutable. In that case such methods
could e.g. throw an exception if they cannot handle EditableSequences, but
each method out there would have to implement this behavior.

Best
Ben

Andreas Prlic schrieb am 2015-04-02:
> Hi,

> I agree with Ben's summary. The basic philosophy is that sequences
> are not
> mutable.  It is clear that we need some mechanism to introduce
> mutations in
> sequences, without having to allocate a copy of the sequence in
> memory.

> About Mark's suggestion: I think Paolo's comment to represent
> mutations via
> a "SequenceView" goes in a similar direction.

> I hear two suggestions for how to do this so far:

> A) Mutations via a SequenceView
> B) introduction of an EditableSequence interface.

> Ben: Could you comment a bit further why you would not have an
> EditableSequence interface extend from Sequence?

> ==

> Having said that, currently sequence manipulation is possible via
> "edits",
> however I suspect this is too complicated from an API perspective?

> >From EditSequenceTest :

> public void substitute() throws CompoundNotFoundException {
>   DNASequence seq = new DNASequence("ACGT");
>   assertSeq(new Edit.Substitute<NucleotideCompound>("T",
>   2).edit(seq), "ATGT");
>   assertSeq(new Edit.Substitute<NucleotideCompound>("TT",
>   2).edit(seq), "ATTT");
>   assertSeq(new Edit.Substitute<NucleotideCompound>("T",
>   1).edit(seq), "TCGT");
>   assertSeq(new Edit.Substitute<NucleotideCompound>("TTC",
> 2).edit(seq), "ATTC");
> }

> .edit() is using the JoiningSequenceReader under the hood which has a
> getCompoundAt method.

> Andreas

> On Wed, Apr 1, 2015 at 3:23 PM, Paolo Pavan <paolo.pavan at gmail.com>
> wrote:

> > Thank you Mark, I think it should be better to clarify this point,
> > I may
> > have a different idea in my mind.

> > Are we talking about a sequence object that given a "parent"
> > sequence will
> > show the result of applying a set of mutations descriptors?
> > Should this result still be a Sequence object such that it will be
> > possible to apply any further processing that takes a
> > AbstractSequence in
> > input? (e.g.:performing a sequence alignment with SmithWaterman)
> > Should this result be the same Sequence object instantiated given
> > in input
> > which, with some mechanism to implement, will show a sequence
> > string
> > different from the original resulting by applying mutation
> > descriptors?

> > If it is so, why do not implement it with SequenceView, the same
> > mechanism
> > we get a reverse complemented sequence?
> > If this will be accomplished, there will be no need for a new
> > interface
> > EditableSequence and conversion to/from Sequence, am I wrong?
> > Ben, could you better clarify your concerns about not having such a
> > design? Why you still see advantages in a mutable implementation of
> > Sequence instead?

> > 2015-04-01 19:13 GMT+02:00 Mark Fortner <phidias51 at gmail.com>:

> >> Just out of curiosity, could mutations be applied as annotations
> >> to a
> >> wild-type sequence? The sequence would remain unedited, but you
> >> would still
> >> be able to represent the mutation and related annotations.  This
> >> might work
> >> for SNPs, and indels, but I'm not sure how you would deal with
> >> chromosomal
> >> translocations.

> >> Also, would it be useful to be able to reference external variant
> >> databases like ClinVar or SwissVar when specifying a mutation?

> >> Regards,

> >> Mark

> >> On Wed, Apr 1, 2015 at 9:20 AM, Ben Stöver
> >> <benstoever at uni-muenster.de>
> >> wrote:

> >>> Hi Paolo and all,

> >>> yes, I guess that is the reason. Imagine a SequenceView
> >>> implementation
> >>> that
> >>> stores indices of the underlying sequence to make its
> >>> modifications. If
> >>> the
> >>> underlying sequence could be modified the indices in the view
> >>> would
> >>> become
> >>> invalid and all views of a Sequence would have to be notified
> >>> about the change (which would require the implementation of an
> >>> observer
> >>> pattern in Sequence, which is currently not present). I guess the
> >>> need
> >>> for
> >>> this logic change was the reason of keeping Sequence
> >>> implementations
> >>> atomic.
> >>> But maybe Andreas could comment on this, because that's just my
> >>> interpretation
> >>> of his opinion.

> >>> Although these are really good points, I would anyway agree that
> >>> having
> >>> some
> >>> kind of mutable sequences would be a great thing, because
> >>> mutating or
> >>> modifying sequences is a common task and such applications might
> >>> anyway
> >>> want/need to rely on a sequence framework, which e.g. checks that
> >>> only
> >>> valid
> >>> tokens are present or offers an implementation that can handle
> >>> changes in
> >>> large sequences without having to copy everything to a new
> >>> object, like
> >>> it
> >>> would be the case with simple String objects.

> >>> If other people agree that there is need for that (I would be
> >>> interested
> >>> in
> >>> feedback here) and the community would agree on a way of
> >>> implementing
> >>> that
> >>> (without having the disadvantages mentioned), I would be happy to
> >>> help
> >>> creating according code.

> >>> A different EditableSequence interface and a tool class that can
> >>> converts
> >>> between Sequence and EditableSequence (without inheriting
> >>> EditableSequence
> >>> from Sequence as I initially proposed) might be one option,
> >>> although this
> >>> would make Sequence and EditableSequence less compatible. I think
> >>> this
> >>> would
> >>> have to be discussed, but it might really be worth it.

> >>> Best
> >>> Ben

> >>> Paolo Pavan schrieb am 2015-03-30:
> >>> > Hi Ben and all,
> >>> > I'm following this thread with interest.
> >>> > Just to examine in depth, what was the reason of the idea of
> >>> > mantaining the
> >>> > sequence atomic? The fact to keep working with the same
> >>> > instantiated
> >>> > object
> >>> > (and hence it's reference) during the software run lifetime?
> >>> > If is it so, I like the idea that yourself are suggesting to
> >>> > accomplish the
> >>> > task of a DNA mutation with a SequenceView.

> >>> > Paolo

> >>> > 2015-03-30 16:36 GMT+02:00 Ben Stöver
> >>> > <benstoever at uni-muenster.de>:

> >>> > > Hi Jonas,

> >>> > > I have been proposing to inherit a subinterface
> >>> > > "EditableSequence"
> >>> > > (with
> >>> > > according implementations) from the existing Sequence
> >>> > > interface on
> >>> > > this
> >>> > > list
> >>> > > last November. Some people liked this idea, some did not,
> >>> > > mainly
> >>> > > because
> >>> > > there
> >>> > > seemed to be concerns that existing code (using BioJava)
> >>> > > relies on
> >>> > > the
> >>> > > assumption of atomic sequences and allowing their
> >>> > > modification
> >>> > > might break
> >>> > > some of this code (at least this was my interpretation of the
> >>> > > concerns).
> >>> > > (You
> >>> > > can have a look at these mails in some archive or I can
> >>> > > forward
> >>> > > them to
> >>> > > you,
> >>> > > if you want to have a closer look at that discussion.)

> >>> > > To my knowledge it is indeed difficult to modify sequences in
> >>> > > the
> >>> > > current
> >>> > > architecture. The only way I'm aware of, is creating a new
> >>> > > SequenceView on
> >>> > > your sequence which provides a modified view on the
> >>> > > underlying
> >>> > > sequence
> >>> > > modeling you mutation. I think there are even some
> >>> > > implementations
> >>> > > out
> >>> > > there
> >>> > > based on this interface

> >>> https://github.com/biojava/biojava/blob/master/biojava-core/src/main/java/org/biojava/nbio/core/sequence/edits/Edit.java
> >>> > > but I never tried them. In my opinion, it is mainly a
> >>> > > question of
> >>> > > performance,
> >>> > > if this approach makes sense for you. (If you e.g. perform
> >>> > > many
> >>> > > mutations
> >>> > > you
> >>> > > would not want to create a copy of your whole sequence for
> >>> > > each
> >>> > > operation
> >>> > > and
> >>> > > have a chain of 1000 sequence views in the end.)

> >>> > > Of course you are always free to create or modify an existing
> >>> > > implementation
> >>> > > of "Sequence" that offer additional methods for modification,
> >>> > > but
> >>> > > keep in
> >>> > > mind
> >>> > > that this would break the assumption of "atomic sequence
> >>> > > objects",
> >>> > > which
> >>> > > seems
> >>> > > to be intended in the current BioJava sequence model.

> >>> > > Anyway, if anyone knows about any other ways to do that in
> >>> > > BioJava
> >>> > > or could
> >>> > > think about a good way of integrating this functionality in
> >>> > > the
> >>> > > existing
> >>> > > architecture (without building up an alternative sequence
> >>> > > framework), I
> >>> > > would
> >>> > > be very interested to know.

> >>> > > Best
> >>> > > Ben

> >>> > > Dipl. Biologe Ben Stöver
> >>> > > Evolution und Biodiversity of Plants Group
> >>> > > Institute for Evolution and Biodiversity
> >>> > > University of Münster
> >>> > > Germany
> >>> > > http://www2.ieb.uni-muenster.de/EvolBiodivPlants/en/People/Stoever
> >>> > > BenStoever at uni-muenster.de

> >>> > > LAW Andy schrieb am 2015-03-30:
> >>> > > > I think the philosophical view on this is that the mutated
> >>> > > > sequence
> >>> > > > is a *new* and *different* sequence.

> >>> > > > On 30 Mar 2015, at 09:30, Jose Manuel Duarte
> >>> > > > <jose.duarte at psi.ch>
> >>> > > > wrote:

> >>> > > > > Hi Jonas

> >>> > > > > I'm not very familiar with the sequence part of Biojava,
> >>> > > > > but
> >>> > > > > after
> >>> > > > > looking around a bit it seems that indeed there's no
> >>> > > > > available
> >>> > > > > way
> >>> > > > > to mutate sequences. It looks like people using Biojava
> >>> > > > > before
> >>> > > > > had
> >>> > > > > "read-only" applications in mind. I agree a
> >>> > > > > setCompoundAt(int
> >>> > > > > position) would be needed, it should actually be part of
> >>> > > > > the
> >>> > > > > Sequence interface. It would be a nice addition for 4.1.

> >>> > > > > Anyway sorry I can't be of more help, perhaps someone
> >>> > > > > else has
> >>> > > > > some
> >>> > > > > more background info on this.

> >>> > > > > Jose

> >>> > > > > On 28.03.2015 17:13, Jonas Dehairs wrote:
> >>> > > > >> I want to introduce a mutation to a DNA sequence at a
> >>> > > > >> particular
> >>> > > > >> location.
> >>> > > > >> I can't seem to find a suitable method for this in the
> >>> > > > >> 4.0
> >>> > > > >> API.
> >>> > > > >> What would make most sense to me is a setCompoundAt (int
> >>> > > > >> position,
> >>> > > > >> c compound) method in the AbstractSequence class,
> >>> > > > >> similar to
> >>> > > > >> the
> >>> > > > >> getCompoundAt(int position) method, but this doesn't
> >>> > > > >> seem to
> >>> > > > >> exist. And the mutator class seems to be for proteins
> >>> > > > >> only.
> >>> > > > >> How
> >>> > > > >> can I do this?

> >>> > > > --
> >>> > > > The University of Edinburgh is a charitable body,
> >>> > > > registered in
> >>> > > > Scotland, with registration number SC005336.

> >>> > > > _______________________________________________
> >>> > > > Biojava-l mailing list  -  Biojava-l at mailman.open-bio.org
> >>> > > > http://mailman.open-bio.org/mailman/listinfo/biojava-l

> >>> > > _______________________________________________
> >>> > > Biojava-l mailing list  -  Biojava-l at mailman.open-bio.org
> >>> > > http://mailman.open-bio.org/mailman/listinfo/biojava-l

> >>> _______________________________________________
> >>> Biojava-l mailing list  -  Biojava-l at mailman.open-bio.org
> >>> http://mailman.open-bio.org/mailman/listinfo/biojava-l

> >> _______________________________________________
> >> Biojava-l mailing list  -  Biojava-l at mailman.open-bio.org
> >> http://mailman.open-bio.org/mailman/listinfo/biojava-l

> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at mailman.open-bio.org
> > http://mailman.open-bio.org/mailman/listinfo/biojava-l

> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> RCSB PDB Protein Data Bank
> University of California, San Diego

> Editor Software Section
> PLOS Computational Biology

> BioJava Project Lead
> -----------------------------------------------------------------------