[Biojava-l] How to find a sequence within a larger sequence and flip it

Doug Swisher big.swish at gmail.com
Fri Sep 19 14:20:08 UTC 2008


Mark & Richard,

Thanks for the quick responses.

It looks like the combination of KnuthMorrisPrattSearch and edit() will do
just what I need.

FYI, The SymbolListCharSequence won't work for me, as I'm actually porting
the code to .Net, and the .Net RegEx engine isn't flexible enough to accept
a non-string.  (Please don't hate me; I'm working in a java-averse
environment, and I want to take advantage of all the BioJava goodness.)

Cheers,
-Doug

On Fri, Sep 19, 2008 at 7:43 AM, Mark Schreiber <markjschreiber at gmail.com>wrote:

> Hi -
>
> You don't have to go to a String to make a match. There is a class
> SymbolListCharSequence that wraps a SymbolList as a CharSequence that lets
> you perform Regexs etc to identify the match. You can also use the
> KnuthMorrisPrattSearch to find exact matches.
>
> Finally to find non-exact matches you can use the SmithWaterman or
> Needleman Wunsch.
>
> - Mark
>
> On Fri, Sep 19, 2008 at 4:42 PM, Richard Holland <
> holland at eaglegenomics.com> wrote:
>
>> Hello.
>>
>> To be honest, I think you've already got the only way to quickly
>> locate a subsequence within a sequence. For whatever reason, the
>> Sequence and SymbolList interfaces lack any kind of indexOf() or
>> find() functions, and the SequenceTools class, usually the provider of
>> all things useful, also fails to fill the gap.
>>
>> You're right about there being a SymbolList edit facility. This only
>> works on SymbolLists that have declared themselves editable, which
>> will depend on how your SymbolList objects were created. What you do
>> is create a new Edit object, based on starting position in the
>> original sequence, length of sequence to remove in the original, and
>> the SymbolList you want to use to replace the removed bits. Then you
>> pass this to the edit() method on the SymbolList/Sequence object you
>> want to replace.
>>
>> So, the end result is only a small improvement on your original plan,
>> but here goes:
>>
>>  1. Create your sequence.
>>  2. Create your other sequence.
>>  3. Convert both to strings and use an indexOf in the String object to
>> locate the subsequence in the original sequence.
>>  4. Use string tools to flip the subsequence then create a new
>> SymbolList based on it.
>>  5. If the original sequence is editable, use the Edit method
>> described above to replace a chunk of it with the new flipped
>> subsequence. Otherwise, construct a new string using the String object
>> methods and construct a new original sequence based on that instead.
>>
>> cheers.
>> Richard
>>
>> 2008/9/19 Doug Swisher <big.swish at gmail.com>:
>>  > Hi,
>> >
>> > I'm pretty new to BioJava, and I'm a bit stuck.  I'm hoping someone can
>> help
>> > out a bit...even if it's just a hint as to where to look next.
>> >
>> > I have a long DNA sequence and a shorter sequence that exists within the
>> > larger one.  I want to find the location of the smaller sequence within
>> the
>> > larger one, and then create a new sequence with the small one flipped
>> > end-for-end.  That's confusing, so let me give an example.
>> >
>> > Long sequence: aaaagacttttt
>> > Short sequence: gact
>> > Goal sequence: aaaatcagtttt
>> >
>> > To find the location of the short sequence within the larger one, I
>> could
>> > certainly do some string manipulation:
>> >
>> >    SymbolList bigDNA = DNATools.createDNA("aaaagacttttt");
>> >    SymbolList subDNA = DNATools.createDNA("gact");
>> >    int start = bigDNA.seqString().indexOf(subDNA.seqString());
>> >
>> > While that would work, I'm wondering if there is a more efficient method
>> > that avoids the conversion to strings (in my real code, I start with
>> > Sequences, not strings; I used SymbolLists here for simplicity).
>> >
>> > To "excise" the short sequence, flip it around, and construct a new
>> > SymbolList, I could also do some string manipulation, as in the
>> following:
>> >
>> >    StringBuilder middle = new StringBuilder(subDNA.seqString());
>> >    String leftPart = bigDNA.seqString().substring(0, subDNA.length());
>> >    String rightPart = bigDNA.seqString().substring(start +
>> subDNA.length(),
>> > bigDNA.length());
>> >    SymbolList goalDNA = DNATools.createDNA(leftPart + middle.reverse() +
>> > rightPart);
>> >
>> > Looking at the documentation, such as ProjectionUtils or
>> SymbolList.edit(),
>> > it appears there might be some support for manipulating the sequence
>> > directly.  Is there a way to do it, without again dropping "down" to
>> > strings?
>> >
>> > Thanks in advance for any assistance.
>> >
>> > Cheers,
>> > -Doug
>> >
>> > P.S. Yeah, the second code snippet is pretty inefficient; I was trying
>> to be
>> > clear rather than efficient.
>> > _______________________________________________
>> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/biojava-l
>> >
>>
>>
>>
>> --
>> Richard Holland, BSc MBCS
>> Finance Director, Eagle Genomics Ltd
>> M: +44 7500 438846 | E: holland at eaglegenomics.com
>> http://www.eaglegenomics.com/
>>  _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>
>



More information about the Biojava-l mailing list