[Biojava-l] How to find a sequence within a larger sequence and flip it

Richard Holland holland at eaglegenomics.com
Fri Sep 19 08:42:50 UTC 2008


Hello.

To be honest, I think you've already got the only way to quickly
locate a subsequence within a sequence. For whatever reason, the
Sequence and SymbolList interfaces lack any kind of indexOf() or
find() functions, and the SequenceTools class, usually the provider of
all things useful, also fails to fill the gap.

You're right about there being a SymbolList edit facility. This only
works on SymbolLists that have declared themselves editable, which
will depend on how your SymbolList objects were created. What you do
is create a new Edit object, based on starting position in the
original sequence, length of sequence to remove in the original, and
the SymbolList you want to use to replace the removed bits. Then you
pass this to the edit() method on the SymbolList/Sequence object you
want to replace.

So, the end result is only a small improvement on your original plan,
but here goes:

 1. Create your sequence.
 2. Create your other sequence.
 3. Convert both to strings and use an indexOf in the String object to
locate the subsequence in the original sequence.
 4. Use string tools to flip the subsequence then create a new
SymbolList based on it.
 5. If the original sequence is editable, use the Edit method
described above to replace a chunk of it with the new flipped
subsequence. Otherwise, construct a new string using the String object
methods and construct a new original sequence based on that instead.

cheers.
Richard

2008/9/19 Doug Swisher <big.swish at gmail.com>:
> Hi,
>
> I'm pretty new to BioJava, and I'm a bit stuck.  I'm hoping someone can help
> out a bit...even if it's just a hint as to where to look next.
>
> I have a long DNA sequence and a shorter sequence that exists within the
> larger one.  I want to find the location of the smaller sequence within the
> larger one, and then create a new sequence with the small one flipped
> end-for-end.  That's confusing, so let me give an example.
>
> Long sequence: aaaagacttttt
> Short sequence: gact
> Goal sequence: aaaatcagtttt
>
> To find the location of the short sequence within the larger one, I could
> certainly do some string manipulation:
>
>    SymbolList bigDNA = DNATools.createDNA("aaaagacttttt");
>    SymbolList subDNA = DNATools.createDNA("gact");
>    int start = bigDNA.seqString().indexOf(subDNA.seqString());
>
> While that would work, I'm wondering if there is a more efficient method
> that avoids the conversion to strings (in my real code, I start with
> Sequences, not strings; I used SymbolLists here for simplicity).
>
> To "excise" the short sequence, flip it around, and construct a new
> SymbolList, I could also do some string manipulation, as in the following:
>
>    StringBuilder middle = new StringBuilder(subDNA.seqString());
>    String leftPart = bigDNA.seqString().substring(0, subDNA.length());
>    String rightPart = bigDNA.seqString().substring(start + subDNA.length(),
> bigDNA.length());
>    SymbolList goalDNA = DNATools.createDNA(leftPart + middle.reverse() +
> rightPart);
>
> Looking at the documentation, such as ProjectionUtils or SymbolList.edit(),
> it appears there might be some support for manipulating the sequence
> directly.  Is there a way to do it, without again dropping "down" to
> strings?
>
> Thanks in advance for any assistance.
>
> Cheers,
> -Doug
>
> P.S. Yeah, the second code snippet is pretty inefficient; I was trying to be
> clear rather than efficient.
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>



-- 
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
M: +44 7500 438846 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/



More information about the Biojava-l mailing list