[Biojava-l] Sequence Iteration in BioJava(x)

mark.schreiber at novartis.com mark.schreiber at novartis.com
Thu Dec 15 22:45:21 EST 2005


There is probably not any performance benefit except in the case of very 
large sequences which are often compressed behind the scenes by biojava.

The benefits may come from ease of use and object orientation.

eg, There is probably already a parser to read in an validate your 
sequence, The windowing or nMer stuff is already figured out and has been 
used by lots of people so it's been "stress tested". Also the objects 
themselves have a lot of functionality built in that a character stream 
does not. The downside of using objects is they take up memory and there 
is a certain amount of overhead in there construction. To help overcome 
this SymbolLists are actually lists of references to Symbols not lists of 
Symbols themselves. This makes them much smaller (although not as small as 
char[]'s).

If you want superfast performance then you should bit encode the data and 
operate over it with memory pointers as in C or machine code. You should 
be aware though that any intensive loop like the ones that would be used 
to carry out this operation in biojava will almost certainly be detected 
and compiled into native code by the Java Runtime on the fly. This might 
make it hard to say if the java code would be much slower than the C code.

- Mark





Mark Fortner <m.fortner at sbcglobal.net>
Sent by: biojava-l-bounces at portal.open-bio.org
12/16/2005 10:36 AM
Please respond to m.fortner

 
        To:     biojava-list <biojava-l at biojava.org>
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        Re: [Biojava-l] Sequence Iteration in BioJava(x)


Richard,
Thanks for the example.  Your approach is very similar to a non-BioJava 
approach that I had worked out earlier.  I was wondering if the 
BioJava(x) API provides any performance benefit over simply running a 
window along a character stream? 

The work that we're doing involves iterating through the human genome, 
(and in a number of cases, metagenomic sequences) and we're trying to 
squeeze as much performance out of it as possible while minimizing the 
memory footprint.

Thanks,

Mark

Richard HOLLAND wrote:

>orderNSymbolList splits the sequence into non-overlapping chunks. What
>is required here is chunks that are only one base different (further on)
>than the previous chunk.
>
>The simplest way would be this:
>
>                SymbolList mySeq; // this is your sequence from somewhere 
else
>                for (int i = 1 ; i <= mySeq.length()-2; i++) {
>                                SymbolList trimer = mySeq.subSeq(i,i+2); 
// coords are
>inclusive so i to i+2 = 3 bases
>                                // do something with your trimer here
>                }
>
>Note that the index starts at 1 and goes right up to and including
>length(), as symbols in a SymbolList are 1-indexed, not 0-indexed.
> 
>cheers,
>Richard
>
>Richard Holland
>Bioinformatics Specialist
>GIS extension 8199
>---------------------------------------------
>This email is confidential and may be privileged. If you are not the
>intended recipient, please delete it and notify us immediately. Please
>do not copy or use it for any purpose, or disclose its content to any
>other person. Thank you.
>---------------------------------------------
>
>
> 
>
>>-----Original Message-----
>>From: biojava-l-bounces at portal.open-bio.org 
>>[mailto:biojava-l-bounces at portal.open-bio.org] On Behalf Of David Huen
>>Sent: Friday, December 16, 2005 7:34 AM
>>To: m.fortner at sbcglobal.net
>>Cc: biojava-list
>>Subject: Re: [Biojava-l] Sequence Iteration in BioJava(x)
>>
>>
>>On Dec 15 2005, Mark Fortner wrote:
>>I think what you want is the SymbolListViews.orderNSymbolList method.
>>
>>It will take a SymbolList and turn it into another where it 
>>is viewed in 
>>another compound alphabet of the required order.
>>
>>
>> 
>>
>>>I'm looking for the best way to iterate through all
>>>nmers within a given sequence.  For example, given a
>>>sequence that looks like this:
>>>
>>>ACTGACTGACTG
>>>
>>>If I extract all trimers from this I should get:
>>>
>>>ACT
>>>CTG
>>>TGA
>>>GAC
>>>ACT
>>>CTG
>>>TGA
>>>GAC
>>>ACT
>>>CTG
>>>
>>>Is there an existing class that will allow me to
>>>iterate through a sequence this way, or am I on my
>>>own?
>>>
>>> 
>>>
>>_______________________________________________
>>Biojava-l mailing list  -  Biojava-l at biojava.org
>>http://biojava.org/mailman/listinfo/biojava-l
>>
>> 
>>
>
> 
>

_______________________________________________
Biojava-l mailing list  -  Biojava-l at biojava.org
http://biojava.org/mailman/listinfo/biojava-l





More information about the Biojava-l mailing list