[Biojava-l] RichSequence.IOTools performance

Khalil El Mazouari khalil.elmazouari at gmail.com
Tue Mar 29 14:41:13 UTC 2011


Hi,

using nio, the app performance improved well. App tested for 6599 annotated genbank seq. 

1. RichSequence.IOTools.writeGenbank(myFileOutputStream, mySeq, null): 57% of app exec time.
2. writing mySeq -> byteArrayOutputStream -> byteBuffer -> fileChannel (code below): 31% of exec time.

         ByteArrayOutputStream baos = new ByteArrayOutputStream();
         RichSequence.IOTools.writeGenbank(baos, mySeq, null);
         ByteBuffer buf = ByteBuffer.wrap(baos.toByteArray());
         fileChannel.write(buf);

any suggestion on how to improve the performance (further ;-)) is welcome.

Regards,

khalil

On 28 Mar 2011, at 23:39, Andy Yates wrote:

> Dang Rich :). 
> 
> At the moment we've not done anything WRT Genbank outputting but would accept anything to help us out with this. 
> 
> As for the performance difference between BJ3 & BJ what happens if you use the writer objects directly with a BufferedOutputStream writer? Have you got any profiling results? It would be very interesting to see where we've lost the performance ...
> 
> Andy
> 
> On 28 Mar 2011, at 18:23, Richard Holland wrote:
> 
>> In which case you've got little option but to rewrite the GenbankFormat module to use NIO or other alternative methods for writing files. However before you do that I suggest you investigate the recent BioJava3 developments to see if they've already done anything in this area - Andy Yates is your man there.
>> 
>> On 28 Mar 2011, at 18:11, Khalil El Mazouari wrote:
>> 
>>> Sequences objects are all in-memory.
>>> I agree, 10000 seq in ± 20 sec is not bad. However, scientists will processes 100,000 seqs in each run, and IO is a real  bottleneck. So, I am trying, as far as I can, to fine tune the app.
>>> 
>>> Regards,
>>> 
>>> khalil
>>> 
>>> On 28 Mar 2011, at 18:15, Richard Holland wrote:
>>> 
>>>> I would have thought 10,000 seqs written out in full Genbank format in 20 seconds was pretty good! However, the key to speeding it up would be to modify the OutputStream interactions to use faster things such as NIO. Also it would depend on the source of your sequence objects - if they are all in-memory then this isn't an issue, but if they are being read from a database using lazy or dynamic loading then that could be a bottleneck too.
>>>> 
>>>> 
>>>> On 28 Mar 2011, at 17:07, Khalil El Mazouari wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I am developing a sequence annotation app. It should handle ± 100.000 sequence per run.
>>>>> 
>>>>> When profiling the app (with 10.000 seq), the total execution time was ± 20 seconds, of which 57% was used for   RichSequence.IOTools.writeGenbak!!
>>>>> 
>>>>> How one could improve the RichSequence.IOTools performance? 
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>> khalil
>>>>> _______________________________________________
>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>> 
>>>> --
>>>> Richard Holland, BSc MBCS
>>>> Operations and Delivery Director, Eagle Genomics Ltd
>>>> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
>>>> http://www.eaglegenomics.com/
>>>> 
>>> 
>> 
>> --
>> Richard Holland, BSc MBCS
>> Operations and Delivery Director, Eagle Genomics Ltd
>> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
>> http://www.eaglegenomics.com/
>> 
> 
> -- 
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> 
> 
> 
> 





More information about the Biojava-l mailing list