[Biojava-l] Genbank feature parsing performance

Mark Fortner phidias51 at gmail.com
Fri Jun 17 14:36:12 UTC 2011


Martin, Khalil

In the code sample you check to see if the taxon is in a list. I suspect
that operation is slower than you intend. You might try using a treeset and
see if the lookup performance improves.

As for genbank parsing performance itself, I'm curious if you've tried
parsing the genbank XML files and noticed any performance difference?

If you're looking for something similar to GPars in Java, you might try the
ThreadPoolExecutor<http://download.oracle.com/javase/6/docs/api/java/util/concurrent/ThreadPoolExecutor.html>
which
manages a threadpool and queuing Runnable tasks to the threadpool.

Hope this helps,

Mark

PS if you have Groovy code that you'd like to share, feel free to add any
examples to the BioGroovy wiki<http://biogroovy.open-bio.org/wiki/Main_Page>
.



On Jun 17, 2011 4:16 AM, "Khalil El Mazouari" <khalil.elmazouari at gmail.com>
wrote:

Good suggestion ;)
However, I am not familiar with Groovy. I'll look for something similar in
Java.

Regards,

khalil
On 17 Jun 2011, at 12:36, Martin Jones wrote:

> Yes, this approach won't be much use if you are interested in the
> contents of every genbank record.
>
> Have you thought about parsing the gb files in parallel? In my
> experience, parsing genbank files scales quite nicely when done in
> multiple threads. I have used the GPars library for this type of job
> and it is very nice to use:
>
> http://gpars.codehaus.org/Parallelizer
>
>
> M

>
>
>
> On 17 June 2011 11:33, Khalil El Mazouari <khalil.elmazouari at gmail.com>
wrote:
>> Thanks ...



More information about the Biojava-l mailing list