[Biojava-l] readGenbank performance

Dean, David P david_p_dean@groton.pfizer.com
Wed, 30 Oct 2002 08:22:35 -0500


Hi,

Thanks for your prompt replies!

I'll try Java 1.4 and see how that compares. I should start looking at it
anyhow..

I like the idea of listening for certain fields. Ideally, it would be great
to tell the Genbank formatter I want fields X,Y and Z and it would just
process those. Not a big deal for small data sets but if I want to, say,
plow through all Genbank ESTs it should make some difference!

David
-----Original Message-----
From: Schreiber, Mark [mailto:mark.schreiber@agresearch.co.nz]
Sent: Tuesday, October 29, 2002 4:09 PM
To: David P Dean; biojava-l@biojava.org
Subject: RE: [Biojava-l] readGenbank performance


Hi -

Its often hard to compare a perl lib to biojava without knowing what the
perl lib does, biojava does a reasonable amount of checking that the
symbols used match the alphabet etc and does most of its work on Symbols
as Objects, probably the perl lib does everything as Strings.

You can cut down on overhead if you only want a particular part of the
sequence. Matthew and I where just discussing how making a custom
listener for a particular field in a file can perform as fast a grep. If
you are only interested in the Sequence information for example you
could ignore all the rest as by default it gets processed and stored as
annotations and features of the object.

- Mark


> -----Original Message-----
> From: David P Dean [mailto:deandp@groton.pfizer.com] 
> Sent: Wednesday, 30 October 2002 10:02 a.m.
> To: biojava-l@biojava.org
> Subject: [Biojava-l] readGenbank performance
> 
> 
> Hi,
> I'm new to BioJava and am very keen to learn more about it. 
> I've got a routine to read some Genbank sequences and do 
> stuff and that works fine. But I'm suprised it doesn't run 
> faster. A basic read loop like:
> 
>      sit = SeqIOTools.readGenbank(br);
>      while( sit.hasNext() ) {
>         Sequence entry = sit.nextSequence();
> 
> takes about 90 seconds to read 10,000 Genbank EST entries on 
> my Sparc Ultra 10. A comparable perl library I have that 
> iterates over the set and parses all the records takes about 
> half the time. Is this expected, or any suggestions?
> 
> I have downloaded and built biojava-live and am game to tweak 
> things. Is there any kind of profiling tool that would show 
> where the time is going? Also, I am using an older Solaris 
> JVM, 1.3.0. Could this be a factor?
> 
> Thanks!
> David Dean
> ----
> Count your blessing.
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org 
> http://biojava.org/mailman/listinfo/biojava-l
> 
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================


LEGAL NOTICE
Unless expressly stated otherwise, this message is confidential and may be privileged. It is intended for the addressee(s) only. Access to this E-mail by anyone else is unauthorized. If you are not an addressee, any disclosure or copying of the contents of this E-mail or any action taken (or not taken) in reliance on it is unauthorized and may be unlawful. If you are not an addressee, please inform the sender immediately.