[Bioperl-l] New to Bioperl

Sun Jun 15 22:54:47 EDT 2003

On Sunday, June 15, 2003, at 06:48  AM, Niels Larsen wrote:

> Greetings,
>
> I am exploring bioperl, to get an idea of its advantages/disadvantages 
> .. I
> hope to use it and contribute to it. So first I try to load the latest 
> full EMBL
> release into MySQL 4.0.12 using Bio::SeqIO. Parsing of a typical .dat
> entry file (with ~100,000 entries) takes a full 5-6 minutes, whereas 
> zcat'ing
> plus reading each line in perl takes 5-6 seconds.

You must have a pretty fast machine :-) People have reported 20-30 
minutes before.

Parsing rich sequence databanks fast in bioperl is not what we can be 
proud of. It's slow. We have optimized it on several occasions, it used 
to be 3x slower. There is meanwhile a way to speed up parsing of 
genbank-formatted files by an order of magnitude if you want only, say, 
the sequence, description, and ID. If you want the full annotation, 
then there's not much that you can do.

What costs the time is mostly building up the 
Bio::Seq+SeqFeature+Annotation object model and populating it for every 
entry. If you don't want the object model to be built, I wouldn't use 
bioperl. If you do want it to be built and populated, we'd be grateful 
for suggestions how to build it faster ...

>  Loading each entry at
> a time (using bioperl-db/scripts/biosq/load_seqdatabase.pl) however
> takes 1-2 hours (didnt time exactly)

Not sure what you mean here by each entry at a time. If you mean one 
genbank entry (sequence) at a time, this certainly shouldn't take 1-2 
hours, nor minutes, nor seconds. I used to get on the order of 3-10 
entries per second for a database served by Oracle on a not-so-shiny 
linux box. MySQL supposedly is faster ...

With this rate you can load DBs like swissprot (120k entries) over 
night. If by EMBL/Genbank you mean the entire Genbank including ESTs, 
you'd refer to ~15 Mio entries at least (I believe it's actually a lot 
more currently). You will need to run several loaders in parallel on 
different chunks in order to achieve loading this in a sensible time.

> , which means 300-600 hours for the
> release. The parsing time I could live with. Is there a supported way 
> of
> loading faster? I could write something that creates loading-ready 
> tables
> for each .dat file and then each would take 1-2 minutes I think.

You can do this only if you load into a flat target table, otherwise 
you'd have to generate a primary key sequence in order to establish the 
foreign key relationships.

Both ways are doable, but unsupported ...

>  But finding
> which accessors to use in order to do that is a hard read for me. Do 
> you
> have advice about 1) how to avoid loading release-entries one at a time

Let me know if what I answered here and in a separate email doesn't 
answer this.

> and 2) how to get a quick overview over which methods apply to any
> given object?

Hmm - not sure what your goal would be and what you mean. The adaptors 
apply to bioperl interfaces (or generally to bioperl objects). E.g., 
SeqAdaptor is for Bio::SeqI objects.

	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------