[Bioperl-l] New to Bioperl
Hilmar Lapp
hlapp at gnf.org
Sun Jun 15 22:54:47 EDT 2003
On Sunday, June 15, 2003, at 06:48 AM, Niels Larsen wrote:
> Greetings,
>
> I am exploring bioperl, to get an idea of its advantages/disadvantages
> .. I
> hope to use it and contribute to it. So first I try to load the latest
> full EMBL
> release into MySQL 4.0.12 using Bio::SeqIO. Parsing of a typical .dat
> entry file (with ~100,000 entries) takes a full 5-6 minutes, whereas
> zcat'ing
> plus reading each line in perl takes 5-6 seconds.
You must have a pretty fast machine :-) People have reported 20-30
minutes before.
Parsing rich sequence databanks fast in bioperl is not what we can be
proud of. It's slow. We have optimized it on several occasions, it used
to be 3x slower. There is meanwhile a way to speed up parsing of
genbank-formatted files by an order of magnitude if you want only, say,
the sequence, description, and ID. If you want the full annotation,
then there's not much that you can do.
What costs the time is mostly building up the
Bio::Seq+SeqFeature+Annotation object model and populating it for every
entry. If you don't want the object model to be built, I wouldn't use
bioperl. If you do want it to be built and populated, we'd be grateful
for suggestions how to build it faster ...
> Loading each entry at
> a time (using bioperl-db/scripts/biosq/load_seqdatabase.pl) however
> takes 1-2 hours (didnt time exactly)
Not sure what you mean here by each entry at a time. If you mean one
genbank entry (sequence) at a time, this certainly shouldn't take 1-2
hours, nor minutes, nor seconds. I used to get on the order of 3-10
entries per second for a database served by Oracle on a not-so-shiny
linux box. MySQL supposedly is faster ...
With this rate you can load DBs like swissprot (120k entries) over
night. If by EMBL/Genbank you mean the entire Genbank including ESTs,
you'd refer to ~15 Mio entries at least (I believe it's actually a lot
more currently). You will need to run several loaders in parallel on
different chunks in order to achieve loading this in a sensible time.
> , which means 300-600 hours for the
> release. The parsing time I could live with. Is there a supported way
> of
> loading faster? I could write something that creates loading-ready
> tables
> for each .dat file and then each would take 1-2 minutes I think.
You can do this only if you load into a flat target table, otherwise
you'd have to generate a primary key sequence in order to establish the
foreign key relationships.
Both ways are doable, but unsupported ...
> But finding
> which accessors to use in order to do that is a hard read for me. Do
> you
> have advice about 1) how to avoid loading release-entries one at a time
Let me know if what I answered here and in a separate email doesn't
answer this.
> and 2) how to get a quick overview over which methods apply to any
> given object?
Hmm - not sure what your goal would be and what you mean. The adaptors
apply to bioperl interfaces (or generally to bioperl objects). E.g.,
SeqAdaptor is for Bio::SeqI objects.
-hilmar
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the Bioperl-l
mailing list