[Biopython-dev] BioSQL : BatchLoader

Wed Apr 23 12:48:25 UTC 2008

> > That's impressive - you seem to have got the database side of things
> > down to about 30 seconds; a fraction of the time to parse the GenBank
> > file!  Although, as you pointed out, there are a lot of provisos here.
>
>  Yep.
>
>  Would it be helpful to do anything further with this code, i.e. put it into
> CVS and document on the Wiki, perhaps when its been a bit more tested?

I'm not ready to put this into the main Biopython CVS.  But by all
means, add a new page to the wiki to describe your approach.
Hopefully there are a few others who might be interested, and we'll
see.

> > There are still some slow bits in the current GenBank parser which
> > would be an obvious next target for you in your quest for speed.  I
> > did a little investigation a while ago, and concluded the parsing of
> > the feature locations was the biggest bottleneck.  However, this is a
> > rather complicated lump of code, so its not such an easy task.  I
> > tried out a "hack" which special-cased the most common feature
> > location types, with a fall back on the original parser, which gave
> > much better performance.  I didn't check this in as it made some
> > already complex code WAY more complicated!
>
>  Aha, sounds good. I haven't profiled the Biopython code but I will check
> this. I'm dealing with bacterial sequences in the main which have mainly
> simple location identifiers, so there could well be some mileage here.

Yes, I had been experimenting with bacterial sequences too.  Beware
that the location string in general can be extremely complex (and even
reference other files by their identifier).  A complete backwards
compatible re-write of the location parsing (into sub-features) looked
like a big job.

That said, if you do run some profiling, you may spot some other "low
hanging fruit" which would be easier to tackle.  I haven't done any
optimisation work since my original re-write of the GenBank parser
back in August 2006 when I replaced the older slower Martel parser
which didn't scale well with large input files.

>  I mean process one GenBank file per core.
>
>  Locally that would mean on a 4-core machine you could have 3 parser threads
> working concurrently, each passing the generated Seq object to the Loader
> when read.

I see - that means there is only one thread/job writing to the
database, which keeps that side of things thread-safe.  To be honest,
unless you are trying to import several hundred bacterial genomes into
BioSQL, I don't think this level of complexity is a worth while pay
off.  Right now, I would target the GenBank parsing itself (which
would be useful outside the task of loading sequences into BioSQL).

Something else you may want to consider is timing the BioPerl scripts
for importing a GenBank file into BioSQL.  There will probably be some
minor differences in their interpretation of the data and exactly they
store it, but it would be a useful base mark.

Peter