[Bioperl-l] bp_bulk_load_gff.pl speed

Scott Cain cain at cshl.edu
Thu Jul 15 16:36:36 EDT 2004


Dustin,

Besides Aaron, a few other people have complained about this, and yes, I
had written them off as crazy :-)

Since I can't reproduce this problem, I'll have to ask you: is the
problem that the files are not being written to /usr/tmp (or where ever)
as quickly as before, or is it that, after the files are done being
written, they aren't loaded into mysql as quickly?  Not that I have a
solution to either problem, but the first is presumably a perl problem
and the second a mysql problem.  If it were the latter (which I kind of
doubt), you could get around it by using a real database, like
PostgreSQL.

Scott


On Thu, 2004-07-15 at 13:45, bioperl-l-request at portal.open-bio.org
wrote:
> 
> I recently started using Bio:DB:GFF, beginning by using
> bp_bulk_load_gff.pl to load a simple but large gff2 file.  This file
> consisted only of transcripts and their subfeatures, so the group
> class of all features was "transcript".  The files loaded with no
> problem and I was able to write a few successful test scripts.
> 
> Now I have added  new features (genes) to the gff file, and I
> attempted to load the new file exactly as before with
> bp_bulk_load_gff.pl, but now it takes _much_ longer to load, and takes
> more time the more features are added (the first 5K features take
> about 30 seconds, the next 5K features take nearly 2 minutes, and so
> on).  It took over an hour to 50K features, at which point I stopped
> it.
> 
> I've played around with the gff file a bit and found that anything
> that doesn't have a  group class of "transcript" has this problem, for
> example if I 'sed s/transcript/foo/g'  the original file it's slow,
> and if I 'sed s/gene/transcript/g' the new file it's fast.  I have
> manually verified that the MySQL database is empty before each attempt
> and even wiped the tmp directory before each attempt.
> 
> Any ideas why non-transcript features take so long? 
> 
> Thanks,
> 
> Dustin Cram

-- 
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         cain at cshl.org
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory



More information about the Bioperl-l mailing list