[Bioperl-l] bp_bulk_load_gff.pl speed

Wed Jul 14 19:51:21 EDT 2004

Heh, I was sure I had to be missing something obvious too, glad to see
someone else has noticed this.

I'll have to wait till I go in to work tomorrow to check exact
versions, but MySQL is 3.23.x, perl is 5.8.x, and OS is Redhat 9.

Dustin Cram

On Wed, 14 Jul 2004 19:10:39 -0400, Aaron J. Mackey
<amackey at pcbi.upenn.edu> wrote:
> 
> Aha, I'm *not* crazy!  I've experienced exactly this same behavior (I
> ended up "solving" it by batching loading in blocks of 500, which
> worked fine until my database got very big such that the initial group
> loading got too slow).
> 
> What's your mysql version, perl version (usemymalloc?), and OS?  I
> think this is a perl hash/memory issue, but I'd love to solve it now
> that I know it's not just something stupid I'm doing wrong.
> 
> -Aaron
> 
> 
> 
> On Jul 14, 2004, at 6:22 PM, Dustin Cram wrote:
> 
> > I recently started using Bio:DB:GFF, beginning by using
> > bp_bulk_load_gff.pl to load a simple but large gff2 file.  This file
> > consisted only of transcripts and their subfeatures, so the group
> > class of all features was "transcript".  The files loaded with no
> > problem and I was able to write a few successful test scripts.
> >
> > Now I have added  new features (genes) to the gff file, and I
> > attempted to load the new file exactly as before with
> > bp_bulk_load_gff.pl, but now it takes _much_ longer to load, and takes
> > more time the more features are added (the first 5K features take
> > about 30 seconds, the next 5K features take nearly 2 minutes, and so
> > on).  It took over an hour to 50K features, at which point I stopped
> > it.
> >
> > I've played around with the gff file a bit and found that anything
> > that doesn't have a  group class of "transcript" has this problem, for
> > example if I 'sed s/transcript/foo/g'  the original file it's slow,
> > and if I 'sed s/gene/transcript/g' the new file it's fast.  I have
> > manually verified that the MySQL database is empty before each attempt
> > and even wiped the tmp directory before each attempt.
> >
> > Any ideas why non-transcript features take so long?
> >
> > Thanks,
> >
> > Dustin Cram
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at portal.open-bio.org
> > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >
> >
> --
> Aaron J. Mackey, Ph.D.
> Dept. of Biology, Goddard 212
> University of Pennsylvania       email:  amackey at pcbi.upenn.edu
> 415 S. University Avenue         office: 215-898-1205
> Philadelphia, PA  19104-6017     fax:    215-746-6697
> 
>