[Bioperl-l] bp_bulk_load_gff.pl speed

Sat Jul 17 15:37:03 EDT 2004

My apologies for that old bug.

Lincoln

On Thursday 15 July 2004 05:30 pm, Dustin Cram wrote:
> Well, I think I've traced my problem to a bug in
> Bio::DB::GFF->_split_gff2_group that only existed for a while in CVS.
> I had assumed that release 1.4 was installed at our site, but it turns
> out that it was a cvs for shortly after the 1.4 release.  The revision
> of Bio::DB::GFF.pm with the problem is 1.105 (maybe others too).
>
> It looks to me like $self->preferred_groups is being appended to with
> ("Sequence",Transcript") for every call of the method, so as time goes
> by the array gets huge, with just those elements repeated over and
> over.  That is why only my non-transcript features had problems - the
> entire array was searched unsuccessfully for each feature.
>
> I've grabbed the latest CVS and it seems to work fine.  Although I
> haven't tried 1.4 release,  I think it should work too.  If this isn't
> the problem for other folk, then I guess they're still just crazy :).
>
> Thanks,
>
> Dustin
>
> On Thu, 15 Jul 2004 16:36:36 -0400, Scott Cain <cain at cshl.edu> wrote:
> > Dustin,
> >
> > Besides Aaron, a few other people have complained about this, and yes, I
> > had written them off as crazy :-)
> >
> > Since I can't reproduce this problem, I'll have to ask you: is the
> > problem that the files are not being written to /usr/tmp (or where ever)
> > as quickly as before, or is it that, after the files are done being
> > written, they aren't loaded into mysql as quickly?  Not that I have a
> > solution to either problem, but the first is presumably a perl problem
> > and the second a mysql problem.  If it were the latter (which I kind of
> > doubt), you could get around it by using a real database, like
> > PostgreSQL.
> >
> > Scott
> >
> > On Thu, 2004-07-15 at 13:45, bioperl-l-request at portal.open-bio.org
> >
> > wrote:
> > > I recently started using Bio:DB:GFF, beginning by using
> > > bp_bulk_load_gff.pl to load a simple but large gff2 file.  This file
> > > consisted only of transcripts and their subfeatures, so the group
> > > class of all features was "transcript".  The files loaded with no
> > > problem and I was able to write a few successful test scripts.
> > >
> > > Now I have added  new features (genes) to the gff file, and I
> > > attempted to load the new file exactly as before with
> > > bp_bulk_load_gff.pl, but now it takes _much_ longer to load, and takes
> > > more time the more features are added (the first 5K features take
> > > about 30 seconds, the next 5K features take nearly 2 minutes, and so
> > > on).  It took over an hour to 50K features, at which point I stopped
> > > it.
> > >
> > > I've played around with the gff file a bit and found that anything
> > > that doesn't have a  group class of "transcript" has this problem, for
> > > example if I 'sed s/transcript/foo/g'  the original file it's slow,
> > > and if I 'sed s/gene/transcript/g' the new file it's fast.  I have
> > > manually verified that the MySQL database is empty before each attempt
> > > and even wiped the tmp directory before each attempt.
> > >
> > > Any ideas why non-transcript features take so long?
> > >
> > > Thanks,
> > >
> > > Dustin Cram
> >
> > --
> > ------------------------------------------------------------------------
> > Scott Cain, Ph. D.                                         cain at cshl.org
> > GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
> > Cold Spring Harbor Laboratory
> >
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at portal.open-bio.org
> > http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l

-- 
Lincoln Stein
lstein at cshl.edu
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)