[Bioperl-l] bp_bulk_load_gff.pl speed
Lincoln Stein
lstein at cshl.edu
Sat Jul 17 15:37:03 EDT 2004
My apologies for that old bug.
Lincoln
On Thursday 15 July 2004 05:30 pm, Dustin Cram wrote:
> Well, I think I've traced my problem to a bug in
> Bio::DB::GFF->_split_gff2_group that only existed for a while in CVS.
> I had assumed that release 1.4 was installed at our site, but it turns
> out that it was a cvs for shortly after the 1.4 release. The revision
> of Bio::DB::GFF.pm with the problem is 1.105 (maybe others too).
>
> It looks to me like $self->preferred_groups is being appended to with
> ("Sequence",Transcript") for every call of the method, so as time goes
> by the array gets huge, with just those elements repeated over and
> over. That is why only my non-transcript features had problems - the
> entire array was searched unsuccessfully for each feature.
>
> I've grabbed the latest CVS and it seems to work fine. Although I
> haven't tried 1.4 release, I think it should work too. If this isn't
> the problem for other folk, then I guess they're still just crazy :).
>
> Thanks,
>
> Dustin
>
> On Thu, 15 Jul 2004 16:36:36 -0400, Scott Cain <cain at cshl.edu> wrote:
> > Dustin,
> >
> > Besides Aaron, a few other people have complained about this, and yes, I
> > had written them off as crazy :-)
> >
> > Since I can't reproduce this problem, I'll have to ask you: is the
> > problem that the files are not being written to /usr/tmp (or where ever)
> > as quickly as before, or is it that, after the files are done being
> > written, they aren't loaded into mysql as quickly? Not that I have a
> > solution to either problem, but the first is presumably a perl problem
> > and the second a mysql problem. If it were the latter (which I kind of
> > doubt), you could get around it by using a real database, like
> > PostgreSQL.
> >
> > Scott
> >
> > On Thu, 2004-07-15 at 13:45, bioperl-l-request at portal.open-bio.org
> >
> > wrote:
> > > I recently started using Bio:DB:GFF, beginning by using
> > > bp_bulk_load_gff.pl to load a simple but large gff2 file. This file
> > > consisted only of transcripts and their subfeatures, so the group
> > > class of all features was "transcript". The files loaded with no
> > > problem and I was able to write a few successful test scripts.
> > >
> > > Now I have added new features (genes) to the gff file, and I
> > > attempted to load the new file exactly as before with
> > > bp_bulk_load_gff.pl, but now it takes _much_ longer to load, and takes
> > > more time the more features are added (the first 5K features take
> > > about 30 seconds, the next 5K features take nearly 2 minutes, and so
> > > on). It took over an hour to 50K features, at which point I stopped
> > > it.
> > >
> > > I've played around with the gff file a bit and found that anything
> > > that doesn't have a group class of "transcript" has this problem, for
> > > example if I 'sed s/transcript/foo/g' the original file it's slow,
> > > and if I 'sed s/gene/transcript/g' the new file it's fast. I have
> > > manually verified that the MySQL database is empty before each attempt
> > > and even wiped the tmp directory before each attempt.
> > >
> > > Any ideas why non-transcript features take so long?
> > >
> > > Thanks,
> > >
> > > Dustin Cram
> >
> > --
> > ------------------------------------------------------------------------
> > Scott Cain, Ph. D. cain at cshl.org
> > GMOD Coordinator (http://www.gmod.org/) 216-392-3087
> > Cold Spring Harbor Laboratory
> >
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at portal.open-bio.org
> > http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
--
Lincoln Stein
lstein at cshl.edu
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
More information about the Bioperl-l
mailing list