[Bioperl-l] bp_bulk_load_gff.pl speed

Dustin Cram dustin.cram at gmail.com
Wed Jul 14 18:22:28 EDT 2004


I recently started using Bio:DB:GFF, beginning by using
bp_bulk_load_gff.pl to load a simple but large gff2 file.  This file
consisted only of transcripts and their subfeatures, so the group
class of all features was "transcript".  The files loaded with no
problem and I was able to write a few successful test scripts.

Now I have added  new features (genes) to the gff file, and I
attempted to load the new file exactly as before with
bp_bulk_load_gff.pl, but now it takes _much_ longer to load, and takes
more time the more features are added (the first 5K features take
about 30 seconds, the next 5K features take nearly 2 minutes, and so
on).  It took over an hour to 50K features, at which point I stopped
it.

I've played around with the gff file a bit and found that anything
that doesn't have a  group class of "transcript" has this problem, for
example if I 'sed s/transcript/foo/g'  the original file it's slow,
and if I 'sed s/gene/transcript/g' the new file it's fast.  I have
manually verified that the MySQL database is empty before each attempt
and even wiped the tmp directory before each attempt.

Any ideas why non-transcript features take so long? 

Thanks,

Dustin Cram


More information about the Bioperl-l mailing list