[Bioperl-l] bp_bulk_load_gff.pl speed
Dustin Cram
dustin.cram at gmail.com
Wed Jul 14 18:22:28 EDT 2004
I recently started using Bio:DB:GFF, beginning by using
bp_bulk_load_gff.pl to load a simple but large gff2 file. This file
consisted only of transcripts and their subfeatures, so the group
class of all features was "transcript". The files loaded with no
problem and I was able to write a few successful test scripts.
Now I have added new features (genes) to the gff file, and I
attempted to load the new file exactly as before with
bp_bulk_load_gff.pl, but now it takes _much_ longer to load, and takes
more time the more features are added (the first 5K features take
about 30 seconds, the next 5K features take nearly 2 minutes, and so
on). It took over an hour to 50K features, at which point I stopped
it.
I've played around with the gff file a bit and found that anything
that doesn't have a group class of "transcript" has this problem, for
example if I 'sed s/transcript/foo/g' the original file it's slow,
and if I 'sed s/gene/transcript/g' the new file it's fast. I have
manually verified that the MySQL database is empty before each attempt
and even wiped the tmp directory before each attempt.
Any ideas why non-transcript features take so long?
Thanks,
Dustin Cram
More information about the Bioperl-l
mailing list