[Bioperl-l] Bio::Assembly bug/feature?
Chris Fields
cjfields at uiuc.edu
Mon Jul 23 15:41:35 UTC 2007
To all:
I think I have found a major problem with Bio::Assembly; this was
first noticed on Mac OS X in relation to bug 2320 and
Bio::Assembly::IO. I am uncertain whether this is meant to be a
feature or a bug but it certainly needs to be documented or fixed as
it leads to subtle errors. I also can't see the advantage of this
approach, but maybe I can be enlightened? Either way, I think it's
worth a discussion for those willing to follow. I'll add as a bug
later if needed.
A bit of background: each instance of a Bio::Assembly::Contig has a
Bio::SeqFeature::Collection instance attached to it; each
Bio::SeqFeature::Collection itself has a tied DB_File handle attached
which remains open during the lifetime of the Bio::SF::Collection
object. When using Bio::Assembly one adds the various Contig objects
to a Bio::Assembly::Scaffold. So, for instance, if one had ~1000
Contigs in a Scaffold, one would also have ~1000 open tied db
handles, one per Contig instance. So far, so good.
Unfortunately, when adding a ton of Contig objects to a
Bio::Assembly::Scaffold one can run into a host of system-dependent
issues based on resource usage limits (as one might expect). This
script:
------------------------------
use Bio::Assembly::Scaffold;
use Bio::Assembly::Contig;
use Bio::SeqFeature::Generic;
my $scaffold = Bio::Assembly::Scaffold->new();
for my $id (1..15000) {
print "Contig #$id\n";
my $contig = Bio::Assembly::Contig->new(-id => $id);
my $feat = Bio::SeqFeature::Generic->new(-start=>1,
-end=>10,
-strand=>1);
$contig->add_features([$feat]);
$scaffold->add_contig($contig);
}
------------------------------
may fail on Mac OS X when one reaches the maximum number of open file
descriptors possible on Mac OS X (on UNIX'y systems, this is 'ulimit -
n'); the call to tie the DB_File handle in SF::Collection fails
silently, so later on when called on you get the following:
...
Contig #251
Contig #252
Contig #253
Contig #254
Can't call method "put" on an undefined value at /Users/cjfields/src/
bioperl-live/Bio/SeqFeature/Collection.pm line 225.
I have added an exception to catch this. On Mac OS X you can
increase the file descriptor limit using ulimit, at least to a
certain point. However, when testing this out on dev.open-bio.org
(Linux) the 'tie' sometimes fails (and the exception pops up), but it
isn't dependent on 'ulimit -n'. This is what happens more often:
...
Contig #10567
Contig #10568
Contig #10569
Contig #10570
Out of memory!
Sometimes followed by a seg fault. Ick!
Any ideas? For instance, should we set this up so that one
SF::Collection is used for all the Contigs (since each one has a
unique ID anyway)? Leave as is and document/track the issue as a
bug? Both?
chris
More information about the Bioperl-l
mailing list