[Bioperl-l] get_Stream_by_query Terminates Prematurely

Mon May 10 16:31:15 UTC 2010

On May 10, 2010, at 12:38 AM, Robert Bradbury wrote:

> I don't know whether this is related or not.  But the last time I tried to
> fetch a moderately large genome (NS_000198 for *Podospera anserina*) it
> failed [1].  It takes a *very* long time and eventually springs an "Out of
> Memory" error.  This is on a Pentium IV Prescott which only has a 4GB
> address space (configured for 3GB for user programs) and after running a
> long strace on the perl process it seemed that what was happening was that
> it was never properly returning and merging memory from the sequence chunks
> which were being fetched.  The final program address was brk(0xafd8c000) or
> 2,950,217,728 which is probably the maximum amount of data space a user
> program can have considering that one needs room for the stack.  After that
> the mmap2() calls started failing with ENOMEM.

That's odd.  What OS?

> If Bio::DB::GenBank::Query is intelligent enough to only fetch just the
> requested fields you should be ok.  But if it fetches the entire GenBank
> record and simply throws away the sequence information and you are running
> into large sequences (say a big chunk of a chromosome) and this ends up
> hitting the memory/swap space limits on your machine that could be a
> problem.

Yes, that may happen, as (at the moment) we push everything into memory; there are no lazy or DB-linked Seq instances, at least not yet.  Very large sequences take a lot of time (object instantiation) and a lot of memory.  To tell the truth, that seems to be the default of most toolkits, but we have recently talked about possible ways to deal with it, just need the tuits for it (as with anything).

The other alternative is to pull the sequences down locally as a raw text file.  This can still be done within BioPerl, just using Bio::DB::EUtilities:

my $in = Bio::DB::EUtilities->new(-eutil => 'efetch',
                                  -db    => 'nuccore',
                                  -email => 'cjfields at bioperl.org',
                                  -rettype => 'gbwithparts',
                                  -id   => 'NS_000198');

$in->get_Response(-file => "$id.gb");

> If the program is running for a long time I'd be inclined to check my system
> logs to see if one is running out of memory/swap.  You can also watch the
> process using ps to determine if the VSZ grows continuously.
> 
> I think I mentioned this before on the BioPerl list but never had a clear
> understanding of what was going on and may not have filed a bug report.  I
> think I eventually worked around it, perhaps by fetching the offending
> (large) sequence using wget or a browser.

You can still file a bug on it; does help with keeping track (just reporting it here doesn't help much, it gets lost in the shuffle).  

> Robert
> 
> 1. Given that NS_000198 is only ~7MB (4.6 million actual bases)  the BioPerl
> memory management has to be really poor in merging/reusing if the fetch uses
> ~3GB.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

BioPerl stores everything in memory, but I've worked with 4.6Mbp genomes quite a bit on my MB Pro.  However, the default mode for Bio;:DB::GenBank is to pull down everything using 'gbwithparts'.  This file is much larger doing so (sequence is ~34Mbp, file is ~51 MB).  Maybe that's the problem?

If you can please file a bug report, along with the relevant information.  That helps us determine the best course of action.

chris