[Bioperl-l] get_Stream_by_query Terminates Prematurely

Robert Bradbury robert.bradbury at gmail.com
Mon May 10 05:38:09 UTC 2010


I don't know whether this is related or not.  But the last time I tried to
fetch a moderately large genome (NS_000198 for *Podospera anserina*) it
failed [1].  It takes a *very* long time and eventually springs an "Out of
Memory" error.  This is on a Pentium IV Prescott which only has a 4GB
address space (configured for 3GB for user programs) and after running a
long strace on the perl process it seemed that what was happening was that
it was never properly returning and merging memory from the sequence chunks
which were being fetched.  The final program address was brk(0xafd8c000) or
2,950,217,728 which is probably the maximum amount of data space a user
program can have considering that one needs room for the stack.  After that
the mmap2() calls started failing with ENOMEM.

If Bio::DB::GenBank::Query is intelligent enough to only fetch just the
requested fields you should be ok.  But if it fetches the entire GenBank
record and simply throws away the sequence information and you are running
into large sequences (say a big chunk of a chromosome) and this ends up
hitting the memory/swap space limits on your machine that could be a
problem.

If the program is running for a long time I'd be inclined to check my system
logs to see if one is running out of memory/swap.  You can also watch the
process using ps to determine if the VSZ grows continuously.

I think I mentioned this before on the BioPerl list but never had a clear
understanding of what was going on and may not have filed a bug report.  I
think I eventually worked around it, perhaps by fetching the offending
(large) sequence using wget or a browser.

Robert

1. Given that NS_000198 is only ~7MB (4.6 million actual bases)  the BioPerl
memory management has to be really poor in merging/reusing if the fetch uses
~3GB.



More information about the Bioperl-l mailing list