[Bioperl-l] Limit on sequence file size fetches?

Jason Stajich jason at bioperl.org
Sun Aug 16 19:22:35 UTC 2009


Robert -

Posting your script will help us replicate and diagnose - I am not  
sure which GenBank fetch option you are using.  I have a feeling it is  
trying to do recursive calls to stitch together the pseudoscaffold. I  
presume it works find though if you request the each chromosome  
scaffold like CU607053,CU633438, ...

I guess posting it via a bugzilla bug is the best way unless you have  
a git account and wanted to post it as a 'gist'.

-jason
--
Jason Stajich
jason at bioperl.org
http://fungalgenomes.org/

On Aug 16, 2009, at 3:16 PM, Robert Bradbury wrote:

> Hello,
>
> I am trying to use get_sequence() to fetch the sequence NS_000198  
> for the
> fungus *Podospora anserina* with the databases "GenBank" and when that
> didn't work "Gene".  This is a simple script which fetches the  
> sequence then
> writes out the fasta and genbank files from the data structure.
>
> The errors I got suggested that the system was running out of memory  
> which I
> thought was unlikely since I've got something like 3GB of main  
> memory and
> 9GB of swap space.  After running strace on the script (which takes  
> a while)
> I determined that the brk() calls were generating ENOMEM at ~3GB.   
> This
> turns out to be due to the limit of the Linux memory model I am using
> (3GB/1GB) on a Pentium IV (Prescott).
>
> Now, I think the total genome size for the fungus is ~70MB but haven't
> verified this so I "should" be able to fetch it unless Bioperl (or  
> perl
> itself) is doing extremely poor memory management (perhaps not  
> coalescing
> memory segments into one large sequence) as the reads take place? [1].
>
> Has anyone encountered this problem (fetching say large mammalian
> chromosomes)?  Does anyone know what the limits are for "fetching"   
> sequence
> files (on 32/64 bit machines?.  The reason I am using get_sequence and
> BioPerl is that I can't seem to find the *Podospora anserina*  
> sequence in a
> FTP database anywhere (so I can't use "wget or ftp").  I haven't  
> tested
> accessing the GenBank file in a browser (I don't know what browsers  
> would do
> with a HTML file that large but suspect it would not be pretty).
>
> Thanks in advance,
> Robert Bradbury
>
> 1. The strace seems to indicate periodic brk() calls to expand the  
> process
> data segment size between which there are lots of read() calls of  
> size 4096,
> presumably reading the socket from NCBI.  I don't know if there is  
> an easy
> way to trace perl's memory allocation/manipulation at a higher level.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l





More information about the Bioperl-l mailing list