[Bioperl-l] Memory not sufficient when storing human chromosom 1 in BioSQL

Fri Jul 4 15:31:05 UTC 2008

On Jul 4, 2008, at 5:10 AM, Sendu Bala wrote:

> [CC:ing Gabrielle who had an identical problem]
>
> Chris Fields wrote:
>> On Jul 3, 2008, at 6:48 AM, Andreas Dräger wrote:
>>> Recently I have successfully installed the latest version of  
>>> BioPerl and BioSQL on my computer, which has 2 GB RAM. Both works  
>>> fine, but when trying to insert the genbank file of the human  
>>> chromosome 1, which I have downloaded from the NCBI website (ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_01/hs_ref_chr1.gbk.gz 
>>> ) I receive the error message 'Out of memory'. This takes about  
>>> one hour. My question is, how I can insert large genbank files in  
>>> my BioSQL database using BioPerl. I do not know, what to do. Thank  
>>> you for your help!!!
>>
>> Have you tried just loading the sequence into memory using  
>> Bio::SeqIO?  The problem may be the size of the file itself.
>
> Just looping through:
> perl -MBio::SeqIO -e '$i Bio::SeqIO->new(-file =>  
> "hs_ref_chr1.gbk"); while ($seq = $i->next_seq) { $ac = $seq- 
> >accession; }'
>
> This gave me a variable memory usage, typically around 360MB,  
> peaking up to 980MB before dropping back down again. Seems a little  
> high to me, but it doesn't seem to be a memory leak?
>
>
> Keeping every seq object in memory:
> perl -MBio::SeqIO -e '$i Bio::SeqIO->new(-file =>  
> "hs_ref_chr1.gbk"); @seqs; while ($seq = $i->next_seq) { push(@seqs,  
> $seq); }'
>
> This used up to 810MB. I didn't notice any peakiness, but it may  
> have been there.
>
> SeqIO by itself shouldn't be causing any out of memory errors on 2  
> and 4GB machines.
>
>
> What does bioperl-db do as it enters sequences into the db? How does  
> it currently deal with species information?

Are the 'latest versions of bioperl/bioperl-db' Andreas indicated  
above the latest versions from Subversion, or 1.5.2?  I can't recall  
whether 1.5.2 shipped with the memory issue fixes re: Bio::Species (I  
think it did, but maybe Sendu knows better than I).  Some more info  
from Andreas would also help, such as OS, RDBMS, etc.

Cold this be a combination of RDBMS, bioperl-db, and bioperl memory  
issues?  Of course that would depend on how the local MySQL/Pg/Oracle  
is set up, but if the memory peaks out at 980MB (or 810MB for all  
sequences) for Bio::SeqIO alone, I could see how bioperl-db and the  
RDBMS may add quite a bit more to that.

If anyone has a local bioperl-db set up we should try replicating  
this.  Speaking of, does anyone know if we have set up bioperl-db  
testing on dev (or wherever it was to be hosted)?  This was discussed  
at one point.

chris