[Bioperl-l] Out of memory errors running Bio::ASN1::EntrezGeneagainst latest Homo_sapiens.ags file

Sun Oct 14 03:28:09 UTC 2007

Hi, Susan,

Let us know how my suggestions worked for you.  I replied to both you 
and the bioperl mailing list last Friday in the hope that my answer 
could be helpful for the list discussion, but it seems that the mailing 
list server had serious problems and dropped both of my emails.  I'm 
therefore replying again and combined the content of my 2 emails 
together below. Hopefully the email gets sent out to the mailing list.  
If not, would one of you please forward it out?  Thanks.

Mingyi Liu wrote:
> Hi, Susan,
>
> Mauricio is right. When there's a problem with Bio::ASN1::EntrezGene, 
> it's better to directly contact me.  I actually deleted a few messages 
> of this discussion before one caught my eye. Nowadays I'm working in 
> some other areas and not tracking bioperl mailing list closely, a 
> direct email to me would usually work out better.
>
> As for the problem you mentioned, there could be two reasons: 1. It 
> seems that you converted the file to XML file instead of ASN file. My 
> parser is designed for ASN file, so please use gene2xml to convert the 
> downloaded file to ASN file instead of XML file.  It is likely the 
> wrong syntax of the file caused my parser to attempt to read the 
> entire file as a string (because it couldn't find the start/end).  
> However, there's another minor possibility (which you might have taken 
> care of already): 2. Perl 5.8 added 64 bit support, but I don't know 
> if you have perl 5.8 64 bit installed on your system to support the 
> 256 GB system memory you have?  If not, your >5 GB file is over the 4 
> GB 32 bit Perl limit.
>
> Let me know if my suggestions work out for you.
>
> Best,
>
> Mingyi
>
BTW, here's the syntax in one of my messages last year about how to 
convert the compressed binary ASN format NCBI provides to the text ASN 
format my module (or Stefan's SeqIO::entrezgene) expects (the -x switch 
does the trick, overwriting the default option to produce XML output):

my $parser = Bio::ASN1::EntrezGene->new('file' => "gene2xml -i 
Homo_sapiens.ags.gz -c -x -b | "); # Homo_sapiens.ags.gz is the gzipped 
binary file directly downloaded from NCBI

Same syntax should be used when you're using SeqIO (thus 
SeqIO::entrezgene).

BTW, text ASN is both smaller and faster to parse than XML format.

Best,

Mingyi

> Susan Wilson wrote:
>> Hi,
>>
>> I downloaded the latest ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ 
>> ASN_BINARY/Mammalia/Homo_sapiens.ags.gz and ran gene2xml on it to  
>> generate Homo_sapiens.xml which is 5821420628 bytes.  I cannot parse  
>> this file with Bio::ASN1::EntrezGene, even on a machine with 256GB 
>> of  memory.  I get a simple "Out of memory" output even with the  
>> following code:
>>
>> #!/usr/bin/perl
>> use strict;
>> use Bio::ASN1::EntrezGene;
>>    my $parser = Bio::ASN1::EntrezGene->new('file' =>  
>> "Homo_sapiens.xml");
>>    while(my $result = $parser->next_seq)
>>    {
>>    }
>>
>>
>>
>> Thanks.
>> Susan
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>   
>
>