[Bioperl-l] need help with large genbank file - THANK YOU

Dinakar Desai Desai.Dinakar@mayo.edu
Wed, 24 Jul 2002 08:45:18 -0500


Jason Stajich wrote:
> You need to make sure that perl is compiled with LARGE_FILE option - see
> perl -V, alternatively you can open the filehandle with
> open(FILE, "cat nt |")
> as was suggested by Simon and Darin
> 
> On Tue, 23 Jul 2002, Dinakar Desai wrote:
> 
> 
>>Chris Dagdigian wrote:
>>
>>>Dinakar,
>>>
>>>The file is to big for perl to open a filehandle on (at least that is
>>>what your error message states)
>>>
>>>I know from painful experience :) that the file you are trying to read
>>>is larger than 2GB when it is uncompressed into its native form.  If
>>>your computer, filesystem, kernel or operating system cannot handle
>>>files larger than 2GB in size then you will get these sorts of errors.
>>>
>>>There are various tricks to make things work. Systems with 64-bit
>>>architectures (like Alphaservers) do not have these problems at all.
>>>
>>>Linux solved this in the kernel a long time ago and the common linux
>>>filesystems can all handle large files. There are however binary
>>>programs that you may run into like 'cat', 'more', 'uncompress' etc.
>>>etc. that will coredump or segfault on large files because they were not
>>>compiled to support 64-bit offsets.
>>>
>>>Without knowing your operating system or local configuration I'd
>>>recommend that you experiment with breaking NT into several smaller
>>>pieces. You should be able to determine experimentally the filesize
>>>limit that you appear to have.
>>>
>>>-Chris
>>>
>>>
>>>
>>>
>>>Dinakar Desai wrote:
>>>
>>>
>>>>Hello:
>>>>
>>>>I am new to perl and bioperl. I have downloaded file from ncbi
>>>>(ftp://ftp.ncbi.nih.gov/blast/db/nt) and this file is quite large. I
>>>>am trying to parse this file for certain pattern with Bioperl. I get
>>>>error.I have looked into largefasta.pm and they suggest not to use it.
>>>>I would appreciate, if you could help me with this problem.
>>>>
>>>>My code to test only 5 records out of this big file is as follows:
>>>><code>
>>>>#!/usr/bin/env perl
>>>>
>>>>use lib '/home/desas2/perl_mod/lib/site_perl/5.6.0/';
>>>>
>>>>use Bio::SeqIO;
>>>>
>>>>$seqio = Bio::SeqIO->new( -file =>"/home/desas2/data/nt", '-format' =>
>>>>'Fasta');
>>>>
>>>>$seqobj = $seqio->next_seq();
>>>>$count = 5;
>>>>while ($count > 0){
>>>>        print $seqobj->seq();
>>>>        $seqobj = $seqio->next_seq();
>>>
>>>
>>>
>>>>}
>>>></code>
>>>>and the error message is:
>>>><error>
>>>>------------ EXCEPTION  -------------
>>>>MSG: Could not open /home/desas2/data/nt for reading: File too large
>>>>STACK Bio::Root::IO::_initialize_io
>>>>/home/desas2/perl_mod/lib/site_perl/5.6.0//B
>>>>io/Root/IO.pm:244
>>>>STACK Bio::SeqIO::_initialize
>>>>/home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/Seq
>>>>IO.pm:381
>>>>STACK Bio::SeqIO::new
>>>>/home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/SeqIO.pm:31
>>>>4
>>>>STACK Bio::SeqIO::new
>>>>/home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/SeqIO.pm:32
>>>>7
>>>>STACK toplevel ./test_fasta.pl:8
>>>>
>>>>--------------------------------------
>>>></error>
>>>>
>>>>Do you have any suggestion, how I could get to read this big file and
>>>>get sequence object. I know how to manipulate sequence object.
>>>>
>>>>Thank you.
>>>>
>>>>Dinakar
>>>>
>>>
>>>
>>>
>>Thank you very much for your email. I am running this script on :
>>Linux  2.4.7-10 #1 Thu Sep 6 16:46:36 EDT 2001 i686 unknown
>>it has about 2.5 GB memory.
>>
>>I used Biopython and I could open file and do some work. I thought I
>>will try bioperl (which seems to more mature) and I got in to this problem.
>>
>>The size of file is: 6298460844 bytes (6.2 GB)
>>
>>Can you suggest how I can break this file into smaller files and then
>>parse them.
>>
>>
>>
>>Thank you.
>>
>>Dinakar
>>
>>
> 
> 
Thank you very much for all the help with large file.
Now it works great.

Thank you.

Dinakar

-- 

Dinakar Desai, Ph.D

----------------------

Everything should be made as simple as possible, but no 
simpler.-----Albert Einstein