[Bioperl-l] need help with large genbank file
Dinakar Desai
Desai.Dinakar@mayo.edu
Tue, 23 Jul 2002 18:10:49 -0500
Chris Dagdigian wrote:
>
> Dinakar,
>
> The file is to big for perl to open a filehandle on (at least that is
> what your error message states)
>
> I know from painful experience :) that the file you are trying to read
> is larger than 2GB when it is uncompressed into its native form. If
> your computer, filesystem, kernel or operating system cannot handle
> files larger than 2GB in size then you will get these sorts of errors.
>
> There are various tricks to make things work. Systems with 64-bit
> architectures (like Alphaservers) do not have these problems at all.
>
> Linux solved this in the kernel a long time ago and the common linux
> filesystems can all handle large files. There are however binary
> programs that you may run into like 'cat', 'more', 'uncompress' etc.
> etc. that will coredump or segfault on large files because they were not
> compiled to support 64-bit offsets.
>
> Without knowing your operating system or local configuration I'd
> recommend that you experiment with breaking NT into several smaller
> pieces. You should be able to determine experimentally the filesize
> limit that you appear to have.
>
> -Chris
>
>
>
>
> Dinakar Desai wrote:
>
>> Hello:
>>
>> I am new to perl and bioperl. I have downloaded file from ncbi
>> (ftp://ftp.ncbi.nih.gov/blast/db/nt) and this file is quite large. I
>> am trying to parse this file for certain pattern with Bioperl. I get
>> error.I have looked into largefasta.pm and they suggest not to use it.
>> I would appreciate, if you could help me with this problem.
>>
>> My code to test only 5 records out of this big file is as follows:
>> <code>
>> #!/usr/bin/env perl
>>
>> use lib '/home/desas2/perl_mod/lib/site_perl/5.6.0/';
>>
>> use Bio::SeqIO;
>>
>> $seqio = Bio::SeqIO->new( -file =>"/home/desas2/data/nt", '-format' =>
>> 'Fasta');
>>
>> $seqobj = $seqio->next_seq();
>> $count = 5;
>> while ($count > 0){
>> print $seqobj->seq();
>> $seqobj = $seqio->next_seq();
>
>
>
>>
>> }
>> </code>
>> and the error message is:
>> <error>
>> ------------ EXCEPTION -------------
>> MSG: Could not open /home/desas2/data/nt for reading: File too large
>> STACK Bio::Root::IO::_initialize_io
>> /home/desas2/perl_mod/lib/site_perl/5.6.0//B
>> io/Root/IO.pm:244
>> STACK Bio::SeqIO::_initialize
>> /home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/Seq
>> IO.pm:381
>> STACK Bio::SeqIO::new
>> /home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/SeqIO.pm:31
>> 4
>> STACK Bio::SeqIO::new
>> /home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/SeqIO.pm:32
>> 7
>> STACK toplevel ./test_fasta.pl:8
>>
>> --------------------------------------
>> </error>
>>
>> Do you have any suggestion, how I could get to read this big file and
>> get sequence object. I know how to manipulate sequence object.
>>
>> Thank you.
>>
>> Dinakar
>>
>
>
>
Thank you very much for your email. I am running this script on :
Linux 2.4.7-10 #1 Thu Sep 6 16:46:36 EDT 2001 i686 unknown
it has about 2.5 GB memory.
I used Biopython and I could open file and do some work. I thought I
will try bioperl (which seems to more mature) and I got in to this problem.
The size of file is: 6298460844 bytes (6.2 GB)
Can you suggest how I can break this file into smaller files and then
parse them.
Thank you.
Dinakar
--
Dinakar Desai, Ph.D
perl -e '$_ = "mqonx.zako\@ude";$_=~ tr /qnxzk\@.ue/npqmy.\@eu/; print'
----------------------
Everything should be made as simple as possible, but no
simpler.-----Albert Einstein