[Bioperl-l] need help with large genbank file

Jason Stajich jason@cgt.mc.duke.edu
Wed, 24 Jul 2002 08:38:19 -0400 (EDT)


You need to make sure that perl is compiled with LARGE_FILE option - see
perl -V, alternatively you can open the filehandle with
open(FILE, "cat nt |")
as was suggested by Simon and Darin

On Tue, 23 Jul 2002, Dinakar Desai wrote:

> Chris Dagdigian wrote:
> >
> > Dinakar,
> >
> > The file is to big for perl to open a filehandle on (at least that is
> > what your error message states)
> >
> > I know from painful experience :) that the file you are trying to read
> > is larger than 2GB when it is uncompressed into its native form.  If
> > your computer, filesystem, kernel or operating system cannot handle
> > files larger than 2GB in size then you will get these sorts of errors.
> >
> > There are various tricks to make things work. Systems with 64-bit
> > architectures (like Alphaservers) do not have these problems at all.
> >
> > Linux solved this in the kernel a long time ago and the common linux
> > filesystems can all handle large files. There are however binary
> > programs that you may run into like 'cat', 'more', 'uncompress' etc.
> > etc. that will coredump or segfault on large files because they were not
> > compiled to support 64-bit offsets.
> >
> > Without knowing your operating system or local configuration I'd
> > recommend that you experiment with breaking NT into several smaller
> > pieces. You should be able to determine experimentally the filesize
> > limit that you appear to have.
> >
> > -Chris
> >
> >
> >
> >
> > Dinakar Desai wrote:
> >
> >> Hello:
> >>
> >> I am new to perl and bioperl. I have downloaded file from ncbi
> >> (ftp://ftp.ncbi.nih.gov/blast/db/nt) and this file is quite large. I
> >> am trying to parse this file for certain pattern with Bioperl. I get
> >> error.I have looked into largefasta.pm and they suggest not to use it.
> >> I would appreciate, if you could help me with this problem.
> >>
> >> My code to test only 5 records out of this big file is as follows:
> >> <code>
> >> #!/usr/bin/env perl
> >>
> >> use lib '/home/desas2/perl_mod/lib/site_perl/5.6.0/';
> >>
> >> use Bio::SeqIO;
> >>
> >> $seqio = Bio::SeqIO->new( -file =>"/home/desas2/data/nt", '-format' =>
> >> 'Fasta');
> >>
> >> $seqobj = $seqio->next_seq();
> >> $count = 5;
> >> while ($count > 0){
> >>         print $seqobj->seq();
> >>         $seqobj = $seqio->next_seq();
> >
> >
> >
> >>
> >> }
> >> </code>
> >> and the error message is:
> >> <error>
> >> ------------ EXCEPTION  -------------
> >> MSG: Could not open /home/desas2/data/nt for reading: File too large
> >> STACK Bio::Root::IO::_initialize_io
> >> /home/desas2/perl_mod/lib/site_perl/5.6.0//B
> >> io/Root/IO.pm:244
> >> STACK Bio::SeqIO::_initialize
> >> /home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/Seq
> >> IO.pm:381
> >> STACK Bio::SeqIO::new
> >> /home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/SeqIO.pm:31
> >> 4
> >> STACK Bio::SeqIO::new
> >> /home/desas2/perl_mod/lib/site_perl/5.6.0//Bio/SeqIO.pm:32
> >> 7
> >> STACK toplevel ./test_fasta.pl:8
> >>
> >> --------------------------------------
> >> </error>
> >>
> >> Do you have any suggestion, how I could get to read this big file and
> >> get sequence object. I know how to manipulate sequence object.
> >>
> >> Thank you.
> >>
> >> Dinakar
> >>
> >
> >
> >
>
> Thank you very much for your email. I am running this script on :
> Linux  2.4.7-10 #1 Thu Sep 6 16:46:36 EDT 2001 i686 unknown
> it has about 2.5 GB memory.
>
> I used Biopython and I could open file and do some work. I thought I
> will try bioperl (which seems to more mature) and I got in to this problem.
>
> The size of file is: 6298460844 bytes (6.2 GB)
>
> Can you suggest how I can break this file into smaller files and then
> parse them.
>
>
>
> Thank you.
>
> Dinakar
>
>

-- 
Jason Stajich
Duke University
jason at cgt.mc.duke.edu