[Bioperl-l] need help with large genbank file

simon andrews (BI) simon.andrews@bbsrc.ac.uk
Wed, 24 Jul 2002 08:43:02 +0100


 
> > Dinakar Desai wrote:
> >> and the error message is:
> >> <error>
> >> ------------ EXCEPTION  -------------
> >> MSG: Could not open /home/desas2/data/nt for reading:  
> >> File too large

> Chris Dagdigian wrote:
> > 
> > Dinakar,
> > 
> > The file is to big for perl to open a filehandle on (at 
> > least that is what your error message states)
> > 

> > Without knowing your operating system or local 
> > configuration I'd recommend that you experiment with 
> > breaking NT into several smaller pieces.

> Dinakar Desai wrote:
>
> Thank you very much for your email. I am running this 
> script on : Linux  2.4.7-10 
>
> Can you suggest how I can break this file into smaller 
> files and then parse them.

Dinakar,

You seemed to sugges before that your file contained lots of small sequence files rather than a few large ones.  In this case there may be a quick fix.

Since you seem to be running a pretty recent kernel you will hopefully find that your system commands (eg cat) can cope with >2Gb files.  If not then try upgrading your textutils package (kernel 2.4.9 and textutils 2.0.11-7 definitely works with >2Gb).

If you can use cat on your large file then simply create a script which reads its input from STDIN and then pipe the results of cat to it.  We have done this successfully in the past to process large files.

eg (untested):
-----------------------------------------------------
#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;

my $stream = $Bio::SeqIO -> new(-fh => \*STDIN,
					  -format => 'fasta');

while (my $seqobj = $stream -> next_seq()) {
  # Do Something
}
-----------------------------------------------------

Then run with:

cat your_big_file | the_perl_script.pl

Hope this helps

Simon.