[Bioperl-l] dealing with large files

Jason Stajich jason at bioperl.org
Thu Dec 20 07:13:55 UTC 2007


It gets buffered via the OS -- Bio::Root::IO calls next_line  
iteratively, but eventually the whole sequence object will get put  
into RAM as it is built up.
zcat or bzcat can also be used for gzipped and bzipped files  
respectively, I like to use this where I want to disk space footprint  
down.

Because we treat data input usually as from a stream ignoring whether  
it is in a file or not, we have to have a more flexible structure to  
really handle this, although I'd argue the data really belongs in a  
database when it is too big for memory.
More compact Feature/Location objects would probably also help here.   
I would not be surprised if the memory requirement has more to do  
with the number of features than length of the sequence - human chrom  
1 can fit into memory just fine on most machines with 2GB of RAM.

But it would require someone taking an interest in some re- 
architecting here.

-jason

On Dec 19, 2007, at 9:59 PM, Michael Thon wrote:

>
> On Dec 18, 2007, at 7:04 PM, Stefano Ghignone wrote:
>
>> my $in  = Bio::SeqIO->new(-file => "/bin/gunzip -c $infile |", - 
>> format => 'EMBL');
>
> This is just for the sake of curiosity, since you already found a  
> solution to your problem, but I wonder how perl will handle a file  
> opened this way.  Will it try to suck the whole thing into ram in  
> one go?
>
> Mike
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list