[Bioperl-l] dealing with large files

Thu Dec 20 16:14:55 UTC 2007

As Jason mentioned, it may be the number of features in the record if  
the record itself is huge (i.e. human chromosome-sized, full  
metagenome, etc).  If (my) memory serves correctly the mem. footprint  
for a perl object is ~10x the actual data, give or take (it depends on  
the complexity of the object itself).  In cases like this indexing may  
not fix the problem, unless you have an object which retains the file  
position of the data instead of the data itself; I don't think we have  
this object type in BioPerl.

The only way I can think of to fix this would be (as Jason also  
suggested) lightweight objects, or something like the lazy sequence  
object ala the SwissKnife suite (which only bring what you want into  
memory).

Related to that, I have been testing something like that, which uses  
iterators to pass in chunks of data from a stream to handlers to build  
a sequence object.  Wouldn't be too hard to reconfigure that to return  
file positions as well.  Maybe for the 1.7 release...

chris

On Dec 20, 2007, at 7:57 AM, Stefano Ghignone wrote:

> I was wandering if, working with so big FILE, should be better first  
> index the database, than query it formatting the sequences as one  
> want...
>
>> It gets buffered via the OS -- Bio::Root::IO calls next_line
>> iteratively, but eventually the whole sequence object will get put
>> into RAM as it is built up.
>> zcat or bzcat can also be used for gzipped and bzipped files
>> respectively, I like to use this where I want to disk space footprint
>> down.
>>
>> Because we treat data input usually as from a stream ignoring whether
>> it is in a file or not, we have to have a more flexible structure to
>> really handle this, although I'd argue the data really belongs in a
>> database when it is too big for memory.
>> More compact Feature/Location objects would probably also help here.
>> I would not be surprised if the memory requirement has more to do
>> with the number of features than length of the sequence - human chrom
>> 1 can fit into memory just fine on most machines with 2GB of RAM.
>>
>> But it would require someone taking an interest in some re-
>> architecting here.
>>
>> -jason
>>
>> On Dec 19, 2007, at 9:59 PM, Michael Thon wrote:
>>
>>>
>>> On Dec 18, 2007, at 7:04 PM, Stefano Ghignone wrote:
>>>
>>>> my $in  = Bio::SeqIO->new(-file => "/bin/gunzip -c $infile |", -
>>>> format => 'EMBL');
>>>
>>> This is just for the sake of curiosity, since you already found a
>>> solution to your problem, but I wonder how perl will handle a file
>>> opened this way.  Will it try to suck the whole thing into ram in
>>> one go?
>>>
>>> Mike
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign