[Bioperl-l] dealing with large files
Chris Fields
cjfields at uiuc.edu
Thu Dec 20 16:14:55 UTC 2007
As Jason mentioned, it may be the number of features in the record if
the record itself is huge (i.e. human chromosome-sized, full
metagenome, etc). If (my) memory serves correctly the mem. footprint
for a perl object is ~10x the actual data, give or take (it depends on
the complexity of the object itself). In cases like this indexing may
not fix the problem, unless you have an object which retains the file
position of the data instead of the data itself; I don't think we have
this object type in BioPerl.
The only way I can think of to fix this would be (as Jason also
suggested) lightweight objects, or something like the lazy sequence
object ala the SwissKnife suite (which only bring what you want into
memory).
Related to that, I have been testing something like that, which uses
iterators to pass in chunks of data from a stream to handlers to build
a sequence object. Wouldn't be too hard to reconfigure that to return
file positions as well. Maybe for the 1.7 release...
chris
On Dec 20, 2007, at 7:57 AM, Stefano Ghignone wrote:
> I was wandering if, working with so big FILE, should be better first
> index the database, than query it formatting the sequences as one
> want...
>
>> It gets buffered via the OS -- Bio::Root::IO calls next_line
>> iteratively, but eventually the whole sequence object will get put
>> into RAM as it is built up.
>> zcat or bzcat can also be used for gzipped and bzipped files
>> respectively, I like to use this where I want to disk space footprint
>> down.
>>
>> Because we treat data input usually as from a stream ignoring whether
>> it is in a file or not, we have to have a more flexible structure to
>> really handle this, although I'd argue the data really belongs in a
>> database when it is too big for memory.
>> More compact Feature/Location objects would probably also help here.
>> I would not be surprised if the memory requirement has more to do
>> with the number of features than length of the sequence - human chrom
>> 1 can fit into memory just fine on most machines with 2GB of RAM.
>>
>> But it would require someone taking an interest in some re-
>> architecting here.
>>
>> -jason
>>
>> On Dec 19, 2007, at 9:59 PM, Michael Thon wrote:
>>
>>>
>>> On Dec 18, 2007, at 7:04 PM, Stefano Ghignone wrote:
>>>
>>>> my $in = Bio::SeqIO->new(-file => "/bin/gunzip -c $infile |", -
>>>> format => 'EMBL');
>>>
>>> This is just for the sake of curiosity, since you already found a
>>> solution to your problem, but I wonder how perl will handle a file
>>> opened this way. Will it try to suck the whole thing into ram in
>>> one go?
>>>
>>> Mike
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list