[Bioperl-l] dealing with large files

Thu Dec 20 16:48:58 UTC 2007

> -----Original Message-----
> From: Chris Fields [mailto:cjfields at uiuc.edu] 
> 
> 
> On Dec 19, 2007, at 10:45 AM, Stefano Ghignone wrote:
> 
> > At the end, I succeeded in the format conversion using this command:
> >
> > gunzip -c uniprot_trembl_bacteria.dat.gz | perl -ne 'print ">$1 " if
> > (/^AC\s+(\S+);/); print " $1" if (/^DE\s+(.*)/);print " [$1]\n" if
> > (/^OS\s+(.*)/); if (($a)=/^\s+(.*)/){$a=~s/ //g; print "$a\n"};'
> >
> > (Thanks to Riccardo Percudani). It's not bioperl...but it works!
> 
> 
> As this shows, sometimes BioPerl isn't always the best answer 
> (I know,  
> blasphemy...).  As Jason suggested it's quite likely there are large  
> sequence records causing your problems when using BioPerl.  The one- 
> liner works b/c it doesn't retain data (sequence, annotation, 
> etc) in  
> memory as Bio::Seq object; it's a direct conversion.
> 
> It would be nice to code up a lazy sequence object and related  
> parsers; maybe for the next dev release.

Yes!

Also, BLAST parsing. Blasting the proteome against the genome makes for
rather large result files. Right now, if you want to delete queries that
hit, say, more than 1000 times, you still need to wait for Bioperl to
create objects and sub-objects for every single hit. Sadly, this example
isn't hypothetical. I'm going to solve it with something like:

perl -wne 'BEGIN {$/="TBLASTN"} print if length($_) < $some_big_value'
big_blast > filtered_blast

(Not that I'm volunteering to help with the parser writing, so I should
stop complaining.)

-Amir