[Bioperl-l] Bio::SeqIO -- add an ugly but fast grep hack?

Tue Sep 12 01:41:14 UTC 2006

I was assigned the task of wading through 4GB of GenBank data to find  
all of the sequences that mentioned "Staphylococcus epidermidis"  
anywhere in any annotation. Separate those hits out into smaller files.

Easy enough. Something like this (stripped down):

foreach my $source_file (@file_list) {
    my $outfile = Bio::SeqIO->new(-file => ">$dir/$source_file" .  
'_parsed',  -format => 'GenBank');
    my $infile  = Bio::SeqIO->new(-file => "$dir/ 
$source_file",               -format => 'GenBank');
    SEQUENCE:
    while(my $seqobj = $infile->next_seq()) {
       my $annotations = $seqobj->annotation;
       foreach my $key ( $annotations->get_all_annotation_keys() ) {
          my @values = $annotations->get_Annotations($key);
          foreach my $value ( @values ) {
             if ($value->as_text =~ /$select_feature/i) {   #  
$select_feature is 'Staph...'
                $outfile->write_seq($seqobj);
                next SEQUENCE;
             }
          }
       }
    }
}

Works great. The only problem is that it takes a long time to wade  
through all 4GB. I'm making Perl do a TON of work and throwing  
99.999% of that work away. :)

I noticed that I can discover the sweet spots in the 4GB a thousand  
times faster with a simple grep:

grep -in "Staphylococcus epidermidis" *.seq

 From the filenames:linenumbers that grep discovers for me I believe  
I could "hop" BioPerl (Bio::SeqIO) into the "sweet spots" of the  
files and have BioPerl just serialize *those* into sequence objects.

Has anyone ever thought of adding a "raw_file_grep" argument or  
something? Like this?

    my $infile  = Bio::SeqIO->new(
        -file          => "$dir/$source_file",
        -format        => 'GenBank',
        -raw_file_grep => 'Staphylococcus epidermidis'
    );

or perhaps implement it inside Bio::SeqIO::next_seq()?

    my $seqobj = $infile->next_seq(
        raw_file_grep => 'Staphylococcus epidermidis'
    );

(This would be a blind string-based hack for pure speed. Any string  
match anywhere -- no context intelligence whatsoever.)

Or am I just crazy and if I'm going to do a hack like this I should  
just write a stand-alone file filter outside of BioPerl?

Or has someone already optimized this somewhere and I just couldn't  
find it?

Thanks,

j

P.S.  If you're keeping score, yes I will be circling back to  
implement the ideas I started in previous threads...  :)