[Bioperl-l] Bio::SeqIO -- add an ugly but fast grep hack?

Tue Sep 12 15:30:23 UTC 2006

...
> I noticed that I can discover the sweet spots in the 4GB a thousand
> times faster with a simple grep:
> 
> grep -in "Staphylococcus epidermidis" *.seq
> 
>  From the filenames:linenumbers that grep discovers for me I believe
> I could "hop" BioPerl (Bio::SeqIO) into the "sweet spots" of the
> files and have BioPerl just serialize *those* into sequence objects.
> 
> Has anyone ever thought of adding a "raw_file_grep" argument or
> something? Like this?
> 
>     my $infile  = Bio::SeqIO->new(
>         -file          => "$dir/$source_file",
>         -format        => 'GenBank',
>         -raw_file_grep => 'Staphylococcus epidermidis'
>     );
> 
> or perhaps implement it inside Bio::SeqIO::next_seq()?
> 
>     my $seqobj = $infile->next_seq(
>         raw_file_grep => 'Staphylococcus epidermidis'
>     );

You have two problems with this approach:

1) It sounds like you want the system 'grep' instead of the perl built-in,
which would have to be used here for cross-platform issues (Windows does not
have grep).  

2) I don't think SeqIO would be the best way to go; you should use SeqIO for
getting sequences into objects, not searching for specific sequences.  

There may be a way to do this via flat databases, where you would index your
local genbank file (though I can't vouch for how they will work on a 4 GB
file).  But I think a non-Bioperl approach is better.

> (This would be a blind string-based hack for pure speed. Any string
> match anywhere -- no context intelligence whatsoever.)
> 
> Or am I just crazy and if I'm going to do a hack like this I should
> just write a stand-alone file filter outside of BioPerl?

That would probably be your best bet.  If your individual sequence records
aren't very large, you could iterate through the individual sequence records
in the file by changing the line separator to gulp each record and use a
plain ol' regex, like this (modified from a quickie script I use):

#! perl
use strict;
use warnings;

{
    local $/ = "//\n";
    while (my $gb = <>) {
        print $gb if $gb =~ m/Staphylococcus\sepidermidis/im;
    }
}

You could probably squeeze that into a one-liner if needed; this one was
from WinXP which has problems with using one-liners containing quotes.  

I use something like the above as a screen whose output is piped into a
second script, which then runs everything through SeqIO via STDIN to do with
what you want (print accession #, convert to FASTA, etc).  You could also
just save it as a file instead for later processing if the output data is
too large.

Chris

Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign