[Bioperl-l] Bio::SeqIO -- add an ugly but fast grep hack?
Jay Hannah
jay at jays.net
Tue Sep 12 01:41:14 UTC 2006
I was assigned the task of wading through 4GB of GenBank data to find
all of the sequences that mentioned "Staphylococcus epidermidis"
anywhere in any annotation. Separate those hits out into smaller files.
Easy enough. Something like this (stripped down):
foreach my $source_file (@file_list) {
my $outfile = Bio::SeqIO->new(-file => ">$dir/$source_file" .
'_parsed', -format => 'GenBank');
my $infile = Bio::SeqIO->new(-file => "$dir/
$source_file", -format => 'GenBank');
SEQUENCE:
while(my $seqobj = $infile->next_seq()) {
my $annotations = $seqobj->annotation;
foreach my $key ( $annotations->get_all_annotation_keys() ) {
my @values = $annotations->get_Annotations($key);
foreach my $value ( @values ) {
if ($value->as_text =~ /$select_feature/i) { #
$select_feature is 'Staph...'
$outfile->write_seq($seqobj);
next SEQUENCE;
}
}
}
}
}
Works great. The only problem is that it takes a long time to wade
through all 4GB. I'm making Perl do a TON of work and throwing
99.999% of that work away. :)
I noticed that I can discover the sweet spots in the 4GB a thousand
times faster with a simple grep:
grep -in "Staphylococcus epidermidis" *.seq
From the filenames:linenumbers that grep discovers for me I believe
I could "hop" BioPerl (Bio::SeqIO) into the "sweet spots" of the
files and have BioPerl just serialize *those* into sequence objects.
Has anyone ever thought of adding a "raw_file_grep" argument or
something? Like this?
my $infile = Bio::SeqIO->new(
-file => "$dir/$source_file",
-format => 'GenBank',
-raw_file_grep => 'Staphylococcus epidermidis'
);
or perhaps implement it inside Bio::SeqIO::next_seq()?
my $seqobj = $infile->next_seq(
raw_file_grep => 'Staphylococcus epidermidis'
);
(This would be a blind string-based hack for pure speed. Any string
match anywhere -- no context intelligence whatsoever.)
Or am I just crazy and if I'm going to do a hack like this I should
just write a stand-alone file filter outside of BioPerl?
Or has someone already optimized this somewhere and I just couldn't
find it?
Thanks,
j
P.S. If you're keeping score, yes I will be circling back to
implement the ideas I started in previous threads... :)
More information about the Bioperl-l
mailing list