[Bioperl-l] Bio::SeqIO -- add an ugly but fast grep hack?

Amir Karger akarger at CGR.Harvard.edu
Thu Sep 14 15:58:49 UTC 2006

> From: Chris Fields [mailto:cjfields at uiuc.edu] 
> Subject: Re: [Bioperl-l] Bio::SeqIO -- add an ugly but fast grep hack?

> If your individual 
> sequence records
> aren't very large, you could iterate through the individual 
> sequence records
> in the file by changing the line separator to gulp each 
> record and use a
> plain ol' regex, like this (modified from a quickie script I use):
> #! perl
> use strict;
> use warnings;
> {
>     local $/ = "//\n";
>     while (my $gb = <>) {
>         print $gb if $gb =~ m/Staphylococcus\sepidermidis/im;
>     }
> }

Perl Golf! (Untested, as all good Perl Golf should be.)

perl -wne 'BEGIN {$/="//\n"} print if /Staphylococcus\sepidermidis/im/'
blah.gb > filtered.gb

Unfortunately, I can't golf down the species name :)
> You could probably squeeze that into a one-liner if needed; 
> this one was
> from WinXP which has problems with using one-liners 
> containing quotes. 


The Windows shell is very annoying. For those who don't know, it
basically requires you to put double quotes around scripts, not single.
(This automatically means you can't exactly port one-liners
UNIX<->Windows, because if you use single quotes, Windows doesn't get
it, and if you use double quotes, UNIX interprets any $ variables in
your script as SHELL variables instead of Perl ones.)

You can use qq~blah~ instead of "blah" in one-liners. So the above
one-liner could be used on Windows as:

perl -wne "BEGIN {$/=qq~//\n~} print if
/Staphylococcus\sepidermidis/im/" blah.gb > filtered.gb

(If you really need ~ inside your string, you can use other quote
characters, qq/blah/ or qq-blah- or qq{blah} will work.)

I've had pretty good luck with taking UNIX one-liners and just running
them through (a slightly more complicated version of):

# Use a non-greedy match so we correctly frame each pair of double
# Non-greedy match matches the first ' on the whole line to the very
last one
# So we avoid messing up any apostrophes or \' inside the script 

- Amir Karger
Research Computing
Bauer Center for Genomics Research
Harvard University

More information about the Bioperl-l mailing list