[Bioperl-l] Output a subset of FASTA data from a singlelargefile

Sat Jun 24 14:45:52 UTC 2006

----- Original Message ----- 
From: "Michael Oldham" <oldham at ucla.edu>
To: "Cook, Malcolm" <MEC at stowers-institute.org>; "Chris Fields" 
<cjfields at uiuc.edu>
Cc: <bioperl-l at lists.open-bio.org>
Sent: Friday, June 23, 2006 12:18 PM
Subject: Re: [Bioperl-l] Output a subset of FASTA data from a 
singlelargefile

> Hello again,
>
> I finally got it to work, using the following script.  However, it takes
> about 5 hours to run on a fast computer.  Using grep (in bash), on the
> other hand, takes about 5 minutes (see below if you are interested).
> Thanks to everyone for your help!
>
> SLOW perl script:
>
> #!/usr/bin/perl -w
>
> use strict;
>
> my $IDs = 'ID_all_X';
>
> unless (open(IDFILE, $IDs)) {
> print "Could not open file $IDs!\n";
> }
>
> my $probes = 'HG_U95Av2_probe_fasta';
>
> unless (open(PROBES, $probes)) {
> print "Could not open file $probes!\n";
> }
>
> open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
>
> my @ID = <IDFILE>;
> print @ID;
> chomp @ID;
>
> while (my $line = <PROBES>) {
> foreach my $identifier (@ID) {
> if($line=~/^>probe:\w+:$identifier:/) {
> print OUT $line;
> print OUT scalar(<PROBES>);
> }
> }
> }

This could probably be done MUCH faster using a hash on the sequence 
identifier.  (I have to admit that I didn't follow the first part of this 
conversation, so I could be misunderstanding some part of what you are 
trying to do.)  If you have a couple hundred-thousand sequences, my guess is 
that it could be done in under 30 seconds, but I could be wrong about the 
exact time.  The important part is to make a hash of your sequences with the 
key being the $identifier.  Then, loop through your @ID array doing 
something like (untested):

#open files as before and read in @ID as before

my %seq_hash;

while (my $line = <PROBES>) {
    if ($line =~/^>probe:\w+:$identifier:/) {
        $seq_hash{$identifier}=<PROBES>;
    }
}

foreach my $id (@ID) {
    print OUT ">$id\n" . $seq_hash{$id};
}