[Bioperl-l] Output a subset of FASTA data from a single large file

Mon Jun 12 08:21:31 UTC 2006

Chris Fields wrote:
> There's information in the HOWTOs:
> 
> http://www.bioperl.org/wiki/HOWTO:Flat_databases
> 
> http://www.bioperl.org/wiki/HOWTO:OBDA
> 
> Just so you know, I tried, out of curiosity, passing this through Bio::SeqIO
> ('fasta' format I/O) and this is what I got as output:
> 
>> probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
> 
> 
> i.e. an empty sequence, which is what I guessed might happen
[snip]

As you later discovered, that was an Outlook problem. Just to make this 
thread relevant to bioperl, the bioperl solution is:

use Bio::SeqIO;
use Bio::Index::Fasta;
my $inx = Bio::Index::Fasta->new(-write_flag => 1);
$inx->id_parser(\&get_id);
$inx->make_index(shift);

my $out = Bio::SeqIO->new(-format => 'Fasta', -fh => \*STDOUT);
my $wanted_ids_file = shift;
open(IDS, $wanted_ids_file);
while (<IDS>) {
   chomp;
   my $seq = $inx->fetch($_);
   $out->write_seq($seq);
}

sub get_id {
   my $line = shift;
   $line =~ /^>probe:\S+?:(\S+?):/;
   $1;
}

It works for me on the sample sequence given by the OP.