[Bioperl-l] Output a subset of FASTA data from a single large file

Fri Jun 9 02:07:34 UTC 2006

Dear all,

I am a total Bioperl newbie struggling to accomplish a conceptually simple
task.  I have a single large fasta file containing about 200,000 probe
sequences (from an Affymetrix microarray), each of which looks like this:

>probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631; Antisense;
TGGCTCCTGCTGAGGTCCCCTTTCC

What I would like to do is extract from this file a subset of ~130,800
probes (both the header and the sequence) and output this subset into a new
fasta file.  These 130,800 probes correspond to 8,175 probe set IDs
("1138_at" is the probe set ID in the header listed above); I have these
8,175 IDs listed in a separate file.  I *think* that I managed to create an
index of all 200,000 probes in the original fasta file using the following
script:

#!/usr/bin/perl -w

 # script 1: create the index

 use Bio::Index::Fasta;
 use strict;
 my $Index_File_Name = shift;
 my $inx = Bio::Index::Fasta->new(
     -filename => $Index_File_Name,
     -write_flag => 1);
 $inx->make_index(@ARGV);

I'm not sure if this is the most sensible approach, and even if it is, I'm
not sure what to do next.  Any help would be greatly appreciated!

Many thanks,
Mike O.

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date: 6/8/2006