[Bioperl-l] Output a subset of FASTA data from a single large file
Lincoln Stein
lstein at cshl.edu
Tue Jun 27 22:35:08 UTC 2006
Hi All,
This is rather late, but just for future reference on the mailing list, here
is how I would do the task using Bio::DB::Fasta.
Script 1: index the file for future use:
#!/usr/bin/perl
use strict;
use Bio::DB::Fasta;
my $filename = shift; # name of file to index on command line
Bio::DB::Fasta->new($filename,-makeid=>\&make_my_id)
or die "Indexing failed";
print "Indexing succeeded!\n";
exit 0;
sub make_my_id {
my $description_line = shift;
$description_line =~ /(\d+_at)/ or die "malformed description line";
return $1;
}
Run this script once to create a reusable index of the file. The index will be
stored in the same directory as the FASTA file.
Script 2: extract the sequences using the IDs stored in a second file:
#!/usr/bin/perl
use strict;
use Bio::DB::Fasta;
use Bio::SeqIO;
use IO::File;
my $indexed_fasta_file = shift;
my $probe_id_file = shift;
# open up the indexed fasta file
my $db = Bio::DB::Fasta->new($indexed_fasta_file) or die;
# open up a FASTA writer
my $out = Bio::SeqIO->new(-format=>'Fasta',-fh=>\*STDOUT) or die;
# open the probe id file
my $in = IO::File->new($probe_id_file) or die;
# do the work
while (my $id = <$in>) {
chomp $id;
my $seq = $db->get_Seq_by_id($id) or die;
$out->write_seq($seq);
}
exit 0;
Bio::Index::Fasta will work in almost exactly the same way. The only
difference is that the Bio::DB::Fasta will allow you to retrieve subsequences
efficiently.
Lincoln
--
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING,
PLEASE CONTACT MY ASSISTANT,
SANDRA MICHELSEN, AT michelse at cshl.edu
More information about the Bioperl-l
mailing list