[Bioperl-l] Output a subset of FASTA data from a single large file

Tue Jun 27 22:35:08 UTC 2006

Hi All,

This is rather late, but just for future reference on the mailing list,  here 
is how I would do the task using Bio::DB::Fasta.

Script 1: index the file for future use:

	#!/usr/bin/perl
	use strict;
	use Bio::DB::Fasta;

	my $filename = shift;  # name of file to index on command line
	Bio::DB::Fasta->new($filename,-makeid=>\&make_my_id)
		or die "Indexing failed";
	print "Indexing succeeded!\n";
	exit 0;

	sub make_my_id {
		my $description_line = shift;
		$description_line =~ /(\d+_at)/ or die "malformed description line";
		return $1;
	}

Run this script once to create a reusable index of the file. The index will be 
stored in the same directory as the FASTA file.

Script 2: extract the sequences using the IDs stored in a second file:

	#!/usr/bin/perl
	use strict;
	use Bio::DB::Fasta;
	use Bio::SeqIO;
	use IO::File;

	my $indexed_fasta_file = shift;
	my $probe_id_file         = shift;

	# open up the indexed fasta file
	my $db = Bio::DB::Fasta->new($indexed_fasta_file) or die;
	# open up a FASTA writer
	my $out = Bio::SeqIO->new(-format=>'Fasta',-fh=>\*STDOUT) or die;
	# open the probe id file
	my $in   = IO::File->new($probe_id_file) or die;

	# do the work
	while (my $id = <$in>) {
		chomp $id;
		my $seq = $db->get_Seq_by_id($id) or die;
		$out->write_seq($seq);
	}

	exit 0;

Bio::Index::Fasta will work in almost exactly the same way. The only 
difference is that the Bio::DB::Fasta will allow you to retrieve subsequences 
efficiently.

Lincoln

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu