[Bioperl-l] retrieving sequences by ID
sokeeff at tcd.ie
sokeeff at tcd.ie
Tue Apr 26 13:50:04 EDT 2005
Hi all,
Probably an easy problem for ye, but one I'm having difficulty with none the
less.
I'm trying to extract only sequences from a fasta file(~38,000 sequences)
containing a specific ID in the header files e.g.
return only the sequence for header containing 'ABCD12346' from:
>ABCD12345|description
acgtacgtgttttgggccctttaaa.....
>ABCD12346|description
acgtacgtgttttgggccctttaaa.....
>ABCD12347|description
acgtacgtgttttgggccctttaaa.....
...
The ID's are contained in a list(~20,000) which I want to loop through.
This is what I have done so far w/out any luck:
############################
use strict;
use lib "/usr/lib/perl5/site_perl/5.8.1/";
use Bio::SeqIO;
my (@ids)=@_
my $seq_in = Bio::SeqIO->new( '-format' => "fasta",
'-file' => "$fastafile");
my $seq_out = Bio::SeqIO->new( '-format' => "fasta",
'-file' => "$outfile");
for ($i=0;$i<=scalar(@ids);$i++){
while ($sequence = $seq_in->next_seq){
if ($sequence->display_id =~ /^>$ids[$i](.*$)/){
$seq_out->write_seq($sequence);
}
}
}
exit;
#############################
This returns a small list of fasta headers. I'd like to know if it's the right
way to go about the task, where I'm going wrong and if possible, how I could
improve on things to prevent memory usage/speed it up.
Thanks in advance,
Sean O'Keeffe.
More information about the Bioperl-l
mailing list