[Bioperl-l] retrieving sequences by ID

Diego Riano diriano at rz.uni-potsdam.de
Wed Apr 27 11:18:10 EDT 2005


Hi Sean
I use to do something similar, here is how I am doing it:

Your ids would be in @pre_list, I just remove the final \n, so I have a
string without them.
Using the string is faster than looping through the list, But I am not
sure how it would behave with >20000 ids.
######################################################################
my @list=();
foreach my $seq(@pre_list){
    chomp $seq;
    push @list,$seq;
}
my $list=join("\t", at list);
 
my $output="outputfile"
my $in  = Bio::SeqIO->new(-file => "$ARGV[0]" , '-format' => 'Fasta');
my $out = Bio::SeqIO->new(-file => ">$output" , '-format' => 'Fasta');

while(my $seq = $in->next_seq()){
    my $seqid=$seq->id;
    if ($rmlist=~/\b$seqid\b/){
        $out->write_seq($seq);
    }
}
######################################################################
I hope this helps,

diego
On Wed, 2005-04-27 at 16:33, Sean O'Keeffe wrote:
> Hi all,
> This is probably an easy problem for ye, but one I'm having difficulty
> with none the
> less.
> I'm trying to extract only sequences from a fasta file (containing
> ~38,000 sequences)
> containing a specific ID in the header line e.g.
> return only the sequence for header containing 'ABCD12346' from:
> >ABCD12345|followed by the rest of the description
> acgtacgtgttttgggccctttaaa.....
> >ABCD12346|description
> acgtacgtgttttgggccctttaaa.....
> >ABCD12347|description
> acgtacgtgttttgggccctttaaa.....
> ...
> 
> The specific ID's are contained in a list(~20,000) which I want to loop through.
> This is what I have done so far w/out any luck:
> 
> ############################
> 
> use strict;
> use lib "/usr/lib/perl5/site_perl/5.8.1/";
> use Bio::SeqIO;
> 
> my (@ids)=@_
> my $seq_in = Bio::SeqIO->new( '-format' => "fasta",
>                               '-file' => "$fastafile");
> my $seq_out = Bio::SeqIO->new( '-format' => "fasta",
>                               '-file' => "$outfile");
> for ($i=0;$i<=scalar(@ids);$i++){
> while ($sequence = $seq_in->next_seq){
>         if ($sequence->display_id =~ /^>$ids[$i](.*$)/){
>                 $seq_out->write_seq($sequence);
>         }
> }
> }
> exit;
> 
> #############################
> 
> This has so far returned a list of 3 fasta headers and the programs
> then finishes without errors.
> I'd like to know where I'm going wrong and if possible, how I could
> improve on things to prevent memory usage/speed it up.
> 
> Thanks in advance,
> Sean O'Keeffe.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
-- 
_______________________________________
Diego Mauricio Riano Pachon
Biologist
Institute of Biology and Biochemistry
Potsdam University
Karl-Liebknecht-Str. 24-25
Haus 20
14476 Golm
Germany
Tel:+49 331 977 2809
http://www.geocities.com/dmrp.geo/



More information about the Bioperl-l mailing list