[Bioperl-l] retrieving sequences by ID
Diego Riano
diriano at rz.uni-potsdam.de
Wed Apr 27 11:18:10 EDT 2005
Hi Sean
I use to do something similar, here is how I am doing it:
Your ids would be in @pre_list, I just remove the final \n, so I have a
string without them.
Using the string is faster than looping through the list, But I am not
sure how it would behave with >20000 ids.
######################################################################
my @list=();
foreach my $seq(@pre_list){
chomp $seq;
push @list,$seq;
}
my $list=join("\t", at list);
my $output="outputfile"
my $in = Bio::SeqIO->new(-file => "$ARGV[0]" , '-format' => 'Fasta');
my $out = Bio::SeqIO->new(-file => ">$output" , '-format' => 'Fasta');
while(my $seq = $in->next_seq()){
my $seqid=$seq->id;
if ($rmlist=~/\b$seqid\b/){
$out->write_seq($seq);
}
}
######################################################################
I hope this helps,
diego
On Wed, 2005-04-27 at 16:33, Sean O'Keeffe wrote:
> Hi all,
> This is probably an easy problem for ye, but one I'm having difficulty
> with none the
> less.
> I'm trying to extract only sequences from a fasta file (containing
> ~38,000 sequences)
> containing a specific ID in the header line e.g.
> return only the sequence for header containing 'ABCD12346' from:
> >ABCD12345|followed by the rest of the description
> acgtacgtgttttgggccctttaaa.....
> >ABCD12346|description
> acgtacgtgttttgggccctttaaa.....
> >ABCD12347|description
> acgtacgtgttttgggccctttaaa.....
> ...
>
> The specific ID's are contained in a list(~20,000) which I want to loop through.
> This is what I have done so far w/out any luck:
>
> ############################
>
> use strict;
> use lib "/usr/lib/perl5/site_perl/5.8.1/";
> use Bio::SeqIO;
>
> my (@ids)=@_
> my $seq_in = Bio::SeqIO->new( '-format' => "fasta",
> '-file' => "$fastafile");
> my $seq_out = Bio::SeqIO->new( '-format' => "fasta",
> '-file' => "$outfile");
> for ($i=0;$i<=scalar(@ids);$i++){
> while ($sequence = $seq_in->next_seq){
> if ($sequence->display_id =~ /^>$ids[$i](.*$)/){
> $seq_out->write_seq($sequence);
> }
> }
> }
> exit;
>
> #############################
>
> This has so far returned a list of 3 fasta headers and the programs
> then finishes without errors.
> I'd like to know where I'm going wrong and if possible, how I could
> improve on things to prevent memory usage/speed it up.
>
> Thanks in advance,
> Sean O'Keeffe.
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
--
_______________________________________
Diego Mauricio Riano Pachon
Biologist
Institute of Biology and Biochemistry
Potsdam University
Karl-Liebknecht-Str. 24-25
Haus 20
14476 Golm
Germany
Tel:+49 331 977 2809
http://www.geocities.com/dmrp.geo/
More information about the Bioperl-l
mailing list