[Bioperl-l] match patterns to retrieve seqs

Roy Chaudhuri roy.chaudhuri at gmail.com
Mon Mar 19 15:00:19 UTC 2012


Hi Philipp,

There are several problems with your code. Firstly, you will want to 
remove trailing newlines from your headers using chomp. Also, Perl 
patterns interpolate as if they were double quoted strings, which means 
that the elements of your array are all concatenated (separated by 
spaces) and you are trying to match them all in one go. Finally, it 
would be better to output using Bio::SeqIO rather than print.

Since you are just looking for exact matches, it would be much more 
efficient to read your headers into a hash, and then use exists rather 
than pattern matching:
my %query;
while (<LIST>) {
   chomp;
   $query{$_}=1;
}
my $fastafile = Bio::SeqIO->new(-file =>$file, -format =>$format);
my $out=Bio::SeqIO->new(-format=>$format);
while (my $fastaseq = $fastafile->next_seq){
   if (exists $query{$fastaseq->id}) {
      $out->write_seq($fastaseq);
   }
}

You should also look at the Bio::DB::Fasta module, which indexes your 
fasta file so is particularly useful for long files that you will access 
more than once:
http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/DB/Fasta.pm

Hope this helps.
Cheers,
Roy.

On 18/03/2012 20:36, Philipp Schiffer wrote:
> Sorry,
>
> at least my $queries should read my queries at .
>
>
> Am 18.03.12 17:49, schrieb Philipp Schiffer:
>> Hi!
>>
>> I am trying to write a small script to retrieve such sequences
>> (including headers) from a fasta file that match the headers I listed in
>> another file.
>> The fasta file looks like:
>>   >maker-scaffold15249_size4120-snap-gene-0.2-mRNA-1 protein
>> AED:0.0448532948532949 eAED:0.0448532948532949 QI:42|-1|0|1|-1|0|1|1|191
>> MDILSLPLSFVCLMTIAWVVLAAALFLLWENGWSYFNSFYFTVVSFSTVGLGDMTPDYTR.............
>>   >snap_masked-scaffold15249_size4120-abinit-gene-0.1-mRNA-1 protein
>> AED:1 eAED:1 QI:0|0|0|0|1|1|4|0|108
>> MHAMWMIRKFRLQWRWCMHGRPRDEQPEHHCVMGFLAVTHANRAAYNTDTLPTMLTERRK............
>>   >snap_masked-scaffold15249_size4120-abinit-gene-0.0-mRNA-1 protein
>> AED:1 eAED:1 QI:55|0|0|0|1|1|2|227|5
>> MLHHR.......
>>
>> while the list file is:
>> maker-scaffold539_size40162-snap-gene-0.4-mRNA-1
>> snap_masked-scaffold2197_size27871-abinit-gene-0.3-mRNA-1
>> maker-scaffold1000_size34843-snap-gene-0.7-mRNA-1
>> maker-scaffold10087_size13457-snap-gene-0.3-mRNA-1
>>
>> My perl so far would be:
>>
>> #! /usr/bin/perl -w
>>
>> use Bio::SeqIO;
>>
>> my $usage = "list_from_fasta.pl fastafile format headerlistfile\n";
>> my $file = shift or die $usage;
>> my $format = shift or die $usage;
>> my $headerlist = shift or die $usage;
>>
>> open LIST, "<$headerlist" or die $!;
>> my $queries =<LIST>; #might be a hash as well, don't know what's better
>>
>> my $fastafile = Bio::SeqIO->new(-file =>  "<$file", -format =>  $format);
>> while (my $fastaseq = $fastafile->next_seq){
>>
>> if ($fastaseq ->id =~ /(@queries)/) # here is the problem, after much
>> trying it seems actually to look for matches, but does not find any,
>> still there must be
>>
>> {print $fastaseq->  id,"\n" and print $fastaseq->seq,"\n";}
>> ;
>> };
>>
>> So I am wondering, is this the right way to tackle the problem. Will it
>> work? Should I use another bioPerl module? My thinking is, that the
>> pattern matching operation after the 'if' statement is wrong.
>>
>> Any help highly appreciated!
>>
>> Thanks a lot.
>>
>> Philipp
>>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list