[Bioperl-l] match patterns to retrieve seqs
Philipp Schiffer
philipp.schiffer at googlemail.com
Sun Mar 18 16:49:11 UTC 2012
Hi!
I am trying to write a small script to retrieve such sequences
(including headers) from a fasta file that match the headers I listed in
another file.
The fasta file looks like:
>maker-scaffold15249_size4120-snap-gene-0.2-mRNA-1 protein
AED:0.0448532948532949 eAED:0.0448532948532949 QI:42|-1|0|1|-1|0|1|1|191
MDILSLPLSFVCLMTIAWVVLAAALFLLWENGWSYFNSFYFTVVSFSTVGLGDMTPDYTR.............
>snap_masked-scaffold15249_size4120-abinit-gene-0.1-mRNA-1 protein
AED:1 eAED:1 QI:0|0|0|0|1|1|4|0|108
MHAMWMIRKFRLQWRWCMHGRPRDEQPEHHCVMGFLAVTHANRAAYNTDTLPTMLTERRK............
>snap_masked-scaffold15249_size4120-abinit-gene-0.0-mRNA-1 protein
AED:1 eAED:1 QI:55|0|0|0|1|1|2|227|5
MLHHR.......
while the list file is:
maker-scaffold539_size40162-snap-gene-0.4-mRNA-1
snap_masked-scaffold2197_size27871-abinit-gene-0.3-mRNA-1
maker-scaffold1000_size34843-snap-gene-0.7-mRNA-1
maker-scaffold10087_size13457-snap-gene-0.3-mRNA-1
My perl so far would be:
#! /usr/bin/perl -w
use Bio::SeqIO;
my $usage = "list_from_fasta.pl fastafile format headerlistfile\n";
my $file = shift or die $usage;
my $format = shift or die $usage;
my $headerlist = shift or die $usage;
open LIST, "<$headerlist" or die $!;
my $queries = <LIST>; #might be a hash as well, don't know what's better
my $fastafile = Bio::SeqIO->new(-file => "<$file", -format => $format);
while (my $fastaseq = $fastafile->next_seq){
if ($fastaseq ->id =~ /(@queries)/) # here is the problem, after
much trying it seems actually to look for matches, but does not find
any, still there must be
{print $fastaseq-> id,"\n" and print $fastaseq->seq,"\n";}
;
};
So I am wondering, is this the right way to tackle the problem. Will it
work? Should I use another bioPerl module? My thinking is, that the
pattern matching operation after the 'if' statement is wrong.
Any help highly appreciated!
Thanks a lot.
Philipp
More information about the Bioperl-l
mailing list