[Bioperl-l] Bio::SeqIO; seq->desc() gives back too (!!!) full header

Jason Stajich jason@cgt.mc.duke.edu
Thu, 8 Aug 2002 09:39:57 -0400 (EDT)


FASTA/Pearson format is unstructured.  That means you (the programmer) get
to do the work figuring out what you want out of the desc/id line.  We
have taken the approach that the first section (white space is the
separator) is the id and everything else is the description
more succiently (in perl):

my ($id,$desc) = ($str =~ /^>\s*(\S+)\s*(.*)/);

You're only going to get a recapitulation of what is in the header file we
don't do anything magic to detect this because it would invariably break
with someone else's scheme.  If you only want the pir number you write a
simple regexp

(this [untested] will find the gi,optional pir num, and acc for the 1st
pir if it exists)

my ($pirgi,$pirnum,$piracc) = ($seq->desc() =~ /(\d+)\|pir\|(\S+)?\|(\S+)/);

You also can investigate the split function to split on whitespace.
Programming the correct soln depends on what you actually want to extract
from the desc line.

-jason



On Thu, 8 Aug 2002, Benjamin Breu wrote:

> Hi,
>
> thx Jason for your help.
>
> The desc() funktion prints out the header but there is too much stuff
> in it. I thought it would print only the description, but if there are
> multiple gi numbers for one protein (I'm using NCBI-Fasta (nr)), it
> shows me the description and the following gi, pir, etc. number plus
> their description. See below.
>
> use Bio::SeqIO;
> my $seq = Bio::SeqIO->new(-format => 'fasta', -file => 'filename');   	#filename = my filename
> while( my $seq = $in->next_seq ) {
>  print  $seq->display_id(), "\n",$seq->desc(), "\n", $seq->seq(), "\n\n";
> }
>
> format as folows for output:
>
> ID
> description
> sequence
>
> gi|15233744|ref|NP_194152.1|
> (NM_118554) putative protein [Arabidopsis thaliana]gi|7487330|pir||T09884 hypothetical protein T22A6.40 - Arabidopsis thalianagi|5051763|emb|CAB45056.1| (AL078637) putative protein [Arabidopsis thaliana]gi|7269271|emb|CAB79331.1| (AL161561) putative protein [Arabidopsis thaliana]
> MKRSTTDSDLAGDAHNETNKKMKSTEEEEIGFSNLDENLVYEVLKHVDAKTLAMSSCVSKIWHKTAQDERLWELICTRHWTNIGCGQNQLRSVVLALGGFRRLHSLYLWPLSKPNPRARFGKDELKLTLSLLSIRYYEKMSFTKRPLPESK
>
> Is there a problem with the parser or what options does it need in order to tell me the whole gi, pir, etc. -numbers when I call for an ID. That could be an hash with key = database (e.g. dbj, pir) and values = @arrayofnumbers. Is there such a smart little parser or do I have to spend (a lot of) hours to do this myself?
>
> Thx
>
> Ben
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>

-- 
Jason Stajich
Duke University
jason at cgt.mc.duke.edu