[Bioperl-l] Problems parsing Accesion number in FASTA format.

Mon Jan 3 12:18:31 EST 2005

The FASTA parser only sets display_id. It doesn't set the accession  
number, and it doesn't set primary_id either. IMO, this is the correct  
behaviour, because the identifier in FASTA headers can come in all  
sorts of formats.

If what you want is to print the identifier part of the description  
line, print $seq->display_id(). If what you want is to extract the  
accession number, then parse it out from what display_id returns, using  
the format you expect it to be in.

	-hilmar

(BTW technically, CAA73704.1 is not the accession - CAA73704 is and 1  
is the version; just to illustrate)

On Monday, January 3, 2005, at 09:00  AM, David García Cortés wrote:

> Hello.
>
> I have the "nr" database in FASTA format (downloaded from NCBI  
> website), and i want to retrieve the accession number of each sequence  
> in that database, so I do the following:
>
> my $seqsfich  = Bio::SeqIO->new(-file=>"nr.fa", '-format' => 'Fasta');
>
>  while (my $seq = $seqsfich->next_seq()) {
>     print STDOUT "Sequence accession number: ", $seq->accession, "\n";
>    }
>
> But the results I get are:
>
> Sequence accession number: unknown
> Sequence accession number: unknown
> Sequence accession number: unknown
> Sequence accession number: unknown
> etc...
>
> Here you can see a fragment of the "nr.fa" file
> :
>> gi|2695847|emb|CAA73704.1| immunoglobulin heavy chain [Acipenser  
>> baerii]
> MGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSAYMSWVRQAPGKGLEWVAYIY 
> SGGSSTYYA
> QSVQGRFAISRDDSNSMLYLQMNSLKTEDTAVYYCARGGLGWSLDYWGKGTMITVTSATPSPPTVFPLMES 
> CCLSDISGP
> VATGCLATGFCLPPRPSRGLINLEKL
>> gi|2695851|emb|CAA73709.1| immunoglobulin heavy chain [Acipenser  
>> baerii]
> MGILTALCIIMTALSSVRSDVVLTESGPAVVKPGESHKLSCKAAGFTFSSYWMGWVRQTPGKGLEWVSIIS 
> AGGSTYYAP
> SVEGRFTISRDNSNSMLYLQMNSLKTEDTAMYYCARKPETGSYGNISFEHWGKGTMITVTSATPSPPTVFP 
> LMQACCSVD
> VTGPSATGCLATEF
>> gi|2695853|emb|CAA73712.1| immunoglobulin heavy chain [Acipenser  
>> baerii]
> MGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSNNMGWVRQAPGKGLEWVSTIS 
> YSVNAYYAQ
> SVQGRFTISRDDSNSMLYLQMNSLKTEDSAVYYCARESNFNRFDYWGSGTMVTVTNATPSPPTVFPLMQAC 
> CSVDVTGPS
> ATGCLATEF
>
> I suppose the accession numbers are: CAA73704.1, CAA73709.1,  
> CAA73712.1|, etc... (¿?)
> The thing is, how can I do for Bioperl to parse and recognize them?
>
> Thanks in advance.
>
> --
> David García Cortés
> Instituto Nacional de Bioinformática (INB)
> Nodo Computacional GNHC-2 UPC-CIRI
> c/. Jordi Girona 1-3
> Modul C6-E201                   Tel.  : 934 011 650
> E-08034 Barcelona               Fax   : 934 017 014
> Catalunya (Spain)               e-mail: davidg at lsi.upc.edu
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------