[Bioperl-l] Problems parsing Accesion number in FASTA format.

David García Cortés davidg at lsi.upc.edu
Mon Jan 3 12:00:21 EST 2005


Hello.

I have the "nr" database in FASTA format (downloaded from NCBI website), and i want to retrieve the accession number of each sequence in that database, so I do the following:

my $seqsfich  = Bio::SeqIO->new(-file=>"nr.fa", '-format' => 'Fasta');
 
 while (my $seq = $seqsfich->next_seq()) {  
    print STDOUT "Sequence accession number: ", $seq->accession, "\n";
   }

But the results I get are:

Sequence accession number: unknown
Sequence accession number: unknown
Sequence accession number: unknown
Sequence accession number: unknown
etc...

Here you can see a fragment of the "nr.fa" file
:
>gi|2695847|emb|CAA73704.1| immunoglobulin heavy chain [Acipenser baerii]
MGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSAYMSWVRQAPGKGLEWVAYIYSGGSSTYYA
QSVQGRFAISRDDSNSMLYLQMNSLKTEDTAVYYCARGGLGWSLDYWGKGTMITVTSATPSPPTVFPLMESCCLSDISGP
VATGCLATGFCLPPRPSRGLINLEKL
>gi|2695851|emb|CAA73709.1| immunoglobulin heavy chain [Acipenser baerii]
MGILTALCIIMTALSSVRSDVVLTESGPAVVKPGESHKLSCKAAGFTFSSYWMGWVRQTPGKGLEWVSIISAGGSTYYAP
SVEGRFTISRDNSNSMLYLQMNSLKTEDTAMYYCARKPETGSYGNISFEHWGKGTMITVTSATPSPPTVFPLMQACCSVD
VTGPSATGCLATEF
>gi|2695853|emb|CAA73712.1| immunoglobulin heavy chain [Acipenser baerii]
MGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSNNMGWVRQAPGKGLEWVSTISYSVNAYYAQ
SVQGRFTISRDDSNSMLYLQMNSLKTEDSAVYYCARESNFNRFDYWGSGTMVTVTNATPSPPTVFPLMQACCSVDVTGPS
ATGCLATEF

I suppose the accession numbers are: CAA73704.1, CAA73709.1, CAA73712.1|, etc... (¿?)
The thing is, how can I do for Bioperl to parse and recognize them?

Thanks in advance.

--
David García Cortés
Instituto Nacional de Bioinformática (INB)
Nodo Computacional GNHC-2 UPC-CIRI
c/. Jordi Girona 1-3              
Modul C6-E201                   Tel.  : 934 011 650
E-08034 Barcelona               Fax   : 934 017 014
Catalunya (Spain)               e-mail: davidg at lsi.upc.edu




More information about the Bioperl-l mailing list