[Bioperl-l] Problems parsing Accesion number in FASTA format.

Mon Jan 3 13:11:05 EST 2005

David,

The information you need is returned by the display_id() and desc() methods.
display_id() will return >(\S+), and desc() returns >\S+\s+(.+).

Brian O.

-----Original Message-----
From: bioperl-l-bounces at portal.open-bio.org
[mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of David García
Cortés
Sent: Monday, January 03, 2005 12:00 PM
To: bioperl-l at bioperl.org
Subject: [Bioperl-l] Problems parsing Accesion number in FASTA format.

Hello.

I have the "nr" database in FASTA format (downloaded from NCBI website), and
i want to retrieve the accession number of each sequence in that database,
so I do the following:

my $seqsfich  = Bio::SeqIO->new(-file=>"nr.fa", '-format' => 'Fasta');

 while (my $seq = $seqsfich->next_seq()) {
    print STDOUT "Sequence accession number: ", $seq->accession, "\n";
   }

But the results I get are:

Sequence accession number: unknown
Sequence accession number: unknown
Sequence accession number: unknown
Sequence accession number: unknown
etc...

Here you can see a fragment of the "nr.fa" file
:
>gi|2695847|emb|CAA73704.1| immunoglobulin heavy chain [Acipenser baerii]
MGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSAYMSWVRQAPGKGLEWVAYIYSGGSS
TYYA
QSVQGRFAISRDDSNSMLYLQMNSLKTEDTAVYYCARGGLGWSLDYWGKGTMITVTSATPSPPTVFPLMESCCLSD
ISGP
VATGCLATGFCLPPRPSRGLINLEKL
>gi|2695851|emb|CAA73709.1| immunoglobulin heavy chain [Acipenser baerii]
MGILTALCIIMTALSSVRSDVVLTESGPAVVKPGESHKLSCKAAGFTFSSYWMGWVRQTPGKGLEWVSIISAGGST
YYAP
SVEGRFTISRDNSNSMLYLQMNSLKTEDTAMYYCARKPETGSYGNISFEHWGKGTMITVTSATPSPPTVFPLMQAC
CSVD
VTGPSATGCLATEF
>gi|2695853|emb|CAA73712.1| immunoglobulin heavy chain [Acipenser baerii]
MGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSNNMGWVRQAPGKGLEWVSTISYSVNA
YYAQ
SVQGRFTISRDDSNSMLYLQMNSLKTEDSAVYYCARESNFNRFDYWGSGTMVTVTNATPSPPTVFPLMQACCSVDV
TGPS
ATGCLATEF

I suppose the accession numbers are: CAA73704.1, CAA73709.1, CAA73712.1|,
etc... (¿?)
The thing is, how can I do for Bioperl to parse and recognize them?

Thanks in advance.

--
David García Cortés
Instituto Nacional de Bioinformática (INB)
Nodo Computacional GNHC-2 UPC-CIRI
c/. Jordi Girona 1-3
Modul C6-E201                   Tel.  : 934 011 650
E-08034 Barcelona               Fax   : 934 017 014
Catalunya (Spain)               e-mail: davidg at lsi.upc.edu

_______________________________________________
Bioperl-l mailing list
Bioperl-l at portal.open-bio.org
http://portal.open-bio.org/mailman/listinfo/bioperl-l