[Bioperl-l] Problems parsing Accesion number in FASTA format.
David García Cortés
davidg at lsi.upc.edu
Mon Jan 3 12:00:21 EST 2005
Hello.
I have the "nr" database in FASTA format (downloaded from NCBI website), and i want to retrieve the accession number of each sequence in that database, so I do the following:
my $seqsfich = Bio::SeqIO->new(-file=>"nr.fa", '-format' => 'Fasta');
while (my $seq = $seqsfich->next_seq()) {
print STDOUT "Sequence accession number: ", $seq->accession, "\n";
}
But the results I get are:
Sequence accession number: unknown
Sequence accession number: unknown
Sequence accession number: unknown
Sequence accession number: unknown
etc...
Here you can see a fragment of the "nr.fa" file
:
>gi|2695847|emb|CAA73704.1| immunoglobulin heavy chain [Acipenser baerii]
MGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSAYMSWVRQAPGKGLEWVAYIYSGGSSTYYA
QSVQGRFAISRDDSNSMLYLQMNSLKTEDTAVYYCARGGLGWSLDYWGKGTMITVTSATPSPPTVFPLMESCCLSDISGP
VATGCLATGFCLPPRPSRGLINLEKL
>gi|2695851|emb|CAA73709.1| immunoglobulin heavy chain [Acipenser baerii]
MGILTALCIIMTALSSVRSDVVLTESGPAVVKPGESHKLSCKAAGFTFSSYWMGWVRQTPGKGLEWVSIISAGGSTYYAP
SVEGRFTISRDNSNSMLYLQMNSLKTEDTAMYYCARKPETGSYGNISFEHWGKGTMITVTSATPSPPTVFPLMQACCSVD
VTGPSATGCLATEF
>gi|2695853|emb|CAA73712.1| immunoglobulin heavy chain [Acipenser baerii]
MGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSNNMGWVRQAPGKGLEWVSTISYSVNAYYAQ
SVQGRFTISRDDSNSMLYLQMNSLKTEDSAVYYCARESNFNRFDYWGSGTMVTVTNATPSPPTVFPLMQACCSVDVTGPS
ATGCLATEF
I suppose the accession numbers are: CAA73704.1, CAA73709.1, CAA73712.1|, etc... (¿?)
The thing is, how can I do for Bioperl to parse and recognize them?
Thanks in advance.
--
David García Cortés
Instituto Nacional de Bioinformática (INB)
Nodo Computacional GNHC-2 UPC-CIRI
c/. Jordi Girona 1-3
Modul C6-E201 Tel. : 934 011 650
E-08034 Barcelona Fax : 934 017 014
Catalunya (Spain) e-mail: davidg at lsi.upc.edu
More information about the Bioperl-l
mailing list