[Bioperl-l] Problems parsing Accesion number in FASTA format.
Brian Osborne
brian_osborne at cognia.com
Mon Jan 3 13:11:05 EST 2005
David,
The information you need is returned by the display_id() and desc() methods.
display_id() will return >(\S+), and desc() returns >\S+\s+(.+).
Brian O.
-----Original Message-----
From: bioperl-l-bounces at portal.open-bio.org
[mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of David García
Cortés
Sent: Monday, January 03, 2005 12:00 PM
To: bioperl-l at bioperl.org
Subject: [Bioperl-l] Problems parsing Accesion number in FASTA format.
Hello.
I have the "nr" database in FASTA format (downloaded from NCBI website), and
i want to retrieve the accession number of each sequence in that database,
so I do the following:
my $seqsfich = Bio::SeqIO->new(-file=>"nr.fa", '-format' => 'Fasta');
while (my $seq = $seqsfich->next_seq()) {
print STDOUT "Sequence accession number: ", $seq->accession, "\n";
}
But the results I get are:
Sequence accession number: unknown
Sequence accession number: unknown
Sequence accession number: unknown
Sequence accession number: unknown
etc...
Here you can see a fragment of the "nr.fa" file
:
>gi|2695847|emb|CAA73704.1| immunoglobulin heavy chain [Acipenser baerii]
MGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSAYMSWVRQAPGKGLEWVAYIYSGGSS
TYYA
QSVQGRFAISRDDSNSMLYLQMNSLKTEDTAVYYCARGGLGWSLDYWGKGTMITVTSATPSPPTVFPLMESCCLSD
ISGP
VATGCLATGFCLPPRPSRGLINLEKL
>gi|2695851|emb|CAA73709.1| immunoglobulin heavy chain [Acipenser baerii]
MGILTALCIIMTALSSVRSDVVLTESGPAVVKPGESHKLSCKAAGFTFSSYWMGWVRQTPGKGLEWVSIISAGGST
YYAP
SVEGRFTISRDNSNSMLYLQMNSLKTEDTAMYYCARKPETGSYGNISFEHWGKGTMITVTSATPSPPTVFPLMQAC
CSVD
VTGPSATGCLATEF
>gi|2695853|emb|CAA73712.1| immunoglobulin heavy chain [Acipenser baerii]
MGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSNNMGWVRQAPGKGLEWVSTISYSVNA
YYAQ
SVQGRFTISRDDSNSMLYLQMNSLKTEDSAVYYCARESNFNRFDYWGSGTMVTVTNATPSPPTVFPLMQACCSVDV
TGPS
ATGCLATEF
I suppose the accession numbers are: CAA73704.1, CAA73709.1, CAA73712.1|,
etc... (¿?)
The thing is, how can I do for Bioperl to parse and recognize them?
Thanks in advance.
--
David García Cortés
Instituto Nacional de Bioinformática (INB)
Nodo Computacional GNHC-2 UPC-CIRI
c/. Jordi Girona 1-3
Modul C6-E201 Tel. : 934 011 650
E-08034 Barcelona Fax : 934 017 014
Catalunya (Spain) e-mail: davidg at lsi.upc.edu
_______________________________________________
Bioperl-l mailing list
Bioperl-l at portal.open-bio.org
http://portal.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list