[Bioperl-l] Problems parsing Accesion number in FASTA format.
Hilmar Lapp
hlapp at gmx.net
Mon Jan 3 12:18:31 EST 2005
The FASTA parser only sets display_id. It doesn't set the accession
number, and it doesn't set primary_id either. IMO, this is the correct
behaviour, because the identifier in FASTA headers can come in all
sorts of formats.
If what you want is to print the identifier part of the description
line, print $seq->display_id(). If what you want is to extract the
accession number, then parse it out from what display_id returns, using
the format you expect it to be in.
-hilmar
(BTW technically, CAA73704.1 is not the accession - CAA73704 is and 1
is the version; just to illustrate)
On Monday, January 3, 2005, at 09:00 AM, David García Cortés wrote:
> Hello.
>
> I have the "nr" database in FASTA format (downloaded from NCBI
> website), and i want to retrieve the accession number of each sequence
> in that database, so I do the following:
>
> my $seqsfich = Bio::SeqIO->new(-file=>"nr.fa", '-format' => 'Fasta');
>
> while (my $seq = $seqsfich->next_seq()) {
> print STDOUT "Sequence accession number: ", $seq->accession, "\n";
> }
>
> But the results I get are:
>
> Sequence accession number: unknown
> Sequence accession number: unknown
> Sequence accession number: unknown
> Sequence accession number: unknown
> etc...
>
> Here you can see a fragment of the "nr.fa" file
> :
>> gi|2695847|emb|CAA73704.1| immunoglobulin heavy chain [Acipenser
>> baerii]
> MGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSAYMSWVRQAPGKGLEWVAYIY
> SGGSSTYYA
> QSVQGRFAISRDDSNSMLYLQMNSLKTEDTAVYYCARGGLGWSLDYWGKGTMITVTSATPSPPTVFPLMES
> CCLSDISGP
> VATGCLATGFCLPPRPSRGLINLEKL
>> gi|2695851|emb|CAA73709.1| immunoglobulin heavy chain [Acipenser
>> baerii]
> MGILTALCIIMTALSSVRSDVVLTESGPAVVKPGESHKLSCKAAGFTFSSYWMGWVRQTPGKGLEWVSIIS
> AGGSTYYAP
> SVEGRFTISRDNSNSMLYLQMNSLKTEDTAMYYCARKPETGSYGNISFEHWGKGTMITVTSATPSPPTVFP
> LMQACCSVD
> VTGPSATGCLATEF
>> gi|2695853|emb|CAA73712.1| immunoglobulin heavy chain [Acipenser
>> baerii]
> MGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSNNMGWVRQAPGKGLEWVSTIS
> YSVNAYYAQ
> SVQGRFTISRDDSNSMLYLQMNSLKTEDSAVYYCARESNFNRFDYWGSGTMVTVTNATPSPPTVFPLMQAC
> CSVDVTGPS
> ATGCLATEF
>
> I suppose the accession numbers are: CAA73704.1, CAA73709.1,
> CAA73712.1|, etc... (¿?)
> The thing is, how can I do for Bioperl to parse and recognize them?
>
> Thanks in advance.
>
> --
> David García Cortés
> Instituto Nacional de Bioinformática (INB)
> Nodo Computacional GNHC-2 UPC-CIRI
> c/. Jordi Girona 1-3
> Modul C6-E201 Tel. : 934 011 650
> E-08034 Barcelona Fax : 934 017 014
> Catalunya (Spain) e-mail: davidg at lsi.upc.edu
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the Bioperl-l
mailing list