[Bioperl-l] Problems parsing Accesion number in FASTA format.
David García Cortés
davidg at lsi.upc.edu
Mon Jan 3 12:59:14 EST 2005
Thank you very much. Now it works!!!! :-)
----- Original Message -----
From: "Jason Stajich" <jason.stajich at duke.edu>
To: "David García Cortés" <davidg at lsi.upc.edu>
Cc: <bioperl-l at bioperl.org>
Sent: Monday, January 03, 2005 6:18 PM
Subject: Re: [Bioperl-l] Problems parsing Accesion number in FASTA format.
>I though someone was going to centralize this function at some point.
>Right now there is a _get_accession_version function in
>Bio::SearchIO::blast. Perhaps someone would care to make a utility module
>which can export a bunch of useful functions like this?
>
> my $seqsfich = Bio::SeqIO->new(-file=>"nr.fa", '-format' => 'Fasta');
>
> while (my $seq = $seqsfich->next_seq()) {
> my ($acc,$ver) = &_get_accession_version($seq->display_id)
> $seq->accession_number($acc);
> $seq->version($ver);
> print STDOUT "Sequence accession number: ",
> $seq->accession_number, "\n";
> }
>
> sub _get_accession_version {
> my $id = shift;
>
> # handle case when this is accidently called as a class method
> if( ref($id) && $id->isa('Bio::SearchIO') ) {
> $id = shift;
> }
> return undef unless defined $id;
> my ($acc, $version);
> if ($id =~ /(gb|emb|dbj|sp|pdb|bbs|ref|lcl)\|(.*)\|(.*)/) {
> ($acc, $version) = split /\./, $2;
> } elsif ($id =~ /(pir|prf|pat|gnl)\|(.*)\|(.*)/) {
> ($acc, $version) = split /\./, $3;
> } else {
> #punt, not matching the db's at
> ftp://ftp.ncbi.nih.gov/blast/db/README
> #Database Name Identifier Syntax
> #============================ ========================
> #GenBank gb|accession|locus
> #EMBL Data Library emb|accession|locus
> #DDBJ, DNA Database of Japan dbj|accession|locus
> #NBRF PIR pir||entry
> #Protein Research Foundation prf||name
> #SWISS-PROT sp|accession|entry name
> #Brookhaven Protein Data Bank pdb|entry|chain
> #Patents pat|country|number
> #GenInfo Backbone Id bbs|number
> #General database identifier gnl|database|identifier
> #NCBI Reference Sequence ref|accession|locus
> #Local Sequence identifier lcl|identifier
> $acc=$id;
> }
> return ($acc,$version);
> }
>
> On Jan 3, 2005, at 12:00 PM, David García Cortés wrote:
>
>> Hello.
>>
>> I have the "nr" database in FASTA format (downloaded from NCBI website),
>> and i want to retrieve the accession number of each sequence in that
>> database, so I do the following:
>>
>> my $seqsfich = Bio::SeqIO->new(-file=>"nr.fa", '-format' => 'Fasta');
>>
>> while (my $seq = $seqsfich->next_seq()) {
>> print STDOUT "Sequence accession number: ", $seq->accession, "\n";
>> }
>>
>> But the results I get are:
>>
>> Sequence accession number: unknown
>> Sequence accession number: unknown
>> Sequence accession number: unknown
>> Sequence accession number: unknown
>> etc...
>>
>> Here you can see a fragment of the "nr.fa" file
>> :
>>> gi|2695847|emb|CAA73704.1| immunoglobulin heavy chain [Acipenser
>>> baerii]
>> MGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSAYMSWVRQAPGKGLEWVAYIY
>> SGGSSTYYA
>> QSVQGRFAISRDDSNSMLYLQMNSLKTEDTAVYYCARGGLGWSLDYWGKGTMITVTSATPSPPTVFPLMES
>> CCLSDISGP
>> VATGCLATGFCLPPRPSRGLINLEKL
>>> gi|2695851|emb|CAA73709.1| immunoglobulin heavy chain [Acipenser
>>> baerii]
>> MGILTALCIIMTALSSVRSDVVLTESGPAVVKPGESHKLSCKAAGFTFSSYWMGWVRQTPGKGLEWVSIIS
>> AGGSTYYAP
>> SVEGRFTISRDNSNSMLYLQMNSLKTEDTAMYYCARKPETGSYGNISFEHWGKGTMITVTSATPSPPTVFP
>> LMQACCSVD
>> VTGPSATGCLATEF
>>> gi|2695853|emb|CAA73712.1| immunoglobulin heavy chain [Acipenser
>>> baerii]
>> MGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSNNMGWVRQAPGKGLEWVSTIS
>> YSVNAYYAQ
>> SVQGRFTISRDDSNSMLYLQMNSLKTEDSAVYYCARESNFNRFDYWGSGTMVTVTNATPSPPTVFPLMQAC
>> CSVDVTGPS
>> ATGCLATEF
>>
>> I suppose the accession numbers are: CAA73704.1, CAA73709.1,
>> CAA73712.1|, etc... (¿?)
>> The thing is, how can I do for Bioperl to parse and recognize them?
>>
>> Thanks in advance.
>>
>> --
>> David García Cortés
>> Instituto Nacional de Bioinformática (INB)
>> Nodo Computacional GNHC-2 UPC-CIRI
>> c/. Jordi Girona 1-3
>> Modul C6-E201 Tel. : 934 011 650
>> E-08034 Barcelona Fax : 934 017 014
>> Catalunya (Spain) e-mail: davidg at lsi.upc.edu
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
> --
> Jason Stajich
> jason.stajich at duke.edu
> http://www.duke.edu/~jes12/
>
>
More information about the Bioperl-l
mailing list