[Bioperl-l] Problems parsing Accesion number in FASTA format.

Mon Jan 3 12:18:45 EST 2005

I though someone was going to centralize this function at some point.   
Right now there is a _get_accession_version function in  
Bio::SearchIO::blast.  Perhaps someone would care to make a utility  
module which can export a bunch of useful functions like this?

my $seqsfich  = Bio::SeqIO->new(-file=>"nr.fa", '-format' => 'Fasta');

  while (my $seq = $seqsfich->next_seq()) {
	my ($acc,$ver) = &_get_accession_version($seq->display_id)
	$seq->accession_number($acc);
	$seq->version($ver);
          print STDOUT "Sequence accession number: ",  
$seq->accession_number, "\n";
  }

sub _get_accession_version {
     my $id = shift;

     # handle case when this is accidently called as a class method
     if( ref($id) && $id->isa('Bio::SearchIO') ) {
         $id = shift;
     }
     return undef unless defined $id;
     my ($acc, $version);
     if ($id =~ /(gb|emb|dbj|sp|pdb|bbs|ref|lcl)\|(.*)\|(.*)/) {
         ($acc, $version) = split /\./, $2;
     } elsif ($id =~ /(pir|prf|pat|gnl)\|(.*)\|(.*)/) {
         ($acc, $version) = split /\./, $3;
     } else {
         #punt, not matching the db's at  
ftp://ftp.ncbi.nih.gov/blast/db/README
         #Database Name                     Identifier Syntax
         #============================      ========================
         #GenBank                           gb|accession|locus
         #EMBL Data Library                 emb|accession|locus
         #DDBJ, DNA Database of Japan       dbj|accession|locus
         #NBRF PIR                          pir||entry
         #Protein Research Foundation       prf||name
         #SWISS-PROT                        sp|accession|entry name
         #Brookhaven Protein Data Bank      pdb|entry|chain
         #Patents                           pat|country|number
         #GenInfo Backbone Id               bbs|number
         #General database identifier           gnl|database|identifier
         #NCBI Reference Sequence           ref|accession|locus
         #Local Sequence identifier         lcl|identifier
         $acc=$id;
     }
     return ($acc,$version);
}

On Jan 3, 2005, at 12:00 PM, David García Cortés wrote:

> Hello.
>
> I have the "nr" database in FASTA format (downloaded from NCBI  
> website), and i want to retrieve the accession number of each sequence  
> in that database, so I do the following:
>
> my $seqsfich  = Bio::SeqIO->new(-file=>"nr.fa", '-format' => 'Fasta');
>
>  while (my $seq = $seqsfich->next_seq()) {
>     print STDOUT "Sequence accession number: ", $seq->accession, "\n";
>    }
>
> But the results I get are:
>
> Sequence accession number: unknown
> Sequence accession number: unknown
> Sequence accession number: unknown
> Sequence accession number: unknown
> etc...
>
> Here you can see a fragment of the "nr.fa" file
> :
>> gi|2695847|emb|CAA73704.1| immunoglobulin heavy chain [Acipenser  
>> baerii]
> MGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSAYMSWVRQAPGKGLEWVAYIY 
> SGGSSTYYA
> QSVQGRFAISRDDSNSMLYLQMNSLKTEDTAVYYCARGGLGWSLDYWGKGTMITVTSATPSPPTVFPLMES 
> CCLSDISGP
> VATGCLATGFCLPPRPSRGLINLEKL
>> gi|2695851|emb|CAA73709.1| immunoglobulin heavy chain [Acipenser  
>> baerii]
> MGILTALCIIMTALSSVRSDVVLTESGPAVVKPGESHKLSCKAAGFTFSSYWMGWVRQTPGKGLEWVSIIS 
> AGGSTYYAP
> SVEGRFTISRDNSNSMLYLQMNSLKTEDTAMYYCARKPETGSYGNISFEHWGKGTMITVTSATPSPPTVFP 
> LMQACCSVD
> VTGPSATGCLATEF
>> gi|2695853|emb|CAA73712.1| immunoglobulin heavy chain [Acipenser  
>> baerii]
> MGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSNNMGWVRQAPGKGLEWVSTIS 
> YSVNAYYAQ
> SVQGRFTISRDDSNSMLYLQMNSLKTEDSAVYYCARESNFNRFDYWGSGTMVTVTNATPSPPTVFPLMQAC 
> CSVDVTGPS
> ATGCLATEF
>
> I suppose the accession numbers are: CAA73704.1, CAA73709.1,  
> CAA73712.1|, etc... (¿?)
> The thing is, how can I do for Bioperl to parse and recognize them?
>
> Thanks in advance.
>
> --
> David García Cortés
> Instituto Nacional de Bioinformática (INB)
> Nodo Computacional GNHC-2 UPC-CIRI
> c/. Jordi Girona 1-3
> Modul C6-E201                   Tel.  : 934 011 650
> E-08034 Barcelona               Fax   : 934 017 014
> Catalunya (Spain)               e-mail: davidg at lsi.upc.edu
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
--
Jason Stajich
jason.stajich at duke.edu
http://www.duke.edu/~jes12/