[Bioperl-l] Problems parsing Accesion number in FASTA format.
Jason Stajich
jason.stajich at duke.edu
Mon Jan 3 12:18:45 EST 2005
I though someone was going to centralize this function at some point.
Right now there is a _get_accession_version function in
Bio::SearchIO::blast. Perhaps someone would care to make a utility
module which can export a bunch of useful functions like this?
my $seqsfich = Bio::SeqIO->new(-file=>"nr.fa", '-format' => 'Fasta');
while (my $seq = $seqsfich->next_seq()) {
my ($acc,$ver) = &_get_accession_version($seq->display_id)
$seq->accession_number($acc);
$seq->version($ver);
print STDOUT "Sequence accession number: ",
$seq->accession_number, "\n";
}
sub _get_accession_version {
my $id = shift;
# handle case when this is accidently called as a class method
if( ref($id) && $id->isa('Bio::SearchIO') ) {
$id = shift;
}
return undef unless defined $id;
my ($acc, $version);
if ($id =~ /(gb|emb|dbj|sp|pdb|bbs|ref|lcl)\|(.*)\|(.*)/) {
($acc, $version) = split /\./, $2;
} elsif ($id =~ /(pir|prf|pat|gnl)\|(.*)\|(.*)/) {
($acc, $version) = split /\./, $3;
} else {
#punt, not matching the db's at
ftp://ftp.ncbi.nih.gov/blast/db/README
#Database Name Identifier Syntax
#============================ ========================
#GenBank gb|accession|locus
#EMBL Data Library emb|accession|locus
#DDBJ, DNA Database of Japan dbj|accession|locus
#NBRF PIR pir||entry
#Protein Research Foundation prf||name
#SWISS-PROT sp|accession|entry name
#Brookhaven Protein Data Bank pdb|entry|chain
#Patents pat|country|number
#GenInfo Backbone Id bbs|number
#General database identifier gnl|database|identifier
#NCBI Reference Sequence ref|accession|locus
#Local Sequence identifier lcl|identifier
$acc=$id;
}
return ($acc,$version);
}
On Jan 3, 2005, at 12:00 PM, David García Cortés wrote:
> Hello.
>
> I have the "nr" database in FASTA format (downloaded from NCBI
> website), and i want to retrieve the accession number of each sequence
> in that database, so I do the following:
>
> my $seqsfich = Bio::SeqIO->new(-file=>"nr.fa", '-format' => 'Fasta');
>
> while (my $seq = $seqsfich->next_seq()) {
> print STDOUT "Sequence accession number: ", $seq->accession, "\n";
> }
>
> But the results I get are:
>
> Sequence accession number: unknown
> Sequence accession number: unknown
> Sequence accession number: unknown
> Sequence accession number: unknown
> etc...
>
> Here you can see a fragment of the "nr.fa" file
> :
>> gi|2695847|emb|CAA73704.1| immunoglobulin heavy chain [Acipenser
>> baerii]
> MGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSAYMSWVRQAPGKGLEWVAYIY
> SGGSSTYYA
> QSVQGRFAISRDDSNSMLYLQMNSLKTEDTAVYYCARGGLGWSLDYWGKGTMITVTSATPSPPTVFPLMES
> CCLSDISGP
> VATGCLATGFCLPPRPSRGLINLEKL
>> gi|2695851|emb|CAA73709.1| immunoglobulin heavy chain [Acipenser
>> baerii]
> MGILTALCIIMTALSSVRSDVVLTESGPAVVKPGESHKLSCKAAGFTFSSYWMGWVRQTPGKGLEWVSIIS
> AGGSTYYAP
> SVEGRFTISRDNSNSMLYLQMNSLKTEDTAMYYCARKPETGSYGNISFEHWGKGTMITVTSATPSPPTVFP
> LMQACCSVD
> VTGPSATGCLATEF
>> gi|2695853|emb|CAA73712.1| immunoglobulin heavy chain [Acipenser
>> baerii]
> MGILTALCIIMTALSSVRSDVVLTESGPAVIKPGESHKLSCKASGFTFSSNNMGWVRQAPGKGLEWVSTIS
> YSVNAYYAQ
> SVQGRFTISRDDSNSMLYLQMNSLKTEDSAVYYCARESNFNRFDYWGSGTMVTVTNATPSPPTVFPLMQAC
> CSVDVTGPS
> ATGCLATEF
>
> I suppose the accession numbers are: CAA73704.1, CAA73709.1,
> CAA73712.1|, etc... (¿?)
> The thing is, how can I do for Bioperl to parse and recognize them?
>
> Thanks in advance.
>
> --
> David García Cortés
> Instituto Nacional de Bioinformática (INB)
> Nodo Computacional GNHC-2 UPC-CIRI
> c/. Jordi Girona 1-3
> Modul C6-E201 Tel. : 934 011 650
> E-08034 Barcelona Fax : 934 017 014
> Catalunya (Spain) e-mail: davidg at lsi.upc.edu
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
--
Jason Stajich
jason.stajich at duke.edu
http://www.duke.edu/~jes12/
More information about the Bioperl-l
mailing list