[Bioperl-l] parsing protein accession numbers and types from>fasta headers

Wed Sep 13 14:36:33 UTC 2006

I agree that the non-BioPerl way is probably best, though you can look at
the Flat Database HOWTO for a fast Bioperl-ish way to index a FASTA file,
get the IDs, set primary and secondary accessions, retrieve sequences, etc.

http://www.bioperl.org/wiki/HOWTO:Flat_databases

Bio::DB::Fasta is also a flat-db interface for accessing large FASTA
databases which users seem to like.  It's now capable of handling files >
4GB.

Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign 

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Bernd Web
> Sent: Wednesday, September 13, 2006 8:18 AM
> To: Antonio Ramos Fernández
> Cc: bioperl-l at bioperl.org
> Subject: Re: [Bioperl-l] parsing protein accession numbers and types
> from>fasta headers
> 
> Hi
> 
> I tried to parse this variabilty and get out the dbs. So first I read
> the DB type in $1 and then I got out the ID I needed for my purposes.
> Of course not *Bio*Perl, but it worked for me ;-)
> 
> if ( m/>gi\|\d+\|(\w+)\|([^\|\s]*)\|(\S*)\s/ ) {
> 	my $name;
> 	#if ($1 eq 'pdb') { $name = $2.$3 } elsif ($1 eq 'sp' || $1 eq
> 'pir')
> { $name = $3 } else { $name = $2 }
> 	SWITCH: {
> 		if ($1 eq 'pdb') { $name = $2.$3; last SWITCH; }
> 		if ($1 eq 'sp' ) { $name = $3; last SWITCH; }
> 		if ($1 eq 'pir') { $name = $3; last SWITCH; }
> 		$name = $2;
> 	}
> 
> bernd
> 
> 
> On 9/13/06, Antonio Ramos Fernández <tniram at hotmail.com> wrote:
> >
> > I'd like to write a script to parse fasta headers of fasta-formatted
> protein
> > databases and get protein accession numbers and identifiers (uniprot,
> IPI,
> > gi, Refseq, ensembl...). The idea is building a simple local database
> that
> > relates an accession number for protein sequence with all valid
> identifiers
> > and the fasta files from where they weher obtained at my system, or
> > checking, for instance, if an uniprot accession exists for a given gi.
> > However, the structure of the fasta header is quite variable depending
> on
> > the source. Any suggestions?
> >
> > _________________________________________________________________
> > Horóscopo, tarot, numerología... Escucha lo que te dicen los astros.
> > http://astrocentro.msn.es/
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l