NCBI fasta format [was: Re: [Bioperl-l] loading data into bioperl-db]

Fri Jun 6 17:07:09 EDT 2003

On Thu, 5 Jun 2003, Hilmar Lapp wrote:

> >  Is there ever a case where
> > Bio::SeqIO::fasta will parse a sequence header like :
> >
> >> gi|30419336|gb|CD037498.1|CD037498 mgsu014xP21f.b Magnaporthe grisea
> >
> > and read the namespace, accession, version etc from it?
>
> No. Bioperl itself does not interpret the identifier token, especially
> given the fact that there are plenty of ways in which people convolute
> information here, and that it is relatively simple to apply whatever
> extraction is suitable in 1 or 2 lines of perl.

It seems like we hear this request alot; I think it's an (almost) valid
newbie expectation that somehow the "gi|123456|db|acc.v|name descr" fasta
header is some kind of universal standard.  I agree that it's an easy
couple of lines of Perl to get right, but maybe we should be trying to do
this for people?  It seems like such an easy thing (I'll wait to be told
otherwise ...).  I agree that there are lots of bastardizations out there
in the wild, but for each of the databases that NCBI "outputs" (gb, emb,
dbj, ref, pir, sp, pdb, etc), there's pretty consistent behaviour for the
fields.

It should make loading up biosql databases from flatfiles a bit easier,
too.

Any lurkers want to write Bio::SeqIO::fasta_ncbi.pm (inheriting from
Bio::SeqIO::fasta) ??  I guess we'd have to agree on where the "db" and
any secondary accession/names would be stored in which Seq model ...

-Aaron