NCBI fasta format [was: Re: [Bioperl-l] loading data into bioperl-db]

Sat Jun 7 11:54:37 EDT 2003

ok I will volunteer,

having 'specialised fasta objects' are very useful. Another 'type' of fasta 
that I have needed before is one for handling fasta files with TIGR's 
assembler. I will have a look at the code of fasta module this evening and 
see what I can do since I have written fasta modules like this in python.

However I think there is another sensible approach. The formatting of the 
definition line is exactly that, formatting. I don't think that you need a 
new fasta object just for defining the line format; this convolutes the 
class hierarchy. Instead one could create a fasta object with

Bio::SeqIO::fasta('ncbi'), and the internals of the class will take care of 
setting and retrieving the data in a sensible way.

Of course perhaps we might not want to meddle with the basic fasta class, 
and we create another object Bio::SeqIO::formated_fasta('ncbi'), where we 
could add more specific types of fasta formats.

What would people prefer?

Peter W.

p.s. sorry for mentioning the p word ;-)

At 04:07 PM 06/06/2003 -0400, Aaron J Mackey wrote:

>On Thu, 5 Jun 2003, Hilmar Lapp wrote:
>
> > >  Is there ever a case where
> > > Bio::SeqIO::fasta will parse a sequence header like :
> > >
> > >> gi|30419336|gb|CD037498.1|CD037498 mgsu014xP21f.b Magnaporthe grisea
> > >
> > > and read the namespace, accession, version etc from it?
> >
> > No. Bioperl itself does not interpret the identifier token, especially
> > given the fact that there are plenty of ways in which people convolute
> > information here, and that it is relatively simple to apply whatever
> > extraction is suitable in 1 or 2 lines of perl.
>
>It seems like we hear this request alot; I think it's an (almost) valid
>newbie expectation that somehow the "gi|123456|db|acc.v|name descr" fasta
>header is some kind of universal standard.  I agree that it's an easy
>couple of lines of Perl to get right, but maybe we should be trying to do
>this for people?  It seems like such an easy thing (I'll wait to be told
>otherwise ...).  I agree that there are lots of bastardizations out there
>in the wild, but for each of the databases that NCBI "outputs" (gb, emb,
>dbj, ref, pir, sp, pdb, etc), there's pretty consistent behaviour for the
>fields.
>
>It should make loading up biosql databases from flatfiles a bit easier,
>too.
>
>Any lurkers want to write Bio::SeqIO::fasta_ncbi.pm (inheriting from
>Bio::SeqIO::fasta) ??  I guess we'd have to agree on where the "db" and
>any secondary accession/names would be stored in which Seq model ...
>
>-Aaron
>
>
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l at portal.open-bio.org
>http://portal.open-bio.org/mailman/listinfo/bioperl-l

-------------------------------------
Peter Wilkinson
Bioinformatics Consultant

-------------------------------------