NCBI fasta format [was: Re: [Bioperl-l] loading data into
bioperl-db]
Peter Wilkinson
pwilk at videotron.ca
Sat Jun 7 11:54:37 EDT 2003
ok I will volunteer,
having 'specialised fasta objects' are very useful. Another 'type' of fasta
that I have needed before is one for handling fasta files with TIGR's
assembler. I will have a look at the code of fasta module this evening and
see what I can do since I have written fasta modules like this in python.
However I think there is another sensible approach. The formatting of the
definition line is exactly that, formatting. I don't think that you need a
new fasta object just for defining the line format; this convolutes the
class hierarchy. Instead one could create a fasta object with
Bio::SeqIO::fasta('ncbi'), and the internals of the class will take care of
setting and retrieving the data in a sensible way.
Of course perhaps we might not want to meddle with the basic fasta class,
and we create another object Bio::SeqIO::formated_fasta('ncbi'), where we
could add more specific types of fasta formats.
What would people prefer?
Peter W.
p.s. sorry for mentioning the p word ;-)
At 04:07 PM 06/06/2003 -0400, Aaron J Mackey wrote:
>On Thu, 5 Jun 2003, Hilmar Lapp wrote:
>
> > > Is there ever a case where
> > > Bio::SeqIO::fasta will parse a sequence header like :
> > >
> > >> gi|30419336|gb|CD037498.1|CD037498 mgsu014xP21f.b Magnaporthe grisea
> > >
> > > and read the namespace, accession, version etc from it?
> >
> > No. Bioperl itself does not interpret the identifier token, especially
> > given the fact that there are plenty of ways in which people convolute
> > information here, and that it is relatively simple to apply whatever
> > extraction is suitable in 1 or 2 lines of perl.
>
>It seems like we hear this request alot; I think it's an (almost) valid
>newbie expectation that somehow the "gi|123456|db|acc.v|name descr" fasta
>header is some kind of universal standard. I agree that it's an easy
>couple of lines of Perl to get right, but maybe we should be trying to do
>this for people? It seems like such an easy thing (I'll wait to be told
>otherwise ...). I agree that there are lots of bastardizations out there
>in the wild, but for each of the databases that NCBI "outputs" (gb, emb,
>dbj, ref, pir, sp, pdb, etc), there's pretty consistent behaviour for the
>fields.
>
>It should make loading up biosql databases from flatfiles a bit easier,
>too.
>
>Any lurkers want to write Bio::SeqIO::fasta_ncbi.pm (inheriting from
>Bio::SeqIO::fasta) ?? I guess we'd have to agree on where the "db" and
>any secondary accession/names would be stored in which Seq model ...
>
>-Aaron
>
>
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l at portal.open-bio.org
>http://portal.open-bio.org/mailman/listinfo/bioperl-l
-------------------------------------
Peter Wilkinson
Bioinformatics Consultant
-------------------------------------
More information about the Bioperl-l
mailing list