NCBI fasta format [was: Re: [Bioperl-l] loading data into
bioperl-db]
Aaron J Mackey
ajm6q at virginia.edu
Sat Jun 7 13:11:06 EDT 2003
Thanks Peter,
I think Hilmar's points about using a SeqProcessor are valid; we don't
want to maintain 10 different "client" header format parsers for ncbi,
tigr, ensembl, etc etc. I would still argue though that the ncbi header
format is quite stable (I've been parsing it the same way for a few years
now), and, more importantly, is a high-profile, highly used target. It's
such low-hanging fruit that many people coming into BioPerl are surprised
we don't have some kind of readily-available, built-in support for it.
Maybe a combination of ideas would be acceptable? A
Bio::SeqIO::formatted_fasta that (as you implied) would take a parameter
that specified either a builtin SeqProcessor, or a custom SeqProcessor.
Bouncing ideas,
-Aaron
On Sat, 7 Jun 2003, Peter Wilkinson wrote:
> ok I will volunteer,
>
> having 'specialised fasta objects' are very useful. Another 'type' of fasta
> that I have needed before is one for handling fasta files with TIGR's
> assembler. I will have a look at the code of fasta module this evening and
> see what I can do since I have written fasta modules like this in python.
>
> However I think there is another sensible approach. The formatting of the
> definition line is exactly that, formatting. I don't think that you need a
> new fasta object just for defining the line format; this convolutes the
> class hierarchy. Instead one could create a fasta object with
>
> Bio::SeqIO::fasta('ncbi'), and the internals of the class will take care of
> setting and retrieving the data in a sensible way.
>
> Of course perhaps we might not want to meddle with the basic fasta class,
> and we create another object Bio::SeqIO::formated_fasta('ncbi'), where we
> could add more specific types of fasta formats.
>
> What would people prefer?
>
> Peter W.
>
> p.s. sorry for mentioning the p word ;-)
>
>
>
> At 04:07 PM 06/06/2003 -0400, Aaron J Mackey wrote:
>
> >On Thu, 5 Jun 2003, Hilmar Lapp wrote:
> >
> > > > Is there ever a case where
> > > > Bio::SeqIO::fasta will parse a sequence header like :
> > > >
> > > >> gi|30419336|gb|CD037498.1|CD037498 mgsu014xP21f.b Magnaporthe grisea
> > > >
> > > > and read the namespace, accession, version etc from it?
> > >
> > > No. Bioperl itself does not interpret the identifier token, especially
> > > given the fact that there are plenty of ways in which people convolute
> > > information here, and that it is relatively simple to apply whatever
> > > extraction is suitable in 1 or 2 lines of perl.
> >
> >It seems like we hear this request alot; I think it's an (almost) valid
> >newbie expectation that somehow the "gi|123456|db|acc.v|name descr" fasta
> >header is some kind of universal standard. I agree that it's an easy
> >couple of lines of Perl to get right, but maybe we should be trying to do
> >this for people? It seems like such an easy thing (I'll wait to be told
> >otherwise ...). I agree that there are lots of bastardizations out there
> >in the wild, but for each of the databases that NCBI "outputs" (gb, emb,
> >dbj, ref, pir, sp, pdb, etc), there's pretty consistent behaviour for the
> >fields.
> >
> >It should make loading up biosql databases from flatfiles a bit easier,
> >too.
> >
> >Any lurkers want to write Bio::SeqIO::fasta_ncbi.pm (inheriting from
> >Bio::SeqIO::fasta) ?? I guess we'd have to agree on where the "db" and
> >any secondary accession/names would be stored in which Seq model ...
> >
> >-Aaron
> >
> >
> >_______________________________________________
> >Bioperl-l mailing list
> >Bioperl-l at portal.open-bio.org
> >http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
> -------------------------------------
> Peter Wilkinson
> Bioinformatics Consultant
>
> -------------------------------------
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
--
Aaron J Mackey
Pearson Laboratory
University of Virginia
(434) 924-2821
amackey at virginia.edu
More information about the Bioperl-l
mailing list