NCBI fasta format [was: Re: [Bioperl-l] loading data
MEC at Stowers-Institute.org
Mon Jun 9 10:14:13 EDT 2003
> ok I will volunteer,
Hooray for Peter!
> having 'specialised fasta objects' are very useful. Another
> 'type' of fasta
> that I have needed before is one for handling fasta files with TIGR's
> assembler. I will have a look at the code of fasta module
> this evening and
> see what I can do since I have written fasta modules like
> this in python.
What however should we be taking as the spec for ncbi deflines that you are going to parse?
How about the one appearing in: ftp://ftp.ncbi.nih.gov/blast/db/README ?
which, among other things, specifies this 'sequence identifier syntax, dependent upon the source database:
Database Name Identifier Syntax
EMBL Data Library emb|accession|locus
DDBJ, DNA Database of Japan dbj|accession|locus
NBRF PIR pir||entry
Protein Research Foundation prf||name
SWISS-PROT sp|accession|entry name
Brookhaven Protein Data Bank pdb|entry|chain
GenInfo Backbone Id bbs|number
General database identifier gnl|database|identifier
NCBI Reference Sequence ref|accession|locus
Local Sequence identifier lcl|identifier
conceptual problem for me though: identical sequences can get 'merged' in ncbi nr, resulting in MULTIPLE concatenated deflines (separated by control-a). What to do here? Just use the first? Hmmm.
> ... What would people prefer?
Merged sequences notwithstanding, I would prefer subclassing as Bio::SeqIO::fasta::ncbi.
Hilmar elsewhere points argues against rolling this into BioPerl with the argument:
>>A little SeqProcessor module or a
>>whole collection of them would basically have the same status as a
>>contributed script - use at your own risk, consider it as a starter
>>only, you may need to tweak it, but maybe it works out of the box as it
>>did for me.
I think this is a healthy attitude to maintain toward all of BioPerl. And adding NCBI defline parsing into the base install would be a blessing to many, and if done well, would be good model for how to handle the next syntax that comes rolling in the door....
My 2 cents...
> Peter W.
> p.s. sorry for mentioning the p word ;-)
> At 04:07 PM 06/06/2003 -0400, Aaron J Mackey wrote:
> >On Thu, 5 Jun 2003, Hilmar Lapp wrote:
> > > > Is there ever a case where
> > > > Bio::SeqIO::fasta will parse a sequence header like :
> > > >
> > > >> gi|30419336|gb|CD037498.1|CD037498 mgsu014xP21f.b
> Magnaporthe grisea
> > > >
> > > > and read the namespace, accession, version etc from it?
> > >
> > > No. Bioperl itself does not interpret the identifier
> token, especially
> > > given the fact that there are plenty of ways in which
> people convolute
> > > information here, and that it is relatively simple to
> apply whatever
> > > extraction is suitable in 1 or 2 lines of perl.
> >It seems like we hear this request alot; I think it's an
> (almost) valid
> >newbie expectation that somehow the "gi|123456|db|acc.v|name
> descr" fasta
> >header is some kind of universal standard. I agree that it's an easy
> >couple of lines of Perl to get right, but maybe we should be
> trying to do
> >this for people? It seems like such an easy thing (I'll
> wait to be told
> >otherwise ...). I agree that there are lots of
> bastardizations out there
> >in the wild, but for each of the databases that NCBI
> "outputs" (gb, emb,
> >dbj, ref, pir, sp, pdb, etc), there's pretty consistent
> behaviour for the
> >It should make loading up biosql databases from flatfiles a
> bit easier,
> >Any lurkers want to write Bio::SeqIO::fasta_ncbi.pm (inheriting from
> >Bio::SeqIO::fasta) ?? I guess we'd have to agree on where
> the "db" and
> >any secondary accession/names would be stored in which Seq model ...
> >Bioperl-l mailing list
> >Bioperl-l at portal.open-bio.org
> Peter Wilkinson
> Bioinformatics Consultant
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
More information about the Bioperl-l