NCBI fasta format [was: Re: [Bioperl-l] loading data intobioperl-db]

Cook, Malcolm MEC at Stowers-Institute.org
Mon Jun 9 10:14:13 EDT 2003


[Peter Wilkinson]
> ok I will volunteer,

Hooray for Peter!

> having 'specialised fasta objects' are very useful. Another 
> 'type' of fasta 
> that I have needed before is one for handling fasta files with TIGR's 
> assembler. I will have a look at the code of fasta module 
> this evening and 
> see what I can do since I have written fasta modules like 
> this in python.

What however should we be taking as the spec for ncbi deflines that you are going to parse?

How about the one appearing in: ftp://ftp.ncbi.nih.gov/blast/db/README ?

which, among other things, specifies this 'sequence identifier syntax, dependent upon the source database:

  Database Name                     Identifier Syntax
  ============================      ========================
  GenBank                           gb|accession|locus
  EMBL Data Library                 emb|accession|locus
  DDBJ, DNA Database of Japan       dbj|accession|locus
  NBRF PIR                          pir||entry
  Protein Research Foundation       prf||name
  SWISS-PROT                        sp|accession|entry name
  Brookhaven Protein Data Bank      pdb|entry|chain
  Patents                           pat|country|number 
  GenInfo Backbone Id               bbs|number 
  General database identifier	    gnl|database|identifier
  NCBI Reference Sequence           ref|accession|locus
  Local Sequence identifier         lcl|identifier
 
conceptual problem for me though:  identical sequences can get 'merged' in ncbi nr, resulting in MULTIPLE concatenated deflines (separated by control-a).  What to do here?  Just use the first?  Hmmm.

> ... What would people prefer?

Merged sequences notwithstanding, I would prefer subclassing as Bio::SeqIO::fasta::ncbi.

Hilmar elsewhere points argues against rolling this into BioPerl with the argument:

>>A little SeqProcessor module or a 
>>whole collection of them would basically have the same status as a 
>>contributed script - use at your own risk, consider it as a starter 
>>only, you may need to tweak it, but maybe it works out of the box as it 
>>did for me.

I think this is a healthy attitude to maintain toward all of BioPerl.  And adding NCBI defline parsing into the base install would be a blessing to many, and if done well, would be good model for how to handle the next syntax that comes rolling in the door....


My 2 cents...

Malcolm Cook


> 
> Peter W.
> 
> p.s. sorry for mentioning the p word ;-)
> 
> 
> 
> At 04:07 PM 06/06/2003 -0400, Aaron J Mackey wrote:
> 
> >On Thu, 5 Jun 2003, Hilmar Lapp wrote:
> >
> > > >  Is there ever a case where
> > > > Bio::SeqIO::fasta will parse a sequence header like :
> > > >
> > > >> gi|30419336|gb|CD037498.1|CD037498 mgsu014xP21f.b 
> Magnaporthe grisea
> > > >
> > > > and read the namespace, accession, version etc from it?
> > >
> > > No. Bioperl itself does not interpret the identifier 
> token, especially
> > > given the fact that there are plenty of ways in which 
> people convolute
> > > information here, and that it is relatively simple to 
> apply whatever
> > > extraction is suitable in 1 or 2 lines of perl.
> >
> >It seems like we hear this request alot; I think it's an 
> (almost) valid
> >newbie expectation that somehow the "gi|123456|db|acc.v|name 
> descr" fasta
> >header is some kind of universal standard.  I agree that it's an easy
> >couple of lines of Perl to get right, but maybe we should be 
> trying to do
> >this for people?  It seems like such an easy thing (I'll 
> wait to be told
> >otherwise ...).  I agree that there are lots of 
> bastardizations out there
> >in the wild, but for each of the databases that NCBI 
> "outputs" (gb, emb,
> >dbj, ref, pir, sp, pdb, etc), there's pretty consistent 
> behaviour for the
> >fields.
> >
> >It should make loading up biosql databases from flatfiles a 
> bit easier,
> >too.
> >
> >Any lurkers want to write Bio::SeqIO::fasta_ncbi.pm (inheriting from
> >Bio::SeqIO::fasta) ??  I guess we'd have to agree on where 
> the "db" and
> >any secondary accession/names would be stored in which Seq model ...
> >
> >-Aaron
> >
> >
> >_______________________________________________
> >Bioperl-l mailing list
> >Bioperl-l at portal.open-bio.org
> >http://portal.open-bio.org/mailman/listinfo/bioperl-l
> 
> 
> -------------------------------------
> Peter Wilkinson
> Bioinformatics Consultant
> 
> -------------------------------------  
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> 



More information about the Bioperl-l mailing list