NCBI fasta format [was: Re: [Bioperl-l] loading data intobioperl-db]

Tue Jun 10 00:46:33 EDT 2003

>
>What however should we be taking as the spec for ncbi deflines that you 
>are going to parse?
>
>How about the one appearing in: ftp://ftp.ncbi.nih.gov/blast/db/README ?
>
>which, among other things, specifies this 'sequence identifier syntax, 
>dependent upon the source database:
>
>   Database Name                     Identifier Syntax
>   ============================      ========================
>   GenBank                           gb|accession|locus
>   EMBL Data Library                 emb|accession|locus
>   DDBJ, DNA Database of Japan       dbj|accession|locus
>   NBRF PIR                          pir||entry
>   Protein Research Foundation       prf||name
>   SWISS-PROT                        sp|accession|entry name
>   Brookhaven Protein Data Bank      pdb|entry|chain
>   Patents                           pat|country|number
>   GenInfo Backbone Id               bbs|number
>   General database identifier       gnl|database|identifier
>   NCBI Reference Sequence           ref|accession|locus
>   Local Sequence identifier         lcl|identifier
>
>conceptual problem for me though:  identical sequences can get 'merged' in 
>ncbi nr, resulting in MULTIPLE concatenated deflines (separated by 
>control-a).  What to do here?  Just use the first?  Hmmm.

Yes that ncbi doc should be what its based on. And yes the lines are 
separated by an 'esc' sequence, I am not sure what we should do about that 
list .... I can not see any immediate use for keeping the list in the 
sequence. Perhaps as a first implementation we will just drop the list and 
keep the first annotation.

Can anyone think of a pressing use for the list of definitions?

> > ... What would people prefer?
>
>Merged sequences notwithstanding, I would prefer subclassing as 
>Bio::SeqIO::fasta::ncbi.
>
>Hilmar elsewhere points argues against rolling this into BioPerl with the 
>argument:
>
> >>A little SeqProcessor module or a
> >>whole collection of them would basically have the same status as a
> >>contributed script - use at your own risk, consider it as a starter
> >>only, you may need to tweak it, but maybe it works out of the box as it
> >>did for me.

well I think that I should subclassed as well.

Peter W.