NCBI fasta format [was: Re: [Bioperl-l] loading data
pwilk at videotron.ca
Tue Jun 10 00:46:33 EDT 2003
>What however should we be taking as the spec for ncbi deflines that you
>are going to parse?
>How about the one appearing in: ftp://ftp.ncbi.nih.gov/blast/db/README ?
>which, among other things, specifies this 'sequence identifier syntax,
>dependent upon the source database:
> Database Name Identifier Syntax
> ============================ ========================
> GenBank gb|accession|locus
> EMBL Data Library emb|accession|locus
> DDBJ, DNA Database of Japan dbj|accession|locus
> NBRF PIR pir||entry
> Protein Research Foundation prf||name
> SWISS-PROT sp|accession|entry name
> Brookhaven Protein Data Bank pdb|entry|chain
> Patents pat|country|number
> GenInfo Backbone Id bbs|number
> General database identifier gnl|database|identifier
> NCBI Reference Sequence ref|accession|locus
> Local Sequence identifier lcl|identifier
>conceptual problem for me though: identical sequences can get 'merged' in
>ncbi nr, resulting in MULTIPLE concatenated deflines (separated by
>control-a). What to do here? Just use the first? Hmmm.
Yes that ncbi doc should be what its based on. And yes the lines are
separated by an 'esc' sequence, I am not sure what we should do about that
list .... I can not see any immediate use for keeping the list in the
sequence. Perhaps as a first implementation we will just drop the list and
keep the first annotation.
Can anyone think of a pressing use for the list of definitions?
> > ... What would people prefer?
>Merged sequences notwithstanding, I would prefer subclassing as
>Hilmar elsewhere points argues against rolling this into BioPerl with the
> >>A little SeqProcessor module or a
> >>whole collection of them would basically have the same status as a
> >>contributed script - use at your own risk, consider it as a starter
> >>only, you may need to tweak it, but maybe it works out of the box as it
> >>did for me.
well I think that I should subclassed as well.
More information about the Bioperl-l