use of full genbank style ids

Fri Dec 15 09:50:48 UTC 2000

Steve Roels wrote:
> 
> Anyone know if there is a way to force the use of full genbank-style sequence identifiers
> in output files?
> 
> >gi|12345|gb|AC00123.4|HUDDR
> GGCGCGCCG...
> 
> The id used is either "gi" (if fasta format is specified - or no format is specified) or
> "gb|AC00123.4|HUDDR" (if ncbi format is specified).
> 
> In short (and to be more general), what I want is to have everything up to the first
> white-space (i.e. including vertical bars,colons,etc) to be treated as the id.

Possible - by defining a new format. Not a big code change.

'GenBank' is of course the 'wrong' name - as we use that for the CODATA
format GENBANK files. This is really NCBI's blast version of the FASTA
format. Any suggestions for a format name?

You want to include the GI number in parsing, and also pick up the sequence
version rather than just the accession number.

One question would be - what would you like to use as a filename on output?
Unix will not be happy with  those '|' characters in a filename, so we
would normally trim it back to the ID at the end.

I guess the '|' could also break future extensions to the USA syntax if you
plan to use these IDs in USAs. We already use '|' at the end to pipe
application output (perl style). We could, in theory, use SRS syntax
to offer alternative IDs or accession numbers. For example, SRS accepts
swissprot-id:amic_ecoli|amic_pseae|amic_strpn and returns 3 entries.
Any takers for this syntax in EMBOSS?

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723