use of full genbank style ids

Fri Dec 15 15:07:44 UTC 2000

Peter et al,

Thanks for the quick reply.  A few more comments follow.

Peter Rice wrote:
> 
> Steve Roels wrote:
> >
> > Anyone know if there is a way to force the use of full genbank-style sequence identifiers
> > in output files?
> >
> > >gi|12345|gb|AC00123.4|HUDDR
> > GGCGCGCCG...
> >
> > The id used is either "gi" (if fasta format is specified - or no format is specified) or
> > "gb|AC00123.4|HUDDR" (if ncbi format is specified).
> >
> > In short (and to be more general), what I want is to have everything up to the first
> > white-space (i.e. including vertical bars,colons,etc) to be treated as the id.
> 
> Possible - by defining a new format. Not a big code change.
> 
> 'GenBank' is of course the 'wrong' name - as we use that for the CODATA
> format GENBANK files. This is really NCBI's blast version of the FASTA
> format. Any suggestions for a format name?

Yes - I should have been more specific.  I meant NCBI's style for fasta header ids.  

Generally:   databasetag|id(|possiblymorealiasesorids)
         and (to be more general still, the ":" equivalent)
             databasetag:id(:possiblymorealiasesorids)

Note to that it could be treated as a two step process:

(1) To handle ids most generally, have a format that grabs everything up to the first
white space as an id, including delimiters like '|' and ":".  You need not try to extract
the elements of the id (gid,accession,version,locus for genbank fasta ids for example). 
Here you leave interpretation of the full id to the user - you just provide a way to pass
ids intact from the input file to the output file. This  would work for those (
hypothetically speaking :) ) that use the ">dbtag|id" or ">dbtag:id" structure for there
own data. Even if I converted an id like "mydb|12345" to "mydb:12345" and specified fasta
format, the dbtag is dropped in the output, rendering the id in the output potentially
ambiguous.

(2) If desired (I don't really need it), expand on #1 (as the "ncbi" format seem to try to
do) so that, for example, "sp|P12345" would be recognized as swissprot entry with
accession P12345, and "gi|12345|gb|AC00123.4|HUDDR" would be recognized as genbank entry
with gid 12345, accession AC00123, version 4, and locus HUDDR. The parsed accessions,locus
names,gids, etc could then be used in some contexts. But even here, I'd like the option of
putting the full id (including database/source tag) in the output.

> 
> You want to include the GI number in parsing, and also pick up the sequence
> version rather than just the accession number.
> 
> One question would be - what would you like to use as a filename on output?
> Unix will not be happy with  those '|' characters in a filename, so we
> would normally trim it back to the ID at the end.

I agree this is a problem with the use of "|" as a delimiter.  In naming files, we've
typically resorted to substituting another character (e.g. "_") for "|" to generate files
like:

gi_12345_gb_AC00123.4_HUDDR.output

or more simply (since the aliases/alt_ids are not really needed):

gi_12345.output

The key is to have the full id in the output file, and a unique name for the output file. 
Using just 12345.output for the file, or "12345" in that output file, could be problematic
if you have a sequence with id 12345 from more than one database/source. 

> 
> I guess the '|' could also break future extensions to the USA syntax if you
> plan to use these IDs in USAs. We already use '|' at the end to pipe
> application output (perl style). We could, in theory, use SRS syntax
> to offer alternative IDs or accession numbers. For example, SRS accepts
> swissprot-id:amic_ecoli|amic_pseae|amic_strpn and returns 3 entries.
> Any takers for this syntax in EMBOSS?

Obviously, I wouldn't want to wreak havoc with the large body of code already in place or
envisioned.  But if its a simple new format...

Thanks again,

-Steve

> 
> --
> ------------------------------------------------
> Peter Rice, LION Bioscience Ltd, Cambridge, UK
> peter.rice at uk.lionbioscience.com +44 1223 224723

-- 

*****************************************************************
Steve Roels, Ph.D.                       
Scientist - Computational Biology        Phone: (617) 761-6820
Millennium Pharmaceuticals, Inc.         FAX:   (617) 577-3568
640 Memorial Drive                       Email: roels at mpi.com
Cambridge, MA 02139-4815
*****************************************************************