[EMBOSS] IDs in output

Peter Rice pmr at ebi.ac.uk
Fri Nov 3 13:31:40 UTC 2006


Bernd Web wrote:
> Hi Peter,
> 
> Although I copy pasted, indeed the defline was wrong. It should have been:
> 
>> gi|248166|gb|AAB21972.1| invertase {EC 3.2.1.26} [baker's yeast,
> Peptide Partial, 6 aa, segment 10 of 12]
> ATNTTL
> 
> EMBOSS extracts "AAB21972.1".
> Having the version number is OK since otherwise the sequence is not
> completely defined (AAB21972 could refer to multiple versions).

If you specify -osformat ncbi you should be able to recreate the original 
defline in the EMBOSS output.

> My idea was more related to selecting the GI number as ID to use in
> EMBOSS applications. Now the accession number depends on the format of
> the defline:
> sp ->  Entry Name (not primary accession)

If there is an Entry name EMBOSS will use it.

> ref, emb, gb -> Accesion

But now EMBL and Genbank define this as the entry name anyway.

> pdb -> PDB protein name with Chain concatenated to it.

That seems good to me ... although we know of a problem when there are more than 
26 chains and -a comes round again.

> Although I wrote a script to map the names from NCBI deflines to
> EMBOSS names, it could be easy to have the option to use the GI
> number.

Hmmm ..... in EMBOSS terms, this counts as yet another sequence format. We could 
make a new output format (-osformat gifasta for example) that uses the GI as the 
ID... but it would use the original sequence name as the filename first time 
around (and then when you read the file it would start using the GI number as 
the filename).

But we could also make "gifasta" an input format (-sformat gifasta) and then it 
could use the GI number - but you would have to specify the -sformat on the 
command line (or gifasta::filename as input) because EMBOSS has to choose which 
way to interpret the defline. Does that solve your problem?

NCBI regard the ID as the entire string with "|" characters embedded, but that 
is no use when making filenames so we had to choose something.

EMBL does not use GI numbers ... they only appear in GenBank and NCBI files. I 
never liked them, but EMBOSS does try to do whatever the users demand :-)

regards,

Peter



More information about the EMBOSS mailing list