[EMBOSS] How to find protein sequences in a given genome using CDS information

Wed Feb 4 09:21:37 UTC 2009

Nermin Celik wrote:
> Hi,
> 
> I have the CDS section of a feature table and a genome of an organism.
> Which EMBOSS program will allow me to extract the coding regions defined
> in the CDS file from the genome and then translate them to protein
> sequences?
> 
> Example of CDS file:
> FT   CDS             166..231
> FT                   /systematic_id="ROD00001"
> FT   CDS             313..2775
> FT                   /systematic_id="ROD00011"
> FT   CDS             2778..3707

Ah, that highlights something we meant to fix.

We have the application coderet that, in theory, will read the sequence and 
the feature table and do exactly what you want.

Unfortunately the original author of coderet used a shortcut - it reads a 
sequence database entry and parses the feature table. Not good.

However, what you can do is convert your genomic sequence and feature table 
into an EMBL entry:

seqret -feature genomic.fasta -ufo embl::feature.table embl.entry
coderet embl.entry

GenBank entries also work in coderet.

We will be working on coderet to fix this and read feature data normally. 
Any other suggestions for improvements are welcome.

regards,

Peter Rice