[EMBOSS] How to find protein sequences in a given genome using CDS information

Rodrigo Lopez rls at ebi.ac.uk
Wed Feb 4 09:32:15 UTC 2009


Hi Nermin,

To complement Guy's reply: You could also use the EMBLCDS database. This 
one contains all CDSs in EMBL-Bank (soon to be called ENA = European 
Nucleotide Archive). This one is available via the EBI's ftp server at 
pub/databases/embl/cds. The identifiers in this database correspond to 
the protein_id feature in the EMBL-Bank Feature Table which maps each 
CDS to corresponding protein translation. These in turn can be 
identified in UniProtKB. Please see the README.txt file  at:

ftp.ebi.ac.uk/pub/databases/embl/cds/README.txt

for further details.

Further to the above, and depending on the proteome in question, you 
could have a look at the integr8 directory on the ftp server as well:

ftp.ebi.ac.uk/pub/databases/integr8

In here you will find the proteomes of more than 1600 organisms, mainly 
bacteria and archea, but also human, rat, mouse, etc.

R:)


Nermin Celik wrote:
> Hi,
> 
> I have the CDS section of a feature table and a genome of an organism.
> Which EMBOSS program will allow me to extract the coding regions defined
> in the CDS file from the genome and then translate them to protein
> sequences?
> 
> Example of CDS file:
> FT   CDS             166..231
> FT                   /systematic_id="ROD00001"
> FT   CDS             313..2775
> FT                   /systematic_id="ROD00011"
> FT   CDS             2778..3707
> 
> Thank you.
> Nermin
> 
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss



More information about the EMBOSS mailing list