[EMBOSS] Coderet

Peter Rice pmr at ebi.ac.uk
Fri Feb 20 13:13:50 UTC 2004


Hi Sean

Sean.Maceachern at dpi.vic.gov.au wrote:

> I have not done any programming in c++ so I was hoping that someone
> might be able how to suggest how I can get the output from coderet to
> resemble that of NCBI's
> 
> I think it could be done by parsing the sections in BOLD from the first few lines of the feature table.
> 
> LOCUS       NM_000367               2742 bp    mRNA    linear   PRI
> 31-OCT-2000
> DEFINITION  Homo sapiens thiopurine S-methyltransferase (TPMT), mRNA.
> ACCESSION   NM_000367
> VERSION     NM_000367.1  GI:4507652
> KEYWORDS    .
> SOURCE      Homo sapiens (human)
> 
> 
> ie)
> 
> 
>>gi|4507652 Homo sapiens thiopurine S-methyltransferase (TPMT), mRNA
> 
> ATGGATGGTACAAGAACTTCACTTGACATTGAAGAGTACTCGGATACTGAGGTACAGAAA
> AACCAAGTACTAACTCTGGAAGAATGGCAAGACAAGTGGGTGAACGGCAAGACTGCTTTT
> 
> 
> Does anyone know if this already exists in a coderet option or how I would
> be able to modfiy this in the original script?

Possible .... it would be nice to have a description in the output, but 
we have to be careful where we take it from.

This is a simple REFSEQ entry with only one CDS, but coderet has to also 
work on large bacterial genome contigs.

This means we can only get the taxonomy from the top of the entry.

However, this CDS does have information in the (true) feature table:

CDS             66..803
                 /gene="TPMT"
                 /EC_number="2.1.1.67"
                 /codon_start=1
                 /product="thiopurine S-methyltransferase"
                 /protein_id="NP_000358.1"
                 /db_xref="GI:4507653"
                 /db_xref="LocusID:7172"
                 /db_xref="MIM:187680"
                 /translation="MDGTRTSL...LYLLTEK"

So, for the description line you wanted:

 >gi|4507652 Homo sapiens thiopurine S-methyltransferase (TPMT), mRNA

We can get:

"Homo" sapiens from the SOURCE line (EMBOSS already parses this)

"TPMT" from the /gene qualifier (if present)

"thiopurine S-methyltransferase" from the /product qualifier (if present)

we can also ... if other qualifiers are missing, try /note= or simply 
use the entry description and some CDS counter:
"CDS 1 from Homo sapiens thiopurine S-methyltransferase (TPMT), mRNA."

The mRNA at the end is tricky ... I would much prefer to use the feature 
type (CDS) because that is what we have.

So you would have a description (in FASTA or NCBI or any other format) of:

 >NM_000367 Homo sapiens thiopurine S-methyltransferase (TPMT), CDS

regards,

Peter Rice




More information about the EMBOSS mailing list