[EMBOSS] Coderet
Peter Rice
pmr at ebi.ac.uk
Fri Feb 20 13:13:50 UTC 2004
Hi Sean
Sean.Maceachern at dpi.vic.gov.au wrote:
> I have not done any programming in c++ so I was hoping that someone
> might be able how to suggest how I can get the output from coderet to
> resemble that of NCBI's
>
> I think it could be done by parsing the sections in BOLD from the first few lines of the feature table.
>
> LOCUS NM_000367 2742 bp mRNA linear PRI
> 31-OCT-2000
> DEFINITION Homo sapiens thiopurine S-methyltransferase (TPMT), mRNA.
> ACCESSION NM_000367
> VERSION NM_000367.1 GI:4507652
> KEYWORDS .
> SOURCE Homo sapiens (human)
>
>
> ie)
>
>
>>gi|4507652 Homo sapiens thiopurine S-methyltransferase (TPMT), mRNA
>
> ATGGATGGTACAAGAACTTCACTTGACATTGAAGAGTACTCGGATACTGAGGTACAGAAA
> AACCAAGTACTAACTCTGGAAGAATGGCAAGACAAGTGGGTGAACGGCAAGACTGCTTTT
>
>
> Does anyone know if this already exists in a coderet option or how I would
> be able to modfiy this in the original script?
Possible .... it would be nice to have a description in the output, but
we have to be careful where we take it from.
This is a simple REFSEQ entry with only one CDS, but coderet has to also
work on large bacterial genome contigs.
This means we can only get the taxonomy from the top of the entry.
However, this CDS does have information in the (true) feature table:
CDS 66..803
/gene="TPMT"
/EC_number="2.1.1.67"
/codon_start=1
/product="thiopurine S-methyltransferase"
/protein_id="NP_000358.1"
/db_xref="GI:4507653"
/db_xref="LocusID:7172"
/db_xref="MIM:187680"
/translation="MDGTRTSL...LYLLTEK"
So, for the description line you wanted:
>gi|4507652 Homo sapiens thiopurine S-methyltransferase (TPMT), mRNA
We can get:
"Homo" sapiens from the SOURCE line (EMBOSS already parses this)
"TPMT" from the /gene qualifier (if present)
"thiopurine S-methyltransferase" from the /product qualifier (if present)
we can also ... if other qualifiers are missing, try /note= or simply
use the entry description and some CDS counter:
"CDS 1 from Homo sapiens thiopurine S-methyltransferase (TPMT), mRNA."
The mRNA at the end is tricky ... I would much prefer to use the feature
type (CDS) because that is what we have.
So you would have a description (in FASTA or NCBI or any other format) of:
>NM_000367 Homo sapiens thiopurine S-methyltransferase (TPMT), CDS
regards,
Peter Rice
More information about the EMBOSS
mailing list