[EMBOSS] Coderet

Tue Jan 27 01:17:37 UTC 2004

Thanks for the replies, I have tried a number of the -osformat options
before I posted my original request. What I was hoping to find out was how
to replicate the NCBI output. The current -osformat ncbi option seems to
replicate NCBI's cds output in fasta format only and does not replicate the
descriptor line very well. I was thinking that as the file usually comes
from an NCBI flat file that it would be good if the output resembled the
NCBI output that is provided on the web (cds links and eutils) to allow
sequences from both the net and coderet to be combined in the same fasta
file. I have not done any programming in c++ so I was hoping that someone
might be able how to suggest how I can get the output from coderet to
resemble that of NCBI's

I think it could be done by parsing the sections in BOLD from the first few lines of the feature table.

LOCUS       NM_000367               2742 bp    mRNA    linear   PRI
31-OCT-2000
DEFINITION  Homo sapiens thiopurine S-methyltransferase (TPMT), mRNA.
ACCESSION   NM_000367
VERSION     NM_000367.1  GI:4507652
KEYWORDS    .
SOURCE      Homo sapiens (human)

ie)

>gi|4507652 Homo sapiens thiopurine S-methyltransferase (TPMT), mRNA
ATGGATGGTACAAGAACTTCACTTGACATTGAAGAGTACTCGGATACTGAGGTACAGAAA
AACCAAGTACTAACTCTGGAAGAATGGCAAGACAAGTGGGTGAACGGCAAGACTGCTTTT

Does anyone know if this already exists in a coderet option or how I would
be able to modfiy this in the original script?

Thanks

Sean MacEachern

David.Bauer at SCHERING.DE on 23/01/2004 06:16:00 PM

To:    henrikki.almusa at helsinki.fi
cc:    emboss at embnet.org, Sean.Maceachern at dpi.vic.gov.au

Subject:    Re: [EMBOSS] Coderet

Hi,

the problem is that -osformat ncbi with coderet creates the NCBI pipe
notation but it does not parse the GI number from the CDS feature.
I think it's a good idea to transfer more tags from the CDS feature into
the ID line of coderet.
I'm not sure if /gene, /protein_id and /product are mandatory for CDS.
But if they are there it would be nice to transfer them into the
description of the extracted cds and/or mRNA sequence.

David.

                      Henrikki Almusa
                      <henrikki.almusa at h
                      elsinki.fi>                An:
                      Sean.Maceachern at dpi.vic.gov.au
                      Gesendet von:              Kopie:   emboss at embnet.org
                      owner-emboss at hgmp.         Thema:   Re: [EMBOSS]
                      Coderet
                      mrc.ac.uk

                      23.01.04 07:36

On Friday 23 January 2004 07:42, Sean.Maceachern at dpi.vic.gov.au wrote:
> Hello,
>
> I am trying to use coderet to extract cds from some genbank flat files. I
> am running into a problem regarding the desriptor line in the output
fasta
> files.
>
> eg)
>
> >nm_000367_cds_1
> ATGGATGGTACAAGAACTTCACTTGACATTGAAGAGTACTCGGATACTGAGGTACAGAAA
> AACCAAGTACTAACTCTGGAAGAATGGCAAGACAAGTGGGTGAACGGCAAGACTGCTTTT
>
> I was hoping someone would be able to tell me how I can change the
> descriptor line from the generic output above (nm_000367_cds_1) to
include
> the GI : ID form the
> flat file? I also think it would be a good idea if the id could be
followed
> by a definition line to make the output more closely resemble the output
> from NCBI.

You can change the sequence format with -osformat option (in all emboss
programs which outputs sequences). Probably the right format is "ncbi". If
it
isn't read the page on emboss web site in User Documantation -> Sequence
format. That will list all available formats.

Here to help,
--
Henrikki Almusa