[EMBOSS] Compseq DNA/Protein sequence problem

Bernd Web bernd.web at gmail.com
Thu May 17 10:32:38 UTC 2007


Hi,

Regarding compseq I wonder how to count words in reading frame 0 only.
The frame values can be 0,1,2 for words of length 2.
I use "AGAGAG" as sequence and 1 as frame. This results in 2 times GA.
Using frame 2 results in two times AG.
But how to get a count of 3 times AG only? Frame zero returns a count
of 3 for AG, but also a count of 2 for GA.

I used emboss version 4.1.0 over the web with EMBOSS explorer.

regards,
bernd

On 4/23/07, Bernd Web <bernd.web at gmail.com> wrote:
> Hi Annette,
>
> Your seq1 is incorrectly guessed to be a nucleotide sequence, since
> you state it's protein. EMBOSS provides a boolean to state nucleotide
> or protein nature of your sequence, see EMBOSS help:
>
>  "-sequence" associated qualifiers
>  -snucleotide1       boolean    Sequence is nucleotide
>  -sprotein1            boolean    Sequence is protein
>
> regards,
> bernd
>
> On 4/23/07, Becher, Anette <anette.becher at agresearch.co.nz> wrote:
> > Hi all,
> >
> > I believe I *may* have found a bug in compseq.
> >
> > I have been using compseq to calculate the frequency of amino acids in
> > translated DNA sequences. I find that frequently compseq takes the amino
> > acid sequence to be DNA (they are sequences with an unusual composition,
> > but then I am looking for odd proteins). So instead of the expected
> > output for all amino acids with most being zero, I often get output for
> > A,C,G,T and 'other'. I cannot see an obvious pattern that would explain
> > this behaviour, but maybe you can help.
> >
> > Command line:
> >
> > compseq -seq compseq_bug.in -word 1 -frame 1 -out compseq_bug.out
> >
> > An example input and output file are pasted in below - I can provide
> > many more.
> >
> > It might help if the user could specify whether the input sequence is
> > DNA or protein, rather than the program working it out somehow?
> >
> >
> > Best wishes
> >
> >
> > Anette
> >
> >
> >
> > Here is an example of the problem:
> >
> >
> > >Seq1
> > GSGGGGGSGGRGMGGWGGGRGSGVGGRGWGVG
> >
> >
> > #
> > # Output from 'compseq'
> > #
> > # Only words in frame 1 will be counted.
> > # The Expected frequencies are calculated on the (false) assumption that
> > every
> > # word has equal frequency.
> > #
> > # The input sequences are:
> > #       Seq1
> >
> >
> > Word size       1
> > Total count     31
> >
> > #
> > # Word  Obs Count       Obs Frequency   Exp Frequency   Obs/Exp
> > Frequency
> > #
> > A       0               0.0000000       0.2500000       0.0000000
> > C       0               0.0000000       0.2500000       0.0000000
> > G       20              0.6451613       0.2500000       2.5806452
> > T       0               0.0000000       0.2500000       0.0000000
> >
> > Other   11              0.3548387       0.0000000
> > 10000000000.0000000
> >
> >
> >
> >
> > Here is a similar sequence that works fine:
> >
> >
> > >Seq2
> > VGSEGGGGGRRGEGGGGGGRGGGGGRWEEGAG
> >
> >
> >
> > #
> > # Output from 'compseq'
> > #
> > # Only words in frame 1 will be counted.
> > # The Expected frequencies are calculated on the (false) assumption that
> > every
> > # word has equal frequency.
> > #
> > # The input sequences are:
> > #       Seq2
> >
> >
> > Word size       1
> > Total count     31
> >
> > #
> > # Word  Obs Count       Obs Frequency   Exp Frequency   Obs/Exp
> > Frequency
> > #
> > A       1               0.0322581       0.0476190       0.6774194
> > C       0               0.0000000       0.0476190       0.0000000
> > D       0               0.0000000       0.0476190       0.0000000
> > E       4               0.1290323       0.0476190       2.7096774
> > F       0               0.0000000       0.0476190       0.0000000
> > G       20              0.6451613       0.0476190       13.5483871
> > H       0               0.0000000       0.0476190       0.0000000
> > I       0               0.0000000       0.0476190       0.0000000
> > K       0               0.0000000       0.0476190       0.0000000
> > L       0               0.0000000       0.0476190       0.0000000
> > M       0               0.0000000       0.0476190       0.0000000
> > N       0               0.0000000       0.0476190       0.0000000
> > P       0               0.0000000       0.0476190       0.0000000
> > Q       0               0.0000000       0.0476190       0.0000000
> > R       4               0.1290323       0.0476190       2.7096774
> > S       1               0.0322581       0.0476190       0.6774194
> > T       0               0.0000000       0.0476190       0.0000000
> > U       0               0.0000000       0.0476190       0.0000000
> > V       0               0.0000000       0.0476190       0.0000000
> > W       1               0.0322581       0.0476190       0.6774194
> > Y       0               0.0000000       0.0476190       0.0000000
> > =======================================================================
> > Attention: The information contained in this message and/or attachments
> > from AgResearch Limited is intended only for the persons or entities
> > to which it is addressed and may contain confidential and/or privileged
> > material. Any review, retransmission, dissemination or other use of, or
> > taking of any action in reliance upon, this information by persons or
> > entities other than the intended recipients is prohibited by AgResearch
> > Limited. If you have received this message in error, please notify the
> > sender immediately.
> > =======================================================================
> >
> > _______________________________________________
> > EMBOSS mailing list
> > EMBOSS at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/emboss
> >
>



More information about the EMBOSS mailing list