[EMBOSS] Compseq DNA/Protein sequence problem
Becher, Anette
anette.becher at agresearch.co.nz
Mon Apr 23 20:54:09 UTC 2007
Hi all,
I believe I *may* have found a bug in compseq.
I have been using compseq to calculate the frequency of amino acids in
translated DNA sequences. I find that frequently compseq takes the amino
acid sequence to be DNA (they are sequences with an unusual composition,
but then I am looking for odd proteins). So instead of the expected
output for all amino acids with most being zero, I often get output for
A,C,G,T and 'other'. I cannot see an obvious pattern that would explain
this behaviour, but maybe you can help.
Command line:
compseq -seq compseq_bug.in -word 1 -frame 1 -out compseq_bug.out
An example input and output file are pasted in below - I can provide
many more.
It might help if the user could specify whether the input sequence is
DNA or protein, rather than the program working it out somehow?
Best wishes
Anette
Here is an example of the problem:
>Seq1
GSGGGGGSGGRGMGGWGGGRGSGVGGRGWGVG
#
# Output from 'compseq'
#
# Only words in frame 1 will be counted.
# The Expected frequencies are calculated on the (false) assumption that
every
# word has equal frequency.
#
# The input sequences are:
# Seq1
Word size 1
Total count 31
#
# Word Obs Count Obs Frequency Exp Frequency Obs/Exp
Frequency
#
A 0 0.0000000 0.2500000 0.0000000
C 0 0.0000000 0.2500000 0.0000000
G 20 0.6451613 0.2500000 2.5806452
T 0 0.0000000 0.2500000 0.0000000
Other 11 0.3548387 0.0000000
10000000000.0000000
Here is a similar sequence that works fine:
>Seq2
VGSEGGGGGRRGEGGGGGGRGGGGGRWEEGAG
#
# Output from 'compseq'
#
# Only words in frame 1 will be counted.
# The Expected frequencies are calculated on the (false) assumption that
every
# word has equal frequency.
#
# The input sequences are:
# Seq2
Word size 1
Total count 31
#
# Word Obs Count Obs Frequency Exp Frequency Obs/Exp
Frequency
#
A 1 0.0322581 0.0476190 0.6774194
C 0 0.0000000 0.0476190 0.0000000
D 0 0.0000000 0.0476190 0.0000000
E 4 0.1290323 0.0476190 2.7096774
F 0 0.0000000 0.0476190 0.0000000
G 20 0.6451613 0.0476190 13.5483871
H 0 0.0000000 0.0476190 0.0000000
I 0 0.0000000 0.0476190 0.0000000
K 0 0.0000000 0.0476190 0.0000000
L 0 0.0000000 0.0476190 0.0000000
M 0 0.0000000 0.0476190 0.0000000
N 0 0.0000000 0.0476190 0.0000000
P 0 0.0000000 0.0476190 0.0000000
Q 0 0.0000000 0.0476190 0.0000000
R 4 0.1290323 0.0476190 2.7096774
S 1 0.0322581 0.0476190 0.6774194
T 0 0.0000000 0.0476190 0.0000000
U 0 0.0000000 0.0476190 0.0000000
V 0 0.0000000 0.0476190 0.0000000
W 1 0.0322581 0.0476190 0.6774194
Y 0 0.0000000 0.0476190 0.0000000
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================
More information about the EMBOSS
mailing list