[EMBOSS] seqret messes up on phylip input with duplicate sequence names

Jan Kim jttkim at googlemail.com
Wed Aug 6 20:54:56 UTC 2014


Dear All,

I've run into a somewhat strange problem while using seqret to convert
from phylip to fasta (the default) format. Essentially, when a phylip
file contains multiple sequences with the same name, weird things happen,
either core dumps, or all sequences, including their names, get concatenated
to one "EMBOSS_001" sequence. Please see below for an example that is
reproducible for me with EMBOSS 6.4.0 (from an Ubuntu package) and
6.5.7 (compiled by myself).

I think the issue is not specific to seqret but rather an issue in the
sequence reading library. Perhaps some function decides that the input
isn't valid phylip when it encounters the duplicate name, and this
triggers falling back to reading the entire file as raw.

As a bit of context, I ran into this because fdnadist mysteriously
produced an 1 x 1 matrix with the row name "EMBOSS_001". It took me quite
a while to figure out that this was triggered by duplicate sequence names,
which I didn't expect to exist in the input. But if I'm allowed this
whinge, an error or warning such as "duplicate sequence name in phylip
input -- giving up" might have directed me to the root of the problem
more quickly. (I can still hope, though, that this email saves someone
else a bit of time hunting down a related issue.)

Best regards, Jan

----- 8< --- reproducible example -----------------------------------------

$ # dnadist.phy is an example input copied from the fdnadist HTML documentation page
$ cat dnadist.phy
   5   13
Alpha     AACGTGGCCACAT
Beta      AAGGTCGCCACAC
Gamma     CAGTTCGCCACAA
Delta     GAGATTTCCGCCT
Epsilon   GAGATCTCCGCCC
$ seqret dnadist.phy -outseq stdout
Read and write (return) sequences
>Alpha
AACGTGGCCACAT
>Beta
AAGGTCGCCACAC
>Gamma
CAGTTCGCCACAA
>Delta
GAGATTTCCGCCT
>Epsilon
$ # so if the input is good, the output is good too. Replace "Beta " with
$ "Alpha", though, so that "Alpha" is a duplicate identifier, and...
$ seqret dnabroken.phy -outseq stdout
Read and write (return) sequences
>Epsilon
GAGATCTCCGCCC
Segmentation fault (core dumped)
$ # the core dump can be "fixed" by adding an empty line to the broken file:
$ echo >> dnabroken.phy 
$ seqret dnabroken.phy -outseq stdout
Read and write (return) sequences
>EMBOSS_001
AlphaAACGTGGCCACATAlphaAAGGTCGCCACACGammaCAGTTCGCCACAADeltaG
AGATTTCCGCCTEpsilonGAGATCTCCGCCC
$ echo $?
0
$ # so seqret has written something and hasn't complained, but the output
$ # is really garbage

-- 
 +- Jan T. Kim -------------------------------------------------------+
 |             email: jttkim at gmail.com                                |
 |             WWW:   http://www.jtkim.dreamhosters.com/              |
 *-----=<  hierarchical systems are for files, not for humans  >=-----*


More information about the EMBOSS mailing list