[EMBOSS] EMBOSS seqret : IntelliGenetics and new DOS lines

Peter biopython at maubp.freeserve.co.uk
Mon Jul 20 15:41:43 UTC 2009


Hi all,

I've just updated my Mac to EMBOSS 6.1.0, and have found an
issue with seqret conversion of IntelliGenetics files. After some
digging, I think this problem relates to having DOS new lines in
a file on Unix (in my case, Mac OS X).

For illustration, I'm using the example file from the EMBOSS
website, saved to disk (using Unix new lines on a Mac):
http://emboss.sourceforge.net/docs/themes/seqformats/ig

Using EMBOSS 6.0.1, there was a problem:

$ embossversion
Writes the current EMBOSS version number to a file
6.0.1
$  seqret -sequence emboss_ig.txt -sformat ig -osformat fasta -auto -filter
>HSFAU
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaaH-sapiensfaugenebasesH
SFAUctaccattttccctctcgattctatatgtacactcgggacaagttctcctgatcga
aaacggcaaaactaaggccccaagtaggaatgccttagttttcggggttaacaatgatta
acactgagcctcacacccacgcgatgccctcagctcctcgctcagcgctctcaccaacag
ccgtagcccgcagccccgctggacaccggttctccatccccgcagcgtagcccggaacat
ggtagctgccatctttacctgctacgccagccttctgtgcgcgcaactgtctggtcccgc
cccgtcctgcgcgagctgctgcccaggcaggttcgccggtgcgagcgtaaaggggcggag
ctaggactgccttgggcggtacaaatagcagggaaccgcgcggtcgctcagcagtgacgt
gacacgcagcccacggtctgtactgacgcgccctcgcttcttcctctttctcgactccat
cttcgcggtagctgggaccgccgttcaggtaagaatggggccttggctggatccgaaggg
cttgtagcaggttggctgcggggtcagaaggcgcggggggaaccgaagaacggggcctgc
tccgtggccctgctccagtccctatccgaactccttgggaggcactggccttccgcacgt
gagccgccgcgaccaccatcccgtcgcgatcgtttctggaccgctttccactcccaaatc
tcctttatcccagagcatttcttggcttctcttacaagccgtcttttctttactcagtcg
ccaatatgcagctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccagg
aaacggtcgcccagatcaaggtaaggctgcttggtgcgccctgggttccattttcttgtg
ctcttcactctcgcggcccgagggaacgcttacgagccttatctttccctgtaggctcat
gtagcctcactggagggcattgccccggaagatcaagtcgtgctcctggcaggcgcgccc
ctggaggatgaggccactctgggccagtgcggggtggaggccctgactaccctggaagta
gcaggccgcatgcttggaggtgagtgagagaggaatgttctttgaagtaccggtaagcgt
ctagtgagtgtggggtgcatagtcctgacagctgagtgtcacacctatggtaatagagta
cttctcactgtcttcagttcagagtgattcttcctgtttacatccctcatgttgaacaca
gacgtccatgggagactgagccagagtgtagttgtatttcagtcacatcacgagatccta
gtctggttatcagcttccacactaaaaattaggtcagaccaggccccaaagtgctctata
aattagaagctggaagatcctgaaatgaaacttaagatttcaaggtcaaatatctgcaac
tttgttctcattacctattgggcgcagcttctctttaaaggcttgaattgagaaaagagg
ggttctgctgggtggcaccttcttgctcttacctgctggtgccttcctttcccactacag
gtaaagtccatggttccctggcccgtgctggaaaagtgagaggtcagactcctaaggtga
gtgagagtattagtggtcatggtgttaggactttttttcctttcacagctaaaccaagtc
cctgggctcttactcggtttgccttctccctccctggagatgagcctgagggaagggatg
ctaggtgtggaagacaggaaccagggcctgattaaccttcccttctccaggtggccaaac
aggagaagaagaagaagaagacaggtcgggctaagcggcggatgcagtacaaccggcgct
ttgtcaacgttgtgcccacctttggcaagaagaagggccccaatgccaactcttaagtct
tttgtaattctggctttctctaataaaaaagccacttagttcagtcatcgcattgtttca
tctttacttgcaaggcctcagggagaggtgtgcttctcgg

i.e. The two sequences have been munged into one, with the
name of the second sequence as part of the sequence.

Using EMBOSS 6.1.0, the following now works:

$ embossversion
Reports the current EMBOSS version number
6.1.0
$ seqret -sequence emboss_ig.txt -sformat ig -osformat fasta -auto -filter
>HSFAU H.sapiens fau mRNA, 518 bases
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaa
>HSFAU1 H.sapiens fau 1 gene, 2016 bases
ctaccattttccctctcgattctatatgtacactcgggacaagttctcctgatcgaaaac
ggcaaaactaaggccccaagtaggaatgccttagttttcggggttaacaatgattaacac
tgagcctcacacccacgcgatgccctcagctcctcgctcagcgctctcaccaacagccgt
agcccgcagccccgctggacaccggttctccatccccgcagcgtagcccggaacatggta
gctgccatctttacctgctacgccagccttctgtgcgcgcaactgtctggtcccgccccg
tcctgcgcgagctgctgcccaggcaggttcgccggtgcgagcgtaaaggggcggagctag
gactgccttgggcggtacaaatagcagggaaccgcgcggtcgctcagcagtgacgtgaca
cgcagcccacggtctgtactgacgcgccctcgcttcttcctctttctcgactccatcttc
gcggtagctgggaccgccgttcaggtaagaatggggccttggctggatccgaagggcttg
tagcaggttggctgcggggtcagaaggcgcggggggaaccgaagaacggggcctgctccg
tggccctgctccagtccctatccgaactccttgggaggcactggccttccgcacgtgagc
cgccgcgaccaccatcccgtcgcgatcgtttctggaccgctttccactcccaaatctcct
ttatcccagagcatttcttggcttctcttacaagccgtcttttctttactcagtcgccaa
tatgcagctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaac
ggtcgcccagatcaaggtaaggctgcttggtgcgccctgggttccattttcttgtgctct
tcactctcgcggcccgagggaacgcttacgagccttatctttccctgtaggctcatgtag
cctcactggagggcattgccccggaagatcaagtcgtgctcctggcaggcgcgcccctgg
aggatgaggccactctgggccagtgcggggtggaggccctgactaccctggaagtagcag
gccgcatgcttggaggtgagtgagagaggaatgttctttgaagtaccggtaagcgtctag
tgagtgtggggtgcatagtcctgacagctgagtgtcacacctatggtaatagagtacttc
tcactgtcttcagttcagagtgattcttcctgtttacatccctcatgttgaacacagacg
tccatgggagactgagccagagtgtagttgtatttcagtcacatcacgagatcctagtct
ggttatcagcttccacactaaaaattaggtcagaccaggccccaaagtgctctataaatt
agaagctggaagatcctgaaatgaaacttaagatttcaaggtcaaatatctgcaactttg
ttctcattacctattgggcgcagcttctctttaaaggcttgaattgagaaaagaggggtt
ctgctgggtggcaccttcttgctcttacctgctggtgccttcctttcccactacaggtaa
agtccatggttccctggcccgtgctggaaaagtgagaggtcagactcctaaggtgagtga
gagtattagtggtcatggtgttaggactttttttcctttcacagctaaaccaagtccctg
ggctcttactcggtttgccttctccctccctggagatgagcctgagggaagggatgctag
gtgtggaagacaggaaccagggcctgattaaccttcccttctccaggtggccaaacagga
gaagaagaagaagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgt
caacgttgtgcccacctttggcaagaagaagggccccaatgccaactcttaagtcttttg
taattctggctttctctaataaaaaagccacttagttcagtcatcgcattgtttcatctt
tacttgcaaggcctcagggagaggtgtgcttctcgg

i.e. There was a problem with this example file in EMBOSS 6.0.1,
but things look fine in EMBOSS 6.1.0. Great :)

However, if we now convert this input file to use DOS/Windows
newlines, and repeat the test (on Mac OS X, so Unix):

$ embossversionReports the current EMBOSS version number
6.1.0
$ seqret -sequence emboss_ig.txt -sformat ig -osformat fasta -auto -filter
 H.sapiens fau mRNA, 518 bases
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaa
 H.sapiens fau 1 gene, 2016 bases
ctaccattttccctctcgattctatatgtacactcgggacaagttctcctgatcgaaaac
ggcaaaactaaggccccaagtaggaatgccttagttttcggggttaacaatgattaacac
tgagcctcacacccacgcgatgccctcagctcctcgctcagcgctctcaccaacagccgt
agcccgcagccccgctggacaccggttctccatccccgcagcgtagcccggaacatggta
gctgccatctttacctgctacgccagccttctgtgcgcgcaactgtctggtcccgccccg
tcctgcgcgagctgctgcccaggcaggttcgccggtgcgagcgtaaaggggcggagctag
gactgccttgggcggtacaaatagcagggaaccgcgcggtcgctcagcagtgacgtgaca
cgcagcccacggtctgtactgacgcgccctcgcttcttcctctttctcgactccatcttc
gcggtagctgggaccgccgttcaggtaagaatggggccttggctggatccgaagggcttg
tagcaggttggctgcggggtcagaaggcgcggggggaaccgaagaacggggcctgctccg
tggccctgctccagtccctatccgaactccttgggaggcactggccttccgcacgtgagc
cgccgcgaccaccatcccgtcgcgatcgtttctggaccgctttccactcccaaatctcct
ttatcccagagcatttcttggcttctcttacaagccgtcttttctttactcagtcgccaa
tatgcagctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaac
ggtcgcccagatcaaggtaaggctgcttggtgcgccctgggttccattttcttgtgctct
tcactctcgcggcccgagggaacgcttacgagccttatctttccctgtaggctcatgtag
cctcactggagggcattgccccggaagatcaagtcgtgctcctggcaggcgcgcccctgg
aggatgaggccactctgggccagtgcggggtggaggccctgactaccctggaagtagcag
gccgcatgcttggaggtgagtgagagaggaatgttctttgaagtaccggtaagcgtctag
tgagtgtggggtgcatagtcctgacagctgagtgtcacacctatggtaatagagtacttc
tcactgtcttcagttcagagtgattcttcctgtttacatccctcatgttgaacacagacg
tccatgggagactgagccagagtgtagttgtatttcagtcacatcacgagatcctagtct
ggttatcagcttccacactaaaaattaggtcagaccaggccccaaagtgctctataaatt
agaagctggaagatcctgaaatgaaacttaagatttcaaggtcaaatatctgcaactttg
ttctcattacctattgggcgcagcttctctttaaaggcttgaattgagaaaagaggggtt
ctgctgggtggcaccttcttgctcttacctgctggtgccttcctttcccactacaggtaa
agtccatggttccctggcccgtgctggaaaagtgagaggtcagactcctaaggtgagtga
gagtattagtggtcatggtgttaggactttttttcctttcacagctaaaccaagtccctg
ggctcttactcggtttgccttctccctccctggagatgagcctgagggaagggatgctag
gtgtggaagacaggaaccagggcctgattaaccttcccttctccaggtggccaaacagga
gaagaagaagaagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgt
caacgttgtgcccacctttggcaagaagaagggccccaatgccaactcttaagtcttttg
taattctggctttctctaataaaaaagccacttagttcagtcatcgcattgtttcatctt
tacttgcaaggcctcagggagaggtgtgcttctcgg

i.e. The ">" is missing on all the FASTA sequences.

So, it looks like EMBOSS 6.1.0 fixed one problem with
IntelliGenetics files, but that there is still an issue here.

Peter C.

P.S. Should I have reported this possible bug via sourceforge?

P.P.S. Back in 2006, I reported a similar issue with a data
corruption reading stockholm/pfam with DOS newlines
(Sourceforge Bug #1588956, long since fixed). It seems to
me that EMBOSS would benefit from explicit testing of all
the file formats using DOS/Windows newlines when run on
Unix, and vice versa. Does that sound feasible, or just
hopelessly ambitious?



More information about the EMBOSS mailing list