[emboss-dev] EMBOSS seqret FASTQ support

Mon Jul 20 21:46:38 UTC 2009

Hi all,

I've just been having a play with the FASTQ support in seqret from EMBOSS 6.1.0

This first example is included in Biopython's unit tests, and can be
downloaded here:
http://biopython.org/SRC/biopython/Tests/Quality/solexa_example.fastq
This was taken from http://maq.sourceforge.net/fq_all2std.pl where it
is given as
as an example of a Solexa (or early Illumina) format FASTQ file encoding Solexa
scores with an ASCII offset of 64, and can be seen by doing:

$ perl fq_all2std.pl example
...

@SLXA-B3_649_FC8437_R1_1_1_610_79
GATGTGCAATACCTTTGTAGAGGAA
+SLXA-B3_649_FC8437_R1_1_1_610_79
YYYYYYYYYYYYYYYYYYWYWYYSU
@SLXA-B3_649_FC8437_R1_1_1_397_389
GGTTTGAGAAAGAGAAATGAGATAA
+SLXA-B3_649_FC8437_R1_1_1_397_389
YYYYYYYYYWYYYYWWYYYWYWYWW
@SLXA-B3_649_FC8437_R1_1_1_850_123
GAGGGTGTTGATCATGATGATGGCG
+SLXA-B3_649_FC8437_R1_1_1_850_123
YYYYYYYYYYYYYWYYWYYSYYYSY
@SLXA-B3_649_FC8437_R1_1_1_362_549
GGAAACAAAGTTTTTCTCAACATAG
+SLXA-B3_649_FC8437_R1_1_1_362_549
YYYYYYYYYYYYYYYYYYWWWWYWY
@SLXA-B3_649_FC8437_R1_1_1_183_714
GTATTATTTAATGGCATACACTCAA
+SLXA-B3_649_FC8437_R1_1_1_183_714
YYYYYYYYYYWYYYYWYWWUWWWQQ

I am pleased to say EMBOSS 6.1.0 will read this and convert it into a
standard FASTA file:

$ seqret -sequence solexa_example.fastq -sformat fastq -osformat fasta -filter
>SLXA-B3_649_FC8437_R1_1_1_610_79
GATGTGCAATACCTTTGTAGAGGAA
>SLXA-B3_649_FC8437_R1_1_1_397_389
GGTTTGAGAAAGAGAAATGAGATAA
>SLXA-B3_649_FC8437_R1_1_1_850_123
GAGGGTGTTGATCATGATGATGGCG
>SLXA-B3_649_FC8437_R1_1_1_362_549
GGAAACAAAGTTTTTCTCAACATAG
>SLXA-B3_649_FC8437_R1_1_1_183_714
GTATTATTTAATGGCATACACTCAA

Or, output as a Sanger style FASTQ file (using PHRED qualities with an ASCII
offset of 33):

$ seqret -sequence solexa_example.fastq -sformat fastq-solexa
-osformat fastq-sanger -filter
@SLXA-B3_649_FC8437_R1_1_1_610_79
GATGTGCAATACCTTTGTAGAGGAA
+SLXA-B3_649_FC8437_R1_1_1_610_79
::::::::::::::::::8:8::46
@SLXA-B3_649_FC8437_R1_1_1_397_389
GGTTTGAGAAAGAGAAATGAGATAA
+SLXA-B3_649_FC8437_R1_1_1_397_389
:::::::::8::::88:::8:8:88
@SLXA-B3_649_FC8437_R1_1_1_850_123
GAGGGTGTTGATCATGATGATGGCG
+SLXA-B3_649_FC8437_R1_1_1_850_123
:::::::::::::8::8::4:::4:
@SLXA-B3_649_FC8437_R1_1_1_362_549
GGAAACAAAGTTTTTCTCAACATAG
+SLXA-B3_649_FC8437_R1_1_1_362_549
::::::::::::::::::8888:8:
@SLXA-B3_649_FC8437_R1_1_1_183_714
GTATTATTTAATGGCATACACTCAA
+SLXA-B3_649_FC8437_R1_1_1_183_714
::::::::::8::::8:88688822

Using Biopython, for example as shown on the following
cookbook page, agrees perfectly (except that Biopython
omits the optional repeated title on the plus lines):
http://www.biopython.org/wiki/Reading_from_unix_pipes

This also agrees with the MAQ script - if you ignore its strange
bug where it adds a "!" to the end of each quality string, see:
http://sourceforge.net/mailarchive/forum.php?thread_name=320fb6e00906170708lb2ce4f7qbc5dfa43543189a2%40mail.gmail.com&forum_name=maq-help

So far so good :)

Was there any particular reason why EMBOSS includes the
redundant second title on the plus lines? I can see that doing
this makes the FASTQ files perhaps slightly more likely to work
with other parsers, but imposes quite a size penalty.

Peter C.