[emboss-dev] EMBOSS seqret FASTQ support

Mon Jul 20 22:12:29 UTC 2009

Earlier I wrote:
> Hi all,
>
> I've just been having a play with the FASTQ support in seqret from EMBOSS 6.1.0
> ...
> So far so good :)

Could anyone spot a "but" coming up?

Well, here we are - consider the following single Sanger format
FASTQ record (originally from the NCBI SRA, I think SRA000271,
but I would have to double check that).

@071113_EAS56_0053:1:1:182:712
ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG
+071113_EAS56_0053:1:1:182:712
@IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+

I would guess the problem is that quality line starts with a @,
meaning care is needed. Likewise of course, quality lines can
start with a + character too (although in my quick testing
EMBOSS seems happy with these).

The ASCII code for @ is 64, meaning for a Sanger style file this
is a PHRED quality of 64-33 = 31. Here is what Biopython gives
for the FASTA conversion:

>071113_EAS56_0053:1:1:182:712
ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG

And this is what Biopython gives for the QUAL conversion,
showing the PHRED scores as integers:

>071113_EAS56_0053:1:1:182:712
31 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 34 35 40 40
40 40 40 27 4 27 21 5 12 9 8 13 7 9 4 10

Anyway, EMBOSS doesn't seem to like this example FASTQ record:

$ seqret -sequence tricky_one.fastq -sformat fastq -osformat fasta -filter
Error: Unable to read sequence 'tricky_one.fastq'
Died: seqret terminated: Bad value for '-sequence' with -auto defined

This read is actually one of four records in the following Biopython
test file, in which EMBOSS only seems to find the first record:
http://biopython.org/SRC/biopython/Tests/Quality/tricky.fastq

As described here, this is a hand modified version of a real NCBI
FASTQ file to show case several potential gotchas in parsing FASTQ
(including some unlikely to occur in real life - unless someone were
to concatenate FASTQ files from separate sources or something):
http://www.biopython.org/DIST/docs/api/Bio.SeqIO.QualityIO-module.html#FastqGeneralIterator

In fact, looking at that again now, maybe I should include another
record where the sequence line starts with a "+" as well... maybe
even a record with the quality split over multiple lines some starting
with @ and some with +. That would be an even better evil test ;)

Regards,

Peter C.