[emboss-dev] [Biopython-dev] Line wrapping in FASTQ output

Thu Jul 23 09:14:52 UTC 2009

On Thu, Jul 23, 2009 at 9:08 AM, Peter Rice<pmr at ebi.ac.uk> wrote:
> Peter C. wrote:
>>
>> Hi Peter R. et al,
>>
>> For Biopython we should be able cope with any strange line breaks
>> in the sequences and qualities lines on input, but for output don't do
>> any line wrapping. I felt this would result in more widely parseable
>> output. I wondered what your thought process was, and if you think
>> it is worth removing the line wrapping on EMBOSS's FASTQ output
>> (or indeed, if you have a good argument to convince me to make
>> Biopython output FASTQ with line wrapping by default).
>
> There is also an issue with making the ines so long that brain-damaged
> parsers (those that read a line in C and fail to check it was a complete
> line) will fail.

You mean a C parser with a finite string buffer (say 100 characters)
which reads things line by line. Yes, that would be a bit brain dead
too. I guess either way could break some parsers out there ;)

> Leaving the line breaks in was deliberate in EMBOSS 6.1.0 to see
> whether any parsers would object.

I see - well I'm not objecting, and neither is the Biopython parser.

> The obvious compromise is to increase the default line length in
> EMBOSS to say 500 so that anyone reading up to 512 characters
> will still be safe. Unfortunately some flk will then assume there will
> never be a line break.

That seems like a bad idea - especially as Roche 454 reads are in the
region of 500+ bp, meaning some would wrap and some wouldn't. Even
using a longer wrap like 1000 would probably just postpone the issue.

If you are going to wrap, something short like 60 seems more sensible
(often used in FASTA files too) given the historical 80 character width
of a terminal window.

People using early Solexa/Illumina machines will only see a single
line, but as their read lengths are already in the range 70 to 100bp,
I wonder what the latest Illumina pipelines output (wrt wrapping)?

> Alternatively, we could truly make everything fit on one line.

That's what Biopython currently does. But you are right - I hadn't
considered brain dead parsers using fixed buffers.

> Or we could double up the fastq outputs with and without line breaks
> (horrible problems with naming the ouptut formats)

I don't like that plan. For Biopython we could have a wrapping setting
available for people who really need to specify this (as we do for
FASTA already), with a sensible default value.

> I suspect this one-line thing is a simple attempt to avoid the "quality line
> starting with '@' or '+'" issue.

Could be. I think the fact that @ and + are valid entries in the quality
string is the second most annoying thing about the FASTQ format
(after the lack of a clear format definition from Sanger, and the
resulting variants from Solexa/Illumina etc).

>> [I nearly CC'd BioPerl-l with this. In fact, this topic strikes me as
>> ideal for an OBF cross project mailing list, something we talked
>> about at BOSC/ISMB 2009. Am I right in thinking you (Peter Rice)
>> were going to look into this?]
>
> Yes indeed I was. Waylaid by the demands of the 6.1.0 EMOSS release
> but I will get back on to it.

Thanks!

> regards,
>
> Peter

Cheers,

Peter C.