[Bioperl-l] Next-gen modules

Mon Jun 22 20:29:46 UTC 2009

On Jun 22, 2009, at 9:24 AM, Peter wrote:

> On Wed, Jun 17, 2009 at 6:06 PM, Chris Fields wrote:
>> Peter wrote:
>>> Other issues to keep in mind:
>>>
>>> (3) There should be no warning parsing files where the optional  
>>> repeated
>>> title is missing on the "+" lines (as discussed earlier on the  
>>> BioPerl
>>> list).
>>
>> Agreed, though we'll have to check the current fastq parser to see  
>> if that's
>> currently the case.  I thought that was fixed but maybe not?
>>
>>> (4) When writing FASTQ files should BioPerl omit the optional  
>>> repeated
>>> title on the "+" line? Biopython omits this as I understand this  
>>> to be
>>> common practice, and can make a big different to file sizes -  
>>> especially
>>> on short read data from Solexa/Illumina.
>>
>> Agreed, particularly if it's commonly encountered.
>>
>>> (5) Also test reading and writing files with an optional  
>>> description (as
>>> well as an identifier) on the "@" (and "+") lines. See the NCBI SRA
>>> for examples, e.g.
>>>
>>> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>>> GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
>>> +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>>> IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
>>
>> Should be easy enough to implement with a simple regex.
>>
>>> (6) Test reading and writing files where the encoded quality  
>>> string starts
>>> with a "@" or a "+" character, e.g.
>>> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html
>>>
>>> Peter
>>
>> Mark, getting all that? ;>
>>
>> chris
>
> Another couple of points that I should have remembered earlier,
> related to converting between PHRED scores and Solexa scores.
> On the bright side, with Illumina abandoning the Solexa scores
> in pipeline 1.3+, these issues will go away with time:
>
> (7) If BioPerl will be converting Solexa scores to/from PHRED
> scores as integers automatically (as discussed earlier), make
> sure you round to the nearest whole number (don't just truncate
> with a call to int!). MAQ does this by adding 0.5 before calling
> int (while in Biopython I just use Python's round function).

That can probably be done with sprintf if needed.  It avoids a call to  
POSIX functions.

> (8) When asked to write out an old Solexa style FASTQ file,
> what will you do if given a standard Sanger FASTQ file (or a
> new Illumina 1.3+ FASTQ file) containing a base with PHRED
> quality zero? This maps to a Solexa quality of minus infinity...
> Right now the development version of Biopython will throw an
> error in this situation, but mapping to the lowest observed
> Solexa score might be reasonable.
>
> Peter

Maybe address with a warning followed by assigning to the lowest  
solexa score?

chris