[Bioperl-l] Next-gen modules

Wed Jun 17 18:01:28 UTC 2009

If we reach a consensus on how/who/what, I will be happy to contribute  
some coding time in the coming days.

Would it be a good starting point to start adding the different  
formats as named in BioPython, and test support for reading/wrting  
them? I could start playing with that.

regards,

Elia

On 17 Jun 2009, at 18:52, Chris Fields wrote:

> I think this is a top priority for a fall BioPerl release, maybe  
> 1.6.2 (I am planning on a summer 1.6.1 release still).  Made it into  
> a bug report for tracking:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2857
>
> If no one works on this I may take it up after the 1.6.1 release.
>
> chris
>
> On Jun 17, 2009, at 12:13 PM, Mark A. Jensen wrote:
>
>> I'm on the case! (but maybe not in realtime, today!)
>>
>> ----- Original Message ----- From: "Chris Fields" <cjfields at illinois.edu 
>> >
>> To: "Peter" <biopython at maubp.freeserve.co.uk>
>> Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>; "Elia Stupka" <e.stupka at ucl.ac.uk 
>> >; "Heikki Lehvaslaiho" <heikki at sanbi.ac.za>
>> Sent: Wednesday, June 17, 2009 1:06 PM
>> Subject: Re: [Bioperl-l] Next-gen modules
>>
>>
>>>
>>> On Jun 17, 2009, at 8:25 AM, Peter wrote:
>>>
>>>> On Wed, Jun 17, 2009 at 1:57 PM, Chris  
>>>> Fields<cjfields at illinois.edu>  wrote:
>>>>>
>>>>> Elia,
>>>>>
>>>>> As Mark indicated, we recently discussed the lack of support  
>>>>> for  next-gen on
>>>>> list, at least re: fastq.  I may be hit with the same thing in  
>>>>> a  few months
>>>>> time myself, and I recall Jason and a few others also  
>>>>> mentioning  the same.
>>>>> Heikki wrote some code for Illumina FASTQ for SeqIO and related   
>>>>> modules but
>>>>> I don't believe it has been committed to trunk yet, so maybe he  
>>>>> can  answer.
>>>>>
>>>>> From prior discussions IIRC the issues were:
>>>>>
>>>>> 1) distinguishing the various FASTQ versions (Sanger, Illumina  
>>>>> 1.0, Illumina
>>>>> 1.3) from one another (so maybe some optional validation), and
>>>>
>>>> Following the python rule of thumb for being explicit, Biopython  
>>>> makes
>>>> the user specify which FASTQ variant is being used. I don't think  
>>>> you
>>>> can do anything else. Any attempted validation would have to be
>>>> heuristic based on the ASCII characters found, and would risk false
>>>> positive warnings.
>>>
>>> Right; I'm thinking along the same lines.  If anything the most  
>>> we  would allow is some level of validation, so if there were a  
>>> degree of  uncertainty about the format one could set a validation  
>>> flag to check  bounds during the parse and warn if they are  
>>> exceeded.
>>>
>>>>> 2) having a way for the Seq object to either 'know' what format is
>>>>> contained, or we use phred score and convert back and forth  
>>>>> from  that (I
>>>>> think the latter makes more sense).
>>>>
>>>> I think it could make sense for BioPerl to convert Solexa scores  
>>>> to/ from
>>>> PHRED scores on the fly (especially now that Illumina is abandoning
>>>> the Solexa score system). Python style tries to avoid implicit   
>>>> conversions,
>>>> so Biopython doesn't automatically do a conversion from Solexa to
>>>> PHRED scores on parsing (but will on writing if the requested  
>>>> output
>>>> format requires this).
>>>>
>>>>> Peter's suggestions also are reasonable, though does biopython  
>>>>> have a
>>>>> separate module for each of these variations?  Our version (I   
>>>>> believe)
>>>>> mainly varied the conversion within Bio::SeqIO::fastq itself  
>>>>> based  on the
>>>>> fastq variant passed in as a separate named argument.
>>>>
>>>> Biopython's SeqIO gives the three FASTQ variants their own unique
>>>> names. This format name is a required argument for parsing/writing
>>>> (we don't try and guess the file format from the data contents).   
>>>> Internally
>>>> we have three separate FASTQ parsers/writers although they do share
>>>> code.
>>>
>>> We could easily do the same if others agree.  Actually, if we   
>>> specified that shorthand for a variant on a format would be  
>>> designated  as -format => 'format-variant', I think we could  
>>> easily hack SeqIO to  deal with that by splitting on '-' and  
>>> passing everything to the  constructor as (-format => 'format', - 
>>> variant => 'variant').  Very  little repeated code in this case,  
>>> just an additional named parameter  indicating the format variant  
>>> (and the SeqIO class can do the type  checking on that within the  
>>> constructor).
>>>
>>>> Other issues to keep in mind:
>>>>
>>>> (3) There should be no warning parsing files where the optional   
>>>> repeated
>>>> title is missing on the "+" lines (as discussed earlier on the   
>>>> BioPerl list).
>>>
>>> Agreed, though we'll have to check the current fastq parser to see  
>>> if  that's currently the case.  I thought that was fixed but maybe  
>>> not?
>>>
>>>> (4) When writing FASTQ files should BioPerl omit the optional  
>>>> repeated
>>>> title on the "+" line? Biopython omits this as I understand this  
>>>> to be
>>>> common practice, and can make a big different to file sizes -   
>>>> especially
>>>> on short read data from Solexa/Illumina.
>>>
>>> Agreed, particularly if it's commonly encountered.
>>>
>>>> (5) Also test reading and writing files with an optional  
>>>> description  (as well
>>>> as an identifier) on the "@" (and "+") lines. See the NCBI SRA  
>>>> for  examples,
>>>> e.g.
>>>>
>>>> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>>>> GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
>>>> +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>>>> IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
>>>
>>> Should be easy enough to implement with a simple regex.
>>>
>>>> (6) Test reading and writing files where the encoded quality  
>>>> string  starts
>>>> with a "@" or a "+" character, e.g.
>>>> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html
>>>>
>>>> Peter
>>>
>>> Mark, getting all that? ;>
>>>
>>> chris
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>

---
Senior Lecturer, Bioinformatics
UCL Cancer Institute
Paul O' Gorman Building
University College London
Gower Street
WC1E 6BT
London
UK

Office (UCL): +44 207 679 6493
Office (ICMS): +44 0207 8822374

Mobile: +44 7597 566 194
Mobile (Italy): +39 338 8448801