[Bioperl-l] Creating a fastq format file?

Chris Fields cjfields at illinois.edu
Mon Apr 27 13:10:04 UTC 2009


This is going within Bio::Seq::Quality, correct?  Does  
Bio::Seq::Quality have a method that indicates what format the quality  
scores are actually in (sanger/illumina/illumina1.3/phred/foo)?  The  
reason I worry about this is quality scores appear inseparable from  
their quality format (ranges vary in length, for instance).

For instance, if I picked a Bio::Seq::Quality out of the blue, could I  
tell which quality format it originated from w/o guessing, and  
similarly could I accurately convert it to another qual format?  To me  
it seems we need something in Bio::Seq::Quality akin to the alphabet()  
method used for sequence data.

chris

On Apr 27, 2009, at 4:38 AM, Heikki Lehvaslaiho wrote:

> I convinced at least myself to the degree that I wrote the
> range_convert() method - with plenty of tests. I mention this now so
> that no-one else need to start thinking through all the edge values.
> :)
>
> I'll contribute it to the code base once there is a consensus of best
> way forward.
>
>    -Heikki
>
> 2009/4/27 Heikki Lehvaslaiho <heikki.lehvaslaiho at gmail.com>:
>>> I have tried to summarise this in a central place:
>>> http://en.wikipedia.org/wiki/FASTQ_format
>>
>> Torsten,
>>
>> Thanks for putting this together. Very helpful.
>>
>> Do you have a plan of action?  Let me propose one for BioPerl. It
>> based on following assumptions:
>>
>> 1. There is multitude of different ways of coding quality values  
>> out there.
>> 2. Bio::Seq::Quality is agnostic of any quality value range rules
>> 3. The emerging open standard is the Sanger fastq specification
>> 4. Open source programs use the Sanger fastq specs
>>
>>
>> From these it follows that:
>>
>>
>> 1. BioPerl should support Sanger fastq standard
>>
>> 1.1. it already does and there are other SeqIO modules for dealing
>> with other non-fastq formats.
>>
>> 2. BioPerl should offer simple ways of converting between quality  
>> range rules
>>
>> 2.1. Have a generic method accessible from Bio::Seq::Quality with
>> preset versions of the method for converting between known variants
>> (Sanger fastq and the two Illumina versions)
>>
>> For example:
>>
>> range_convert ($from_lower, $from_upper, $to_lower, $to_upper,  
>> $value)
>>  throw if $value < $from_lower or $value > $from_upper
>>  return $newvalue
>>
>> range_convert_illumina2fastq(), range_convert_fastq2illumina(),
>> range_convert_fastq2phred(),  range_convert_phred2fastq()....
>>
>> (assuming that illumina 1.3 eq phred)
>>
>> 2.2. Bio::SeqIO::Fastq::next_seq methods should convert Illumina
>> qualities into Sanger fastq on the fly
>>
>> 2.2.1 Bio::SeqIO::Fastq::next_seq should detect the incoming stream  
>> of
>> quality value range either automatically or be given a keyword
>> parameter indicating the range.
>>
>> 2.2.2. Bio::SeqIO::Fastq::next_seq should throw an error if it  
>> detects
>> a quality value out of range.
>>
>> 2.2.3. Bio::SeqIO::Fastq::write_seq should throw an error if it
>> detects a quality value out of range.
>>
>> 2.2.4. It would be useful but not absolutely necessary for
>> Bio::SeqIO::Fastq::write_seq to be able to write out in Illumina
>> ranges
>>
>>
>> What do you think?
>>
>>    -Heikki
>>
>> 2009/4/26 Torsten Seemann <torsten.seemann at infotech.monash.edu.au>:
>>>>> This might be a good place to ask the question: having looked at  
>>>>> the
>>>>> fastq.pm page, is the fastq format defined (only) by a "@'"  
>>>>> followed by
>>>> a
>>>>> sequence line and a "+" header followed by a quality line and  
>>>>> the two
>>>>> headers have to agree? Now that Illumina is using phred scaling,  
>>>>> are
>>>>> 'Sanger' and 'Illumina' versions the same?
>>>>
>>>> No they aren't the same, Illumina still encodes the ascii as  
>>>> value + 64
>>>> and Sanger as value + 33.
>>>>
>>>
>>> Illumina have now CHANGED how they calculate the quality value  
>>> however in
>>> the last month or so... Their Q range used to be -5..40 mapped to  
>>> ASCII 64+,
>>> but now they produce Q >= 0 and it is unclear if they start at 69  
>>> or 64
>>> now...
>>>
>>> I have tried to summarise this in a central place:
>>>
>>> http://en.wikipedia.org/wiki/FASTQ_format
>>>
>>> Corrections welcome!
>>>
>>>
>>> --Torsten Seemann
>>> --Victorian Bioinformatics Consortium, Dept. Microbiology, Monash
>>> University, AUSTRALIA
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>>
>>
>> --
>>    -Heikki
>> Heikki Lehvaslaiho - skype:heikki_lehvaslaiho
>> cell: +27 (0)714328090
>> Sent from Claremont, WC, South Africa
>>
>
>
>
> -- 
>    -Heikki
> Heikki Lehvaslaiho - skype:heikki_lehvaslaiho
> cell: +27 (0)714328090
> Sent from Claremont, WC, South Africa
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list