[Bioperl-l] Creating a fastq format file?

Heikki Lehvaslaiho heikki.lehvaslaiho at gmail.com
Mon Apr 27 15:53:12 UTC 2009


2009/4/27 Chris Fields <cjfields at illinois.edu>:
> This is going within Bio::Seq::Quality, correct?

Yes.

Does Bio::Seq::Quality
> have a method that indicates what format the quality scores are actually in
> (sanger/illumina/illumina1.3/phred/foo)?  The reason I worry about this is
> quality scores appear inseparable from their quality format (ranges vary in
> length, for instance).

No method.

> For instance, if I picked a Bio::Seq::Quality out of the blue, could I tell
> which quality format it originated from w/o guessing, and similarly could I
> accurately convert it to another qual format?  To me it seems we need
> something in Bio::Seq::Quality akin to the alphabet() method used for
> sequence data.

The text formats encode the quality values in different ways, but they
are all stored as integer arrays in the object. Converting between
them is relatively easy.

You are right: quality_format() or even plain format() is needed. The
SeqIO methods creating the objects should be setting it. Warnings for
unset format values should be added to appropriate places.

   -Heikki

> chris
>
> On Apr 27, 2009, at 4:38 AM, Heikki Lehvaslaiho wrote:
>
>> I convinced at least myself to the degree that I wrote the
>> range_convert() method - with plenty of tests. I mention this now so
>> that no-one else need to start thinking through all the edge values.
>> :)
>>
>> I'll contribute it to the code base once there is a consensus of best
>> way forward.
>>
>>   -Heikki
>>
>> 2009/4/27 Heikki Lehvaslaiho <heikki.lehvaslaiho at gmail.com>:
>>>>
>>>> I have tried to summarise this in a central place:
>>>> http://en.wikipedia.org/wiki/FASTQ_format
>>>
>>> Torsten,
>>>
>>> Thanks for putting this together. Very helpful.
>>>
>>> Do you have a plan of action?  Let me propose one for BioPerl. It
>>> based on following assumptions:
>>>
>>> 1. There is multitude of different ways of coding quality values out
>>> there.
>>> 2. Bio::Seq::Quality is agnostic of any quality value range rules
>>> 3. The emerging open standard is the Sanger fastq specification
>>> 4. Open source programs use the Sanger fastq specs
>>>
>>>
>>> From these it follows that:
>>>
>>>
>>> 1. BioPerl should support Sanger fastq standard
>>>
>>> 1.1. it already does and there are other SeqIO modules for dealing
>>> with other non-fastq formats.
>>>
>>> 2. BioPerl should offer simple ways of converting between quality range
>>> rules
>>>
>>> 2.1. Have a generic method accessible from Bio::Seq::Quality with
>>> preset versions of the method for converting between known variants
>>> (Sanger fastq and the two Illumina versions)
>>>
>>> For example:
>>>
>>> range_convert ($from_lower, $from_upper, $to_lower, $to_upper, $value)
>>>  throw if $value < $from_lower or $value > $from_upper
>>>  return $newvalue
>>>
>>> range_convert_illumina2fastq(), range_convert_fastq2illumina(),
>>> range_convert_fastq2phred(),  range_convert_phred2fastq()....
>>>
>>> (assuming that illumina 1.3 eq phred)
>>>
>>> 2.2. Bio::SeqIO::Fastq::next_seq methods should convert Illumina
>>> qualities into Sanger fastq on the fly
>>>
>>> 2.2.1 Bio::SeqIO::Fastq::next_seq should detect the incoming stream of
>>> quality value range either automatically or be given a keyword
>>> parameter indicating the range.
>>>
>>> 2.2.2. Bio::SeqIO::Fastq::next_seq should throw an error if it detects
>>> a quality value out of range.
>>>
>>> 2.2.3. Bio::SeqIO::Fastq::write_seq should throw an error if it
>>> detects a quality value out of range.
>>>
>>> 2.2.4. It would be useful but not absolutely necessary for
>>> Bio::SeqIO::Fastq::write_seq to be able to write out in Illumina
>>> ranges
>>>
>>>
>>> What do you think?
>>>
>>>   -Heikki
>>>
>>> 2009/4/26 Torsten Seemann <torsten.seemann at infotech.monash.edu.au>:
>>>>>>
>>>>>> This might be a good place to ask the question: having looked at the
>>>>>> fastq.pm page, is the fastq format defined (only) by a "@'" followed
>>>>>> by
>>>>>
>>>>> a
>>>>>>
>>>>>> sequence line and a "+" header followed by a quality line and the two
>>>>>> headers have to agree? Now that Illumina is using phred scaling, are
>>>>>> 'Sanger' and 'Illumina' versions the same?
>>>>>
>>>>> No they aren't the same, Illumina still encodes the ascii as value + 64
>>>>> and Sanger as value + 33.
>>>>>
>>>>
>>>> Illumina have now CHANGED how they calculate the quality value however
>>>> in
>>>> the last month or so... Their Q range used to be -5..40 mapped to ASCII
>>>> 64+,
>>>> but now they produce Q >= 0 and it is unclear if they start at 69 or 64
>>>> now...
>>>>
>>>> I have tried to summarise this in a central place:
>>>>
>>>> http://en.wikipedia.org/wiki/FASTQ_format
>>>>
>>>> Corrections welcome!
>>>>
>>>>
>>>> --Torsten Seemann
>>>> --Victorian Bioinformatics Consortium, Dept. Microbiology, Monash
>>>> University, AUSTRALIA
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>
>>>
>>>
>>> --
>>>   -Heikki
>>> Heikki Lehvaslaiho - skype:heikki_lehvaslaiho
>>> cell: +27 (0)714328090
>>> Sent from Claremont, WC, South Africa
>>>
>>
>>
>>
>> --
>>   -Heikki
>> Heikki Lehvaslaiho - skype:heikki_lehvaslaiho
>> cell: +27 (0)714328090
>> Sent from Claremont, WC, South Africa
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>



-- 
    -Heikki
Heikki Lehvaslaiho - skype:heikki_lehvaslaiho
cell: +27 (0)714328090
Sent from Claremont, WC, South Africa




More information about the Bioperl-l mailing list