[Bioperl-l] Next Gen Formats

Phillip San Miguel pmiguel at purdue.edu
Fri Mar 12 14:56:33 UTC 2010


Hi Chris,
   
    Converting back and forth from color space is something that would 
be needed. However, a warning for anyone working with color space data:

    It is a really bad idea to convert raw color space reads into 
sequence. This is because conversion propagates from the key base on the 
left to the right. A sequence error *anywhere* in the sequence will 
ensure all bases farther down will be converted on the wrong track. 
Analogous to a "frame shift" -- except there are 4 "frames", not 3.
    Meanwhile, the converse is not true--sequence space bases can be 
converted into color space without error propagation. So you want to do 
all your work in color space and convert to real sequence only at the 
end, when your consensus certain.

    A little more detail here:

http://seqanswers.com/forums/showthread.php?t=3367

    For people wanting to use a non-color space aware program for 
analysis of color space data, it is possible to use a process called 
"double encoding", where 0,1,2,3 bases of color space are just replaced 
with A, C, G, T of a "fake" base space. This is nearly the same as 
working in color space and does not incur the propagation error issues. 
However it is fraught with the obvious problems: you might later confuse 
the double encoded sequence with true sequence space with likely 
maddening results. Also, to get the opposite strand of color space reads 
you reverse without complementing. So top and bottom strands will look 
different.

    Finally, Kevin McKernan said that the dual base encoding 
error-detection scheme was technically using "Perforated Convolutional 
Codes" and said these were used on 3G networks. I only mention this in 
case there are some engineering types who might be interested.

Phillip

Chris Fields wrote:
> For the colorspace fasta we could derive a parser just for that based on the current fasta parser.  They could retain their original color space designation (maybe via a meta designation), and possibly convert to sequence calls based on their mapping (if the following link is current):
>
> http://marketing.appliedbiosystems.com/images/Product_Microsites/Solid_Knowledge_MS/pdf/SOLiD_Dibase_Sequencing_and_Color_Space_Analysis.pdf
>
> Did the sequencing facility provide the actual sequence, though, and not just the color calls and qual?  Seems strange to not provide it...
>
> chris
>
> On Mar 12, 2010, at 7:43 AM, Ryan Golhar wrote:
>
>   
>> Direct from sequencing machine
>>
>> ------Original Message------
>> From: Peter
>> Sender: p.j.a.cock at googlemail.com
>> To: golharam at umdnj.edu
>> Cc: Chris Fields
>> Cc: bioperl-l at lists.open-bio.org
>> Subject: Re: [Bioperl-l] Next Gen Formats
>> Sent: Mar 12, 2010 8:26 AM
>>
>> On Fri, Mar 12, 2010 at 1:09 PM, Ryan Golhar <golharam at umdnj.edu> wrote:
>>     
>>> Here is an example of a color-space sequence:
>>>
>>> In one file (something.csfasta):
>>>
>>>       
>>>> 1_30_226_F3
>>>>         
>>> T210320010.200.03.0110320320220212200122200.2220200
>>>       
>>>> 1_30_252_F3
>>>>         
>>> T322220212.133.00.2202322132022202221002011.0011020
>>>
>>> The '.' means the color could not be called
>>>
>>> In another file (something.qual):
>>>
>>>       
>>>> 1_30_226_F3
>>>>         
>>> 4 4 27 17 31 7 24 26 13 -1 10 25 14 -1 26 4 -1 19 9 5 6 14 12 6 9 4 4 7 7 20
>>> 4 4 19 12 12 4 4 12 10 10 5 4 -1 13 16 8 4 15 4 4
>>>       
>>>> 1_30_252_F3
>>>>         
>>> 18 4 19 15 9 4 4 5 4 -1 6 4 5 -1 5 6 -1 9 6 4 4 4 6 4 4 4 4 5 8 4 8 7 4 7 5
>>> 4 4 10 9 12 8 4 -1 6 5 5 4 10 4 12
>>>
>>> The -1 represents those colors that could not be called.
>>>       
>> Now that is funny (using -1). True PHRED scores are defined with a
>> logarithm and can't be negative. A score of zero is normally used in
>> this situation since that maps to a probability of error of 1 (i.e. the
>> read is 100% wrong, or 0% true).
>>
>> Where did these files come from? Direct from a sequencing
>> machine or via some third party script?
>>
>> Peter
>>
>>
>> Sent from my Verizon Wireless BlackBerry
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>     
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>   




More information about the Bioperl-l mailing list