[Bioperl-l] Next Gen Formats
Phillip San Miguel
pmiguel at purdue.edu
Fri Mar 12 14:56:33 UTC 2010
Hi Chris,
Converting back and forth from color space is something that would
be needed. However, a warning for anyone working with color space data:
It is a really bad idea to convert raw color space reads into
sequence. This is because conversion propagates from the key base on the
left to the right. A sequence error *anywhere* in the sequence will
ensure all bases farther down will be converted on the wrong track.
Analogous to a "frame shift" -- except there are 4 "frames", not 3.
Meanwhile, the converse is not true--sequence space bases can be
converted into color space without error propagation. So you want to do
all your work in color space and convert to real sequence only at the
end, when your consensus certain.
A little more detail here:
http://seqanswers.com/forums/showthread.php?t=3367
For people wanting to use a non-color space aware program for
analysis of color space data, it is possible to use a process called
"double encoding", where 0,1,2,3 bases of color space are just replaced
with A, C, G, T of a "fake" base space. This is nearly the same as
working in color space and does not incur the propagation error issues.
However it is fraught with the obvious problems: you might later confuse
the double encoded sequence with true sequence space with likely
maddening results. Also, to get the opposite strand of color space reads
you reverse without complementing. So top and bottom strands will look
different.
Finally, Kevin McKernan said that the dual base encoding
error-detection scheme was technically using "Perforated Convolutional
Codes" and said these were used on 3G networks. I only mention this in
case there are some engineering types who might be interested.
Phillip
Chris Fields wrote:
> For the colorspace fasta we could derive a parser just for that based on the current fasta parser. They could retain their original color space designation (maybe via a meta designation), and possibly convert to sequence calls based on their mapping (if the following link is current):
>
> http://marketing.appliedbiosystems.com/images/Product_Microsites/Solid_Knowledge_MS/pdf/SOLiD_Dibase_Sequencing_and_Color_Space_Analysis.pdf
>
> Did the sequencing facility provide the actual sequence, though, and not just the color calls and qual? Seems strange to not provide it...
>
> chris
>
> On Mar 12, 2010, at 7:43 AM, Ryan Golhar wrote:
>
>
>> Direct from sequencing machine
>>
>> ------Original Message------
>> From: Peter
>> Sender: p.j.a.cock at googlemail.com
>> To: golharam at umdnj.edu
>> Cc: Chris Fields
>> Cc: bioperl-l at lists.open-bio.org
>> Subject: Re: [Bioperl-l] Next Gen Formats
>> Sent: Mar 12, 2010 8:26 AM
>>
>> On Fri, Mar 12, 2010 at 1:09 PM, Ryan Golhar <golharam at umdnj.edu> wrote:
>>
>>> Here is an example of a color-space sequence:
>>>
>>> In one file (something.csfasta):
>>>
>>>
>>>> 1_30_226_F3
>>>>
>>> T210320010.200.03.0110320320220212200122200.2220200
>>>
>>>> 1_30_252_F3
>>>>
>>> T322220212.133.00.2202322132022202221002011.0011020
>>>
>>> The '.' means the color could not be called
>>>
>>> In another file (something.qual):
>>>
>>>
>>>> 1_30_226_F3
>>>>
>>> 4 4 27 17 31 7 24 26 13 -1 10 25 14 -1 26 4 -1 19 9 5 6 14 12 6 9 4 4 7 7 20
>>> 4 4 19 12 12 4 4 12 10 10 5 4 -1 13 16 8 4 15 4 4
>>>
>>>> 1_30_252_F3
>>>>
>>> 18 4 19 15 9 4 4 5 4 -1 6 4 5 -1 5 6 -1 9 6 4 4 4 6 4 4 4 4 5 8 4 8 7 4 7 5
>>> 4 4 10 9 12 8 4 -1 6 5 5 4 10 4 12
>>>
>>> The -1 represents those colors that could not be called.
>>>
>> Now that is funny (using -1). True PHRED scores are defined with a
>> logarithm and can't be negative. A score of zero is normally used in
>> this situation since that maps to a probability of error of 1 (i.e. the
>> read is 100% wrong, or 0% true).
>>
>> Where did these files come from? Direct from a sequencing
>> machine or via some third party script?
>>
>> Peter
>>
>>
>> Sent from my Verizon Wireless BlackBerry
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
More information about the Bioperl-l
mailing list