[Bioperl-l] Next Gen Formats
Ryan Golhar
golharam at umdnj.edu
Fri Mar 12 13:09:40 UTC 2010
Here is an example of a color-space sequence:
In one file (something.csfasta):
>1_30_226_F3
T210320010.200.03.0110320320220212200122200.2220200
>1_30_252_F3
T322220212.133.00.2202322132022202221002011.0011020
The '.' means the color could not be called
In another file (something.qual):
>1_30_226_F3
4 4 27 17 31 7 24 26 13 -1 10 25 14 -1 26 4 -1 19 9 5 6 14 12 6 9 4 4 7
7 20 4 4 19 12 12 4 4 12 10 10 5 4 -1 13 16 8 4 15 4 4
>1_30_252_F3
18 4 19 15 9 4 4 5 4 -1 6 4 5 -1 5 6 -1 9 6 4 4 4 6 4 4 4 4 5 8 4 8 7 4
7 5 4 4 10 9 12 8 4 -1 6 5 5 4 10 4 12
The -1 represents those colors that could not be called.
Chris Fields wrote:
> On Mar 12, 2010, at 4:06 AM, Peter wrote:
>
>> On Fri, Mar 12, 2010 at 3:35 AM, Chris Fields <cjfields at illinois.edu> wrote:
>>> Ryan,
>>>
>>> We would have to see example files to get an idea of how feasible it is.
>>> You could possibly use a Bio::SeqIO::fasta and a Bio::SeqIO::qual
>>> stream, and interleave the two somehow. However, BioPerl qual
>>> scores are PHRED-based by default, and I'm not sure how color-space
>>> data would work within that schematic.
>>>
>>> chris
>> Chris,
>>
>> I am under the (possibly mistaken) assumption that PHRED scores
>> are used for SOLiD color space QUAL files - the key issue is each
>> score corresponds to the color call in the color sequence.
>>
>> Ignoring color-space for a moment, are there BioPerl examples
>> of iterating over a pair of sequence-space FASTA and QUAL files?
>> i.e. What you'd get if you had a FASTQ file to iterate over.
>>
>> [I guess Ryan could just merge the color-space FASTA and
>> QUAL into a color-space FASTQ file and iterate over that]
>>
>> Peter
>
> If they're PHRED scores then it should be fine, though we may need to work in a few color-space specific things.
>
> Iterating over pairs is something that has popped up before. For output, in the Bio::SeqIO::fastq module there is code for writing fasta/qual (to two separate streams), where I'm assuming one could do something like:
>
> --------------------------------
> my $in = Bio::SeqIO->new(-format => 'fastq', -file => 'foo.fastq');
> my $out1 = Bio::SeqIO->new(-format => 'fastq', -file => '>foo.fasta');
> my $out2 = Bio::SeqIO->new(-format => 'fastq', -file => '>foo.qual');
>
> while (my $seq = $in->next_seq) {
> $out1->write_fasta($seq);
> $out2->write_fasta($seq);
> }
> --------------------------------
>
> Note that all use the 'fastq' formatm instead of 'fasta' or 'qual'. This should work for those as well, just haven't tried it myself (it's a bug otherwise).
>
> I'm assuming for input it would be something like:
>
> --------------------------------
> my $in1 = Bio::SeqIO->new(-format => 'fasta', -file => 'foo.fasta');
> my $in2 = Bio::SeqIO->new(-format => 'qual', -file => 'foo.qual');
> my $out = Bio::SeqIO->new(-format => 'fastq', -file => '>foo.fastq');
>
> # 'qual' parser joins the two streams
> while (my $seq = $in2->next_seq($in1)) {
> $out->write_seq($seq);
> }
> --------------------------------
>
> chris
>
>
More information about the Bioperl-l
mailing list