[Bioperl-l] Next-Gen and the next point release - updates

Peter biopython at maubp.freeserve.co.uk
Thu Aug 27 11:55:55 UTC 2009


On Thu, Aug 27, 2009 at 3:52 AM, Chris Fields<cjfields at illinois.edu> wrote:
>
> On Aug 26, 2009, at 4:16 PM, Peter wrote:
>
>> It is looking much better than yesterday - nice work :)
>> However, there are a few rough edges still.
>
> Not unexpected, actually.
>
>> ===========================
>> Evil wrapping
>> ===========================
>> Chris - Did you get the zip file of FASTQ examples I sent off list? One of
>> these was the evil_wrapping.fastq file already in Biopython CVS/git (under
>> a new name). This is intended as a real torture test, with line wrapped
>> quality strings where plenty of the lines start with "+" or "@"
>> characters.
>> Bioperl doesn't like this file at all - but I have not dug into why.
>
> Now fixed; I've saved this as very_tricky.fastq, but it's the same file.

Looks good.

>> ===========================
>> Sanger To Illumina 1.3+
>> ===========================
>> When mapping a Sanger FASTQ file with very high scores to Illumina,
>> these don't get the maximum value imposes (ASCII 126, tidle). e.g.
>
> ...
>
> Yes, I know where that one is going wrong.  Fixed now for bounds for the
> above.  Partly related to the below.

Looks good.

>> ===========================
>> Sanger To Solexa
>> ===========================
>> Likewise when mapping a Sanger FASTQ file with very high scores to
>> Solexa FASTQ, these don't get the maximum value imposes (ASCII 126,
>> tidle). For example,
>> ...
>> i.e. You've mapped the high value scores to "<", ASCII 60, thus Solexa -4
>> (an odd thing to happen - getting the lowest score wouldn't surprise me so
>> much).
>
> This one is fixed, it was the same bounding issue as above.

Yes, the high score truncation looks good.

>> Furthermore, notice that PHRED scores 0 and 1 have both been mapped
>> to "<", ASCII 60, thus Solexa -4, and not ";" ASCII 59 meaning Solexa -5.
>
> The two conversions to solexa are still failing.  I'm not sure but I think
> it's something fairly simple, but I can't work on it until Friday (got too
> many other things on my plate ATM).  If I get stumped I'll post a message.

Actually it's not just PHRED 0 and 1 that look wrong, all of the low scores
are messed up. I could repeat this using the sanger_93.fastq file, but
to avoid email line wrapping here I'm using a smaller example file with
PHRED scores in the range 40 to 0 only:

$ cat sanger_faked.fastq
@Test PHRED qualities from 40 to 0 inclusive
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
+
IHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"!

Biopython:

$ python ./biopython_sanger2solexa.py < sanger_faked.fastq
@Test PHRED qualities from 40 to 0 inclusive
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
+
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFECB@>;;

BioPerl SVN (with Chris' latest fixes):

$ ./bioperl_sanger2solexa.pl < sanger_faked.fastq
--------------------- WARNING ---------------------
MSG: Data loss for solexa: following values exceed max 62
0
---------------------------------------------------
@Test PHRED qualities from 40 to 0 inclusive
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
+
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFDCA?=~

The last ten characters are wrong (i.e. PHRED score 0 to 9,
which is precisely the range where the PHRED/Solexa mapping
is non trivial). Also note that data loss warning is misleading (0
is less than 62). Plus you get the exactly same problems with
Illumina to Solexa.

This should narrow it down - the bug is in mapping PHRED
scores (from either Sanger or Illumina 1.3+ files) to the
Solexa encoding.

Peter




More information about the Bioperl-l mailing list