[Bioperl-l] warning: Bio::Index::Fastq;

Peter Cock p.j.a.cock at googlemail.com
Tue Mar 11 23:03:54 UTC 2014


Hi Russell,

The problem is telling apart quality lines which happen to start with
the at sign ('@') from the title line of a new read which must always
start with the '@' sign. A simple regex is not the answer.

Chris and I (and our co-authors) talked about these kinds of issues
while preparing the NAR paper - which comes with some test files, but
even so we've found more corner cases since (like zero length
sequences, quality strings which start with '0', etc):

http://dx.doi.org/10.1093/nar/gkp1137

For indexing, if you assume the file uses four lines per read, we are
looking for the '@' lines on lines on every fourth line. That is fast.
What Biopython's SeqIO currently does is slower - it tracks the length
of the sequence and quality, in order that it can index evil FASTQ
with line wrapping (where there are more than four lines per record).
e.g.

https://github.com/biopython/biopython/blob/master/Tests/Quality/tricky.fastq

Peter

On Tue, Mar 11, 2014 at 7:39 PM, Smithies, Russell
<Russell.Smithies at agresearch.co.nz> wrote:
> To cover the possibility of multiple '@' symbols (apparently there is no single word to describe this symbol) shouldn't it be:
>
> if (/^@+[A-Z]/) {
>
> or
>
> if (/^@[@A-Z]+/) {
>
>
> --Russell
>
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Fields, Christopher J
> Sent: Wednesday, 12 March 2014 8:20 a.m.
> To: Ehsan Habibi
> Cc: BioPerl List; Peter Cock
> Subject: Re: [Bioperl-l] warning: Bio::Index::Fastq;
>
> Wouldn't this fail with a qual string like '@DDDDDFFFFFIIIDCBBBB'?  Preferentially it should cover the FASTQ torture test suite that Peter set up (which IIRC has a few like this).
>
> chris
>
> On Mar 11, 2014, at 1:33 PM, Ehsan Habibi <eh_mcb at yahoo.com> wrote:
>
>> I changed the following part of the code (Bio/Index/Fastq.pm) and now it works perfect :)
>>      # Main indexing loop
>> while (<$FASTQ>) {
>>  if (/^@[A-Z]/) {
>>
>>
>>
>>
>>
>> On Tuesday, March 11, 2014 5:40 PM, "Fields, Christopher J" <cjfields at illinois.edu> wrote:
>> On Mar 11, 2014, at 11:16 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>
>>
>> > On Tue, Mar 11, 2014 at 4:01 PM, EpiMAN <eh_mcb at yahoo.com> wrote:
>> >> test <http://bioperl.996286.n3.nabble.com/file/n17374/test>
>> >>
>> >> grep -C 10 "^@@DDDDDFFFFFIIIDCBBBB"
>> >
>> > Very helpful - it does look like it came from a valid FASTQ (the --
>> > lines are where grep cuts each snippet).
>> >
>> > I think I can explain this, see:
>> > https://github.com/bioperl/bioperl-live/commit/b45e2d33984cf283b846d
>> > 7c146ec9e0e9ebae67f
>> >
>> > Chris' hack handled quality lines starting with a single '@'
>> > sign, but multiple '@' signs like your examples which start '@@'
>> > instead.
>> >
>> > Peter
>>
>>
>> We probably need a stricter index method analogous to the Bio::SeqIO::fastq parser.  I'll see if I can eek out time today to look into it.
>>
>> chris
>>
>>
>>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l



More information about the Bioperl-l mailing list