[Bioperl-l] warning: Bio::Index::Fastq;

Tue Mar 11 23:18:37 UTC 2014

My feeling is that we could implement a switch to allow fast parsing if the 4-line convention is used, and a more diligent parser that deals with trickier FASTQ files.  Frankly, I don’t know of any cases offhand where sequencers giving FASTQ wrap lines off-hand (maybe Roche 454? but the standard there is SFF, and anyway 454 is going away...).  

Even the PacBio and Moleculo data we have seen all uses the 4-line format.

chris

On Mar 11, 2014, at 6:03 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> Hi Russell,
> 
> The problem is telling apart quality lines which happen to start with
> the at sign ('@') from the title line of a new read which must always
> start with the '@' sign. A simple regex is not the answer.
> 
> Chris and I (and our co-authors) talked about these kinds of issues
> while preparing the NAR paper - which comes with some test files, but
> even so we've found more corner cases since (like zero length
> sequences, quality strings which start with '0', etc):
> 
> http://dx.doi.org/10.1093/nar/gkp1137
> 
> For indexing, if you assume the file uses four lines per read, we are
> looking for the '@' lines on lines on every fourth line. That is fast.
> What Biopython's SeqIO currently does is slower - it tracks the length
> of the sequence and quality, in order that it can index evil FASTQ
> with line wrapping (where there are more than four lines per record).
> e.g.
> 
> https://github.com/biopython/biopython/blob/master/Tests/Quality/tricky.fastq
> 
> Peter
> 
> On Tue, Mar 11, 2014 at 7:39 PM, Smithies, Russell
> <Russell.Smithies at agresearch.co.nz> wrote:
>> To cover the possibility of multiple '@' symbols (apparently there is no single word to describe this symbol) shouldn't it be:
>> 
>> if (/^@+[A-Z]/) {
>> 
>> or
>> 
>> if (/^@[@A-Z]+/) {
>> 
>> 
>> --Russell
>> 
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Fields, Christopher J
>> Sent: Wednesday, 12 March 2014 8:20 a.m.
>> To: Ehsan Habibi
>> Cc: BioPerl List; Peter Cock
>> Subject: Re: [Bioperl-l] warning: Bio::Index::Fastq;
>> 
>> Wouldn't this fail with a qual string like '@DDDDDFFFFFIIIDCBBBB'?  Preferentially it should cover the FASTQ torture test suite that Peter set up (which IIRC has a few like this).
>> 
>> chris
>> 
>> On Mar 11, 2014, at 1:33 PM, Ehsan Habibi <eh_mcb at yahoo.com> wrote:
>> 
>>> I changed the following part of the code (Bio/Index/Fastq.pm) and now it works perfect :)
>>>     # Main indexing loop
>>> while (<$FASTQ>) {
>>> if (/^@[A-Z]/) {
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Tuesday, March 11, 2014 5:40 PM, "Fields, Christopher J" <cjfields at illinois.edu> wrote:
>>> On Mar 11, 2014, at 11:16 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>> 
>>> 
>>>> On Tue, Mar 11, 2014 at 4:01 PM, EpiMAN <eh_mcb at yahoo.com> wrote:
>>>>> test <http://bioperl.996286.n3.nabble.com/file/n17374/test>
>>>>> 
>>>>> grep -C 10 "^@@DDDDDFFFFFIIIDCBBBB"
>>>> 
>>>> Very helpful - it does look like it came from a valid FASTQ (the --
>>>> lines are where grep cuts each snippet).
>>>> 
>>>> I think I can explain this, see:
>>>> https://github.com/bioperl/bioperl-live/commit/b45e2d33984cf283b846d
>>>> 7c146ec9e0e9ebae67f
>>>> 
>>>> Chris' hack handled quality lines starting with a single '@'
>>>> sign, but multiple '@' signs like your examples which start '@@'
>>>> instead.
>>>> 
>>>> Peter
>>> 
>>> 
>>> We probably need a stricter index method analogous to the Bio::SeqIO::fastq parser.  I'll see if I can eek out time today to look into it.
>>> 
>>> chris
>>> 
>>> 
>>> 
>> 
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l