[Bioperl-l] warning: Bio::Index::Fastq;
Fields, Christopher J
cjfields at illinois.edu
Tue Mar 11 23:18:37 UTC 2014
My feeling is that we could implement a switch to allow fast parsing if the 4-line convention is used, and a more diligent parser that deals with trickier FASTQ files. Frankly, I don’t know of any cases offhand where sequencers giving FASTQ wrap lines off-hand (maybe Roche 454? but the standard there is SFF, and anyway 454 is going away...).
Even the PacBio and Moleculo data we have seen all uses the 4-line format.
chris
On Mar 11, 2014, at 6:03 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Hi Russell,
>
> The problem is telling apart quality lines which happen to start with
> the at sign ('@') from the title line of a new read which must always
> start with the '@' sign. A simple regex is not the answer.
>
> Chris and I (and our co-authors) talked about these kinds of issues
> while preparing the NAR paper - which comes with some test files, but
> even so we've found more corner cases since (like zero length
> sequences, quality strings which start with '0', etc):
>
> http://dx.doi.org/10.1093/nar/gkp1137
>
> For indexing, if you assume the file uses four lines per read, we are
> looking for the '@' lines on lines on every fourth line. That is fast.
> What Biopython's SeqIO currently does is slower - it tracks the length
> of the sequence and quality, in order that it can index evil FASTQ
> with line wrapping (where there are more than four lines per record).
> e.g.
>
> https://github.com/biopython/biopython/blob/master/Tests/Quality/tricky.fastq
>
> Peter
>
> On Tue, Mar 11, 2014 at 7:39 PM, Smithies, Russell
> <Russell.Smithies at agresearch.co.nz> wrote:
>> To cover the possibility of multiple '@' symbols (apparently there is no single word to describe this symbol) shouldn't it be:
>>
>> if (/^@+[A-Z]/) {
>>
>> or
>>
>> if (/^@[@A-Z]+/) {
>>
>>
>> --Russell
>>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Fields, Christopher J
>> Sent: Wednesday, 12 March 2014 8:20 a.m.
>> To: Ehsan Habibi
>> Cc: BioPerl List; Peter Cock
>> Subject: Re: [Bioperl-l] warning: Bio::Index::Fastq;
>>
>> Wouldn't this fail with a qual string like '@DDDDDFFFFFIIIDCBBBB'? Preferentially it should cover the FASTQ torture test suite that Peter set up (which IIRC has a few like this).
>>
>> chris
>>
>> On Mar 11, 2014, at 1:33 PM, Ehsan Habibi <eh_mcb at yahoo.com> wrote:
>>
>>> I changed the following part of the code (Bio/Index/Fastq.pm) and now it works perfect :)
>>> # Main indexing loop
>>> while (<$FASTQ>) {
>>> if (/^@[A-Z]/) {
>>>
>>>
>>>
>>>
>>>
>>> On Tuesday, March 11, 2014 5:40 PM, "Fields, Christopher J" <cjfields at illinois.edu> wrote:
>>> On Mar 11, 2014, at 11:16 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>>
>>>
>>>> On Tue, Mar 11, 2014 at 4:01 PM, EpiMAN <eh_mcb at yahoo.com> wrote:
>>>>> test <http://bioperl.996286.n3.nabble.com/file/n17374/test>
>>>>>
>>>>> grep -C 10 "^@@DDDDDFFFFFIIIDCBBBB"
>>>>
>>>> Very helpful - it does look like it came from a valid FASTQ (the --
>>>> lines are where grep cuts each snippet).
>>>>
>>>> I think I can explain this, see:
>>>> https://github.com/bioperl/bioperl-live/commit/b45e2d33984cf283b846d
>>>> 7c146ec9e0e9ebae67f
>>>>
>>>> Chris' hack handled quality lines starting with a single '@'
>>>> sign, but multiple '@' signs like your examples which start '@@'
>>>> instead.
>>>>
>>>> Peter
>>>
>>>
>>> We probably need a stricter index method analogous to the Bio::SeqIO::fastq parser. I'll see if I can eek out time today to look into it.
>>>
>>> chris
>>>
>>>
>>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list