[Bioperl-l] Bio::Index::Fastq '@' in qual

Mon Oct 24 16:17:18 UTC 2011

On Mon, Oct 24, 2011 at 5:10 PM, Fields, Christopher J
<cjfields at illinois.edu> wrote:
> On Oct 24, 2011, at 10:10 AM, Peter Cock wrote:
>
>> On Mon, Oct 24, 2011 at 3:58 PM, Sofia Robb <sofia2341 at gmail.com> wrote:
>>> Hi,
>>> I am having problems running Bio::Index::Fastq.  I get the following error
>>> when a quality line begins with '@'.
>>>
>>> ...
>>>
>>> Here is an example of a fastq record that is causing this error, The last
>>> line which starts with an '@'  is actually the qual line.
>>> @5:105:15806:16092:Y
>>> GTGGCGCGGAACAGAGGAGGAATGTTCAGGAGAGGGGGCATGTGTTGTTACCGAGTACTTGGAAACGACG
>>> +
>>> @9;A565:=8B?<E<DEEBEE<E3BB?3??BCCF2<@@=BGGBDB60:64594.81?<B??;3?8-984?
>>>
>>>
>>> i see that chris has partially addressed this in the mailing list
>>> http://bioperl.org/pipermail/bioperl-l/2011-January/034481.html
>>>
>>> However as he pointed out at the time, it appears this may be a fairly large
>>> problem.
>>
>> Have you double checked you have the latest BioPerl with that
>> fix Chris mentioned?
>
> This should be fixed in both CPAN and bioperl-live.  If not let me know.

Good.

>>> My fastq seq and qual lines are alway only one line, so I think that adding
>>> a line count and only checking for @ in the lines that $line_count%4 ==0
>>>  would work since the header lines are always the first of 4 lines , 0,4,8,
>>> etc.
>>
>> Yes, *if* you can assume that for your data, which is an assumption I
>> wouldn't like to make a general purpose library like BioPerl (or Biopython)
>
> One could build in an optimization that takes this assumption into account
> when explicitly requested, something worth looking into.  A lot of our short
> read pipelines use the 4-line format.

That's a sensible compromise.

>>> BioPerl fastq parsing issues aside, is there another tool which allows you
>>> to retrieve arbitrary sequences from a fastq file by sequence ID?
>>> There's one called cdbfasta which looks like it might work — does anyone
>>> have experience with it?
>>>
>>> Thanks,
>>> sofia
>>> P.S. I am CCing Peter Cock in case BioPython has solved this issue already —
>>> if so, perhaps their solution could be applied here.
>>
>> If you want a Python solution, Biopython's Bio.SeqIO.index (in memory)
>> or Bio.SeqIO.index_db (using SQLite) functions will give you random
>> access by ID to assorted files including FASTQ, even with nasty line
>> wrapping and quality lines starting with @ or +.
>>
>> The Biopython FASTQ indexer basically tracks the state: @ header, seq
>> line(s), + line, or qual line(s). You pay a slight performance hit when
>> building the index over assuming four lines per record, but it is robust
>> to this kind of nasty data.
>>
>> Peter
>
> We should really look into a consistent OBDA-like indexing scheme
> that could work cross-Bio*.  Or simply resuscitate OBDA. :)
>
> chris

+1

Our SQLite index is based on OBDA but replacing the BDB / flat file
index with SQLite3. Also we're using the Biopython SeqIO format
names which don't 100% align with BioPerl/EMBOSS/etc.

Peter