[Bioperl-l] Bio::Index::Fastq '@' in qual
Fields, Christopher J
cjfields at illinois.edu
Mon Oct 24 16:10:38 UTC 2011
On Oct 24, 2011, at 10:10 AM, Peter Cock wrote:
> On Mon, Oct 24, 2011 at 3:58 PM, Sofia Robb <sofia2341 at gmail.com> wrote:
>> Hi,
>> I am having problems running Bio::Index::Fastq. I get the following error
>> when a quality line begins with '@'.
>>
>> ...
>>
>> Here is an example of a fastq record that is causing this error, The last
>> line which starts with an '@' is actually the qual line.
>> @5:105:15806:16092:Y
>> GTGGCGCGGAACAGAGGAGGAATGTTCAGGAGAGGGGGCATGTGTTGTTACCGAGTACTTGGAAACGACG
>> +
>> @9;A565:=8B?<E<DEEBEE<E3BB?3??BCCF2<@@=BGGBDB60:64594.81?<B??;3?8-984?
>>
>>
>> i see that chris has partially addressed this in the mailing list
>> http://bioperl.org/pipermail/bioperl-l/2011-January/034481.html
>>
>> However as he pointed out at the time, it appears this may be a fairly large
>> problem.
>
> Have you double checked you have the latest BioPerl with that
> fix Chris mentioned?
This should be fixed in both CPAN and bioperl-live. If not let me know.
>> My fastq seq and qual lines are alway only one line, so I think that adding
>> a line count and only checking for @ in the lines that $line_count%4 ==0
>> would work since the header lines are always the first of 4 lines , 0,4,8,
>> etc.
>
> Yes, *if* you can assume that for your data, which is an assumption I
> wouldn't like to make a general purpose library like BioPerl (or Biopython)
One could build in an optimization that takes this assumption into account when explicitly requested, something worth looking into. A lot of our short read pipelines use the 4-line format.
>> BioPerl fastq parsing issues aside, is there another tool which allows you
>> to retrieve arbitrary sequences from a fastq file by sequence ID?
>> There's one called cdbfasta which looks like it might work — does anyone
>> have experience with it?
>>
>> Thanks,
>> sofia
>> P.S. I am CCing Peter Cock in case BioPython has solved this issue already —
>> if so, perhaps their solution could be applied here.
>
> If you want a Python solution, Biopython's Bio.SeqIO.index (in memory)
> or Bio.SeqIO.index_db (using SQLite) functions will give you random
> access by ID to assorted files including FASTQ, even with nasty line
> wrapping and quality lines starting with @ or +.
>
> The Biopython FASTQ indexer basically tracks the state: @ header, seq
> line(s), + line, or qual line(s). You pay a slight performance hit when
> building the index over assuming four lines per record, but it is robust
> to this kind of nasty data.
>
> Peter
We should really look into a consistent OBDA-like indexing scheme that could work cross-Bio*. Or simply resuscitate OBDA. :)
chris
More information about the Bioperl-l
mailing list