[Bioperl-l] Bio::Index::Fastq '@' in qual

Mon Oct 24 15:10:39 UTC 2011

On Mon, Oct 24, 2011 at 3:58 PM, Sofia Robb <sofia2341 at gmail.com> wrote:
> Hi,
> I am having problems running Bio::Index::Fastq.  I get the following error
> when a quality line begins with '@'.
>
> ...
>
> Here is an example of a fastq record that is causing this error, The last
> line which starts with an '@'  is actually the qual line.
> @5:105:15806:16092:Y
> GTGGCGCGGAACAGAGGAGGAATGTTCAGGAGAGGGGGCATGTGTTGTTACCGAGTACTTGGAAACGACG
> +
> @9;A565:=8B?<E<DEEBEE<E3BB?3??BCCF2<@@=BGGBDB60:64594.81?<B??;3?8-984?
>
>
> i see that chris has partially addressed this in the mailing list
> http://bioperl.org/pipermail/bioperl-l/2011-January/034481.html
>
> However as he pointed out at the time, it appears this may be a fairly large
> problem.

Have you double checked you have the latest BioPerl with that
fix Chris mentioned?

> My fastq seq and qual lines are alway only one line, so I think that adding
> a line count and only checking for @ in the lines that $line_count%4 ==0
>  would work since the header lines are always the first of 4 lines , 0,4,8,
> etc.

Yes, *if* you can assume that for your data, which is an assumption I
wouldn't like to make a general purpose library like BioPerl (or Biopython)

> BioPerl fastq parsing issues aside, is there another tool which allows you
> to retrieve arbitrary sequences from a fastq file by sequence ID?
> There's one called cdbfasta which looks like it might work — does anyone
> have experience with it?
>
> Thanks,
> sofia
> P.S. I am CCing Peter Cock in case BioPython has solved this issue already —
> if so, perhaps their solution could be applied here.

If you want a Python solution, Biopython's Bio.SeqIO.index (in memory)
or Bio.SeqIO.index_db (using SQLite) functions will give you random
access by ID to assorted files including FASTQ, even with nasty line
wrapping and quality lines starting with @ or +.

The Biopython FASTQ indexer basically tracks the state: @ header, seq
line(s), + line, or qual line(s). You pay a slight performance hit when
building the index over assuming four lines per record, but it is robust
to this kind of nasty data.

Peter