[Bioperl-l] Bio::Index::Fastq '@' in qual

Mon Oct 24 14:58:13 UTC 2011

Hi,

I am having problems running Bio::Index::Fastq.  I get the following error when a quality line begins with '@'.

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: No description line parsed
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:368
STACK: Bio::SeqIO::fastq::next_dataset /usr/share/perl5/Bio/SeqIO/fastq.pm:71
STACK: Bio::SeqIO::fastq::next_seq /usr/share/perl5/Bio/SeqIO/fastq.pm:29
STACK: Bio::Index::AbstractSeq::fetch /usr/share/perl5/Bio/Index/AbstractSeq.pm:147
STACK: Bio::Index::AbstractSeq::get_Seq_by_id /usr/share/perl5/Bio/Index/AbstractSeq.pm:198
STACK: /home_stajichlab/robb/bin/clean_pairs_indexed.pl:68

Here is an example of a fastq record that is causing this error, The last line which starts with an '@'  is actually the qual line.

@5:105:15806:16092:Y
GTGGCGCGGAACAGAGGAGGAATGTTCAGGAGAGGGGGCATGTGTTGTTACCGAGTACTTGGAAACGACG
+
@9;A565:=8B?<E<DEEBEE<E3BB?3??BCCF2<@@=BGGBDB60:64594.81?<B??;3?8-984?

i see that chris has partially addressed this in the mailing list
http://bioperl.org/pipermail/bioperl-l/2011-January/034481.html

However as he pointed out at the time, it appears this may be a fairly large problem.

My fastq seq and qual lines are alway only one line, so I think that adding a line count and only checking for @ in the lines that $line_count%4 ==0  would work since the header lines are always the first of 4 lines , 0,4,8, etc.

But if there are multiple lines of seq and qual i think that the /^+$/ of /^+$id/ can be used to identify the end of the sequence and the number of lines of quality should be equal to the number of lines of sequence

## only for single line seq and qual
my $line_count = 0;
   while (<$FASTQ>) {
       if (/^@/ and  $line_count % 4 == 0) {
           # $begin is the position of the first character after the '@'
           my $begin = tell($FASTQ) - length( $_ ) + 1;
           foreach my $id (&$id_parser($_)) {
               $self->add_record($id, $i, $begin);
               $c++;
           }
       }
       $line_count++;
   }

--
BioPerl fastq parsing issues aside, is there another tool which allows you to retrieve arbitrary sequences from a fastq file by sequence ID?

There's one called cdbfasta which looks like it might work — does anyone have experience with it?

Thanks,
sofia

P.S. I am CCing Peter Cock in case BioPython has solved this issue already — if so, perhaps their solution could be applied here.