[Bioperl-l] Bio::Index::Fastq '@' in qual
Sofia Robb
sofia2341 at gmail.com
Mon Oct 24 14:58:13 UTC 2011
Hi,
I am having problems running Bio::Index::Fastq. I get the following error when a quality line begins with '@'.
------------- EXCEPTION: Bio::Root::Exception -------------
MSG: No description line parsed
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:368
STACK: Bio::SeqIO::fastq::next_dataset /usr/share/perl5/Bio/SeqIO/fastq.pm:71
STACK: Bio::SeqIO::fastq::next_seq /usr/share/perl5/Bio/SeqIO/fastq.pm:29
STACK: Bio::Index::AbstractSeq::fetch /usr/share/perl5/Bio/Index/AbstractSeq.pm:147
STACK: Bio::Index::AbstractSeq::get_Seq_by_id /usr/share/perl5/Bio/Index/AbstractSeq.pm:198
STACK: /home_stajichlab/robb/bin/clean_pairs_indexed.pl:68
Here is an example of a fastq record that is causing this error, The last line which starts with an '@' is actually the qual line.
@5:105:15806:16092:Y
GTGGCGCGGAACAGAGGAGGAATGTTCAGGAGAGGGGGCATGTGTTGTTACCGAGTACTTGGAAACGACG
+
@9;A565:=8B?<E<DEEBEE<E3BB?3??BCCF2<@@=BGGBDB60:64594.81?<B??;3?8-984?
i see that chris has partially addressed this in the mailing list
http://bioperl.org/pipermail/bioperl-l/2011-January/034481.html
However as he pointed out at the time, it appears this may be a fairly large problem.
My fastq seq and qual lines are alway only one line, so I think that adding a line count and only checking for @ in the lines that $line_count%4 ==0 would work since the header lines are always the first of 4 lines , 0,4,8, etc.
But if there are multiple lines of seq and qual i think that the /^+$/ of /^+$id/ can be used to identify the end of the sequence and the number of lines of quality should be equal to the number of lines of sequence
## only for single line seq and qual
my $line_count = 0;
while (<$FASTQ>) {
if (/^@/ and $line_count % 4 == 0) {
# $begin is the position of the first character after the '@'
my $begin = tell($FASTQ) - length( $_ ) + 1;
foreach my $id (&$id_parser($_)) {
$self->add_record($id, $i, $begin);
$c++;
}
}
$line_count++;
}
--
BioPerl fastq parsing issues aside, is there another tool which allows you to retrieve arbitrary sequences from a fastq file by sequence ID?
There's one called cdbfasta which looks like it might work — does anyone have experience with it?
Thanks,
sofia
P.S. I am CCing Peter Cock in case BioPython has solved this issue already — if so, perhaps their solution could be applied here.
More information about the Bioperl-l
mailing list