[Bioperl-l] Next-gen modules

Chris Fields cjfields at illinois.edu
Wed Jul 1 16:46:49 UTC 2009

On Jul 1, 2009, at 11:27 AM, Giles Weaver wrote:


> Peter, the trimming algorithm I use employs a sliding window, as  
> follows:
>   - For each sequence position calculate the mean phred quality  
> score for a
>   window around that position.
>   - Record whether the mean score is above or below a threshold as  
> an array
>   of zeros and ones.
>   - Use a regular expression on the joined array to find the start  
> and end
>   of the good quality sequence(s).
>   - Extract the quality sequence(s) and replace any bases below the  
> quality
>   threshold with N.
>   - Trim any Ns from the ends.
> A refinement would be to weight the scores from positions in the  
> window, but
> this could give a performance hit, and the method seems to work well  
> enough
> as is.
> Chris, thanks for committing the fix, I'll give bioperl illumina fastq
> parsing a workout soon. Peter, as much as I'd love to help out with
> biopython, I'm under too much time pressure right now!

Just let me know if the qual values match up with what is expected.   
You can also iterate through the data with hashrefs using next_dataset  
(faster than objects).  This is from the fastq tests in core:

$in_qual  = Bio::SeqIO->new(-file     =>  
                             -variant  => 'illumina',
                             -format   => 'fastq');

$qual = $in_qual->next_dataset();

isa_ok($qual, 'HASH');
is($qual->{-raw_quality}, 'SXXTXXXXXXXXXTTSUXSSXKTMQ');
is($qual->{-id}, 'FC12044_91407_8_200_406_24');
is($qual->{-desc}, '');
is($qual->{-descriptor}, 'FC12044_91407_8_200_406_24');

So one could check those values directly and then filter them through  
as needed directly into Bio::Seq::Quality if necessary (note some of  
the key values are constructor args):

my $qualobj = Bio::Seq::Quality->new(%$qual);


More information about the Bioperl-l mailing list