[Bioperl-l] Next-gen modules
giles.weaver at googlemail.com
Wed Jul 1 16:27:22 UTC 2009
Peter, the trimming algorithm I use employs a sliding window, as follows:
- For each sequence position calculate the mean phred quality score for a
window around that position.
- Record whether the mean score is above or below a threshold as an array
of zeros and ones.
- Use a regular expression on the joined array to find the start and end
of the good quality sequence(s).
- Extract the quality sequence(s) and replace any bases below the quality
threshold with N.
- Trim any Ns from the ends.
A refinement would be to weight the scores from positions in the window, but
this could give a performance hit, and the method seems to work well enough
Chris, thanks for committing the fix, I'll give bioperl illumina fastq
parsing a workout soon. Peter, as much as I'd love to help out with
biopython, I'm under too much time pressure right now!
Jonathan, some of the Illumina sequencing adapters are listed at
Adapter sequence typically appears towards the end of the read, though the
latter part of it is often misread as the sequencing quality drops off.
I abuse needle (EMBOSS) into aligning the adapter sequence with each read. I
then use Bio::AlignIO, Bio::Range and a custom scoring scheme to identify
real alignments and trim the sequence. This is not the ideal way of doing
things, but it's fast enough, and does seem to work. The adapter sequence
shouldn't be gapped, so I'm sure there is a lot of scope for optimising the
I'll happily share some code once I've got it to the stage where I'm not
embarrassed by it!
2009/7/1 Chris Fields <cjfields at illinois.edu>
> I just committed a fix to FASTQ parsing last night to support read/write
> for Sanger/Solexa/Illumina following the biopython convention; the only
> thing needed is more extensive testing for the quality scores. There are a
> few other oddities with it I intend to address soon, but it appears to be
> The Seq instance iterator actually calls a raw data iterator (hash refs of
> named arguments to the class constructor). That should act as a decent
> filtering step if needed.
> We have automated EMBOSS wrapping but I'm not sure how intuitive it is; we
> can probably reconfigure some of that.
> On Jul 1, 2009, at 2:44 AM, Peter Cock wrote:
> Hi all (BioPerl and Biopython),
>> This is a continuation of a long thread on the BioPerl mailing
>> list, which I have now CC'd to the Biopython mailing list. See:
>> On this thread we have been discussing next gen sequencing
>> tools and co-coordinating things like consistent file format
>> naming between Biopython, BioPerl and EMBOSS. I've been
>> chatting to Peter Rice (EMBOSS) while at BOSC/ISMB 2009,
>> and he will look into setting up a cross project mailing list for
>> this kind of discussion in future.
>> In the mean time, my replies to Giles below cover both BioPerl
>> and Biopython (and EMBOSS). Giles' original email is here:
>> On 6/30/09, Giles Weaver <giles.weaver at googlemail.com> wrote:
>>> I'm developing a transcriptomics database for use with next-gen data, and
>>> have found processing the raw data to be a big hurdle.
>>> I'm a bit late in responding to this thread, so most issues have already
>>> been discussed. One thing that hasn't been mentioned is removal of
>>> from raw Illumina sequence. This is a PITA, and I'm not aware of any well
>>> developed and documented open source software for removal of adapters
>>> (and poor quality sequence) from Illumina reads.
>>> My current Illumina sequence processing pipeline is an unholy mix of
>>> biopython, bioperl, pure perl, emboss and bowtie. Biopython for
>>> the Illumina fastq to Sanger fastq, bioperl to read the quality values,
>>> pure perl to trim the poor quality sequence from each read, and bioperl
>>> with emboss to remove the adapter sequence. I'm aware that the pipeline
>>> contains bugs and would like to simplify it, but at least it does work...
>>> Ideally I'd like to replace as much of the pipeline as possible with
>>> bioperl/bioperl-run, but this isn't currently possible due to both a lack
>>> of features and poor performance. I'm sure the features will come with
>>> time, but the performance is more of a concern to me. ..
>> I gather you would rather work with (Bio)Perl, but since you are
>> already using Biopython to do the FASTQ conversion, you could
>> also use it for more of your pipe line. Our tutorial includes examples
>> of simple FASTQ quality filtering, and trimming of primer sequences
>> (something like this might be helpful for removing adaptors). See:
>> Alternatively, with the new release of EMBOSS this July, you will
>> also be able to do the Illumina FASTQ to Sanger standard FASTQ
>> with EMBOSS, and I'm sure BioPerl will offer this soon too.
>> Regarding trimming bad quality bases (see comments from
>>> Tristan Lefebure) from Solexa/Illumina reads, I did find a mixed
>>> pure/bioperl solution to be much faster than a primarily bioperl
>>> based implementation. I found Bio::Seq->subseq(a,b) and
>>> Bio::Seq->subqual(a,b) to be far too slow. My current code trims
>>> ~1300 sequences/second, including unzipping the raw data and
>>> converting it to sanger fastq with biopython. Processing an entire
>>> sequencing run with the whole pipeline takes in the region of 6-12h.
>> There are several ways of doing quality trimming, and it would
>> make an excellent cookbook example (both for BioPerl and
>> Could you go into a bit more detail about your trimming
>> algorithm? e.g. Do you just trim any bases on the right below
>> a certain threshold, perhaps with a minimum length to retain
>> the trimmed read afterwards?
>> Hope this looooong post was of interest to someone!
>> I was interested at least ;)
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
More information about the Bioperl-l