[Bioperl-l] Next-gen modules
Jonathan_Epstein at nih.gov
Wed Jul 1 13:20:50 UTC 2009
I too am interested in these topics. In particular, I would like to
learn more about "sequencing adapter removal," i.e. what these adapters
look like, and what strategies you've employed for finding and removing
Giles Weaver wrote:
> I'm developing a transcriptomics database for use with next-gen data, and
> have found processing the raw data to be a big hurdle.
> I'm a bit late in responding to this thread, so most issues have already
> been discussed. One thing that hasn't been mentioned is removal of adapters
> from raw Illumina sequence. This is a PITA, and I'm not aware of any well
> developed and documented open source software for removal of adapters (and
> poor quality sequence) from Illumina reads.
> My current Illumina sequence processing pipeline is an unholy mix of
> biopython, bioperl, pure perl, emboss and bowtie. Biopython for converting
> the Illumina fastq to Sanger fastq, bioperl to read the quality values, pure
> perl to trim the poor quality sequence from each read, and bioperl with
> emboss to remove the adapter sequence. I'm aware that the pipeline contains
> bugs and would like to simplify it, but at least it does work...
> Ideally I'd like to replace as much of the pipeline as possible with
> bioperl/bioperl-run, but this isn't currently possible due to both a lack of
> features and poor performance. I'm sure the features will come with time,
> but the performance is more of a concern to me. I wonder if Bio::Moose might
> be used to alleviate some of the performance issues? Might next-gen modules
> be an ideal guinea pig for Bio::Moose?
> For my purposes the tools that would love to see supported in
> bioperl/bioperl-run are:
> - next-gen sequence quality parsing (to output phred scores)
> - sequence quality based trimming
> - sequencing adapter removal
> - filtering based on sequence complexity (repeats, entropy etc)
> - bioperl-run modules for bowtie etc.
> Obviously all of these need to be fast!
> I'd love to muck in, but I doubt I'll contribute much before
> Bio::Moose/bioperl6, as the (bio)perl object system gives me nightmares!
> Regarding trimming bad quality bases (see comments from Tristan Lefebure)
> from Solexa/Illumina reads, I did find a mixed pure/bioperl solution to be
> much faster than a primarily bioperl based implementation. I found
> Bio::Seq->subseq(a,b) and Bio::Seq->subqual(a,b) to be far too slow. My
> current code trims ~1300 sequences/second, including unzipping the raw data
> and converting it to sanger fastq with biopython. Processing an entire
> sequencing run with the whole pipeline takes in the region of 6-12h.
> Hope this looooong post was of interest to someone!
> 2009/6/17 Tristan Lefebure <tristan.lefebure at gmail.com>
>> Regarding next-gen sequences and bioperl, following my
>> experience, another issue is bioperl speed. For example, if
>> you want to trim bad quality bases at ends of 1E6 Solexa
>> reads using Bio::SeqIO::fastq and some methods in
>> Bio::Seq::Quality, well, you've got to be patient (but may
>> be I missed some shortcuts...).
>> A pure perl solution will be between 100 to 1000x faster...
>> Would it be possible to have an ultra-light quality object
>> with few simple methods for next-gen reads?
>> I can contribute some tests if that sounds like an important
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
More information about the Bioperl-l