[Bioperl-l] Next-gen modules

Mon Jul 6 14:09:21 UTC 2009

Giles Weaver wrote:
> I'm developing a transcriptomics database for use with next-gen data, and
> have found processing the raw data to be a big hurdle.
> 
> I'm a bit late in responding to this thread, so most issues have already
> been discussed. One thing that hasn't been mentioned is removal of adapters
> from raw Illumina sequence. This is a PITA, and I'm not aware of any well
> developed and documented open source software for removal of adapters (and
> poor quality sequence) from Illumina reads.

We would like to add this to EMBOSS. Can you describe the method you
would like to use (I see you currently use a combination of bioperl and
emboss for this).

> For my purposes the tools that would love to see supported in
> bioperl/bioperl-run are:
> 
>    - next-gen sequence quality parsing (to output phred scores)
>    - sequence quality based trimming
>    - sequencing adapter removal
>    - filtering based on sequence complexity (repeats, entropy etc)
>    - bioperl-run modules for bowtie etc.

We would like to see these supported in all the Open-Bio Projects and
they are a priority for EMBOSS.

Can you suggest quality filters, trimming methods, adaptor removal
methods, sequence filters and any other applications we could provide in
EMBOSS.

We hope to keep in line with what the other projects do so that EMBOSS,
bioperl, biopython etc. can be used interchangeably in pipelines.

> Obviously all of these need to be fast! .... My
> current code trims ~1300 sequences/second, including unzipping the raw data
> and converting it to sanger fastq with biopython. Processing an entire
> sequencing run with the whole pipeline takes in the region of 6-12h.

OK, we will see what speed we can reach.

> Hope this looooong post was of interest to someone!

Very interesting!

regards,

Peter Rice