# [Bioperl-l] Next-gen modules

Wed Jul 8 15:26:54 UTC 2009

```I've just added a sequence adapter removal implementation to the bioperl
think the basic method is sound, but the implementation is ugly.

Performance wise, it currently takes around 80 minutes to remove adapters
from a ~3.2 million read Illumina run. This includes quality trimming and
grouping the sequences to reduce processing time. The quality trimming
(described
earlier in this
takes about 15 minutes, so adapter removal is definitely the bottleneck. I'm
confident that some relatively simple developments in Bioperl and/or EMBOSS
will yield some big performance improvements - if you see my sample code in
the scrapbook you'll understand why!

I've also been experimenting with sequence entropy calculations for
filtering out junk sequence.
I used Mark Jensens code at
http://www.bioperl.org/wiki/Site_entropy_in_an_alignment for inspiration.

Here is my current entropy calculation code:

sub entropy
{
my (\$seq_str, \$word_size) = @_;
my %res_counts;
for (my \$i = 0; \$i <= ((length \$seq_str) - \$word_size); \$i ++)
{
my \$word = substr \$seq_str, \$i, \$word_size;
if (\$word !~ /N/) { \$res_counts{\$word} ++; }
}
#~ print STDERR join (" ", keys %res_counts), "\n";
#~ print STDERR join (" ", values %res_counts), "\n";
my @counts = values %res_counts;
my \$word_count = sum @counts;
map {\$_ /= \$word_count} @counts;
return sum map {-\$_*log2(\$_)} @counts;
}

sub log2
{
my \$n = shift;
return log(\$n)/log(2);
}

I don't know if this does "the right thing", and have yet to determine a
suitable word size and entropy threshold for sequence filtering, so feel
free to comment/test away.

Giles

2009/7/6 Peter Rice <pmr at ebi.ac.uk>

> Giles Weaver wrote:
> > I'm developing a transcriptomics database for use with next-gen data, and
> > have found processing the raw data to be a big hurdle.
> >
> > I'm a bit late in responding to this thread, so most issues have already
> > been discussed. One thing that hasn't been mentioned is removal of
> > from raw Illumina sequence. This is a PITA, and I'm not aware of any well
> > developed and documented open source software for removal of adapters
> (and
> > poor quality sequence) from Illumina reads.
>
> We would like to add this to EMBOSS. Can you describe the method you
> would like to use (I see you currently use a combination of bioperl and
> emboss for this).
>
> > For my purposes the tools that would love to see supported in
> > bioperl/bioperl-run are:
> >
> >    - next-gen sequence quality parsing (to output phred scores)
> >    - sequence quality based trimming
> >    - sequencing adapter removal
> >    - filtering based on sequence complexity (repeats, entropy etc)
> >    - bioperl-run modules for bowtie etc.
>
> We would like to see these supported in all the Open-Bio Projects and
> they are a priority for EMBOSS.
>
> Can you suggest quality filters, trimming methods, adaptor removal
> methods, sequence filters and any other applications we could provide in
> EMBOSS.
>
> We hope to keep in line with what the other projects do so that EMBOSS,
> bioperl, biopython etc. can be used interchangeably in pipelines.
>
> > Obviously all of these need to be fast! .... My
> > current code trims ~1300 sequences/second, including unzipping the raw
> data
> > and converting it to sanger fastq with biopython. Processing an entire
> > sequencing run with the whole pipeline takes in the region of 6-12h.
>
> OK, we will see what speed we can reach.
>
> > Hope this looooong post was of interest to someone!
>
> Very interesting!
>
> regards,
>
> Peter Rice
>

```