[Bioperl-l] Next-gen modules

Wed Jul 8 15:26:54 UTC 2009

I've just added a sequence adapter removal implementation to the bioperl
scrapbook at http://www.bioperl.org/wiki/Removing_sequencing_adapters. I
think the basic method is sound, but the implementation is ugly.

Performance wise, it currently takes around 80 minutes to remove adapters
from a ~3.2 million read Illumina run. This includes quality trimming and
grouping the sequences to reduce processing time. The quality trimming
(described
earlier in this
thread<http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030411.html>)
takes about 15 minutes, so adapter removal is definitely the bottleneck. I'm
confident that some relatively simple developments in Bioperl and/or EMBOSS
will yield some big performance improvements - if you see my sample code in
the scrapbook you'll understand why!

I've also been experimenting with sequence entropy calculations for
filtering out junk sequence.
I used Mark Jensens code at
http://www.bioperl.org/wiki/Site_entropy_in_an_alignment for inspiration.

Here is my current entropy calculation code:

sub entropy
{
    my ($seq_str, $word_size) = @_;
    my %res_counts;
    for (my $i = 0; $i <= ((length $seq_str) - $word_size); $i ++)
    {
        my $word = substr $seq_str, $i, $word_size;
        if ($word !~ /N/) { $res_counts{$word} ++; }
    }
    #~ print STDERR join (" ", keys %res_counts), "\n";
    #~ print STDERR join (" ", values %res_counts), "\n";
    my @counts = values %res_counts;
    my $word_count = sum @counts;
    map {$_ /= $word_count} @counts;
    return sum map {-$_*log2($_)} @counts;
}

sub log2
{
    my $n = shift;
    return log($n)/log(2);
}

I don't know if this does "the right thing", and have yet to determine a
suitable word size and entropy threshold for sequence filtering, so feel
free to comment/test away.

Giles

2009/7/6 Peter Rice <pmr at ebi.ac.uk>

> Giles Weaver wrote:
> > I'm developing a transcriptomics database for use with next-gen data, and
> > have found processing the raw data to be a big hurdle.
> >
> > I'm a bit late in responding to this thread, so most issues have already
> > been discussed. One thing that hasn't been mentioned is removal of
> adapters
> > from raw Illumina sequence. This is a PITA, and I'm not aware of any well
> > developed and documented open source software for removal of adapters
> (and
> > poor quality sequence) from Illumina reads.
>
> We would like to add this to EMBOSS. Can you describe the method you
> would like to use (I see you currently use a combination of bioperl and
> emboss for this).
>
> > For my purposes the tools that would love to see supported in
> > bioperl/bioperl-run are:
> >
> >    - next-gen sequence quality parsing (to output phred scores)
> >    - sequence quality based trimming
> >    - sequencing adapter removal
> >    - filtering based on sequence complexity (repeats, entropy etc)
> >    - bioperl-run modules for bowtie etc.
>
> We would like to see these supported in all the Open-Bio Projects and
> they are a priority for EMBOSS.
>
> Can you suggest quality filters, trimming methods, adaptor removal
> methods, sequence filters and any other applications we could provide in
> EMBOSS.
>
> We hope to keep in line with what the other projects do so that EMBOSS,
> bioperl, biopython etc. can be used interchangeably in pipelines.
>
> > Obviously all of these need to be fast! .... My
> > current code trims ~1300 sequences/second, including unzipping the raw
> data
> > and converting it to sanger fastq with biopython. Processing an entire
> > sequencing run with the whole pipeline takes in the region of 6-12h.
>
> OK, we will see what speed we can reach.
>
> > Hope this looooong post was of interest to someone!
>
> Very interesting!
>
> regards,
>
> Peter Rice
>