[Biopython] Correcting short read errors based on k-mer coverage

Fri Sep 25 16:34:34 UTC 2009

On Fri, Sep 25, 2009 at 5:16 PM, Dan Bolser <dan.bolser at gmail.com> wrote:
>
> 2009/9/25 Peter <biopython at maubp.freeserve.co.uk>:
>>
>> I just tried with a short read file from the NCBI SRA with ~7 million reads
>> of 36bp and k=21. Each 36bp read gives 16 k-mers, thus I had in total
>> ~100 million kmers in total, and found about ~18 million different kmers.
>> About half occurred only once.
>>
>> My naive code to count the kmers used a Python dictionary (k-mer
>> strings as the keys, integer counts as values). It took about 5 minutes
>> to run and about 1.5 GB of RAM.
>>
>> What sized files are you hoping to run this on? Without knowing that,
>> it is hard to say if this simple dictionary approach will scale well.
>
> To warm up I'd want to try 125 million reads of ~50 bp.

That might still be possible in RAM... just. Are you aware of any public
datasets of that size? An NCBI SRA one for example?

> Later I'd want about 100 times more.

Right - that will certainly mean holding everything in memory isn't
going to be an option! A simple SQLite database might work nicely
though.

>> Dan Bolser <dan.bolser at gmail.com> wrote:
>>> In step 2 you take the full reads (ignoring qualities) and look at the
>>> k-mer frequency (average?) at each base. Some bases will have a very
>>> low k-mer frequency, indicating sequencing errors.
>>
>> Are you suggesting following the method of Chaisson et al 2009,
>> described in section "Detecting and error correcting accurate read
>> prefixes" of that paper - or something a little different? That section
>> itself cites several related approaches to read correction.
>
> Yeah, I was thinking of the Chasson 2009 method. Since then I had a
> couple of other methods brought to my attention on the Velvet mailing
> list:
>
> Efficient frequency-based de novo short-read clustering for error
> trimming in next-generation sequencing.
> Qu W, Hashimoto S, Morishita S.
> Genome Res. 2009 Jul;19(7):1309-15. Epub 2009 May 13.
> PMID: 19439514
> http://www.ncbi.nlm.nih.gov/pubmed/19439514
>
> SHREC: a short-read error correction method.
> Schröder J, Schröder H, Puglisi SJ, Sinha R, Schmidt B.
> Bioinformatics. 2009 Sep 1;25(17):2157-63. Epub 2009 Jun 19.
> PMID: 19542152
> http://www.ncbi.nlm.nih.gov/pubmed/19542152
>
>
> So the result is looking more and more redundant... However, a python
> one liner would be awesome!

I doubt a few line python script for the whole task will be forthcoming,
although parts of it may be more realistic (e.g. an SQLite based k-mer
counter).

This sort of thing (k-mer frequency based read correction and
trimming) might be of interest to the EMBOSS project, who have
expressed an interest in developing new command line tools
for next generation sequencing data (e.g. simple quality score
read filtering and trimming).

Peter