[EMBOSS] Trimming illumina short reads based on quality

Zheng Jin Tu ztu at msi.umn.edu
Tue Dec 1 18:38:54 UTC 2009


Need to find bioinformatician to do the coding.

Not sure how to set correct filter parameters
but just sharing some experiences:

Basically person will check both sequence string and quality 
string from xxx_qseq.txt file.  Match each nucleotide and quality
in char level:

454, qual 20 is 95% and 40 99% confident if I am correct.

In GA, qual score is bit coded and can be 
read out by ord function in perl:
   
 qualcore = ord( quality_char ) - 64;

Not sure the cut off appropriate value. B is quality score 2.  
Thus at least we remove these BBBBB.  May set another min length filter 
to get rid of less than like 10 nucleotides read after 
trimming for low quality score.

Set another one max_score or avg_score filter, like 5, it can 
filter out the third and forth sequences in below lines.

R0174436        1       8       119     0       1418    0       1       
.GATCTTCTCCTTCACCTCCTCCAGGTCCTTGGTCAGCTCAGCACGCAGAG     
Bb^`bb____bbaaVI_Zbbaba`X_bb`aUbbb`W\\a^\bbT[_Xb]__     0
R0174436        1       8       119     0       991     0       1       
.GCCAATCTGTACTTGTCTTCTTCAGTTCCCACTTTGAATACCGCACAGTC     
BaGT]]bb[]`_]abaIaaaVbb^``abbaM`Ubbb`babaQT]XS_[a[B     0
R0174436        1       8       119     1791    1559    0       1       
A..................................................     
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB     0
R0174436        1       8       119     1791    1997    0       1       
A..................................................     
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB     0

Some people also look for trimming down poly T in RNA-seq
case. But not sure how many TTTT should be out. Or also
do AAAA case for reverse case?

Finally better output to fastq file format.

@R0174436:6:83:0:1815#0/1
.GTCAATGCGTTCCACCCCCTCTGGGTAGCCTCCAACATCATGTACGTCGA
+R0174436:6:83:0:1815#0/1
Ba`babbb]bb\Xb]b___V^^aaaa___Z_\_[aaa]babb`X`^b\bbb
@R0174436:6:83:0:506#0/1
.GCAGGAGAAGCATTTTATCTTTGTATTTTCTTCACTGGCAACAACAATGT
+R0174436:6:83:0:506#0/1
BaOW_\I]__a``a_\_J_HU_V_J\a`aa^bab^^]]]Y]^`[`[T]]\^

Good luck.

TU

===================================================

On Tue, 1 Dec 2009, David Martin wrote:

> >>> On 12/1/2009 at  3:46 PM, in message <4B153A4C.6000904 at umdnj.edu>, Ryan Golhar <golharam at umdnj.edu> wrote:
> I think virtually every man and his dog who has done anything with Illumina reads has a variety of perl scripts that do this. It depends how you want to do the trimming. Do you want to clip to a specific length, clip on quality (absolute or average over a window) and do you have a minimum length requirement? 
> 
> Do you want to clip 3' and 5' ends or just one? 
> 
> ..d 
> 
> 
> Michael,
> 
> Doesn't Illumina provide tools to do this?  I know with ABI Solid data,
> they have a perl script capable of trimming data based on quality scores.
> 
> Ryan
> 
> 
> michael watson (IAH-C) wrote:
> > Hi
> >
> > I'm sorry if I've not been keeping up to date on what is doubtless a hot topic.
> >
> > Does EMBOSS allow one to trim short reads based on quality data (from a fastq file)?
> >
> > If not, I have read that it is planned - any idea when it will be implemented?
> >
> > Otherwise, alternative suggestions are welcome!
> > 
> 
> 
> 
> David Martin PhD
> College of Life Sciences
> University of Dundee 
> The University of Dundee is a Scottish Registered Charity, No. SC015096.
> 
> 
> The University of Dundee is a registered Scottish charity, No: SC015096
> 
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
> 




More information about the EMBOSS mailing list