[Biopython] still more questions about NGS sequenbce trimming
    Kiss, Csaba 
    csaba.kiss at lanl.gov
       
    Wed Oct 24 17:20:23 UTC 2012
    
    
  
Thanks, Seb.
That’s a clever usage of regex.
csaba
From: Sebastian Schmeier [mailto:s.schmeier at gmail.com]
Sent: Wednesday, October 24, 2012 11:13 AM
To: Kiss, Csaba
Cc: biopython at lists.open-bio.org
Subject: Re: [Biopython] still more questions about NGS sequenbce trimming
A very quick and dirty approach for your reject function (I hope I understood correctly) in script form:
#!/usr/bin/env python
import sys, re
from Bio import SeqIO
def main():
    for record in SeqIO.parse(open(sys.argv[1], "rU"), "fasta") :
        if not discard(str(record.seq)):
            SeqIO.write(record, sys.stdout, 'fasta')
def discard(seq):
    oRes = re.search('(A{9,}|C{9,}|G{9,}|T{9,}|N{9,})', seq)
    if oRes: return 1
    else: return 0
if __name__ == '__main__':
    sys.exit(main())
Best,
   Seb
On Wed, Oct 24, 2012 at 5:49 PM, Kiss, Csaba <csaba.kiss at lanl.gov<mailto:csaba.kiss at lanl.gov>> wrote:
Hi All!
Thanks for all your help to extract DNA sequences from sff files. Using biopython I managed to improve the sequence extraction from 3 hours to 10 minutes.
Now that I am hooked, I would like to replace mothur with some simple python functions.
Is there any function in biopython that would look for homopolymers on DNA sequences. Particularly I am looking to reject a sequence if it has more than 8 bp of stretches of any single nucleotide.
Another function I am looking for is a sliding window function along  the quality file. I could either use the fastq file or the fasta/qual file pair.
I could write these functions myself but if they are available, then it would make my life easier.
Thanks
Csaba
_______________________________________________
Biopython mailing list  -  Biopython at lists.open-bio.org<mailto:Biopython at lists.open-bio.org>
http://lists.open-bio.org/mailman/listinfo/biopython
    
    
More information about the Biopython
mailing list