[Biopython] biopython question

Wed Apr 4 18:05:54 UTC 2012

Hi Tychele,

If I understood correctly, you have a list of primers stored in a file
and you want to trim those primer sequences off your fastq sequences,
correct? One way I could think of is to first store the primers in a
list (since they will be used repeatedly to check every single fastq
sequence).

Here's the code:

from Bio import SeqIO

def trim_primers(records, 'primer_file_name'):

    # read the primers
    primer_list = []
    with open('primer_file_name', 'r') as source:
      for line in source:
        primer_list.append(line.strip())

    for record in records:
        # list to check if the sequence begins with any of the primers
        check = [record.seq.startswith(x) for x in primer_list]
        # if any of the primer is present in the beginning of the
sequence, then we trim it off
        if any(check):
            # get index of primer that matches the beginning
            idx = check.index(True)
            len_primer = len(primer_list[idx])
            yield record[len_primer:]
        # otherwise just return the whole record
        else:
            yield record

and then, you can use the function like so:

original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
trimmed_reads = trim_primers(original_reads, 'primer_file_name')
count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
print "Saved %i reads" % count

I haven't tested the function, but I suppose that should do the trick.

Hope that helps :),
Bow

On Wed, Apr 4, 2012 at 18:55, Tychele Turner <tturne18 at jhmi.edu> wrote:
> Hi,
>
> I have a question regarding one of the biopython capabilities. I would like to trim primers off the end of reads in a fastq file and I found wonderful documentation of how to do this on your website as follows:
>
> from Bio import SeqIO
> def trim_primers(records, primer):
>    """Removes perfect primer sequences at start of reads.
>
>    This is a generator function, the records argument should
>    be a list or iterator returning SeqRecord objects.
>    """
>    len_primer = len(primer) #cache this for later
>    for record in records:
>        if record.seq.startswith(primer):
>            yield record[len_primer:]
>        else:
>            yield record
>
> original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT")
> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
> print "Saved %i reads" % count
>
>
>
>
> My question is: Is there a way to loop through a primer file for instance instead of looking for only
>
> 'GATGACGGTGT' every primer would be checked and subsequently removed from the start of its respective read.
>
> Primer file structured as:
> GATGACGGTGT
> GATGACGGTGA
> GATGACGGCCT
>
> If you have any suggestions it would be greatly appreciated. Thanks.
>
> Tychele
>
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython