[Biopython] biopython question
Wibowo Arindrarto
w.arindrarto at gmail.com
Wed Apr 4 18:05:54 UTC 2012
Hi Tychele,
If I understood correctly, you have a list of primers stored in a file
and you want to trim those primer sequences off your fastq sequences,
correct? One way I could think of is to first store the primers in a
list (since they will be used repeatedly to check every single fastq
sequence).
Here's the code:
from Bio import SeqIO
def trim_primers(records, 'primer_file_name'):
# read the primers
primer_list = []
with open('primer_file_name', 'r') as source:
for line in source:
primer_list.append(line.strip())
for record in records:
# list to check if the sequence begins with any of the primers
check = [record.seq.startswith(x) for x in primer_list]
# if any of the primer is present in the beginning of the
sequence, then we trim it off
if any(check):
# get index of primer that matches the beginning
idx = check.index(True)
len_primer = len(primer_list[idx])
yield record[len_primer:]
# otherwise just return the whole record
else:
yield record
and then, you can use the function like so:
original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
trimmed_reads = trim_primers(original_reads, 'primer_file_name')
count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
print "Saved %i reads" % count
I haven't tested the function, but I suppose that should do the trick.
Hope that helps :),
Bow
On Wed, Apr 4, 2012 at 18:55, Tychele Turner <tturne18 at jhmi.edu> wrote:
> Hi,
>
> I have a question regarding one of the biopython capabilities. I would like to trim primers off the end of reads in a fastq file and I found wonderful documentation of how to do this on your website as follows:
>
> from Bio import SeqIO
> def trim_primers(records, primer):
> """Removes perfect primer sequences at start of reads.
>
> This is a generator function, the records argument should
> be a list or iterator returning SeqRecord objects.
> """
> len_primer = len(primer) #cache this for later
> for record in records:
> if record.seq.startswith(primer):
> yield record[len_primer:]
> else:
> yield record
>
> original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT")
> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
> print "Saved %i reads" % count
>
>
>
>
> My question is: Is there a way to loop through a primer file for instance instead of looking for only
>
> 'GATGACGGTGT' every primer would be checked and subsequently removed from the start of its respective read.
>
> Primer file structured as:
> GATGACGGTGT
> GATGACGGTGA
> GATGACGGCCT
>
> If you have any suggestions it would be greatly appreciated. Thanks.
>
> Tychele
>
>
> _______________________________________________
> Biopython mailing list - Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
More information about the Biopython
mailing list