[Biopython] biopython question

Mic mictadlo at gmail.com
Fri Apr 6 09:59:36 UTC 2012


Hi Bow,
You can remove duplicates in the input file or create a new output file.
With the following commands you create an output file with no duplicates:

*$ samtools fixmate t.paired.sorted.bam t.paired.sorted.SamFix.bam

*
*$ java -Xmx8g -jar MarkDuplicates.jar REMOVE_DUPLICATES=true
ASSUME_SORTED=true MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1024
INPUT=t.paired.sorted.bam OUTPUT=t.paired.sorted.rmdulp.bam
METRICS_FILE=t.paired.sorted.bam.metrics*

Are adapters and fragments the same?

I found the following software for adapter:
** TagDust - eliminate artifactual sequence from NGS data*
*http://www.biomedcentral.com/1471-2164/12/382*
*http://bioinformatics.oxfordjournals.org/content/25/21/2839.full*
** FAR: http://sourceforge.net/apps/mediawiki/theflexibleadap/index.php?*
*title=Main_Page*
** Trimmomatic: http://www.usadellab.org/cms/index.php?page=trimmomatic*
** http://code.google.com/p/cutadapt/ *
** https://github.com/vsbuffalo/scythe *
** http://code.google.com/p/biopieces/wiki/find_adaptor*

Thank you in advance.

Cheers,


On Fri, Apr 6, 2012 at 12:20 PM, Wibowo Arindrarto
<w.arindrarto at gmail.com>wrote:

> Hi Mic,
>
> I'm not familiar with picard, but it seems that this program detects
> whole duplicate molecules instead of detecting whether a primer is
> present in sequences (which may or may not be duplicates). Plus, it
> doesn't do any removal ~ it only flags them. So I don't think the two
> are comparable.
>
> cheers,
> Bow
>
> On Fri, Apr 6, 2012 at 04:06, Mic <mictadlo at gmail.com> wrote:
> > What is the difference to remove primer from the fastq file rather to use
> > markDuplicates
> http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates
>  on
> > an alignment?
> >
> > Would both ways deliver the same results?
> >
> > Thank you in advance.
> >
> >
> > On Thu, Apr 5, 2012 at 8:26 AM, Wibowo Arindrarto <
> w.arindrarto at gmail.com>
> > wrote:
> >>
> >> Hi Tychele,
> >>
> >> Glad to hear that and thanks for attaching the code as well :).
> >>
> >> Just one more heads up on the code, the trimming function assumes that
> >> for any record sequence, there is only one matching primer sequence at
> >> most. If by any random chance a sequence begins with two or more
> >> primer sequences, then it will only trim the first primer sequence. So
> >> if you still see some primer sequences left in the trimmed sequences,
> >> this could be the case and you'll need to modify the code.
> >>
> >> However, that seems unlikely ~ the current code should suffice.
> >>
> >> cheers,
> >> Bow
> >>
> >>
> >> On Thu, Apr 5, 2012 at 00:12, Tychele Turner <tturne18 at jhmi.edu> wrote:
> >> > Hi Bow,
> >> >
> >> > Thank you! This works great. I have attached the final code to the
> email
> >> > in case it may benefit others.
> >> >
> >> > Tychele
> >> >
> >> >
> >> > ________________________________________
> >> > From: Wibowo Arindrarto [w.arindrarto at gmail.com]
> >> > Sent: Wednesday, April 04, 2012 2:05 PM
> >> > To: Tychele Turner
> >> > Cc: biopython at biopython.org
> >> > Subject: Re: [Biopython] biopython question
> >> >
> >> > Hi Tychele,
> >> >
> >> > If I understood correctly, you have a list of primers stored in a file
> >> > and you want to trim those primer sequences off your fastq sequences,
> >> > correct? One way I could think of is to first store the primers in a
> >> > list (since they will be used repeatedly to check every single fastq
> >> > sequence).
> >> >
> >> > Here's the code:
> >> >
> >> > from Bio import SeqIO
> >> >
> >> > def trim_primers(records, 'primer_file_name'):
> >> >
> >> >    # read the primers
> >> >    primer_list = []
> >> >    with open('primer_file_name', 'r') as source:
> >> >      for line in source:
> >> >        primer_list.append(line.strip())
> >> >
> >> >    for record in records:
> >> >        # list to check if the sequence begins with any of the primers
> >> >        check = [record.seq.startswith(x) for x in primer_list]
> >> >        # if any of the primer is present in the beginning of the
> >> > sequence, then we trim it off
> >> >        if any(check):
> >> >            # get index of primer that matches the beginning
> >> >            idx = check.index(True)
> >> >            len_primer = len(primer_list[idx])
> >> >            yield record[len_primer:]
> >> >        # otherwise just return the whole record
> >> >        else:
> >> >            yield record
> >> >
> >> > and then, you can use the function like so:
> >> >
> >> > original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
> >> > trimmed_reads = trim_primers(original_reads, 'primer_file_name')
> >> > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
> >> > print "Saved %i reads" % count
> >> >
> >> > I haven't tested the function, but I suppose that should do the trick.
> >> >
> >> > Hope that helps :),
> >> > Bow
> >> >
> >> >
> >> > On Wed, Apr 4, 2012 at 18:55, Tychele Turner <tturne18 at jhmi.edu>
> wrote:
> >> >> Hi,
> >> >>
> >> >> I have a question regarding one of the biopython capabilities. I
> would
> >> >> like to trim primers off the end of reads in a fastq file and I found
> >> >> wonderful documentation of how to do this on your website as follows:
> >> >>
> >> >> from Bio import SeqIO
> >> >> def trim_primers(records, primer):
> >> >>    """Removes perfect primer sequences at start of reads.
> >> >>
> >> >>    This is a generator function, the records argument should
> >> >>    be a list or iterator returning SeqRecord objects.
> >> >>    """
> >> >>    len_primer = len(primer) #cache this for later
> >> >>    for record in records:
> >> >>        if record.seq.startswith(primer):
> >> >>            yield record[len_primer:]
> >> >>        else:
> >> >>            yield record
> >> >>
> >> >> original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
> >> >> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT")
> >> >> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
> >> >> print "Saved %i reads" % count
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> My question is: Is there a way to loop through a primer file for
> >> >> instance instead of looking for only
> >> >>
> >> >> 'GATGACGGTGT' every primer would be checked and subsequently removed
> >> >> from the start of its respective read.
> >> >>
> >> >> Primer file structured as:
> >> >> GATGACGGTGT
> >> >> GATGACGGTGA
> >> >> GATGACGGCCT
> >> >>
> >> >> If you have any suggestions it would be greatly appreciated. Thanks.
> >> >>
> >> >> Tychele
> >> >>
> >> >>
> >> >> _______________________________________________
> >> >> Biopython mailing list  -  Biopython at lists.open-bio.org
> >> >> http://lists.open-bio.org/mailman/listinfo/biopython
> >> >
> >>
> >> _______________________________________________
> >> Biopython mailing list  -  Biopython at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biopython
> >
> >
>



More information about the Biopython mailing list