[EMBOSS] EMBOSS Digest, Vol 121, Issue 2
Anandkumar Surendrarao
aksrao at ucdavis.edu
Wed Feb 6 12:17:36 UTC 2019
Thank you, David.
Since I emailed the mailing list, I may have found a solution
independently: please see https://www.biostars.org/p/361700/#362231.
But if I find your solution to be better in some way(s), I'll use it
instead.
Thank you for taking the time to explain your solution in detail, I truly
appreciate it.
Sincerely,
Anand
On Tue, Feb 5, 2019 at 7:01 AM <emboss-request at mailman.open-bio.org> wrote:
> Send EMBOSS mailing list submissions to
> emboss at mailman.open-bio.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://mailman.open-bio.org/mailman/listinfo/emboss
> or, via email, send a message with subject or body 'help' to
> emboss-request at mailman.open-bio.org
>
> You can reach the person managing the list at
> emboss-owner at mailman.open-bio.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of EMBOSS digest..."
>
>
> Today's Topics:
>
> 1. Re: run parameters for inexact matching with EMBOSS needle
> (David Mathog)
> 2. Re: run parameters for inexact matching with EMBOSS needle
> (David Mathog)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 04 Feb 2019 09:57:06 -0800
> From: David Mathog <mathog at caltech.edu>
> To: emboss at mailman.open-bio.org
> Subject: Re: [EMBOSS] run parameters for inexact matching with EMBOSS
> needle
> Message-ID: <1af376ce7957ba022462adfcb19765f2 at saf.bio.caltech.edu>
> Content-Type: text/plain; charset=UTF-8; format=flowed
>
> On 03-Feb-2019 04:18, Anandkumar Surendrarao wrote:
>
> > - I want to search with FASTA formatted DNA sequence queries that
> > are <=
> > 100nt in length,
> > - against FASTA formatted genomic DNA sequence,
> > - for global end-to-end matches (using Needleman Wunsch type global
> > search, not Smith Waterman type local matching)
> > - but allowing >= 80% identity and >= 80% of query length coverage
>
> Not sure about needle, except that I expect it would be very slow.
>
> However, I made a modified version of Lastal that added command line
> parameters for identity/query/target cutoff for almost exactly the same
> application. It is available here in the "modified programs" directory.
>
> https://sourceforge.net/projects/lemonade-assemble/
>
> There is a Centos 6 64 bit binary there, for any other platform you will
> need to recompile. Lastal does not have NW exactly, but it does have a
> -T parameter which does something which in many cases is similar. -T 0
> is a local alignment, and -T 1 "extends to the end" (starting from the
> local alignment). My use case involved Illumina reads, and what would
> happen is if there was a mismatch at 3 or 4 bp from the end of that read
> in the alignment against the target (in this case mRNA transcripts) -T 0
> would truncate the alignment at the position before the mismatch, so the
> read would not be seen as a full length match. With -T 1 it would pick
> up the extra matching bases. Anyway, let's say you have
>
> reads.fasta # bunch of 100bp reads
> genome.fasta # self explanatory
>
> CPUS=40
> lastdb -P $CPUS -w 10 genome_10 genome.fasta
> lastal -P $CPUS -T1 -I80 -Y80 -m250 -O50 -8 1 -9 1 \
> -f BlastTab genome_10 reads.fasta > results.txt 2>errors.log
>
>
> The lastdb command makes a database but uses only every 10th position.
> This will make the database smaller and the search much faster. These
> positions are only used as seeds, so if the expected match is very good
> -w10 gives the same results as -w1 (default). If the match is expected
> to be poor use -w 1 and expect much longer run times. Set CPUS to match
> your system.
>
> The lastal command looks for 80% identity (-I80), 80% coverage on query
> (-Y80), allows up to 250 overlapping matches (-m250, see below), allows
> offset overlap alignment as
> long as the overlap is at least 50bp (-O50), requires overlap alignments
> to go right
> to the ends (-8 1 -9 1), and -f BlastTab gives a blast tabular format
> output file (except when it is on the other strand the query positions
> are flipped rather than the target). Offset overlap is like this:
>
> -------------------------
> ||||||||||||||| <- -O50 == this must be >= 50bp
> ------------------------------
>
> This only concerns alignments which overlap the ends. The -m parameter
> is a standard lastal command line option:
>
> -m: maximum initial matches per query position (10)
>
> What it means is that if there are a lot of alignments in exactly the
> same place by default one only retrieves a few of them. Increasing the
> value to 250 means many more will be retrieved. Conceivably if there is
> super high coverage you might want to use an even larger number. Best
> to first try a small subset to determine the appropriate -m value for
> your case.
>
> I don't know if it would be faster if the query/database were the other
> way around.
>
> Regards,
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
>
>
> ------------------------------
>
> Message: 2
> Date: Mon, 04 Feb 2019 10:32:20 -0800
> From: David Mathog <mathog at caltech.edu>
> To: emboss at mailman.open-bio.org
> Subject: Re: [EMBOSS] run parameters for inexact matching with EMBOSS
> needle
> Message-ID: <f5553bd8c0b5781ade9bafca9bc40ed2 at saf.bio.caltech.edu>
> Content-Type: text/plain; charset=UTF-8; format=flowed
>
> On 04-Feb-2019 09:57, David Mathog wrote:
> > lastal -P $CPUS -T1 -I80 -Y80 -m250 -O50 -8 1 -9 1 \
> > -f BlastTab genome_10 reads.fasta > results.txt 2>errors.log
>
> I should add that for my case it was -I93 -J100. The reads were only
> 76bp and
> it was common to have a few bp mismatches on the ends, hence 93%
> identity. -J instead
> of -Y because it was transcripts searching a reads database rather than
> the other way around. 80% is pretty low if the short sequences should
> be present 1:1 in the assembled genome, even for a highly polymorphic
> genome where only one haplotype might be present.
>
> Regards,
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/emboss
>
> ------------------------------
>
> End of EMBOSS Digest, Vol 121, Issue 2
> **************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/emboss/attachments/20190206/76253b85/attachment.html>
More information about the EMBOSS
mailing list