[EMBOSS] EMBOSS Digest, Vol 121, Issue 2

Wed Feb 6 12:17:36 UTC 2019

Thank you, David.

Since I emailed the mailing list, I may have found a solution
independently: please see https://www.biostars.org/p/361700/#362231.

But if I find your solution to be better in some way(s), I'll use it
instead.
Thank you for taking the time to explain your solution in detail, I truly
appreciate it.

Sincerely,
Anand

On Tue, Feb 5, 2019 at 7:01 AM <emboss-request at mailman.open-bio.org> wrote:

> Send EMBOSS mailing list submissions to
>         emboss at mailman.open-bio.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://mailman.open-bio.org/mailman/listinfo/emboss
> or, via email, send a message with subject or body 'help' to
>         emboss-request at mailman.open-bio.org
>
> You can reach the person managing the list at
>         emboss-owner at mailman.open-bio.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of EMBOSS digest..."
>
>
> Today's Topics:
>
>    1. Re: run parameters for inexact matching with EMBOSS needle
>       (David Mathog)
>    2. Re: run parameters for inexact matching with EMBOSS needle
>       (David Mathog)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 04 Feb 2019 09:57:06 -0800
> From: David Mathog <mathog at caltech.edu>
> To: emboss at mailman.open-bio.org
> Subject: Re: [EMBOSS] run parameters for inexact matching with EMBOSS
>         needle
> Message-ID: <1af376ce7957ba022462adfcb19765f2 at saf.bio.caltech.edu>
> Content-Type: text/plain; charset=UTF-8; format=flowed
>
> On 03-Feb-2019 04:18, Anandkumar Surendrarao wrote:
>
> >    - I want to search with FASTA formatted DNA sequence queries that
> > are <=
> >    100nt in length,
> >    - against FASTA formatted genomic DNA sequence,
> >    - for global end-to-end matches  (using Needleman Wunsch type global
> >    search, not Smith Waterman type local matching)
> >    - but allowing >= 80% identity and >= 80% of query length coverage
>
> Not sure about needle, except that I expect it would be very slow.
>
> However, I made a modified version of Lastal that added command line
> parameters for identity/query/target cutoff for almost exactly the same
> application.  It is available here in the "modified programs" directory.
>
>    https://sourceforge.net/projects/lemonade-assemble/
>
> There is a Centos 6 64 bit binary there, for any other platform you will
> need to recompile. Lastal does not have NW exactly, but it does have a
> -T parameter which does something which in many cases is similar.  -T 0
> is a local alignment, and -T 1 "extends to the end" (starting from the
> local alignment).  My use case involved Illumina reads, and what would
> happen is if there was a mismatch at 3 or 4 bp from the end of that read
> in the alignment against the target (in this case mRNA transcripts) -T 0
> would truncate the alignment at the position before the mismatch, so the
> read would not be seen as a full length match. With -T 1 it would pick
> up the extra matching bases.  Anyway, let's say you have
>
>    reads.fasta   # bunch of 100bp reads
>    genome.fasta  # self explanatory
>
> CPUS=40
> lastdb -P $CPUS -w 10 genome_10 genome.fasta
> lastal -P $CPUS -T1 -I80 -Y80 -m250 -O50 -8 1 -9 1 \
>    -f BlastTab genome_10 reads.fasta > results.txt 2>errors.log
>
>
> The lastdb command makes a database but uses only every 10th position.
> This will make the database smaller and the search much faster.  These
> positions are only used as seeds, so if the expected match is very good
> -w10 gives the same results as -w1 (default).  If the match is expected
> to be poor use -w 1 and expect much longer run times.  Set CPUS to match
> your system.
>
> The lastal command looks for 80% identity (-I80), 80% coverage on query
> (-Y80), allows up to 250 overlapping matches (-m250, see below), allows
> offset overlap alignment as
> long as the overlap is at least 50bp (-O50), requires overlap alignments
> to go right
> to the ends (-8 1 -9 1), and -f BlastTab gives a blast tabular format
> output file (except when it is on the other strand the query positions
> are flipped rather than the target).  Offset overlap is like this:
>
>      -------------------------
>                |||||||||||||||  <- -O50 == this must be >= 50bp
>                ------------------------------
>
> This only concerns alignments which overlap the ends.   The -m parameter
> is a standard lastal command line option:
>
>    -m: maximum initial matches per query position (10)
>
> What it means is that if there are a lot of alignments in exactly the
> same place by default one only retrieves a few of them.  Increasing the
> value to 250 means many more will be retrieved.  Conceivably if there is
> super high coverage you might want to use an even larger number.  Best
> to first try a small subset to determine the appropriate -m value for
> your case.
>
> I don't know if it would be faster if the query/database were the other
> way around.
>
> Regards,
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
>
>
> ------------------------------
>
> Message: 2
> Date: Mon, 04 Feb 2019 10:32:20 -0800
> From: David Mathog <mathog at caltech.edu>
> To: emboss at mailman.open-bio.org
> Subject: Re: [EMBOSS] run parameters for inexact matching with EMBOSS
>         needle
> Message-ID: <f5553bd8c0b5781ade9bafca9bc40ed2 at saf.bio.caltech.edu>
> Content-Type: text/plain; charset=UTF-8; format=flowed
>
> On 04-Feb-2019 09:57, David Mathog wrote:
> > lastal -P $CPUS -T1 -I80 -Y80 -m250 -O50 -8 1 -9 1 \
> >   -f BlastTab genome_10 reads.fasta > results.txt 2>errors.log
>
> I should add that for my case it was -I93 -J100.  The reads were only
> 76bp and
> it was common to have a few bp mismatches on the ends, hence 93%
> identity.  -J instead
> of -Y because it was transcripts searching a reads database rather than
> the other way around.  80% is pretty low if the short sequences should
> be present 1:1 in the assembled genome, even for a highly polymorphic
> genome where only one haplotype might be present.
>
> Regards,
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/emboss
>
> ------------------------------
>
> End of EMBOSS Digest, Vol 121, Issue 2
> **************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/emboss/attachments/20190206/76253b85/attachment.html>