[EMBOSS] run parameters for inexact matching with EMBOSS needle

David Mathog mathog at caltech.edu
Mon Feb 4 17:57:06 UTC 2019

On 03-Feb-2019 04:18, Anandkumar Surendrarao wrote:

>    - I want to search with FASTA formatted DNA sequence queries that 
> are <=
>    100nt in length,
>    - against FASTA formatted genomic DNA sequence,
>    - for global end-to-end matches  (using Needleman Wunsch type global
>    search, not Smith Waterman type local matching)
>    - but allowing >= 80% identity and >= 80% of query length coverage

Not sure about needle, except that I expect it would be very slow.

However, I made a modified version of Lastal that added command line 
parameters for identity/query/target cutoff for almost exactly the same 
application.  It is available here in the "modified programs" directory.


There is a Centos 6 64 bit binary there, for any other platform you will 
need to recompile. Lastal does not have NW exactly, but it does have a 
-T parameter which does something which in many cases is similar.  -T 0 
is a local alignment, and -T 1 "extends to the end" (starting from the 
local alignment).  My use case involved Illumina reads, and what would 
happen is if there was a mismatch at 3 or 4 bp from the end of that read 
in the alignment against the target (in this case mRNA transcripts) -T 0 
would truncate the alignment at the position before the mismatch, so the 
read would not be seen as a full length match. With -T 1 it would pick 
up the extra matching bases.  Anyway, let's say you have

   reads.fasta   # bunch of 100bp reads
   genome.fasta  # self explanatory

lastdb -P $CPUS -w 10 genome_10 genome.fasta
lastal -P $CPUS -T1 -I80 -Y80 -m250 -O50 -8 1 -9 1 \
   -f BlastTab genome_10 reads.fasta > results.txt 2>errors.log

The lastdb command makes a database but uses only every 10th position.  
This will make the database smaller and the search much faster.  These 
positions are only used as seeds, so if the expected match is very good 
-w10 gives the same results as -w1 (default).  If the match is expected 
to be poor use -w 1 and expect much longer run times.  Set CPUS to match 
your system.

The lastal command looks for 80% identity (-I80), 80% coverage on query 
(-Y80), allows up to 250 overlapping matches (-m250, see below), allows 
offset overlap alignment as
long as the overlap is at least 50bp (-O50), requires overlap alignments 
to go right
to the ends (-8 1 -9 1), and -f BlastTab gives a blast tabular format 
output file (except when it is on the other strand the query positions 
are flipped rather than the target).  Offset overlap is like this:

               |||||||||||||||  <- -O50 == this must be >= 50bp

This only concerns alignments which overlap the ends.   The -m parameter 
is a standard lastal command line option:

   -m: maximum initial matches per query position (10)

What it means is that if there are a lot of alignments in exactly the 
same place by default one only retrieves a few of them.  Increasing the 
value to 250 means many more will be retrieved.  Conceivably if there is 
super high coverage you might want to use an even larger number.  Best 
to first try a small subset to determine the appropriate -m value for 
your case.

I don't know if it would be faster if the query/database were the other 
way around.


David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

More information about the EMBOSS mailing list