[Biojava-dev] Near Matches

Osborne, John jko1 at cdc.gov
Sun Sep 14 15:59:20 EDT 2003


Hi Mark,
I actually had to look up Knuth Pratt Morris searches (thanks for the
bioinformatics lesson!) but I don't think it is going to work in my case
because I am solving an easier problem(!) -the query sequence and target
sequence are ALWAYS the same size.  So I am actually just performing the
last step of the KMP pattern matching which is checking each base pair for
an exact match.  I just bail out early if the mismatch threshold isn't met,
I never do ANY pattern sliding so something like a modified Boyer-Moore
search isn't going to help either.  :(

One thing I thought is to check for the longest exact substring within the
25mer, since if it is small (say less than 8 for 2 exact matches) then I
know this won't cause hybridization problems.  I don't think this is going
to be much faster though, and I'm almost finished with my stupid
implementation (which takes about 20 minutes to run) anyway.

 -John
 

-----Original Message-----
From: Schreiber, Mark [mailto:mark.schreiber at agresearch.co.nz]
Sent: Saturday, September 13, 2003 12:49 AM
To: Matthew Pocock
Cc: biojava-dev at biojava.org
Subject: RE: [Biojava-dev] Near Matches


Hi -
 
I'm pretty sure you could hack the KnuthMorrisPrattSearch class to allow it
to tolerate a certain number of missmatches. KMP searches are very fast
although they will slow considerabley if you allow more than a few
missmatches.
 
- Mark

	-----Original Message----- 
	From: Matthew Pocock [mailto:matthew_pocock at yahoo.co.uk] 
	Sent: Fri 12/09/2003 9:07 p.m. 
	To: Schreiber, Mark 
	Cc: biojava-dev at biojava.org 
	Subject: Re: [Biojava-dev] Near Matches
	
	

	Hi John,
	
	I commited code that does a slightly different job - accpets or
rejects
	regions based upon sequence content. However, it will be easy to
write
	an aproximate matcher that does at most n miss-matches. The
algorithms
	that sudgest themselves to me are all O(mn) - not sure if we can do
	better without a math genius, or a good string algorithms book.
	
	Matthew
	
	Schreiber, Mark wrote:
	
	>Hi -
	>
	>I think Matthew checked in something that does what you want a few
days back. Have a look at the list archives, or cvs records for that last
few weeks.
	>
	>- Mark
	>
	>
	>-----Original Message-----
	>From: Osborne, John [mailto:jko1 at cdc.gov]
	>Sent: Friday, 12 September 2003 6:20 a.m.
	>To: biojava-dev at biojava.org
	>Subject: [Biojava-dev] Near Matches
	>
	>
	>Hi,
	>
	>I am looking for a way in Biojava to iterate quickly through a list
of DNA N-mers for sequences that are almost an exact match, like 23 of 25
bases.  The mismatches can occur in ANY position in a sequence.  Other than
iterating through a SymbolList and keeping track of the number of
mismatches, is there a better (read faster) way to do this?  I was thinking
maybe the SuffixTree class, but since sequence order is unimportant it
doesn't see like the right tool for the job.
	>
	>Right now it is going to be a little bit ugly, since I am putting
this into a O(n^2) function with a big n...
	>
	> -John
	>_______________________________________________
	>biojava-dev mailing list
	>biojava-dev at biojava.org
http://biojava.org/mailman/listinfo/biojava-dev
	
>=======================================================================
	>Attention: The information contained in this message and/or
attachments
	>from AgResearch Limited is intended only for the persons or
entities
	>to which it is addressed and may contain confidential and/or
privileged
	>material. Any review, retransmission, dissemination or other use
of, or
	>taking of any action in reliance upon, this information by persons
or
	>entities other than the intended recipients is prohibited by
AgResearch
	>Limited. If you have received this message in error, please notify
the
	>sender immediately.
	
>=======================================================================
	>
	>_______________________________________________
	>biojava-dev mailing list
	>biojava-dev at biojava.org
	>http://biojava.org/mailman/listinfo/biojava-dev
	>
	> 
	>
	
	
	_______________________________________________
	biojava-dev mailing list
	biojava-dev at biojava.org
	http://biojava.org/mailman/listinfo/biojava-dev
	


=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================

_______________________________________________
biojava-dev mailing list
biojava-dev at biojava.org
http://biojava.org/mailman/listinfo/biojava-dev


More information about the biojava-dev mailing list