[Biojava-l] Similarity measures for generalized sequences

Mon May 7 10:21:15 UTC 2012

Hi,

I'm looking for a general advice regarding the comparison of sequences
(S). I mean not necessarily DNA sequences, however,
sequences like Region A is connected with Regions B (shortly A->B) and
then a distance or similarity measure that
allows to identify similiar sequences or paths. The regions are
alphanumerically coded like "Bed nucleus of the stria terminalis
anterior division".
Given are 10^2 to 10^7 different paths, searched are all there mutual
similiarities (e.g., similarity matrix) and a multivariate
classificartion like a dendrogram
based on a meaningful cluster analysis.

Example
Given:
S1: A->B->C->G
S2: A->B->F->G
S3: A->C->B->G
S4: A->B->D->G

Searched:
Similiarity matrix

     S1  S2  S3  S4
S1  ?    ?    ?    ?
S2  ?    ?    ?    ?
S3  ?    ?    ?    ?
S4  ?    ?    ?    ?

Then I would like to generate a dendrogram based on similarity measure:

S1--
        |--           
S2--     |
             |----
S3--     |
        |-- |       
S4--

Thanks a lot for any advices.

Regards,
Oliver
-------------- next part --------------
A non-text attachment was scrubbed...
Name: schmitt.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biojava-l/attachments/20120507/653e6053/attachment-0002.vcf>