Skip to contents

This vignette explains how proxyC compute the similarity and distance measures.

Notation

x=[xi,xi+1,,xn]y=[yi,yi+1,,yn] \begin{gather} \vec{x} = [x_i, x_{i + 1}, \dots, x_n] \\ \vec{y} = [y_i, y_{i + 1}, \dots, y_n] \end{gather}

The length of the vector n=||x||n = ||\vec{x}||, while |x||\vec{x}| is the absolute values of the elements.

Operations on vectors are element-wise:

z=xyn=||x||=||y||=||z|| \begin{gather} \vec{z} = \vec{x}\vec{y} \\ n = ||\vec{x}|| = ||\vec{y}|| =||\vec{z}|| \end{gather}

Summation of the elements of vectors is written using sigma without specifying the range:

x=i=1nxi \sum{\vec{x}} = \sum_{i=1}^{n}{x_i}

When the elements of the vector is compared with a value in a pair of square brackets, the summation is counting the number of elements that equal (or unequal) to the value:

[x=1]=i=1n[xi=1] \sum{[\vec{x} = 1]} = \sum_{i=1}^{n}{[x_i = 1]}

Similarity Measures

Similarity measures are available in proxyC::simil().

Cosine similarity (“cosine”)

simil=xyx2y2 simil = \frac{\sum{\vec{x}\vec{y}}}{\sqrt{\sum{\vec{x} ^ 2}} \sqrt{\sum{\vec{y} ^ 2}}}

Pearson correlation coefficient (“correlation”)

simil=Cov(x,y)Var(x)Var(y) simil = \frac{Cov(\vec{x},\vec{y})}{Var(\vec{x}) Var(\vec{y})}

Jaccard similarity (“jaccard” and “ejaccard”)

The values of xx and yy are Boolean for “jaccard”.

e=xyw=user-provided weightsimil=exw+ywe \begin{gather} e = \sum{\vec{x} \vec{y}} \\ w = \text{user-provided weight} \\ simil = \frac{e}{\sum{\vec{x} ^ w} + \sum{\vec{y} ^ w} - e} \end{gather}

Fuzzy Jaccard similarity (“fjaccard”)

The values must be 0x1.00 \le x \le 1.0 and 0y1.00 \le y \le 1.0.

simil=min(x,y)max(x,y) simil = \frac{\sum{min(\vec{x}, \vec{y})}}{\sum{max(\vec{x}, \vec{y})}}

Dice similarity (“dice” and “edice”)

The values of xx and yy are Boolean for “dice”.

e=xyw=user-provided weightsimil=2exw+yw \begin{gather} e = \sum{\vec{x} \vec{y}} \\ w = \text{user-provided weight} \\ simil = \frac{2 e}{\sum{\vec{x} ^ w} + \sum{\vec{y} ^ w}} \end{gather}

Hamann similarity (“hamann”)

e=xyn=||x||=||y||u=nesimil=eue+u \begin{gather} e = \sum{\vec{x} \vec{y}} \\ n = ||\vec{x}|| = ||\vec{y}|| \\ u = n - e \\ simil = \frac{e - u}{e + u} \end{gather}

Faith similarity (“faith”)

t=[x=1][y=1]f=[x=0][y=0]n=||x||=||y||simil=t+0.5fn \begin{gather} t = \sum{[\vec{x} = 1][\vec{y} = 1]} \\ f = \sum{[\vec{x} = 0][\vec{y} = 0]} \\ n = ||\vec{x}|| = ||\vec{y}|| \\ simil = \frac{t + 0.5 f}{n} \end{gather}

Simple matching (“matching”)

simil=[x=y] simil = \sum{[\vec{x} = \vec{y}]}

Distance Measures

Similarity measures are available in proxyC::dist(). Smoothing of the vectors can be performed when method is “chisquared”, “kullback”, “jefferys” or “jensen”: the value of smooth will be added to each element of x\vec{x} and y\vec{y}.

Manhattan distance (“manhattan”)

dist=|xy| dist = \sum{|\vec{x} - \vec{y}|}

Canberra distance (“canberra”)

dist=|xy||x|+|y| dist = \frac{|\vec{x} - \vec{y}|}{|\vec{x}| + |\vec{y}|}

Euclidian (“euclidian”)

dist=x2+y2 dist = \sum{\sqrt{\vec{x}^2 + \vec{y}^2}}

Minkowski distance (“minkowski”)

p=user-provided parameterdist=(|xy|p)1p \begin{gather} p = \text{user-provided parameter} \\ dist = \left( \sum{|\vec{x} - \vec{y}| ^ p} \right) ^ \frac{1}{p} \end{gather}

Hamming distance (“hamming”)

dist=[xy] dist = \sum{[\vec{x} \ne \vec{y}]}

The largest difference between values (“maximum”)

dist=maxxy dist = \max{\vec{x} - \vec{y}}

Chi-squared divergence (“chisquared”)

Oij=augmented matrix from x and yEij=matrix of expected count for Oijdist=(OijEij)2Eij \begin{gather} O_{ij} = \text{augmented matrix from } \vec{x} \text{ and } \vec{y} \\ E_{ij} = \text{matrix of expected count for } O_{ij} \\ dist = \sum{\frac{(O_{ij} - E_{ij}) ^ 2}{ E_{ij}}} \end{gather}

Kullback–Leibler divergence (“kullback”)

p=xxq=yydist=qlog2qp \begin{gather} \vec{p} = \frac{\vec{x}}{\sum{\vec{x}}} \\ \vec{q} = \frac{\vec{y}}{\sum{\vec{y}}} \\ dist = \sum{\vec{q} \log_2{\frac{\vec{q}}{\vec{p}}}} \end{gather}

Jeffreys divergence (“jeffreys”)

p=xxq=yydist=qlog2qp+plog2pq \begin{gather} \vec{p} = \frac{\vec{x}}{\sum{\vec{x}}} \\ \vec{q} = \frac{\vec{y}}{\sum{\vec{y}}} \\ dist = \sum{\vec{q} \log_2{\frac{\vec{q}}{\vec{p}}}} + \sum{\vec{p} \log_2{\frac{\vec{p}}{\vec{q}}}} \end{gather}

Jensen-Shannon divergence (“jensen”)

p=xxq=yym=12(p+q)dist=12qlog2qm+12plog2pm \begin{gather} \vec{p} = \frac{\vec{x}}{\sum{\vec{x}}} \\ \vec{q} = \frac{\vec{y}}{\sum{\vec{y}}} \\ \vec{m} = \frac{1}{2} (\vec{p} + \vec{q}) \\ dist = \frac{1}{2} \sum{\vec{q} \log_2{\frac{\vec{q}}{\vec{m}}}} + \frac{1}{2} \sum{\vec{p} \log_2{\frac{\vec{p}}{\vec{m}}}} \end{gather}

References

  • Choi, S., Cha, S., & Tappert, C. C. (2010). A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics, 8(1), 43–48.
  • Nielsen, F. (2019). On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means. Entropy, 21(5), 485. https://doi.org/10.3390/e21050485
  • Jain, G., Mahara, T., & Tripathi, K. N. (2020). A Survey of Similarity Measures for Collaborative Filtering-Based Recommender System. In M. Pant, T. K. Sharma, O. P. Verma, R. Singla, & A. Sikander (Eds.), Soft Computing: Theories and Applications (pp. 343–352). Springer. https://doi.org/10.1007/978-981-15-0751-9_32
  • Miyamoto, S. (1990). Hierarchical Cluster Analysis and Fuzzy Sets. In S. Miyamoto (Ed.), Fuzzy Sets in Information Retrieval and Cluster Analysis (pp. 125–188). Springer Netherlands. https://doi.org/10.1007/978-94-015-7887-5_6