This vignette explains how proxyC compute the
similarity and distance measures.
Notation
The length of the vector
,
while
is the absolute values of the elements.
Operations on vectors are element-wise:
Summation of the elements of vectors is written using sigma without
specifying the range:
When the elements of the vector is compared with a value in a pair of
square brackets, the summation is counting the number of elements that
equal (or unequal) to the value:
Similarity Measures
Similarity measures are available in
proxyC::simil()
.
Cosine similarity (“cosine”)
Pearson correlation coefficient (“correlation”)
Jaccard similarity (“jaccard” and “ejaccard”)
The values of
and
are Boolean for “jaccard”.
Fuzzy Jaccard similarity (“fjaccard”)
The values must be
and
.
Dice similarity (“dice” and “edice”)
The values of
and
are Boolean for “dice”.
Hamann similarity (“hamann”)
Faith similarity (“faith”)
Simple matching (“matching”)
Distance Measures
Similarity measures are available in proxyC::dist()
.
Smoothing of the vectors can be performed when method
is
“chisquared”, “kullback”, “jefferys” or “jensen”: the value of
smooth
will be added to each element of
and
.
Manhattan distance (“manhattan”)
Canberra distance (“canberra”)
Euclidian (“euclidian”)
Minkowski distance (“minkowski”)
Hamming distance (“hamming”)
The largest difference between values (“maximum”)
Chi-squared divergence (“chisquared”)
Kullback–Leibler divergence (“kullback”)
Jeffreys divergence (“jeffreys”)
Jensen-Shannon divergence (“jensen”)
References
- Choi, S., Cha, S., & Tappert, C. C. (2010). A survey of binary
similarity and distance measures. Journal of Systemics, Cybernetics
and Informatics, 8(1), 43–48.
- Nielsen, F. (2019). On the Jensen–Shannon Symmetrization of
Distances Relying on Abstract Means. Entropy, 21(5), 485. https://doi.org/10.3390/e21050485
- Jain, G., Mahara, T., & Tripathi, K. N. (2020). A Survey of
Similarity Measures for Collaborative Filtering-Based Recommender
System. In M. Pant, T. K. Sharma, O. P. Verma, R. Singla, & A.
Sikander (Eds.), Soft Computing: Theories and Applications
(pp. 343–352). Springer. https://doi.org/10.1007/978-981-15-0751-9_32
- Miyamoto, S. (1990). Hierarchical Cluster Analysis and Fuzzy Sets.
In S. Miyamoto (Ed.), Fuzzy Sets in Information Retrieval and Cluster
Analysis (pp. 125–188). Springer Netherlands. https://doi.org/10.1007/978-94-015-7887-5_6