Fast similarity/distance computation function for large sparse matrices. You
can floor small similarity value to to save computation time and storage
space by an arbitrary threshold (min_simil) or rank (rank). You
can specify the number of threads for parallel computing via
options(proxyC.threads).
Usage
simil(
x,
y = NULL,
margin = 1,
method = c("cosine", "correlation", "dice", "edice", "jaccard", "ejaccard", "fjaccard",
"hamann", "faith", "simple matching"),
mask = NULL,
min_simil = NULL,
rank = NULL,
drop0 = FALSE,
diag = FALSE,
use_nan = NULL,
sparse = TRUE,
digits = 14
)
dist(
x,
y = NULL,
margin = 1,
method = c("euclidean", "chisquared", "kullback", "jeffreys", "jensen", "manhattan",
"maximum", "canberra", "minkowski", "hamming"),
mask = NULL,
p = 2,
smooth = 0,
drop0 = FALSE,
diag = FALSE,
use_nan = NULL,
sparse = TRUE,
digits = 14
)Arguments
- x
a base::matrix or Matrix::Matrix object. Dense matrices are covered to the Matrix::CsparseMatrix internally.
- y
if a base::matrix or Matrix::Matrix object is provided, proximity between documents or features in
xandyis computed.- margin
integer indicating margin of similarity/distance computation. 1 indicates rows or 2 indicates columns.
- method
method to compute similarity or distance
- mask
a pattern matrix created using
mask()for masked similarity/distance computation. The shape of the matrix must be the same as the resulting matrix.- min_simil
the minimum similarity value to be recorded.
- rank
an integer value specifying top-n most similarity values to be recorded.
- drop0
if
TRUE, removes zero values to make the similarity/distance matrix sparse. It has no effect whendense = TRUE.- diag
if
TRUE, only compute diagonal elements of the similarity/distance matrix; useful when comparing corresponding rows or columns ofxandy.- use_nan
if
TRUE, returnsNaNif the standard deviation of a vector is zero whenmethodis "correlation"; if all the values are zero in a vector whenmethodis "cosine", "chisquared", "kullback", "jeffreys" or "jensen". Note that use ofNaNmakes the similarity/distance matrix denser and therefore larger in RAM. IfFALSE, return zero in same use situations as above. IfNULL, will also return zero but also generate a warning (default).- sparse
if
TRUE, returns Matrix::sparseMatrix object. When neithermin_similnorrankis used, dense matrices require less space in RAM.- digits
determines rounding of small values towards zero. Use primarily to correct floating point errors. Rounding is performed in C++ in a similar way as base::zapsmall.
- p
weight for Minkowski distance.
- smooth
adds a fixed value to all the cells to avoid division by zero. Only used when
methodis "chisquared", "kullback", "jeffreys" or "jensen".
Details
Available Methods
Similarity:
cosine: cosine similaritycorrelation: Pearson's correlationjaccard: Jaccard coefficientejaccard: the real value version ofjaccardfjaccard: Fuzzy Jaccard coefficientdice: Dice coefficientedice: the real value version ofdicehamann: Hamann similarityfaith: Faith similaritysimple matching: the percentage of common elements
Distance:
euclidean: Euclidean distancechisquared: chi-squared distancekullback: Kullback–Leibler divergencejeffreys: Jeffreys divergencejensen: Jensen–Shannon divergencemanhattan: Manhattan distancemaximum: the largest difference between valuescanberra: Canberra distanceminkowski: Minkowski distancehamming: Hamming distance
See the vignette for how the similarity and distance are computed:
vignette("measures", package = "proxyC")
Parallel Computing
It performs parallel computing using Intel oneAPI Threads Building Blocks.
The number of threads for parallel computing should be specified via
options(proxyC.threads) before calling the functions. If the value is -1,
all the available threads will be used. Unless the option is used, the
number of threads will be limited by the environmental variables
(OMP_THREAD_LIMIT or RCPP_PARALLEL_NUM_THREADS) to comply with CRAN
policy and offer backward compatibility.
Examples
mt <- Matrix::rsparsematrix(100, 100, 0.01)
simil(mt, method = "cosine")[1:5, 1:5]
#> Warning: x or y has vectors with all zero; consider setting use_nan = TRUE to set these values to NaN or use_nan = FALSE to suppress this warning
#> 5 x 5 sparse Matrix of class "dsTMatrix"
#>
#> [1,] 1 0 0 0 0
#> [2,] 0 1 0 0 0
#> [3,] 0 0 1 0 0
#> [4,] 0 0 0 1 0
#> [5,] 0 0 0 0 1
mt <- Matrix::rsparsematrix(100, 100, 0.01)
dist(mt, method = "euclidean")[1:5, 1:5]
#> 5 x 5 sparse Matrix of class "dsTMatrix"
#>
#> [1,] 0.00 0.340000 0.00 0.850000 0.00
#> [2,] 0.34 0.000000 0.34 0.915478 0.34
#> [3,] 0.00 0.340000 0.00 0.850000 0.00
#> [4,] 0.85 0.915478 0.85 0.000000 0.85
#> [5,] 0.00 0.340000 0.00 0.850000 0.00