Train a Latent Semantic Analysis model (Deerwester et al., 1990) on a quanteda::tokens object.
textmodel_lsa(
x,
dim = 50,
min_count = 5L,
engine = c("RSpectra", "irlba", "rsvd"),
weight = "count",
tolower = TRUE,
verbose = FALSE,
...
)a quanteda::tokens or quanteda::tokens_xptr object.
the size of the word vectors.
the minimum frequency of the words. Words less frequent than
this in x are removed before training.
select the engine perform SVD to generate word vectors.
weighting scheme passed to quanteda::dfm_weight().
if TRUE lower-case all the tokens before fitting the model.
if TRUE, print the progress of training.
additional arguments.
Returns a textmodel_wordvector object with the following elements:
a matrix for word vectors values.
a matrix for word vectors weights.
the frequency of words in x.
the SVD engine used.
weighting scheme.
the value of min_count.
the concatenator in x.
the command used to execute the function.
the version of the wordvector package.
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. JASIS, 41(6), 391–407.
# \donttest{
library(quanteda)
#> Package version: 4.3.1
#> Unicode version: 15.1
#> ICU version: 74.2
#> Parallel computing: disabled
#> See https://quanteda.io for tutorials and examples.
library(wordvector)
# pre-processing
corp <- corpus_reshape(data_corpus_news2014)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>%
tokens_remove(stopwords("en", "marimo"), padding = TRUE) %>%
tokens_select("^[a-zA-Z-]+$", valuetype = "regex", case_insensitive = FALSE,
padding = TRUE) %>%
tokens_tolower()
# train LSA
lsa <- textmodel_lsa(toks, dim = 50, min_count = 5, verbose = TRUE)
#> Performing SVD into 50 dimensions
#> ...using RSpectra
#> ...complete
# find similar words
head(similarity(lsa, c("berlin", "germany", "france"), mode = "words"))
#> berlin germany france
#> [1,] "berlin" "germany" "france"
#> [2,] "german" "france" "montpellier"
#> [3,] "warsaw" "closer" "paris"
#> [4,] "timmermans" "berlin" "froome"
#> [5,] "germany" "moscovici" "germany"
#> [6,] "lisbon" "tougher" "french"
head(similarity(lsa, c("berlin" = 1, "germany" = -1, "france" = 1), mode = "values"))
#> [,1]
#> somali -0.05066583
#> reporters -0.01372003
#> released 0.01725500
#> bail 0.08066897
#> still 0.09478653
#> jailed 0.02431590
head(similarity(lsa, analogy(~ berlin - germany + france)))
#> [,1]
#> [1,] "paris"
#> [2,] "berlin"
#> [3,] "koscielny"
#> [4,] "mans"
#> [5,] "nibali"
#> [6,] "hertha"
# }