Train a word2vec model (Mikolov et al., 2013) using a quanteda::tokens object.
a quanteda::tokens or quanteda::tokens_xptr object.
the size of the word vectors.
the architecture of the model; either "cbow" (continuous back-of-words), "sg" (skip-gram), or "dm" (distributed memory).
the minimum frequency of the words. Words less frequent than
this in x are removed before training.
the size of the word window. Words within this window are considered to be the context of a target word.
the number of iterations in model training.
the initial learning rate.
a trained Word2vec model; if provided, its word vectors are updated for x.
if TRUE, negative sampling is used. Otherwise, hierarchical softmax
is used.
the size of negative samples. Only used when use_ns = TRUE.
the rate of sampling of words based on their frequency. Sampling is
disabled when sample = 1.0
lower-case all the tokens before fitting the model.
if TRUE, the resulting object includes the data supplied as x.
if TRUE, print the progress of training.
additional arguments.
Returns a textmodel_word2vec object with the following elements:
a list of a matrix for word vector values.
a matrix for word vector weights.
the size of the word vectors.
the architecture of the model.
the frequency of words in x.
the size of the word window.
the number of iterations in model training.
the initial learning rate.
the use of negative sampling.
the size of negative samples.
the value of min_count.
the concatenator in x.
the original data supplied as x if include_data = TRUE.
the command used to execute the function.
the version of the wordvector package.
If type = "dm", it trains a doc2vec model but saves only
word vectors to save storage space. textmodel_doc2vec should be
used to access document vectors.
Users can changed the number of processors used for the parallel computing via
options(wordvector_threads).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. https://arxiv.org/abs/1310.4546.
# \donttest{
library(quanteda)
library(wordvector)
# pre-processing
corp <- data_corpus_news2014
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>%
tokens_remove(stopwords("en", "marimo"), padding = TRUE) %>%
tokens_select("^[a-zA-Z-]+$", valuetype = "regex", case_insensitive = FALSE,
padding = TRUE) %>%
tokens_tolower()
# train word2vec
wov <- textmodel_word2vec(toks, dim = 50, type = "cbow", min_count = 5, sample = 0.001)
# find similar words
head(similarity(wov, c("berlin", "germany", "france"), mode = "words"))
#> berlin germany france
#> [1,] "berlin" "germany" "france"
#> [2,] "german" "frankfurt" "germany"
#> [3,] "germany" "hamburg" "toulouse"
#> [4,] "frankfurt" "stuttgart" "norway"
#> [5,] "amsterdam" "berlin" "kazakhstan"
#> [6,] "warsaw" "france" "belgium"
head(similarity(wov, c("berlin" = 1, "germany" = -1, "france" = 1), mode = "values"))
#> [,1]
#> somali 0.06735107
#> reporters 0.17921132
#> released 0.12120764
#> bail 0.21391338
#> still -0.12096084
#> jailed 0.04449241
head(similarity(wov, analogy(~ berlin - germany + france), mode = "words"))
#> [,1]
#> [1,] "france"
#> [2,] "berlin"
#> [3,] "paris"
#> [4,] "amsterdam"
#> [5,] "french"
#> [6,] "strasbourg"
# }