Train a doc2vec model (Le & Mikolov, 2014) using a quanteda::tokens object.
textmodel_doc2vec(
x,
dim = 50,
type = c("dm", "dbow"),
min_count = 5,
window = 5,
iter = 10,
alpha = 0.05,
model = NULL,
use_ns = TRUE,
ns_size = 5,
sample = 0.001,
tolower = TRUE,
include_data = FALSE,
verbose = FALSE,
...
)a quanteda::tokens or quanteda::tokens_xptr object.
the size of the word vectors.
the architecture of the model; either "dm" (distributed memory) or "dbow" (distributed bag-of-words).
the minimum frequency of the words. Words less frequent than
this in x are removed before training.
the size of the window for context words. Ignored when type = "dbow" as
its context window is the entire document (sentence or paragraph).
the number of iterations in model training.
the initial learning rate.
a trained Word2vec model; if provided, its word vectors are updated for x.
if TRUE, negative sampling is used. Otherwise, hierarchical softmax
is used.
the size of negative samples. Only used when use_ns = TRUE.
the rate of sampling of words based on their frequency. Sampling is
disabled when sample = 1.0
lower-case all the tokens before fitting the model.
if TRUE, the resulting object includes the data supplied as x.
if TRUE, print the progress of training.
additional arguments.
Returns a textmodel_doc2vec object with matrices for word and document vector values in values.
Other elements are the same as textmodel_word2vec.
Le, Q. V., & Mikolov, T. (2014). Distributed Representations of Sentences and Documents (No. arXiv:1405.4053). arXiv. https://doi.org/10.48550/arXiv.1405.4053