Latent Semantic Scaling (LSS) is a word embedding-based semisupervised algorithm for document scaling.
Usage
textmodel_lss(x, ...)
# S3 method for dfm
textmodel_lss(
x,
seeds,
terms = NULL,
k = 300,
slice = NULL,
weight = "count",
cache = FALSE,
simil_method = "cosine",
engine = c("RSpectra", "irlba", "rsvd"),
auto_weight = FALSE,
include_data = FALSE,
group_data = FALSE,
verbose = FALSE,
...
)
# S3 method for fcm
textmodel_lss(
x,
seeds,
terms = NULL,
w = 50,
max_count = 10,
weight = "count",
cache = FALSE,
simil_method = "cosine",
engine = c("rsparse"),
auto_weight = FALSE,
verbose = FALSE,
...
)
Arguments
- x
a dfm or fcm created by
quanteda::dfm()
orquanteda::fcm()
- ...
additional arguments passed to the underlying engine.
- seeds
a character vector or named numeric vector that contains seed words. If seed words contain "*", they are interpreted as glob patterns. See quanteda::valuetype.
- terms
a character vector or named numeric vector that specify words for which polarity scores will be computed; if a numeric vector, words' polarity scores will be weighted accordingly; if
NULL
, all the features ofquanteda::dfm()
orquanteda::fcm()
will be used.- k
the number of singular values requested to the SVD engine. Only used when
x
is adfm
.- slice
a number or indices of the components of word vectors used to compute similarity;
slice < k
to further truncate word vectors; useful for diagnosys and simulation.- weight
weighting scheme passed to
quanteda::dfm_weight()
. Ignored whenengine
is "rsparse".- cache
if
TRUE
, save result of SVD for next execution with identicalx
and settings. Use thebase::options(lss_cache_dir)
to change the location cache files to be save.- simil_method
specifies method to compute similarity between features. The value is passed to
quanteda.textstats::textstat_simil()
, "cosine" is used otherwise.- engine
select the engine to factorize
x
to generate word vectors. Choose fromRSpectra::svds()
,irlba::irlba()
,rsvd::rsvd()
, andrsparse::GloVe()
.- auto_weight
automatically determine weights to approximate the polarity of terms to seed words. See details.
- include_data
if
TRUE
, fitted model includes the dfm supplied asx
.- group_data
if
TRUE
, applydfm_group(x)
before saving the dfm.- verbose
show messages if
TRUE
.- w
the size of word vectors. Used only when
x
is afcm
.- max_count
passed to
x_max
inrsparse::GloVe$new()
where cooccurrence counts are ceiled to this threshold. It should be changed according to the size of the corpus. Used only whenx
is afcm
.
Details
Latent Semantic Scaling (LSS) is a semisupervised document scaling
method. textmodel_lss()
constructs word vectors from use-provided
documents (x
) and weights words (terms
) based on their semantic
proximity to seed words (seeds
). Seed words are any known polarity words
(e.g. sentiment words) that users should manually choose. The required
number of seed words are usually 5 to 10 for each end of the scale.
If seeds
is a named numeric vector with positive and negative values, a
bipolar LSS model is construct; if seeds
is a character vector, a
unipolar LSS model. Usually bipolar models perform better in document
scaling because both ends of the scale are defined by the user.
A seed word's polarity score computed by textmodel_lss()
tends to diverge
from its original score given by the user because it's score is affected
not only by its original score but also by the original scores of all other
seed words. If auto_weight = TRUE
, the original scores are weighted
automatically using stats::optim()
to minimize the squared difference
between seed words' computed and original scores. Weighted scores are saved
in seed_weighted
in the object.
Please visit the package website for examples.
References
Watanabe, Kohei. 2020. "Latent Semantic Scaling: A Semisupervised Text Analysis Technique for New Domains and Languages", Communication Methods and Measures. doi:10.1080/19312458.2020.1832976 .
Watanabe, Kohei. 2017. "Measuring News Bias: Russia's Official News Agency ITAR-TASS' Coverage of the Ukraine Crisis" European Journal of Communication. doi:10.1177/0267323117695735 .