Selection of seed words
Evaluating seed words using utility functions
Source:vignettes/pkgdown/seedwords.Rmd
seedwords.Rmd
We must define the measurement with seed words to apply LSS. If seed words are not available, we must create a list of seed words using thesaurus and glossaries, but some of the words might be used in many different contexts, making them unsuitable as seed words. Good seed words are words that appear only in the contexts of the target concepts.
We can evaluate the suitability of seed words by checking their
synonyms identified in the corpus: fitting a LSS model with one seed
word at a time and checking words with highest polarity scores. This
repetitive process can be automated using bootstrap_lss()
.
This function extracts seed words from a fitted LSS model, fits a LSS
model with each seed words internally and returns their synonyms or
similarity scores. We use the LSS model fitted in the introduction in this example.
library(LSX)
#> Warning in .recacheSubclasses(def@className, def, env): undefined subclass
#> "ndiMatrix" of class "replValueSp"; definition not updated
#> Warning in .recacheSubclasses(def@className, def, env): undefined subclass
#> "ndiMatrix" of class "replValueSp"; definition not updated
lss <- readRDS("lss.rds")
print(lss)
#>
#> Call:
#> textmodel_lss(x = dfmt, seeds = seed, k = 300, cache = TRUE,
#> include_data = TRUE, group_data = TRUE)
The model is fitted with the generic sentiment seed words. Their
original polarity scores are weighted by the inverse of the number of
seed words (1/7 = 0.142
) to allow unequal numbers for
opposing ends.
print(lss$seeds)
#> good nice excellent positive fortunate correct
#> 1 1 1 1 1 1
#> superior bad nasty poor negative unfortunate
#> 1 -1 -1 -1 -1 -1
#> wrong inferior
#> -1 -1
print(lss$seeds_weighted)
#> good nice excellent positive fortunate correct
#> 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571
#> superior bad nasty poor negative unfortunate
#> 0.1428571 -0.1428571 -0.1428571 -0.1428571 -0.1428571 -0.1428571
#> wrong inferior
#> -0.1428571 -0.1428571
Evaluation with synonyms
By default, bootstrap_lss()
returns lists of synonyms
for seed words. Each column is a list of words sorted by their
similarity to the seed word shown at the top. There are many proper
names in the example, but we can find many words are positive for
positive seed words and negative for negative seed words. If a list is a
mixture of positive and negative words, the seed word is probably too
ambiguous.
bs_term <- bootstrap_lss(lss, mode = "terms")
knitr::kable(head(bs_term, 10))
good | nice | excellent | positive | fortunate | correct | superior | bad | nasty | poor | negative | unfortunate | wrong | inferior |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
good | nice | excellent | positive | fortunate | correct | superior | bad | nasty | poor | negative | unfortunate | wrong | inferior |
gander | downtrend | tact | constructive | packer | bicocca | 2s19 | good | t.v | oppressed | impact | garnett’s | they’re | harjit |
fingalise | approachable | brennan’s | bilateral | accomplishment | seldom | msta-s | really | salaried | working | likens | squabbled | optimists | sajjan |
toast | romani | staffer’s | us-russia | respectfully | i’ve | lisitsyn | fingalise | pokes | class | credit-based | rostik | think | viterra |
brennan’s | sportsman | china-u.s | moscow-kiev | schultze | uh | tagil | goodness | life-changing | nino | evaporate | pergamon | arranges | batons |
fidgety | glenmede | fidgety | hope | vienna-hosted | really | msta’s | gander | sofa | mine-defusing | economy | lapdogs | serezha | prettied |
nexus | smelly | soured | sides | rama | dostoyevsky | hierarchical | lot | phosgene | movement | adverse | untrained | napolitano | pop-culturally |
overdependent | attract | inter-parliamntary | impasse | cop13 | self-regard | tse | there’s | chloroform | racist | dhaka’s | asian-african | stink | subhuman |
duda’s | io-30 | relations | mutual | washington-tehran | grading | cradling | they’re | clogged | piecemeal | alaskans | minders | ex-spook | cocoa |
relations | swirled | nexus | abramovich’s | kovacevski | marble | nizhny | i’m | nit | capitalism | psychotic | polarise | motorsport | cottonseed |
Evaluation with similarity scores
If mode = "coef"
, the function returns the similarity
scores of words for each seed word (words were sorted by these scores in
the lists above). We can use the matrix to evaluate the seed words more
systematically if lists of synonyms are not useful or sufficient.
bs_coef <- bootstrap_lss(lss, mode = "coef")
knitr::kable(head(bs_coef, 10), digits = 3)
good | nice | excellent | positive | fortunate | correct | superior | bad | nasty | poor | negative | unfortunate | wrong | inferior | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
monday | -0.037 | -0.006 | -0.029 | 0.020 | -0.020 | 0.004 | 0.125 | 0.017 | 0.003 | 0.013 | -0.061 | -0.032 | -0.017 | -0.019 |
us | -0.002 | -0.019 | -0.010 | 0.005 | 0.030 | 0.006 | -0.024 | 0.015 | -0.020 | -0.004 | 0.009 | 0.021 | -0.003 | -0.007 |
president | -0.015 | -0.032 | -0.005 | -0.009 | 0.016 | 0.023 | 0.005 | -0.007 | -0.003 | -0.020 | 0.004 | -0.016 | 0.015 | -0.005 |
joe | 0.018 | 0.158 | -0.021 | -0.083 | -0.060 | -0.036 | -0.022 | 0.024 | -0.036 | 0.062 | -0.041 | -0.054 | -0.023 | -0.001 |
biden | 0.018 | -0.007 | -0.027 | 0.068 | 0.048 | -0.051 | 0.003 | -0.056 | -0.023 | 0.026 | 0.024 | 0.012 | 0.023 | -0.004 |
reiterated | 0.015 | 0.062 | 0.098 | 0.216 | -0.034 | 0.148 | 0.000 | -0.024 | -0.021 | -0.119 | 0.013 | -0.104 | -0.020 | -0.041 |
united | -0.002 | 0.018 | -0.100 | 0.037 | 0.092 | -0.009 | -0.015 | -0.023 | 0.018 | -0.028 | 0.002 | 0.043 | 0.019 | -0.036 |
states | -0.010 | -0.008 | 0.093 | -0.068 | 0.049 | -0.028 | -0.023 | 0.005 | -0.033 | 0.032 | 0.025 | -0.056 | -0.026 | 0.020 |
commitment | 0.019 | 0.064 | 0.096 | 0.076 | 0.056 | 0.025 | 0.011 | -0.062 | 0.002 | 0.035 | -0.030 | -0.064 | -0.012 | 0.002 |
diplomacy | 0.183 | 0.133 | 0.090 | 0.201 | 0.045 | 0.101 | 0.086 | 0.079 | 0.045 | 0.039 | -0.033 | 0.019 | 0.186 | -0.057 |
We can use words with known polarity such as “russia” and “ukraine” as anchor words in evaluating seed words. We know that “russia” is more positive than “ukraine” because the corpus is a collection of articles published by the Russian state media.
We can confirm that the difference in similarity scores between the anchor words and the polarity scores of the seed words largely agree. However, “fortunate” and “negative” disagree with the expected differences, suggesting that they are more ambiguous than other seed words.
dat_seed <- data.frame(seed = lss$seeds, diff = bs_coef["russia",] - bs_coef["ukraine",])
print(dat_seed)
#> seed diff
#> good 1 0.001436750
#> nice 1 0.014494254
#> excellent 1 0.034122232
#> positive 1 0.017660484
#> fortunate 1 -0.003603953
#> correct 1 0.017574761
#> superior 1 0.003759512
#> bad -1 -0.007722325
#> nasty -1 -0.003313554
#> poor -1 -0.008085451
#> negative -1 0.005467496
#> unfortunate -1 -0.090945891
#> wrong -1 -0.008309969
#> inferior -1 -0.043939811
Conclusions
We should make a list of seed words and evaluate them one by one
using bootstrap_lss()
to create accurate measurement.
However, seed words become much less ambiguous when they are used as a
set, so we should not be too nervous about its results. Seed word
selection should be motivated primarily by the theoretical framework
References
- Watanabe, K., & Zhou, Y. (2020). Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches, Social Science Computer Review, https://doi.org/10.1177/0894439320907027.
- Watanabe, K. (2021). Latent Semantic Scaling: A Semisupervised Text Analysis Technique for New Domains and Languages, Communication Methods and Measures, https://doi.org/10.1080/19312458.2020.1832976.