Skip to contents

We must define the measurement with seed words to apply LSS. If seed words are not available, we must create a list of seed words using thesaurus and glossaries, but some of the words might be used in many different contexts, making them unsuitable as seed words. Good seed words are words that appear only in the contexts of the target concepts.

We can evaluate the suitability of seed words by checking their synonyms identified in the corpus: fitting a LSS model with one seed word at a time and checking words with highest polarity scores. This repetitive process can be automated using bootstrap_lss(). This function extracts seed words from a fitted LSS model, fits a LSS model with each seed words internally and returns their synonyms or similarity scores. We use the LSS model fitted in the introduction in this example.

library(LSX)
#> Warning in .recacheSubclasses(def@className, def, env): undefined subclass
#> "ndiMatrix" of class "replValueSp"; definition not updated
lss <- readRDS("lss.rds")
print(lss)
#> 
#> Call:
#> textmodel_lss(x = dfmt, seeds = seed, k = 300, cache = TRUE, 
#>     include_data = TRUE, group_data = TRUE)

The model is fitted with the generic sentiment seed words. Their original polarity scores are weighted by the inverse of the number of seed words (1/7 = 0.142) to allow unequal numbers for opposing ends.

print(lss$seeds)
#>        good        nice   excellent    positive   fortunate     correct 
#>           1           1           1           1           1           1 
#>    superior         bad       nasty        poor    negative unfortunate 
#>           1          -1          -1          -1          -1          -1 
#>       wrong    inferior 
#>          -1          -1
print(lss$seeds_weighted)
#>        good        nice   excellent    positive   fortunate     correct 
#>   0.1428571   0.1428571   0.1428571   0.1428571   0.1428571   0.1428571 
#>    superior         bad       nasty        poor    negative unfortunate 
#>   0.1428571  -0.1428571  -0.1428571  -0.1428571  -0.1428571  -0.1428571 
#>       wrong    inferior 
#>  -0.1428571  -0.1428571

Evaluation with synomums

By default, bootstrap_lss() returns lists of synonyms for seed words. Each column is a list of words sorted by their similarity to the seed word shown at the top. There are many proper names in the example, but we can find many words are positive for positive seed words and negative for negative seed words. If a list is a mixture of positive and negative words, the seed word is probably too ambiguous.

bs_term <- bootstrap_lss(lss, mode = "terms")
knitr::kable(head(bs_term, 10))
good nice excellent positive fortunate correct superior bad nasty poor negative unfortunate wrong inferior
good nice excellent positive fortunate correct superior bad nasty poor negative unfortunate wrong inferior
gander downtrend tact constructive packer bicocca 2s19 good t.v oppressed impact garnett’s they’re harjit
fingalise approachable brennan’s bilateral accomplishment seldom msta-s really salaried working likens squabbled optimists sajjan
toast romani staffer’s us-russia respectfully i’ve lisitsyn fingalise pokes class credit-based rostik think viterra
brennan’s sportsman china-u.s moscow-kiev schultze uh tagil goodness life-changing nino evaporate pergamon arranges batons
fidgety glenmede fidgety hope vienna-hosted really msta’s gander sofa mine-defusing economy lapdogs serezha prettied
nexus smelly soured sides rama dostoyevsky hierarchical lot phosgene movement adverse untrained napolitano pop-culturally
overdependent attract inter-parliamntary impasse cop13 self-regard tse there’s chloroform racist dhaka’s asian-african stink subhuman
duda’s io-30 relations mutual washington-tehran grading cradling they’re clogged piecemeal alaskans minders ex-spook cocoa
relations swirled nexus abramovich’s kovacevski marble nizhny i’m nit capitalism psychotic polarise motorsport cottonseed

Evaluation with similarity scores

If mode = "coef", the function returns the similarity scores of words for each seed word (words were sorted by these scores in the lists above). We can use the matrix to evaluate the seed words more systematically if lists of synonyms are not useful or sufficient.

bs_coef <- bootstrap_lss(lss, mode = "coef")
knitr::kable(head(bs_coef, 10), digits = 3)
good nice excellent positive fortunate correct superior bad nasty poor negative unfortunate wrong inferior
monday -0.037 -0.006 -0.029 0.020 -0.020 0.004 0.125 0.017 0.003 0.013 -0.061 -0.032 -0.017 -0.019
us -0.002 -0.019 -0.010 0.005 0.030 0.006 -0.024 0.015 -0.020 -0.004 0.009 0.021 -0.003 -0.007
president -0.015 -0.032 -0.005 -0.009 0.016 0.023 0.005 -0.007 -0.003 -0.020 0.004 -0.016 0.015 -0.005
joe 0.018 0.158 -0.021 -0.083 -0.060 -0.036 -0.022 0.024 -0.036 0.062 -0.041 -0.054 -0.023 -0.001
biden 0.018 -0.007 -0.027 0.068 0.048 -0.051 0.003 -0.056 -0.023 0.026 0.024 0.012 0.023 -0.004
reiterated 0.015 0.062 0.098 0.216 -0.034 0.148 0.000 -0.024 -0.021 -0.119 0.013 -0.104 -0.020 -0.041
united -0.002 0.018 -0.100 0.037 0.092 -0.009 -0.015 -0.023 0.018 -0.028 0.002 0.043 0.019 -0.036
states -0.010 -0.008 0.093 -0.068 0.049 -0.028 -0.023 0.005 -0.033 0.032 0.025 -0.056 -0.026 0.020
commitment 0.019 0.064 0.096 0.076 0.056 0.025 0.011 -0.062 0.002 0.035 -0.030 -0.064 -0.012 0.002
diplomacy 0.183 0.133 0.090 0.201 0.045 0.101 0.086 0.079 0.045 0.039 -0.033 0.019 0.186 -0.057

We can use words with known polarity such as “russia” and “ukriane” as anchor words in evaluating seed words. We know that “russia” is more positive than “ukraine” because the corpus is a collection of articles published by the Russian state media.

We can confirm that the difference in similarity scores between the anchor words and the polarity scores of the seed words largely agree. However, “fortunate” and “negative” disagree with the expected differences, suggesting that they are more ambiguous than other seed words.

dat_seed <- data.frame(seed = lss$seeds, diff = bs_coef["russia",] - bs_coef["ukraine",])
print(dat_seed)
#>             seed         diff
#> good           1  0.001436750
#> nice           1  0.014494254
#> excellent      1  0.034122232
#> positive       1  0.017660484
#> fortunate      1 -0.003603953
#> correct        1  0.017574761
#> superior       1  0.003759512
#> bad           -1 -0.007722325
#> nasty         -1 -0.003313554
#> poor          -1 -0.008085451
#> negative      -1  0.005467496
#> unfortunate   -1 -0.090945891
#> wrong         -1 -0.008309969
#> inferior      -1 -0.043939811

Conclusions

We should make a list of seed words and evaluate them one by one using bootstrap_lss() to create accurate measurement. However, seed words become much less ambiguous when they are used as a set, so we should not be too nervous about its results. Seed word selection should be motivated primarily by the theoretical framework

References