Selection of seed words

We must define the measurement with seed words to apply LSS. If seed words are not available, we must create a list of seed words using thesaurus and glossaries, but some of the words might be used in many different contexts, making them unsuitable as seed words. Good seed words are words that appear only in the contexts of the target concepts.

We can evaluate the suitability of seed words by checking their synonyms identified in the corpus: fitting a LSS model with one seed word at a time and checking words with highest polarity scores. This repetitive process can be automated using bootstrap_lss(). This function extracts seed words from a fitted LSS model, fits a LSS model with each seed words internally and returns their synonyms or similarity scores. We use the LSS model fitted in the introduction in this example.

library(LSX)
#> Warning in .recacheSubclasses(def@className, def, env): undefined subclass
#> "ndiMatrix" of class "replValueSp"; definition not updated
#> Warning in .recacheSubclasses(def@className, def, env): undefined subclass
#> "ndiMatrix" of class "replValueSp"; definition not updated

lss <- readRDS("lss.rds")
print(lss)
#> 
#> Call:
#> textmodel_lss(x = dfmt, seeds = seed, k = 300, cache = TRUE, 
#>     include_data = TRUE, group_data = TRUE)

The model is fitted with the generic sentiment seed words. Their original polarity scores are weighted by the inverse of the number of seed words (1/7 = 0.142) to allow unequal numbers for opposing ends.

print(lss$seeds)
#>        good        nice   excellent    positive   fortunate     correct 
#>           1           1           1           1           1           1 
#>    superior         bad       nasty        poor    negative unfortunate 
#>           1          -1          -1          -1          -1          -1 
#>       wrong    inferior 
#>          -1          -1

print(lss$seeds_weighted)
#>        good        nice   excellent    positive   fortunate     correct 
#>   0.1428571   0.1428571   0.1428571   0.1428571   0.1428571   0.1428571 
#>    superior         bad       nasty        poor    negative unfortunate 
#>   0.1428571  -0.1428571  -0.1428571  -0.1428571  -0.1428571  -0.1428571 
#>       wrong    inferior 
#>  -0.1428571  -0.1428571

Evaluation with synonyms

By default, bootstrap_lss() returns lists of synonyms for seed words. Each column is a list of words sorted by their similarity to the seed word shown at the top. There are many proper names in the example, but we can find many words are positive for positive seed words and negative for negative seed words. If a list is a mixture of positive and negative words, the seed word is probably too ambiguous.

bs_term <- bootstrap_lss(lss, mode = "terms")
knitr::kable(head(bs_term, 10))

good	nice	excellent	positive	fortunate	correct	superior	bad	nasty	poor	negative	unfortunate	wrong	inferior
good	nice	excellent	positive	fortunate	correct	superior	bad	nasty	poor	negative	unfortunate	wrong	inferior
gander	downtrend	tact	constructive	packer	bicocca	2s19	good	t.v	oppressed	impact	garnett’s	they’re	harjit
fingalise	approachable	brennan’s	bilateral	accomplishment	seldom	msta-s	really	salaried	working	likens	squabbled	optimists	sajjan
toast	romani	staffer’s	us-russia	respectfully	i’ve	lisitsyn	fingalise	pokes	class	credit-based	rostik	think	viterra
brennan’s	sportsman	china-u.s	moscow-kiev	schultze	uh	tagil	goodness	life-changing	nino	evaporate	pergamon	arranges	batons
fidgety	glenmede	fidgety	hope	vienna-hosted	really	msta’s	gander	sofa	mine-defusing	economy	lapdogs	serezha	prettied
nexus	smelly	soured	sides	rama	dostoyevsky	hierarchical	lot	phosgene	movement	adverse	untrained	napolitano	pop-culturally
overdependent	attract	inter-parliamntary	impasse	cop13	self-regard	tse	there’s	chloroform	racist	dhaka’s	asian-african	stink	subhuman
duda’s	io-30	relations	mutual	washington-tehran	grading	cradling	they’re	clogged	piecemeal	alaskans	minders	ex-spook	cocoa
relations	swirled	nexus	abramovich’s	kovacevski	marble	nizhny	i’m	nit	capitalism	psychotic	polarise	motorsport	cottonseed

Evaluation with similarity scores

If mode = "coef", the function returns the similarity scores of words for each seed word (words were sorted by these scores in the lists above). We can use the matrix to evaluate the seed words more systematically if lists of synonyms are not useful or sufficient.

bs_coef <- bootstrap_lss(lss, mode = "coef")
knitr::kable(head(bs_coef, 10), digits = 3)

	good	nice	excellent	positive	fortunate	correct	superior	bad	nasty	poor	negative	unfortunate	wrong	inferior
monday	-0.037	-0.006	-0.029	0.020	-0.020	0.004	0.125	0.017	0.003	0.013	-0.061	-0.032	-0.017	-0.019
us	-0.002	-0.019	-0.010	0.005	0.030	0.006	-0.024	0.015	-0.020	-0.004	0.009	0.021	-0.003	-0.007
president	-0.015	-0.032	-0.005	-0.009	0.016	0.023	0.005	-0.007	-0.003	-0.020	0.004	-0.016	0.015	-0.005
joe	0.018	0.158	-0.021	-0.083	-0.060	-0.036	-0.022	0.024	-0.036	0.062	-0.041	-0.054	-0.023	-0.001
biden	0.018	-0.007	-0.027	0.068	0.048	-0.051	0.003	-0.056	-0.023	0.026	0.024	0.012	0.023	-0.004
reiterated	0.015	0.062	0.098	0.216	-0.034	0.148	0.000	-0.024	-0.021	-0.119	0.013	-0.104	-0.020	-0.041
united	-0.002	0.018	-0.100	0.037	0.092	-0.009	-0.015	-0.023	0.018	-0.028	0.002	0.043	0.019	-0.036
states	-0.010	-0.008	0.093	-0.068	0.049	-0.028	-0.023	0.005	-0.033	0.032	0.025	-0.056	-0.026	0.020
commitment	0.019	0.064	0.096	0.076	0.056	0.025	0.011	-0.062	0.002	0.035	-0.030	-0.064	-0.012	0.002
diplomacy	0.183	0.133	0.090	0.201	0.045	0.101	0.086	0.079	0.045	0.039	-0.033	0.019	0.186	-0.057

We can use words with known polarity such as “russia” and “ukraine” as anchor words in evaluating seed words. We know that “russia” is more positive than “ukraine” because the corpus is a collection of articles published by the Russian state media.

We can confirm that the difference in similarity scores between the anchor words and the polarity scores of the seed words largely agree. However, “fortunate” and “negative” disagree with the expected differences, suggesting that they are more ambiguous than other seed words.

dat_seed <- data.frame(seed = lss$seeds, diff = bs_coef["russia",] - bs_coef["ukraine",])
print(dat_seed)
#>             seed         diff
#> good           1  0.001436750
#> nice           1  0.014494254
#> excellent      1  0.034122232
#> positive       1  0.017660484
#> fortunate      1 -0.003603953
#> correct        1  0.017574761
#> superior       1  0.003759512
#> bad           -1 -0.007722325
#> nasty         -1 -0.003313554
#> poor          -1 -0.008085451
#> negative      -1  0.005467496
#> unfortunate   -1 -0.090945891
#> wrong         -1 -0.008309969
#> inferior      -1 -0.043939811

Conclusions

We should make a list of seed words and evaluate them one by one using bootstrap_lss() to create accurate measurement. However, seed words become much less ambiguous when they are used as a set, so we should not be too nervous about its results. Seed word selection should be motivated primarily by the theoretical framework

References

Watanabe, K., & Zhou, Y. (2020). Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches, Social Science Computer Review, https://doi.org/10.1177/0894439320907027.
Watanabe, K. (2021). Latent Semantic Scaling: A Semisupervised Text Analysis Technique for New Domains and Languages, Communication Methods and Measures, https://doi.org/10.1080/19312458.2020.1832976.

Evaluating seed words using utility functions

Evaluation with synonyms

Evaluation with similarity scores

Conclusions

References