Seeded LDA

Seeded LDA (Latent Dirichlet Allocation) can identify pre-defined topics in the corpus with a small number of seed words. Seeded LDA is useful when you want to match topics with theoretical concepts in deductive analysis.

Preperation

We prepare the Sputnik corpus on Ukraine in the same way as in the introduction.

library(seededlda)

#> Warning: package 'quanteda' was built under R version 4.3.3

#> Warning in .recacheSubclasses(def@className, def, env): undefined subclass
#> "ndiMatrix" of class "replValueSp"; definition not updated

#> Warning: package 'proxyC' was built under R version 4.3.3

library(quanteda)

corp <- readRDS("data_corpus_sputnik2022.rds")
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, 
               remove_numbers = TRUE, remove_url = TRUE)
dfmt <- dfm(toks) |> 
    dfm_remove(stopwords("en")) |>
    dfm_remove("*@*") |>
    dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
print(dfmt)

#> Document-feature matrix of: 8,063 documents, 58,664 features (99.77% sparse) and 4 docvars.
#>              features
#> docs          reiterated commitment diplomacy worsened reporters urge best
#>   s1092644731          1          1         8        1         2    1    1
#>   s1092643478          0          0         0        0         0    0    0
#>   s1092643372          0          0         0        0         0    0    0
#>   s1092643164          0          0         0        0         0    0    0
#>   s1092641413          0          0         6        0         0    0    0
#>   s1092640142          0          0         0        0         0    0    0
#>              features
#> docs          forward continuing buildup
#>   s1092644731       1          2       2
#>   s1092643478       0          0       0
#>   s1092643372       1          0       0
#>   s1092643164       0          0       0
#>   s1092641413       0          1       0
#>   s1092640142       0          0       0
#> [ reached max_ndoc ... 8,057 more documents, reached max_nfeat ... 58,654 more features ]

We will use seed words in a dictionary to define the topics.

dict <- dictionary(file = "dictionary.yml")
print(dict)

#> Dictionary object with 5 key entries.
#> - [economy]:
#>   - market*, money, bank*, stock*, bond*, industry, company, shop*
#> - [politics]:
#>   - parliament*, congress*, white house, party leader*, party member*, voter*, lawmaker*, politician*
#> - [society]:
#>   - police, prison*, school*, hospital*
#> - [diplomacy]:
#>   - ambassador*, diplomat*, embassy, treaty
#> - [military]:
#>   - military, soldier*, terrorist*, air force, marine, navy, army

The function does not have k because it determines the number of topics based on the keys. You can use the distributed algorithm batch_size = 0.01 and convergence detection auto_iter = TRUE to speed up analysis.

lda_seed <- textmodel_seededlda(dfmt, dict, batch_size = 0.01, auto_iter = TRUE,
                                verbose = TRUE)

#> Fitting LDA with 5 topics
#>  ...initializing
#>  ...using up to 16 threads for distributed computing
#>  ......allocating 81 documents to each thread
#>  ...Gibbs sampling in up to 2000 iterations
#>  ......iteration 100 elapsed time: 7.37 seconds (delta: 0.18%)
#>  ......iteration 200 elapsed time: 13.32 seconds (delta: 0.03%)
#>  ......iteration 300 elapsed time: 19.19 seconds (delta: 0.00%)
#>  ......iteration 400 elapsed time: 25.43 seconds (delta: -0.04%)
#>  ...computing theta and phi
#>  ...complete

knitr::kable(terms(lda_seed))

economy	politics	society	diplomacy	military
bank	congress	joins	diplomatic	army
company	trump	police	ambassador	soldiers
money	johnson	journalist	cooperation	dpr
industry	politicians	joined	taiwan	terrorist
markets	hour	author	alliance	air
per	british	rights	lavrov	civilians
companies	parliament	professor	embassy	lpr
banks	lawmakers	talk	diplomat	systems
stream	voters	sean	chinese	missile
nord	democrats	history	turkey	shelling

Seeded LDA with residual topics

Seeded LDA can have both seeded and unseeded topics. If residula = 2, two unseeded topics are added to the model. You can change the name of these topics through options(seededlda_residual_name).

lda_res <- textmodel_seededlda(dfmt, dict, residual = 2, batch_size = 0.01, auto_iter = TRUE,
                                verbose = TRUE)

#> Fitting LDA with 7 topics
#>  ...initializing
#>  ...using up to 16 threads for distributed computing
#>  ......allocating 81 documents to each thread
#>  ...Gibbs sampling in up to 2000 iterations
#>  ......iteration 100 elapsed time: 8.51 seconds (delta: 0.25%)
#>  ......iteration 200 elapsed time: 16.37 seconds (delta: 0.05%)
#>  ......iteration 300 elapsed time: 24.27 seconds (delta: -0.04%)
#>  ...computing theta and phi
#>  ...complete

knitr::kable(terms(lda_res))

economy	politics	society	diplomacy	military	other1	other2
bank	congress	police	diplomatic	army	joins	soviet
company	politicians	azov	ambassador	soldiers	joined	know
money	trump	mediabank	taiwan	terrorist	journalist	say
industry	parliament	civilians	cooperation	air	hour	british
markets	lawmakers	mariupol	alliance	systems	johnson	photo
per	voters	photo	embassy	dpr	talk	it’s
companies	hunter	killed	diplomat	lpr	author	israel
banks	biological	crimes	lavrov	missile	analyst	good
fuel	republicans	human	india	missiles	professor	zelensky
stream	republican	battalion	chinese	grain	spoke	great

References

Lu, B., Ott, M., Cardie, C., & Tsou, B. K. (2011). Multi-aspect sentiment analysis with topic models. 2011 IEEE 11th International Conference on Data Mining Workshops, 81–88.
Watanabe, K., & Baturo, A. (2023). Seeded Sequential LDA: A Semi-Supervised Algorithm for Topic-Specific Analysis of Sentences. Social Science Computer Review. https://doi.org/10.1177/08944393231178605

Seeded LDA

Semi-supervised topic modeling

Preperation

Seeded LDA

Seeded LDA with residual topics

References