Seeded LDA (Latent Dirichlet Allocation) can identify pre-defined topics in the corpus with a small number of seed words. Seeded LDA is useful when you want to match topics with theoretical concepts in deductive analysis.
Preperation
We prepare the Sputnik corpus on Ukraine in the same way as in the introduction.
#> Warning: package 'quanteda' was built under R version 4.3.3
#> Warning in .recacheSubclasses(def@className, def, env): undefined subclass
#> "ndiMatrix" of class "replValueSp"; definition not updated
#> Warning: package 'proxyC' was built under R version 4.3.3
library(quanteda)
corp <- readRDS("data_corpus_sputnik2022.rds")
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE,
remove_numbers = TRUE, remove_url = TRUE)
dfmt <- dfm(toks) |>
dfm_remove(stopwords("en")) |>
dfm_remove("*@*") |>
dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
print(dfmt)
#> Document-feature matrix of: 8,063 documents, 58,664 features (99.77% sparse) and 4 docvars.
#> features
#> docs reiterated commitment diplomacy worsened reporters urge best
#> s1092644731 1 1 8 1 2 1 1
#> s1092643478 0 0 0 0 0 0 0
#> s1092643372 0 0 0 0 0 0 0
#> s1092643164 0 0 0 0 0 0 0
#> s1092641413 0 0 6 0 0 0 0
#> s1092640142 0 0 0 0 0 0 0
#> features
#> docs forward continuing buildup
#> s1092644731 1 2 2
#> s1092643478 0 0 0
#> s1092643372 1 0 0
#> s1092643164 0 0 0
#> s1092641413 0 1 0
#> s1092640142 0 0 0
#> [ reached max_ndoc ... 8,057 more documents, reached max_nfeat ... 58,654 more features ]
We will use seed words in a dictionary to define the topics.
dict <- dictionary(file = "dictionary.yml")
print(dict)
#> Dictionary object with 5 key entries.
#> - [economy]:
#> - market*, money, bank*, stock*, bond*, industry, company, shop*
#> - [politics]:
#> - parliament*, congress*, white house, party leader*, party member*, voter*, lawmaker*, politician*
#> - [society]:
#> - police, prison*, school*, hospital*
#> - [diplomacy]:
#> - ambassador*, diplomat*, embassy, treaty
#> - [military]:
#> - military, soldier*, terrorist*, air force, marine, navy, army
Seeded LDA
The function does not have k
because it determines the
number of topics based on the keys. You can use the distributed
algorithm batch_size = 0.01
and convergence detection
auto_iter = TRUE
to speed up analysis.
lda_seed <- textmodel_seededlda(dfmt, dict, batch_size = 0.01, auto_iter = TRUE,
verbose = TRUE)
#> Fitting LDA with 5 topics
#> ...initializing
#> ...using up to 16 threads for distributed computing
#> ......allocating 81 documents to each thread
#> ...Gibbs sampling in up to 2000 iterations
#> ......iteration 100 elapsed time: 7.37 seconds (delta: 0.18%)
#> ......iteration 200 elapsed time: 13.32 seconds (delta: 0.03%)
#> ......iteration 300 elapsed time: 19.19 seconds (delta: 0.00%)
#> ......iteration 400 elapsed time: 25.43 seconds (delta: -0.04%)
#> ...computing theta and phi
#> ...complete
economy | politics | society | diplomacy | military |
---|---|---|---|---|
bank | congress | joins | diplomatic | army |
company | trump | police | ambassador | soldiers |
money | johnson | journalist | cooperation | dpr |
industry | politicians | joined | taiwan | terrorist |
markets | hour | author | alliance | air |
per | british | rights | lavrov | civilians |
companies | parliament | professor | embassy | lpr |
banks | lawmakers | talk | diplomat | systems |
stream | voters | sean | chinese | missile |
nord | democrats | history | turkey | shelling |
Seeded LDA with residual topics
Seeded LDA can have both seeded and unseeded topics. If
residula = 2
, two unseeded topics are added to the model.
You can change the name of these topics through
options(seededlda_residual_name)
.
lda_res <- textmodel_seededlda(dfmt, dict, residual = 2, batch_size = 0.01, auto_iter = TRUE,
verbose = TRUE)
#> Fitting LDA with 7 topics
#> ...initializing
#> ...using up to 16 threads for distributed computing
#> ......allocating 81 documents to each thread
#> ...Gibbs sampling in up to 2000 iterations
#> ......iteration 100 elapsed time: 8.51 seconds (delta: 0.25%)
#> ......iteration 200 elapsed time: 16.37 seconds (delta: 0.05%)
#> ......iteration 300 elapsed time: 24.27 seconds (delta: -0.04%)
#> ...computing theta and phi
#> ...complete
economy | politics | society | diplomacy | military | other1 | other2 |
---|---|---|---|---|---|---|
bank | congress | police | diplomatic | army | joins | soviet |
company | politicians | azov | ambassador | soldiers | joined | know |
money | trump | mediabank | taiwan | terrorist | journalist | say |
industry | parliament | civilians | cooperation | air | hour | british |
markets | lawmakers | mariupol | alliance | systems | johnson | photo |
per | voters | photo | embassy | dpr | talk | it’s |
companies | hunter | killed | diplomat | lpr | author | israel |
banks | biological | crimes | lavrov | missile | analyst | good |
fuel | republicans | human | india | missiles | professor | zelensky |
stream | republican | battalion | chinese | grain | spoke | great |
References
- Lu, B., Ott, M., Cardie, C., & Tsou, B. K. (2011). Multi-aspect sentiment analysis with topic models. 2011 IEEE 11th International Conference on Data Mining Workshops, 81–88.
- Watanabe, K., & Baturo, A. (2023). Seeded Sequential LDA: A Semi-Supervised Algorithm for Topic-Specific Analysis of Sentences. Social Science Computer Review. https://doi.org/10.1177/08944393231178605