Skip to contents

Seeded LDA (Latent Dirichlet Allocation) can identify pre-defined topics in the corpus with a small number of seed words. Seeded LDA is useful when you want to match topics with theoretical concepts in deductive analysis.

Preperation

We prepare the Sputnik corpus on Ukraine in the same way as in the introduction.

#> Warning: package 'quanteda' was built under R version 4.3.3
#> Warning in .recacheSubclasses(def@className, def, env): undefined subclass
#> "ndiMatrix" of class "replValueSp"; definition not updated
#> Warning: package 'proxyC' was built under R version 4.3.3
library(quanteda)

corp <- readRDS("data_corpus_sputnik2022.rds")
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, 
               remove_numbers = TRUE, remove_url = TRUE)
dfmt <- dfm(toks) |> 
    dfm_remove(stopwords("en")) |>
    dfm_remove("*@*") |>
    dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
print(dfmt)
#> Document-feature matrix of: 8,063 documents, 58,664 features (99.77% sparse) and 4 docvars.
#>              features
#> docs          reiterated commitment diplomacy worsened reporters urge best
#>   s1092644731          1          1         8        1         2    1    1
#>   s1092643478          0          0         0        0         0    0    0
#>   s1092643372          0          0         0        0         0    0    0
#>   s1092643164          0          0         0        0         0    0    0
#>   s1092641413          0          0         6        0         0    0    0
#>   s1092640142          0          0         0        0         0    0    0
#>              features
#> docs          forward continuing buildup
#>   s1092644731       1          2       2
#>   s1092643478       0          0       0
#>   s1092643372       1          0       0
#>   s1092643164       0          0       0
#>   s1092641413       0          1       0
#>   s1092640142       0          0       0
#> [ reached max_ndoc ... 8,057 more documents, reached max_nfeat ... 58,654 more features ]

We will use seed words in a dictionary to define the topics.

dict <- dictionary(file = "dictionary.yml")
print(dict)
#> Dictionary object with 5 key entries.
#> - [economy]:
#>   - market*, money, bank*, stock*, bond*, industry, company, shop*
#> - [politics]:
#>   - parliament*, congress*, white house, party leader*, party member*, voter*, lawmaker*, politician*
#> - [society]:
#>   - police, prison*, school*, hospital*
#> - [diplomacy]:
#>   - ambassador*, diplomat*, embassy, treaty
#> - [military]:
#>   - military, soldier*, terrorist*, air force, marine, navy, army

Seeded LDA

The function does not have k because it determines the number of topics based on the keys. You can use the distributed algorithm batch_size = 0.01 and convergence detection auto_iter = TRUE to speed up analysis.

lda_seed <- textmodel_seededlda(dfmt, dict, batch_size = 0.01, auto_iter = TRUE,
                                verbose = TRUE)
#> Fitting LDA with 5 topics
#>  ...initializing
#>  ...using up to 16 threads for distributed computing
#>  ......allocating 81 documents to each thread
#>  ...Gibbs sampling in up to 2000 iterations
#>  ......iteration 100 elapsed time: 7.37 seconds (delta: 0.18%)
#>  ......iteration 200 elapsed time: 13.32 seconds (delta: 0.03%)
#>  ......iteration 300 elapsed time: 19.19 seconds (delta: 0.00%)
#>  ......iteration 400 elapsed time: 25.43 seconds (delta: -0.04%)
#>  ...computing theta and phi
#>  ...complete
knitr::kable(terms(lda_seed))
economy politics society diplomacy military
bank congress joins diplomatic army
company trump police ambassador soldiers
money johnson journalist cooperation dpr
industry politicians joined taiwan terrorist
markets hour author alliance air
per british rights lavrov civilians
companies parliament professor embassy lpr
banks lawmakers talk diplomat systems
stream voters sean chinese missile
nord democrats history turkey shelling

Seeded LDA with residual topics

Seeded LDA can have both seeded and unseeded topics. If residula = 2, two unseeded topics are added to the model. You can change the name of these topics through options(seededlda_residual_name).

lda_res <- textmodel_seededlda(dfmt, dict, residual = 2, batch_size = 0.01, auto_iter = TRUE,
                                verbose = TRUE)
#> Fitting LDA with 7 topics
#>  ...initializing
#>  ...using up to 16 threads for distributed computing
#>  ......allocating 81 documents to each thread
#>  ...Gibbs sampling in up to 2000 iterations
#>  ......iteration 100 elapsed time: 8.51 seconds (delta: 0.25%)
#>  ......iteration 200 elapsed time: 16.37 seconds (delta: 0.05%)
#>  ......iteration 300 elapsed time: 24.27 seconds (delta: -0.04%)
#>  ...computing theta and phi
#>  ...complete
knitr::kable(terms(lda_res))
economy politics society diplomacy military other1 other2
bank congress police diplomatic army joins soviet
company politicians azov ambassador soldiers joined know
money trump mediabank taiwan terrorist journalist say
industry parliament civilians cooperation air hour british
markets lawmakers mariupol alliance systems johnson photo
per voters photo embassy dpr talk it’s
companies hunter killed diplomat lpr author israel
banks biological crimes lavrov missile analyst good
fuel republicans human india missiles professor zelensky
stream republican battalion chinese grain spoke great

References

  • Lu, B., Ott, M., Cardie, C., & Tsou, B. K. (2011). Multi-aspect sentiment analysis with topic models. 2011 IEEE 11th International Conference on Data Mining Workshops, 81–88.
  • Watanabe, K., & Baturo, A. (2023). Seeded Sequential LDA: A Semi-Supervised Algorithm for Topic-Specific Analysis of Sentences. Social Science Computer Review. https://doi.org/10.1177/08944393231178605