Introduction to seededlda

seededlda was created mainly for semi-supervised topic modeling but it can perform unsupervised topic modeling too. I explain the basic functions of the package taking unsupervised LDA (Latent Dirichlet Allocation) as an example in this page and discuss semi-supervised LDA in a separate page.

Preperation

We use the corpus of Sputnik articles about Ukraine in the examples. In the preprocessing, we remove grammatical words stopwords("en"), email addresses "*@* and words that occur in more than 10% of documents max_docfreq = 0.1 from the document-feature matrix (DFM).

library(seededlda)

#> Warning: package 'quanteda' was built under R version 4.3.3

#> Warning in .recacheSubclasses(def@className, def, env): undefined subclass
#> "ndiMatrix" of class "replValueSp"; definition not updated

#> Warning: package 'proxyC' was built under R version 4.3.3

library(quanteda)

corp <- readRDS("data_corpus_sputnik2022.rds")
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, 
               remove_numbers = TRUE, remove_url = TRUE)
dfmt <- dfm(toks) |> 
    dfm_remove(stopwords("en")) |>
    dfm_remove("*@*") |>
    dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
print(dfmt)

#> Document-feature matrix of: 8,063 documents, 58,664 features (99.77% sparse) and 4 docvars.
#>              features
#> docs          reiterated commitment diplomacy worsened reporters urge best
#>   s1092644731          1          1         8        1         2    1    1
#>   s1092643478          0          0         0        0         0    0    0
#>   s1092643372          0          0         0        0         0    0    0
#>   s1092643164          0          0         0        0         0    0    0
#>   s1092641413          0          0         6        0         0    0    0
#>   s1092640142          0          0         0        0         0    0    0
#>              features
#> docs          forward continuing buildup
#>   s1092644731       1          2       2
#>   s1092643478       0          0       0
#>   s1092643372       1          0       0
#>   s1092643164       0          0       0
#>   s1092641413       0          1       0
#>   s1092640142       0          0       0
#> [ reached max_ndoc ... 8,057 more documents, reached max_nfeat ... 58,654 more features ]

Standard LDA

You can fit LDA on the DFM only by setting the number of topics k = 10 to identify. When verbose = TRUE, it shows the progress of the inference through iterations. It takes long time to fit LDA on a large corpus, but the distributed algorithm will speed up your analysis dramatically.

lda <- textmodel_lda(dfmt, k = 10, verbose = TRUE)

#> Fitting LDA with 10 topics
#>  ...initializing
#>  ...Gibbs sampling in 2000 iterations
#>  ......iteration 100 elapsed time: 23.99 seconds (delta: 0.13%)
#>  ......iteration 200 elapsed time: 52.57 seconds (delta: -0.01%)
#>  ......iteration 300 elapsed time: 93.53 seconds (delta: 0.05%)
#>  ......iteration 400 elapsed time: 134.12 seconds (delta: -0.01%)
#>  ......iteration 500 elapsed time: 174.18 seconds (delta: 0.01%)
#>  ......iteration 600 elapsed time: 213.07 seconds (delta: 0.00%)
#>  ......iteration 700 elapsed time: 250.18 seconds (delta: -0.01%)
#>  ......iteration 800 elapsed time: 288.05 seconds (delta: 0.02%)
#>  ......iteration 900 elapsed time: 328.06 seconds (delta: 0.03%)
#>  ......iteration 1000 elapsed time: 370.19 seconds (delta: 0.01%)
#>  ......iteration 1100 elapsed time: 410.81 seconds (delta: 0.02%)
#>  ......iteration 1200 elapsed time: 453.34 seconds (delta: 0.01%)
#>  ......iteration 1300 elapsed time: 492.66 seconds (delta: 0.02%)
#>  ......iteration 1400 elapsed time: 529.79 seconds (delta: 0.02%)
#>  ......iteration 1500 elapsed time: 566.95 seconds (delta: -0.02%)
#>  ......iteration 1600 elapsed time: 604.24 seconds (delta: 0.02%)
#>  ......iteration 1700 elapsed time: 641.66 seconds (delta: 0.00%)
#>  ......iteration 1800 elapsed time: 679.14 seconds (delta: 0.01%)
#>  ......iteration 1900 elapsed time: 716.78 seconds (delta: -0.01%)
#>  ......iteration 2000 elapsed time: 754.18 seconds (delta: -0.02%)
#>  ...computing theta and phi
#>  ...complete

Topic terms

Once the model is fit, you can can interpret the topics by reading the most salient words in the topics. terms() shows words that are most frequent in each topic at the top of the matrix.

knitr::kable(terms(lda))

topic1	topic2	topic3	topic4	topic5	topic6	topic7	topic8	topic9	topic10
per	azov	joins	iran	taiwan	alliance	india	air	trump	british
stream	mediabank	joined	biological	chinese	lavrov	grain	dpr	congress	johnson
fuel	civilians	journalist	french	beijing	finland	indian	systems	hunter	twitter
nord	video	hour	saudi	interests	sweden	cooperation	missile	democrats	truss
crude	killed	talk	israel	say	german	africa	lpr	republicans	boris
companies	mariupol	author	research	soviet	negotiations	sea	missiles	bill	photo
pipeline	photo	analyst	intelligence	strategic	peskov	trade	zaporozhye	republican	sunak
imports	human	spoke	activities	know	zelensky	african	plant	election	london
natural	battalion	mark	space	course	poland	summit	training	senate	pm
cap	crimes	sean	macron	change	proposals	south	equipment	biden’s	liz

Document topics

You can also predict the topics of documents using topics(). I recommend extracting the document variables from the DFM in the fitted object lda$data and saving the topics in the data.frame.

dat <- docvars(lda$data)
dat$topic <- topics(lda)

knitr::kable(head(dat[,c("date", "topic", "head")], 10))

date	topic	head
2022-01-31	topic6	Biden: US Desires Diplomacy But ‘Ready No Matter What Happens’ If Ukraine Tensions Worsen
2022-01-31	topic1	EU Trade Commissioner Says Nord Stream 2 ‘on Pause’ Pending Review of Compliance With European Laws
2022-01-31	topic1	Russian Assets in UK May Be Frozen in Accordance With New Sanctions Bill, Foreign Secretary Says
2022-01-31	topic9	Hunter Biden Was Reportedly Subpoenaed Over Dealings With China a Year Before Presidential Election
2022-01-31	topic6	US Urges UN Security Council to Act on Ukraine Crisis as Russia Denies Invasion Claims
2022-01-31	topic6	UN Security Council Holds Meeting on Ukrainian Crisis
2022-01-31	topic5	The Depth of US Cold War Thinking and Murder of JFK
2022-01-31	topic6	UK PM Johnson Says He Will Tell Putin to ‘Step Back From the Brink’ in Ukraine
2022-01-31	topic10	Kremlin on UK Sanctions Threat: Attack on Russian Businesses Means There Will Be Retaliation
2022-01-31	topic6	Ukraine Crisis Escalation Forces Boris Johnson to Cancel Visit to Japan, Reports Say

References

Heinrich, G. (2008). Parameter estimation for text analysis. http://www.arbylon.net/publications/text-est.pdf
Watanabe, K., & Baturo, A. (2023). Seeded Sequential LDA: A Semi-Supervised Algorithm for Topic-Specific Analysis of Sentences. Social Science Computer Review. https://doi.org/10.1177/08944393231178605

The package for semi-supervised topic modeling

Preperation

Standard LDA

Topic terms

Document topics

References