Skip to contents

seededlda was created mainly for semi-supervised topic modeling but it can perform unsupervised topic modeling too. I explain the basic functions of the package taking unsupervised LDA (Latent Dirichlet Allocation) as an example in this page and discuss semi-supervised LDA in a separate page.

Preperation

We use the corpus of Sputnik articles about Ukraine in the examples. In the preprocessing, we remove grammatical words stopwords("en"), email addresses "*@* and words that occur in more than 10% of documents max_docfreq = 0.1 from the document-feature matrix (DFM).

library(seededlda)
library(quanteda)

corp <- readRDS("data_corpus_sputnik2022.rds")
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, 
               remove_numbers = TRUE, remove_url = TRUE)
dfmt <- dfm(toks) |> 
    dfm_remove(stopwords("en")) |>
    dfm_remove("*@*") |>
    dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
print(dfmt)
#> Document-feature matrix of: 8,063 documents, 58,541 features (99.77% sparse) and 4 docvars.
#>              features
#> docs          reiterated commitment diplomacy worsened reporters urge best
#>   s1092644731          1          1         8        1         2    1    1
#>   s1092643478          0          0         0        0         0    0    0
#>   s1092643372          0          0         0        0         0    0    0
#>   s1092643164          0          0         0        0         0    0    0
#>   s1092641413          0          0         6        0         0    0    0
#>   s1092640142          0          0         0        0         0    0    0
#>              features
#> docs          forward continuing buildup
#>   s1092644731       1          2       2
#>   s1092643478       0          0       0
#>   s1092643372       1          0       0
#>   s1092643164       0          0       0
#>   s1092641413       0          1       0
#>   s1092640142       0          0       0
#> [ reached max_ndoc ... 8,057 more documents, reached max_nfeat ... 58,531 more features ]

Standard LDA

You can fit LDA on the DFM only by setting the number of topics k = 10 to identify. When verbose = TRUE, it shows the progress of the inference through iterations. It takes long time to fit LDA on a large corpus, but the distributed algorithm will speed up your analysis dramatically.

lda <- textmodel_lda(dfmt, k = 10, verbose = TRUE)
#> Fitting LDA with 10 topics
#>  ...initializing
#>  ...Gibbs sampling in 2000 iterations
#>  ......iteration 100 elapsed time: 31.30 seconds (delta: 0.07%)
#>  ......iteration 200 elapsed time: 59.32 seconds (delta: 0.06%)
#>  ......iteration 300 elapsed time: 87.70 seconds (delta: 0.02%)
#>  ......iteration 400 elapsed time: 116.32 seconds (delta: 0.09%)
#>  ......iteration 500 elapsed time: 144.43 seconds (delta: -0.01%)
#>  ......iteration 600 elapsed time: 172.24 seconds (delta: -0.02%)
#>  ......iteration 700 elapsed time: 200.09 seconds (delta: -0.04%)
#>  ......iteration 800 elapsed time: 228.88 seconds (delta: -0.00%)
#>  ......iteration 900 elapsed time: 259.56 seconds (delta: -0.02%)
#>  ......iteration 1000 elapsed time: 287.79 seconds (delta: 0.02%)
#>  ......iteration 1100 elapsed time: 315.80 seconds (delta: -0.06%)
#>  ......iteration 1200 elapsed time: 348.42 seconds (delta: 0.01%)
#>  ......iteration 1300 elapsed time: 377.44 seconds (delta: 0.01%)
#>  ......iteration 1400 elapsed time: 406.29 seconds (delta: 0.02%)
#>  ......iteration 1500 elapsed time: 434.34 seconds (delta: -0.01%)
#>  ......iteration 1600 elapsed time: 462.74 seconds (delta: 0.04%)
#>  ......iteration 1700 elapsed time: 491.06 seconds (delta: -0.03%)
#>  ......iteration 1800 elapsed time: 519.67 seconds (delta: -0.00%)
#>  ......iteration 1900 elapsed time: 548.42 seconds (delta: 0.02%)
#>  ......iteration 2000 elapsed time: 578.71 seconds (delta: -0.01%)
#>  ...computing theta and phi
#>  ...complete

Topic terms

Once the model is fit, you can can interpret the topics by reading the most salient words in the topics. terms() shows words that are most frequent in each topic at the top of the matrix.

knitr::kable(terms(lda))
topic1 topic2 topic3 topic4 topic5 topic6 topic7 topic8 topic9 topic10
photo german british taiwan india dpr per joins alliance joined
azov stream johnson air grain lpr fuel iran lavrov hour
mediabank nord french systems indian civilians crude journalist sweden trump
human pipeline twitter missile cooperation shelling imports israel finland talk
killed commission truss missiles trade civilian production professor interests hunter
battalion company macron training turkey plant cap author say democrats
refugees restrictions boris equipment africa zaporozhye cost saudi diplomatic spoke
police assets photo chinese african kherson costs dr proposals republicans
children berlin zelensky biological sea regions natural analyst clear congress
crimes scholz leader pentagon summit kiev’s electricity east membership sean

Document topics

You can also predict the topics of documents using topics(). I recommend extracting the document variables from the DFM in the fitted object lda$data and saving the topics in the data.frame.

dat <- docvars(lda$data)
dat$topic <- topics(lda)
knitr::kable(head(dat[,c("date", "topic", "head")], 10))
date topic head
2022-01-31 topic9 Biden: US Desires Diplomacy But ‘Ready No Matter What Happens’ If Ukraine Tensions Worsen
2022-01-31 topic2 EU Trade Commissioner Says Nord Stream 2 ‘on Pause’ Pending Review of Compliance With European Laws
2022-01-31 topic9 Russian Assets in UK May Be Frozen in Accordance With New Sanctions Bill, Foreign Secretary Says
2022-01-31 topic10 Hunter Biden Was Reportedly Subpoenaed Over Dealings With China a Year Before Presidential Election
2022-01-31 topic9 US Urges UN Security Council to Act on Ukraine Crisis as Russia Denies Invasion Claims
2022-01-31 topic9 UN Security Council Holds Meeting on Ukrainian Crisis
2022-01-31 topic9 The Depth of US Cold War Thinking and Murder of JFK
2022-01-31 topic9 UK PM Johnson Says He Will Tell Putin to ‘Step Back From the Brink’ in Ukraine
2022-01-31 topic9 Kremlin on UK Sanctions Threat: Attack on Russian Businesses Means There Will Be Retaliation
2022-01-31 topic9 Ukraine Crisis Escalation Forces Boris Johnson to Cancel Visit to Japan, Reports Say

References