Skip to contents

seededlda was created mainly for semi-supervised topic modeling but it can perform unsupervised topic modeling too. I explain the basic functions of the package taking unsupervised LDA (Latent Dirichlet Allocation) as an example in this page and discuss semi-supervised LDA in a separate page.

Preperation

We use the corpus of Sputnik articles about Ukraine in the examples. In the preprocessing, we remove grammatical words stopwords("en"), email addresses "*@* and words that occur in more than 10% of documents max_docfreq = 0.1 from the document-feature matrix (DFM).

#> Warning: package 'quanteda' was built under R version 4.3.3
#> Warning in .recacheSubclasses(def@className, def, env): undefined subclass
#> "ndiMatrix" of class "replValueSp"; definition not updated
#> Warning: package 'proxyC' was built under R version 4.3.3
library(quanteda)

corp <- readRDS("data_corpus_sputnik2022.rds")
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, 
               remove_numbers = TRUE, remove_url = TRUE)
dfmt <- dfm(toks) |> 
    dfm_remove(stopwords("en")) |>
    dfm_remove("*@*") |>
    dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
print(dfmt)
#> Document-feature matrix of: 8,063 documents, 58,664 features (99.77% sparse) and 4 docvars.
#>              features
#> docs          reiterated commitment diplomacy worsened reporters urge best
#>   s1092644731          1          1         8        1         2    1    1
#>   s1092643478          0          0         0        0         0    0    0
#>   s1092643372          0          0         0        0         0    0    0
#>   s1092643164          0          0         0        0         0    0    0
#>   s1092641413          0          0         6        0         0    0    0
#>   s1092640142          0          0         0        0         0    0    0
#>              features
#> docs          forward continuing buildup
#>   s1092644731       1          2       2
#>   s1092643478       0          0       0
#>   s1092643372       1          0       0
#>   s1092643164       0          0       0
#>   s1092641413       0          1       0
#>   s1092640142       0          0       0
#> [ reached max_ndoc ... 8,057 more documents, reached max_nfeat ... 58,654 more features ]

Standard LDA

You can fit LDA on the DFM only by setting the number of topics k = 10 to identify. When verbose = TRUE, it shows the progress of the inference through iterations. It takes long time to fit LDA on a large corpus, but the distributed algorithm will speed up your analysis dramatically.

lda <- textmodel_lda(dfmt, k = 10, verbose = TRUE)
#> Fitting LDA with 10 topics
#>  ...initializing
#>  ...Gibbs sampling in 2000 iterations
#>  ......iteration 100 elapsed time: 23.99 seconds (delta: 0.13%)
#>  ......iteration 200 elapsed time: 52.57 seconds (delta: -0.01%)
#>  ......iteration 300 elapsed time: 93.53 seconds (delta: 0.05%)
#>  ......iteration 400 elapsed time: 134.12 seconds (delta: -0.01%)
#>  ......iteration 500 elapsed time: 174.18 seconds (delta: 0.01%)
#>  ......iteration 600 elapsed time: 213.07 seconds (delta: 0.00%)
#>  ......iteration 700 elapsed time: 250.18 seconds (delta: -0.01%)
#>  ......iteration 800 elapsed time: 288.05 seconds (delta: 0.02%)
#>  ......iteration 900 elapsed time: 328.06 seconds (delta: 0.03%)
#>  ......iteration 1000 elapsed time: 370.19 seconds (delta: 0.01%)
#>  ......iteration 1100 elapsed time: 410.81 seconds (delta: 0.02%)
#>  ......iteration 1200 elapsed time: 453.34 seconds (delta: 0.01%)
#>  ......iteration 1300 elapsed time: 492.66 seconds (delta: 0.02%)
#>  ......iteration 1400 elapsed time: 529.79 seconds (delta: 0.02%)
#>  ......iteration 1500 elapsed time: 566.95 seconds (delta: -0.02%)
#>  ......iteration 1600 elapsed time: 604.24 seconds (delta: 0.02%)
#>  ......iteration 1700 elapsed time: 641.66 seconds (delta: 0.00%)
#>  ......iteration 1800 elapsed time: 679.14 seconds (delta: 0.01%)
#>  ......iteration 1900 elapsed time: 716.78 seconds (delta: -0.01%)
#>  ......iteration 2000 elapsed time: 754.18 seconds (delta: -0.02%)
#>  ...computing theta and phi
#>  ...complete

Topic terms

Once the model is fit, you can can interpret the topics by reading the most salient words in the topics. terms() shows words that are most frequent in each topic at the top of the matrix.

knitr::kable(terms(lda))
topic1 topic2 topic3 topic4 topic5 topic6 topic7 topic8 topic9 topic10
per azov joins iran taiwan alliance india air trump british
stream mediabank joined biological chinese lavrov grain dpr congress johnson
fuel civilians journalist french beijing finland indian systems hunter twitter
nord video hour saudi interests sweden cooperation missile democrats truss
crude killed talk israel say german africa lpr republicans boris
companies mariupol author research soviet negotiations sea missiles bill photo
pipeline photo analyst intelligence strategic peskov trade zaporozhye republican sunak
imports human spoke activities know zelensky african plant election london
natural battalion mark space course poland summit training senate pm
cap crimes sean macron change proposals south equipment biden’s liz

Document topics

You can also predict the topics of documents using topics(). I recommend extracting the document variables from the DFM in the fitted object lda$data and saving the topics in the data.frame.

dat <- docvars(lda$data)
dat$topic <- topics(lda)
knitr::kable(head(dat[,c("date", "topic", "head")], 10))
date topic head
2022-01-31 topic6 Biden: US Desires Diplomacy But ‘Ready No Matter What Happens’ If Ukraine Tensions Worsen
2022-01-31 topic1 EU Trade Commissioner Says Nord Stream 2 ‘on Pause’ Pending Review of Compliance With European Laws
2022-01-31 topic1 Russian Assets in UK May Be Frozen in Accordance With New Sanctions Bill, Foreign Secretary Says
2022-01-31 topic9 Hunter Biden Was Reportedly Subpoenaed Over Dealings With China a Year Before Presidential Election
2022-01-31 topic6 US Urges UN Security Council to Act on Ukraine Crisis as Russia Denies Invasion Claims
2022-01-31 topic6 UN Security Council Holds Meeting on Ukrainian Crisis
2022-01-31 topic5 The Depth of US Cold War Thinking and Murder of JFK
2022-01-31 topic6 UK PM Johnson Says He Will Tell Putin to ‘Step Back From the Brink’ in Ukraine
2022-01-31 topic10 Kremlin on UK Sanctions Threat: Attack on Russian Businesses Means There Will Be Retaliation
2022-01-31 topic6 Ukraine Crisis Escalation Forces Boris Johnson to Cancel Visit to Japan, Reports Say

References