Introduction to seededlda
The package for semi-supervised topic modeling
Source:vignettes/pkgdown/basic.Rmd
basic.Rmd
seededlda was created mainly for semi-supervised topic modeling but it can perform unsupervised topic modeling too. I explain the basic functions of the package taking unsupervised LDA (Latent Dirichlet Allocation) as an example in this page and discuss semi-supervised LDA in a separate page.
Preperation
We use the
corpus of Sputnik articles about Ukraine in the examples. In the
preprocessing, we remove grammatical words stopwords("en")
,
email addresses "*@*
and words that occur in more than 10%
of documents max_docfreq = 0.1
from the document-feature
matrix (DFM).
#> Warning: package 'quanteda' was built under R version 4.3.3
#> Warning in .recacheSubclasses(def@className, def, env): undefined subclass
#> "ndiMatrix" of class "replValueSp"; definition not updated
#> Warning: package 'proxyC' was built under R version 4.3.3
library(quanteda)
corp <- readRDS("data_corpus_sputnik2022.rds")
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE,
remove_numbers = TRUE, remove_url = TRUE)
dfmt <- dfm(toks) |>
dfm_remove(stopwords("en")) |>
dfm_remove("*@*") |>
dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
print(dfmt)
#> Document-feature matrix of: 8,063 documents, 58,664 features (99.77% sparse) and 4 docvars.
#> features
#> docs reiterated commitment diplomacy worsened reporters urge best
#> s1092644731 1 1 8 1 2 1 1
#> s1092643478 0 0 0 0 0 0 0
#> s1092643372 0 0 0 0 0 0 0
#> s1092643164 0 0 0 0 0 0 0
#> s1092641413 0 0 6 0 0 0 0
#> s1092640142 0 0 0 0 0 0 0
#> features
#> docs forward continuing buildup
#> s1092644731 1 2 2
#> s1092643478 0 0 0
#> s1092643372 1 0 0
#> s1092643164 0 0 0
#> s1092641413 0 1 0
#> s1092640142 0 0 0
#> [ reached max_ndoc ... 8,057 more documents, reached max_nfeat ... 58,654 more features ]
Standard LDA
You can fit LDA on the DFM only by setting the number of topics
k = 10
to identify. When verbose = TRUE
, it
shows the progress of the inference through iterations. It takes long
time to fit LDA on a large corpus, but the distributed algorithm will speed up your
analysis dramatically.
lda <- textmodel_lda(dfmt, k = 10, verbose = TRUE)
#> Fitting LDA with 10 topics
#> ...initializing
#> ...Gibbs sampling in 2000 iterations
#> ......iteration 100 elapsed time: 23.99 seconds (delta: 0.13%)
#> ......iteration 200 elapsed time: 52.57 seconds (delta: -0.01%)
#> ......iteration 300 elapsed time: 93.53 seconds (delta: 0.05%)
#> ......iteration 400 elapsed time: 134.12 seconds (delta: -0.01%)
#> ......iteration 500 elapsed time: 174.18 seconds (delta: 0.01%)
#> ......iteration 600 elapsed time: 213.07 seconds (delta: 0.00%)
#> ......iteration 700 elapsed time: 250.18 seconds (delta: -0.01%)
#> ......iteration 800 elapsed time: 288.05 seconds (delta: 0.02%)
#> ......iteration 900 elapsed time: 328.06 seconds (delta: 0.03%)
#> ......iteration 1000 elapsed time: 370.19 seconds (delta: 0.01%)
#> ......iteration 1100 elapsed time: 410.81 seconds (delta: 0.02%)
#> ......iteration 1200 elapsed time: 453.34 seconds (delta: 0.01%)
#> ......iteration 1300 elapsed time: 492.66 seconds (delta: 0.02%)
#> ......iteration 1400 elapsed time: 529.79 seconds (delta: 0.02%)
#> ......iteration 1500 elapsed time: 566.95 seconds (delta: -0.02%)
#> ......iteration 1600 elapsed time: 604.24 seconds (delta: 0.02%)
#> ......iteration 1700 elapsed time: 641.66 seconds (delta: 0.00%)
#> ......iteration 1800 elapsed time: 679.14 seconds (delta: 0.01%)
#> ......iteration 1900 elapsed time: 716.78 seconds (delta: -0.01%)
#> ......iteration 2000 elapsed time: 754.18 seconds (delta: -0.02%)
#> ...computing theta and phi
#> ...complete
Topic terms
Once the model is fit, you can can interpret the topics by reading
the most salient words in the topics. terms()
shows words
that are most frequent in each topic at the top of the matrix.
topic1 | topic2 | topic3 | topic4 | topic5 | topic6 | topic7 | topic8 | topic9 | topic10 |
---|---|---|---|---|---|---|---|---|---|
per | azov | joins | iran | taiwan | alliance | india | air | trump | british |
stream | mediabank | joined | biological | chinese | lavrov | grain | dpr | congress | johnson |
fuel | civilians | journalist | french | beijing | finland | indian | systems | hunter | |
nord | video | hour | saudi | interests | sweden | cooperation | missile | democrats | truss |
crude | killed | talk | israel | say | german | africa | lpr | republicans | boris |
companies | mariupol | author | research | soviet | negotiations | sea | missiles | bill | photo |
pipeline | photo | analyst | intelligence | strategic | peskov | trade | zaporozhye | republican | sunak |
imports | human | spoke | activities | know | zelensky | african | plant | election | london |
natural | battalion | mark | space | course | poland | summit | training | senate | pm |
cap | crimes | sean | macron | change | proposals | south | equipment | biden’s | liz |
Document topics
You can also predict the topics of documents using
topics()
. I recommend extracting the document variables
from the DFM in the fitted object lda$data
and saving the
topics in the data.frame.
date | topic | head |
---|---|---|
2022-01-31 | topic6 | Biden: US Desires Diplomacy But ‘Ready No Matter What Happens’ If Ukraine Tensions Worsen |
2022-01-31 | topic1 | EU Trade Commissioner Says Nord Stream 2 ‘on Pause’ Pending Review of Compliance With European Laws |
2022-01-31 | topic1 | Russian Assets in UK May Be Frozen in Accordance With New Sanctions Bill, Foreign Secretary Says |
2022-01-31 | topic9 | Hunter Biden Was Reportedly Subpoenaed Over Dealings With China a Year Before Presidential Election |
2022-01-31 | topic6 | US Urges UN Security Council to Act on Ukraine Crisis as Russia Denies Invasion Claims |
2022-01-31 | topic6 | UN Security Council Holds Meeting on Ukrainian Crisis |
2022-01-31 | topic5 | The Depth of US Cold War Thinking and Murder of JFK |
2022-01-31 | topic6 | UK PM Johnson Says He Will Tell Putin to ‘Step Back From the Brink’ in Ukraine |
2022-01-31 | topic10 | Kremlin on UK Sanctions Threat: Attack on Russian Businesses Means There Will Be Retaliation |
2022-01-31 | topic6 | Ukraine Crisis Escalation Forces Boris Johnson to Cancel Visit to Japan, Reports Say |
References
- Heinrich, G. (2008). Parameter estimation for text analysis. http://www.arbylon.net/publications/text-est.pdf
- Watanabe, K., & Baturo, A. (2023). Seeded Sequential LDA: A Semi-Supervised Algorithm for Topic-Specific Analysis of Sentences. Social Science Computer Review. https://doi.org/10.1177/08944393231178605