Distributed LDA

Distributed LDA (Latent Dirichlet Allocation) can dramatically speeds up your analysis by using multiple processors on your computer. The number of topic is small in this example, but the distributed algorithm is highly effective in identifying many topics (k > 100) in a large corpus.

Preperation

We prepare the Sputnik corpus on Ukraine in the same way as in the introduction.

library(seededlda)
#> Warning: package 'quanteda' was built under R version 4.3.3
#> Warning in .recacheSubclasses(def@className, def, env): undefined subclass
#> "ndiMatrix" of class "replValueSp"; definition not updated
#> Warning: package 'proxyC' was built under R version 4.3.3

library(quanteda)

corp <- readRDS("data_corpus_sputnik2022.rds")
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, 
               remove_numbers = TRUE, remove_url = TRUE)
dfmt <- dfm(toks) |> 
    dfm_remove(stopwords("en")) |>
    dfm_remove("*@*") |>
    dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
print(dfmt)
#> Document-feature matrix of: 8,063 documents, 58,664 features (99.77% sparse) and 4 docvars.
#>              features
#> docs          reiterated commitment diplomacy worsened reporters urge best
#>   s1092644731          1          1         8        1         2    1    1
#>   s1092643478          0          0         0        0         0    0    0
#>   s1092643372          0          0         0        0         0    0    0
#>   s1092643164          0          0         0        0         0    0    0
#>   s1092641413          0          0         6        0         0    0    0
#>   s1092640142          0          0         0        0         0    0    0
#>              features
#> docs          forward continuing buildup
#>   s1092644731       1          2       2
#>   s1092643478       0          0       0
#>   s1092643372       1          0       0
#>   s1092643164       0          0       0
#>   s1092641413       0          1       0
#>   s1092640142       0          0       0
#> [ reached max_ndoc ... 8,057 more documents, reached max_nfeat ... 58,654 more features ]

When batch_size = 0.01, the distributed algorithm allocates 1% of the documents in the corpus to each processor. It uses all the processors by default, but you can limit the number through options(seededlda_threads).

lda_dist <- textmodel_lda(dfmt, k = 10, batch_size = 0.01, verbose = TRUE)
#> Fitting LDA with 10 topics
#>  ...initializing
#>  ...using up to 16 threads for distributed computing
#>  ......allocating 81 documents to each thread
#>  ...Gibbs sampling in 2000 iterations
#>  ......iteration 100 elapsed time: 7.59 seconds (delta: 0.48%)
#>  ......iteration 200 elapsed time: 14.32 seconds (delta: 0.02%)
#>  ......iteration 300 elapsed time: 20.94 seconds (delta: -0.00%)
#>  ......iteration 400 elapsed time: 27.55 seconds (delta: -0.00%)
#>  ......iteration 500 elapsed time: 34.13 seconds (delta: -0.01%)
#>  ......iteration 600 elapsed time: 40.78 seconds (delta: 0.03%)
#>  ......iteration 700 elapsed time: 47.46 seconds (delta: 0.08%)
#>  ......iteration 800 elapsed time: 54.06 seconds (delta: -0.10%)
#>  ......iteration 900 elapsed time: 61.19 seconds (delta: -0.05%)
#>  ......iteration 1000 elapsed time: 69.33 seconds (delta: -0.01%)
#>  ......iteration 1100 elapsed time: 77.42 seconds (delta: -0.02%)
#>  ......iteration 1200 elapsed time: 85.56 seconds (delta: -0.03%)
#>  ......iteration 1300 elapsed time: 93.65 seconds (delta: 0.02%)
#>  ......iteration 1400 elapsed time: 101.82 seconds (delta: 0.00%)
#>  ......iteration 1500 elapsed time: 110.06 seconds (delta: -0.07%)
#>  ......iteration 1600 elapsed time: 118.29 seconds (delta: 0.05%)
#>  ......iteration 1700 elapsed time: 126.52 seconds (delta: -0.05%)
#>  ......iteration 1800 elapsed time: 134.76 seconds (delta: -0.02%)
#>  ......iteration 1900 elapsed time: 143.01 seconds (delta: 0.05%)
#>  ......iteration 2000 elapsed time: 151.16 seconds (delta: -0.03%)
#>  ...computing theta and phi
#>  ...complete

Despite the much shorter execution time, it identifies topic terms very similar to the standard LDA.

knitr::kable(terms(lda_dist))

topic1	topic2	topic3	topic4	topic5	topic6	topic7	topic8	topic9	topic10
hour	india	per	dpr	german	taiwan	photo	joins	systems	lavrov
spoke	grain	fuel	lpr	stream	trump	azov	johnson	air	alliance
biological	indian	crude	civilians	nord	twitter	mediabank	journalist	finland	interests
discussed	cooperation	imports	shelling	french	congress	killed	author	missiles	zelensky
lines	trade	cap	civilian	pipeline	hunter	human	joined	sweden	say
lee	africa	production	plant	macron	republicans	police	professor	missile	proposals
mark	iran	cost	zaporozhye	commission	democrats	refugees	sean	training	diplomatic
fault	turkey	companies	kherson	france	republican	battalion	jacquie	equipment	negotiations
research	african	natural	regions	restrictions	election	children	truss	aircraft	course
joined	south	costs	strikes	company	bill	nazi	university	army	clear

Distributed LDA with convergence detection

By default, the algorithm fits LDA through as many as 2000 iterations for reliable results, but we can minimize the number using the convergence detection mechanism to further speed up your analysis. When auto_iter = TRUE, the algorithm stop inference on convergence (delta < 0) and return the result.

lda_auto <- textmodel_lda(dfmt, k = 10, batch_size = 0.01, auto_iter = TRUE,
                          verbose = TRUE)
#> Fitting LDA with 10 topics
#>  ...initializing
#>  ...using up to 16 threads for distributed computing
#>  ......allocating 81 documents to each thread
#>  ...Gibbs sampling in up to 2000 iterations
#>  ......iteration 100 elapsed time: 9.00 seconds (delta: 0.38%)
#>  ......iteration 200 elapsed time: 16.98 seconds (delta: 0.09%)
#>  ......iteration 300 elapsed time: 25.14 seconds (delta: 0.02%)
#>  ......iteration 400 elapsed time: 33.20 seconds (delta: -0.06%)
#>  ...computing theta and phi
#>  ...complete

knitr::kable(terms(lda_auto))

topic1	topic2	topic3	topic4	topic5	topic6	topic7	topic8	topic9	topic10
per	india	dpr	joins	trump	azov	bank	lavrov	twitter	british
fuel	taiwan	air	joined	americans	mediabank	biological	grain	hunter	johnson
stream	indian	systems	journalist	say	photo	assets	negotiations	photo	truss
nord	cooperation	missile	hour	congress	human	research	alliance	post	boris
crude	chinese	lpr	talk	democrats	killed	central	agreement	intelligence	london
pipeline	finland	missiles	author	it’s	battalion	banks	agreements	video	sunak
imports	sweden	artillery	analyst	democratic	children	federal	peskov	investigation	liz
natural	beijing	shelling	mark	republicans	civilians	interest	diplomatic	screenshot	tax
cap	summit	plant	spoke	know	crimes	companies	proposals	rt	britain
exports	south	zaporozhye	sean	change	soviet	rate	zelensky	alleged	pm

References

Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2009). Distributed Algorithms for Topic Models. The Journal of Machine Learning Research, 10, 1801–1828.
Watanabe, K. (2023). Speed Up Topic Modeling: Distributed Computing and Convergence Detection for LDA, working paper.

Distributed LDA

Topic modeling with parallel computing

Preperation

Distributed LDA

Distributed LDA with convergence detection

References