Skip to contents

Distributed LDA (Latent Dirichlet Allocation) can dramatically speeds up your analysis by using multiple processors on your computer. The number of topic is small in this example, but the distributed algorithm is highly effective in identifying many topics (k > 100) in a large corpus.

Preperation

We prepare the Sputnik corpus on Ukraine in the same way as in the introduction.

library(seededlda)
library(quanteda)

corp <- readRDS("data_corpus_sputnik2022.rds")
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, 
               remove_numbers = TRUE, remove_url = TRUE)
dfmt <- dfm(toks) |> 
    dfm_remove(stopwords("en")) |>
    dfm_remove("*@*") |>
    dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
print(dfmt)
#> Document-feature matrix of: 8,063 documents, 58,541 features (99.77% sparse) and 4 docvars.
#>              features
#> docs          reiterated commitment diplomacy worsened reporters urge best
#>   s1092644731          1          1         8        1         2    1    1
#>   s1092643478          0          0         0        0         0    0    0
#>   s1092643372          0          0         0        0         0    0    0
#>   s1092643164          0          0         0        0         0    0    0
#>   s1092641413          0          0         6        0         0    0    0
#>   s1092640142          0          0         0        0         0    0    0
#>              features
#> docs          forward continuing buildup
#>   s1092644731       1          2       2
#>   s1092643478       0          0       0
#>   s1092643372       1          0       0
#>   s1092643164       0          0       0
#>   s1092641413       0          1       0
#>   s1092640142       0          0       0
#> [ reached max_ndoc ... 8,057 more documents, reached max_nfeat ... 58,531 more features ]

Distributed LDA

When batch_size = 0.01, the distributed algorithm allocates 1% of the documents in the corpus to each processor. It uses all the processors by default, but you can limit the number through options(seededlda_threads).

lda_dist <- textmodel_lda(dfmt, k = 10, batch_size = 0.01, verbose = TRUE)
#> Fitting LDA with 10 topics
#>  ...initializing
#>  ...using up to 16 threads for distributed computing
#>  ......allocating 81 documents to each thread
#>  ...Gibbs sampling in 2000 iterations
#>  ......iteration 100 elapsed time: 7.59 seconds (delta: 0.48%)
#>  ......iteration 200 elapsed time: 14.32 seconds (delta: 0.02%)
#>  ......iteration 300 elapsed time: 20.94 seconds (delta: -0.00%)
#>  ......iteration 400 elapsed time: 27.55 seconds (delta: -0.00%)
#>  ......iteration 500 elapsed time: 34.13 seconds (delta: -0.01%)
#>  ......iteration 600 elapsed time: 40.78 seconds (delta: 0.03%)
#>  ......iteration 700 elapsed time: 47.46 seconds (delta: 0.08%)
#>  ......iteration 800 elapsed time: 54.06 seconds (delta: -0.10%)
#>  ......iteration 900 elapsed time: 61.19 seconds (delta: -0.05%)
#>  ......iteration 1000 elapsed time: 69.33 seconds (delta: -0.01%)
#>  ......iteration 1100 elapsed time: 77.42 seconds (delta: -0.02%)
#>  ......iteration 1200 elapsed time: 85.56 seconds (delta: -0.03%)
#>  ......iteration 1300 elapsed time: 93.65 seconds (delta: 0.02%)
#>  ......iteration 1400 elapsed time: 101.82 seconds (delta: 0.00%)
#>  ......iteration 1500 elapsed time: 110.06 seconds (delta: -0.07%)
#>  ......iteration 1600 elapsed time: 118.29 seconds (delta: 0.05%)
#>  ......iteration 1700 elapsed time: 126.52 seconds (delta: -0.05%)
#>  ......iteration 1800 elapsed time: 134.76 seconds (delta: -0.02%)
#>  ......iteration 1900 elapsed time: 143.01 seconds (delta: 0.05%)
#>  ......iteration 2000 elapsed time: 151.16 seconds (delta: -0.03%)
#>  ...computing theta and phi
#>  ...complete

Despite the much shorter execution time, it identifies topic terms very similar to the standard LDA.

knitr::kable(terms(lda_dist))
topic1 topic2 topic3 topic4 topic5 topic6 topic7 topic8 topic9 topic10
hour india per dpr german taiwan photo joins systems lavrov
spoke grain fuel lpr stream trump azov johnson air alliance
biological indian crude civilians nord twitter mediabank journalist finland interests
discussed cooperation imports shelling french congress killed author missiles zelensky
lines trade cap civilian pipeline hunter human joined sweden say
lee africa production plant macron republicans police professor missile proposals
mark iran cost zaporozhye commission democrats refugees sean training diplomatic
fault turkey companies kherson france republican battalion jacquie equipment negotiations
research african natural regions restrictions election children truss aircraft course
joined south costs strikes company bill nazi university army clear

Distributed LDA with convergence detection

By default, the algorithm fits LDA through as many as 2000 iterations for reliable results, but we can minimize the number using the convergence detection mechanism to further speed up your analysis. When auto_iter = TRUE, the algorithm stop inference on convergence (delta < 0) and return the result.

lda_auto <- textmodel_lda(dfmt, k = 10, batch_size = 0.01, auto_iter = TRUE,
                          verbose = TRUE)
#> Fitting LDA with 10 topics
#>  ...initializing
#>  ...using up to 16 threads for distributed computing
#>  ......allocating 81 documents to each thread
#>  ...Gibbs sampling in up to 2000 iterations
#>  ......iteration 100 elapsed time: 9.00 seconds (delta: 0.38%)
#>  ......iteration 200 elapsed time: 16.98 seconds (delta: 0.09%)
#>  ......iteration 300 elapsed time: 25.14 seconds (delta: 0.02%)
#>  ......iteration 400 elapsed time: 33.20 seconds (delta: -0.06%)
#>  ...computing theta and phi
#>  ...complete
knitr::kable(terms(lda_auto))
topic1 topic2 topic3 topic4 topic5 topic6 topic7 topic8 topic9 topic10
per india dpr joins trump azov bank lavrov twitter british
fuel taiwan air joined americans mediabank biological grain hunter johnson
stream indian systems journalist say photo assets negotiations photo truss
nord cooperation missile hour congress human research alliance post boris
crude chinese lpr talk democrats killed central agreement intelligence london
pipeline finland missiles author it’s battalion banks agreements video sunak
imports sweden artillery analyst democratic children federal peskov investigation liz
natural beijing shelling mark republicans civilians interest diplomatic screenshot tax
cap summit plant spoke know crimes companies proposals rt britain
exports south zaporozhye sean change soviet rate zelensky alleged pm

References

  • Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2009). Distributed Algorithms for Topic Models. The Journal of Machine Learning Research, 10, 1801–1828.
  • Watanabe, K. (2023). Speed Up Topic Modeling: Distributed Computing and Convergence Detection for LDA, working paper.