Distributed LDA
Topic modeling with parallel computing
Source:vignettes/pkgdown/distributed.Rmd
distributed.Rmd
Distributed LDA (Latent Dirichlet Allocation) can dramatically speeds
up your analysis by using multiple processors on your computer. The
number of topic is small in this example, but the distributed algorithm
is highly effective in identifying many topics (k > 100
)
in a large corpus.
Preperation
We prepare the Sputnik corpus on Ukraine in the same way as in the introduction.
library(seededlda)
#> Warning: package 'quanteda' was built under R version 4.3.3
#> Warning in .recacheSubclasses(def@className, def, env): undefined subclass
#> "ndiMatrix" of class "replValueSp"; definition not updated
#> Warning: package 'proxyC' was built under R version 4.3.3
library(quanteda)
corp <- readRDS("data_corpus_sputnik2022.rds")
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE,
remove_numbers = TRUE, remove_url = TRUE)
dfmt <- dfm(toks) |>
dfm_remove(stopwords("en")) |>
dfm_remove("*@*") |>
dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
print(dfmt)
#> Document-feature matrix of: 8,063 documents, 58,664 features (99.77% sparse) and 4 docvars.
#> features
#> docs reiterated commitment diplomacy worsened reporters urge best
#> s1092644731 1 1 8 1 2 1 1
#> s1092643478 0 0 0 0 0 0 0
#> s1092643372 0 0 0 0 0 0 0
#> s1092643164 0 0 0 0 0 0 0
#> s1092641413 0 0 6 0 0 0 0
#> s1092640142 0 0 0 0 0 0 0
#> features
#> docs forward continuing buildup
#> s1092644731 1 2 2
#> s1092643478 0 0 0
#> s1092643372 1 0 0
#> s1092643164 0 0 0
#> s1092641413 0 1 0
#> s1092640142 0 0 0
#> [ reached max_ndoc ... 8,057 more documents, reached max_nfeat ... 58,654 more features ]
Distributed LDA
When batch_size = 0.01
, the distributed algorithm
allocates 1% of the documents in the corpus to each processor. It uses
all the processors by default, but you can limit the number through
options(seededlda_threads)
.
lda_dist <- textmodel_lda(dfmt, k = 10, batch_size = 0.01, verbose = TRUE)
#> Fitting LDA with 10 topics
#> ...initializing
#> ...using up to 16 threads for distributed computing
#> ......allocating 81 documents to each thread
#> ...Gibbs sampling in 2000 iterations
#> ......iteration 100 elapsed time: 7.59 seconds (delta: 0.48%)
#> ......iteration 200 elapsed time: 14.32 seconds (delta: 0.02%)
#> ......iteration 300 elapsed time: 20.94 seconds (delta: -0.00%)
#> ......iteration 400 elapsed time: 27.55 seconds (delta: -0.00%)
#> ......iteration 500 elapsed time: 34.13 seconds (delta: -0.01%)
#> ......iteration 600 elapsed time: 40.78 seconds (delta: 0.03%)
#> ......iteration 700 elapsed time: 47.46 seconds (delta: 0.08%)
#> ......iteration 800 elapsed time: 54.06 seconds (delta: -0.10%)
#> ......iteration 900 elapsed time: 61.19 seconds (delta: -0.05%)
#> ......iteration 1000 elapsed time: 69.33 seconds (delta: -0.01%)
#> ......iteration 1100 elapsed time: 77.42 seconds (delta: -0.02%)
#> ......iteration 1200 elapsed time: 85.56 seconds (delta: -0.03%)
#> ......iteration 1300 elapsed time: 93.65 seconds (delta: 0.02%)
#> ......iteration 1400 elapsed time: 101.82 seconds (delta: 0.00%)
#> ......iteration 1500 elapsed time: 110.06 seconds (delta: -0.07%)
#> ......iteration 1600 elapsed time: 118.29 seconds (delta: 0.05%)
#> ......iteration 1700 elapsed time: 126.52 seconds (delta: -0.05%)
#> ......iteration 1800 elapsed time: 134.76 seconds (delta: -0.02%)
#> ......iteration 1900 elapsed time: 143.01 seconds (delta: 0.05%)
#> ......iteration 2000 elapsed time: 151.16 seconds (delta: -0.03%)
#> ...computing theta and phi
#> ...complete
Despite the much shorter execution time, it identifies topic terms very similar to the standard LDA.
topic1 | topic2 | topic3 | topic4 | topic5 | topic6 | topic7 | topic8 | topic9 | topic10 |
---|---|---|---|---|---|---|---|---|---|
hour | india | per | dpr | german | taiwan | photo | joins | systems | lavrov |
spoke | grain | fuel | lpr | stream | trump | azov | johnson | air | alliance |
biological | indian | crude | civilians | nord | mediabank | journalist | finland | interests | |
discussed | cooperation | imports | shelling | french | congress | killed | author | missiles | zelensky |
lines | trade | cap | civilian | pipeline | hunter | human | joined | sweden | say |
lee | africa | production | plant | macron | republicans | police | professor | missile | proposals |
mark | iran | cost | zaporozhye | commission | democrats | refugees | sean | training | diplomatic |
fault | turkey | companies | kherson | france | republican | battalion | jacquie | equipment | negotiations |
research | african | natural | regions | restrictions | election | children | truss | aircraft | course |
joined | south | costs | strikes | company | bill | nazi | university | army | clear |
Distributed LDA with convergence detection
By default, the algorithm fits LDA through as many as 2000 iterations
for reliable results, but we can minimize the number using the
convergence detection mechanism to further speed up your analysis. When
auto_iter = TRUE
, the algorithm stop inference on
convergence (delta < 0
) and return the result.
lda_auto <- textmodel_lda(dfmt, k = 10, batch_size = 0.01, auto_iter = TRUE,
verbose = TRUE)
#> Fitting LDA with 10 topics
#> ...initializing
#> ...using up to 16 threads for distributed computing
#> ......allocating 81 documents to each thread
#> ...Gibbs sampling in up to 2000 iterations
#> ......iteration 100 elapsed time: 9.00 seconds (delta: 0.38%)
#> ......iteration 200 elapsed time: 16.98 seconds (delta: 0.09%)
#> ......iteration 300 elapsed time: 25.14 seconds (delta: 0.02%)
#> ......iteration 400 elapsed time: 33.20 seconds (delta: -0.06%)
#> ...computing theta and phi
#> ...complete
topic1 | topic2 | topic3 | topic4 | topic5 | topic6 | topic7 | topic8 | topic9 | topic10 |
---|---|---|---|---|---|---|---|---|---|
per | india | dpr | joins | trump | azov | bank | lavrov | british | |
fuel | taiwan | air | joined | americans | mediabank | biological | grain | hunter | johnson |
stream | indian | systems | journalist | say | photo | assets | negotiations | photo | truss |
nord | cooperation | missile | hour | congress | human | research | alliance | post | boris |
crude | chinese | lpr | talk | democrats | killed | central | agreement | intelligence | london |
pipeline | finland | missiles | author | it’s | battalion | banks | agreements | video | sunak |
imports | sweden | artillery | analyst | democratic | children | federal | peskov | investigation | liz |
natural | beijing | shelling | mark | republicans | civilians | interest | diplomatic | screenshot | tax |
cap | summit | plant | spoke | know | crimes | companies | proposals | rt | britain |
exports | south | zaporozhye | sean | change | soviet | rate | zelensky | alleged | pm |
References
- Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2009). Distributed Algorithms for Topic Models. The Journal of Machine Learning Research, 10, 1801–1828.
- Watanabe, K. (2023). Speed Up Topic Modeling: Distributed Computing and Convergence Detection for LDA, working paper.