Optimize the number of topics for LDA

divergence() computes the regularized topic divergence scores to help users to find the optimal number of topics for LDA.

Usage

divergence(
  x,
  min_size = 0.01,
  select = NULL,
  regularize = TRUE,
  newdata = NULL,
  ...
)

Arguments

x: a LDA model fitted by textmodel_seededlda() or textmodel_lda().
min_size: the minimum size of topics for regularized topic divergence. Ignored when regularize = FALSE.
select: names of topics for which the divergence is computed.
regularize: if TRUE, returns the regularized divergence.
newdata: if provided, theta and phi are estimated through fresh Gibbs sampling.
...: additional arguments passed to textmodel_lda.

Value

Returns a singple numeric value.

Details

divergence() computes the average Jensen-Shannon divergence between all the pairs of topic vectors in x$phi. The divergence score maximizes when the chosen number of topic k is optimal (Deveaud et al., 2014). The regularized divergence penalizes topics smaller than min_size to avoid fragmentation (Watanabe & Baturo, forthcoming).

References

Deveaud, Romain et al. (2014). "Accurate and Effective Latent Concept Modeling for Ad Hoc Information Retrieval". doi:10.3166/DN.17.1.61-84. Document Numérique.

Watanabe, Kohei & Baturo, Alexander. (2023). "Seeded Sequential LDA: A Semi-supervised Algorithm for Topic-specific Analysis of Sentences". doi:10.1177/08944393231178605. Social Science Computer Review.

Usage

Arguments

Value

Details

References

See also