Implements unsupervised Latent Dirichlet allocation (LDA). Users can run
Seeded LDA by setting gamma > 0
.
Usage
textmodel_lda(
x,
k = 10,
max_iter = 2000,
auto_iter = FALSE,
alpha = 0.5,
beta = 0.1,
gamma = 0,
model = NULL,
batch_size = 1,
verbose = quanteda_options("verbose")
)
Arguments
- x
the dfm on which the model will be fit.
- k
the number of topics.
- max_iter
the maximum number of iteration in Gibbs sampling.
- auto_iter
if
TRUE
, stops Gibbs sampling on convergence before reachingmax_iter
. See details.- alpha
the values to smooth topic-document distribution.
- beta
the values to smooth topic-word distribution.
- gamma
a parameter to determine change of topics between sentences or paragraphs. When
gamma > 0
, Gibbs sampling of topics for the current document is affected by the previous document's topics.- model
a fitted LDA model; if provided,
textmodel_lda()
inherits parameters from an existing model. See details.- batch_size
split the corpus into the smaller batches (specified in proportion) for distributed computing; it is disabled when a batch include all the documents
batch_size = 1.0
. See details.- verbose
logical; if
TRUE
print diagnostic information during fitting.
Value
Returns a list of model parameters:
- k
the number of topics.
- last_iter
the number of iterations in Gibbs sampling
- phi
the distribution of words over topics.
- theta
the distribution of topics over documents.
- words
the raw frequency count of words assigned to topics.
- data
the original input of
x
.- call
the command used to execute the function.
- version
the version of the seededlda package.
Details
If auto_iter = TRUE
, the iteration stops even before max_iter
when delta <= 0
. delta
is computed to measure the changes in the number
of words whose topics are updated by the Gibbs sampler in every 100
iteration as shown in the verbose message.
If batch_size < 1.0
, the corpus is partitioned into sub-corpora of
ndoc(x) * batch_size
documents for Gibbs sampling in sub-processes with
synchronization of parameters in every 10 iteration. Parallel processing is
more efficient when batch_size
is small (e.g. 0.01). The algorithm is the
Approximate Distributed LDA proposed by Newman et al. (2009). User can
changed the number of sub-processes used for the parallel computing via
options(seededlda_threads)
.
set.seed()
should be called immediately before textmodel_lda()
or
textmodel_seededlda()
to control random topic assignment. If the random
number seed is the same, the serial algorithm produces identical results;
the parallel algorithm produces non-identical results because it
classifies documents in different orders using multiple processors.
To predict topics of new documents (i.e. out-of-sample), first, create a
new LDA model from a existing LDA model passed to model
in
textmodel_lda()
; second, apply topics()
to the new model. The model
argument takes objects created either by textmodel_lda()
or
textmodel_seededlda()
.
References
Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2009). Distributed Algorithms for Topic Models. The Journal of Machine Learning Research, 10, 1801–1828.
Examples
# \donttest{
require(seededlda)
require(quanteda)
corp <- head(data_corpus_moviereviews, 500)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE)
dfmt <- dfm(toks) %>%
dfm_remove(stopwords("en"), min_nchar = 2) %>%
dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
lda <- textmodel_lda(dfmt, k = 6, max_iter = 500) # 6 topics
terms(lda)
#> topic1 topic2 topic3 topic4 topic5 topic6
#> [1,] "wild" "killer" "fight" "kevin" "ship" "school"
#> [2,] "daughter" "children" "mars" "boy" "spawn" "girls"
#> [3,] "west" "sound" "space" "patch" "computer" "scream"
#> [4,] "murphy" "fails" "mission" "de" "deep" "joe"
#> [5,] "jack" "number" "van" "doctor" "giant" "teen"
#> [6,] "brothers" "robin" "planet" "legend" "town" "studio"
#> [7,] "brother" "features" "batman" "eyes" "earth" "town"
#> [8,] "cop" "cool" "alien" "claire" "max" "julie"
#> [9,] "eddie" "dull" "battle" "harry" "disaster" "bob"
#> [10,] "partner" "credits" "team" "cage" "crew" "christopher"
topics(lda)
#> cv000_29416.txt cv001_19502.txt cv002_17424.txt cv003_12683.txt cv004_12641.txt
#> topic2 topic5 topic2 topic3 topic2
#> cv005_29357.txt cv006_17022.txt cv007_4992.txt cv008_29326.txt cv009_29417.txt
#> topic3 topic4 topic2 topic4 topic2
#> cv010_29063.txt cv011_13044.txt cv012_29411.txt cv013_10494.txt cv014_15600.txt
#> topic2 topic4 topic3 topic3 topic4
#> cv015_29356.txt cv016_4348.txt cv017_23487.txt cv018_21672.txt cv019_16117.txt
#> topic3 topic2 topic2 topic2 topic6
#> cv020_9234.txt cv021_17313.txt cv022_14227.txt cv023_13847.txt cv024_7033.txt
#> topic5 topic2 topic2 topic4 topic5
#> cv025_29825.txt cv026_29229.txt cv027_26270.txt cv028_26964.txt cv029_19943.txt
#> topic2 topic3 topic1 topic3 topic3
#> cv030_22893.txt cv031_19540.txt cv032_23718.txt cv033_25680.txt cv034_29446.txt
#> topic6 topic2 topic3 topic6 topic4
#> cv035_3343.txt cv036_18385.txt cv037_19798.txt cv038_9781.txt cv039_5963.txt
#> topic4 topic4 topic5 topic2 topic3
#> cv040_8829.txt cv041_22364.txt cv042_11927.txt cv043_16808.txt cv044_18429.txt
#> topic2 topic2 topic6 topic2 topic6
#> cv045_25077.txt cv046_10613.txt cv047_18725.txt cv048_18380.txt cv049_21917.txt
#> topic2 topic5 topic4 topic2 topic5
#> cv050_12128.txt cv051_10751.txt cv052_29318.txt cv053_23117.txt cv054_4101.txt
#> topic2 topic6 topic6 topic2 topic1
#> cv055_8926.txt cv056_14663.txt cv057_7962.txt cv058_8469.txt cv059_28723.txt
#> topic3 topic1 topic3 topic6 topic6
#> cv060_11754.txt cv061_9321.txt cv062_24556.txt cv063_28852.txt cv064_25842.txt
#> topic5 topic2 topic3 topic6 topic2
#> cv065_16909.txt cv066_11668.txt cv067_21192.txt cv068_14810.txt cv069_11613.txt
#> topic6 topic3 topic2 topic2 topic4
#> cv070_13249.txt cv071_12969.txt cv072_5928.txt cv073_23039.txt cv074_7188.txt
#> topic2 topic5 topic2 topic2 topic2
#> cv075_6250.txt cv076_26009.txt cv077_23172.txt cv078_16506.txt cv079_12766.txt
#> topic3 topic5 topic6 topic5 topic5
#> cv080_14899.txt cv081_18241.txt cv082_11979.txt cv083_25491.txt cv084_15183.txt
#> topic2 topic2 topic5 topic2 topic2
#> cv085_15286.txt cv086_19488.txt cv087_2145.txt cv088_25274.txt cv089_12222.txt
#> topic6 topic2 topic2 topic2 topic2
#> cv090_0049.txt cv091_7899.txt cv092_27987.txt cv093_15606.txt cv094_27868.txt
#> topic1 topic5 topic6 topic3 topic1
#> cv095_28730.txt cv096_12262.txt cv097_26081.txt cv098_17021.txt cv099_11189.txt
#> topic6 topic2 topic2 topic2 topic6
#> cv100_12406.txt cv101_10537.txt cv102_8306.txt cv103_11943.txt cv104_19176.txt
#> topic5 topic2 topic5 topic2 topic1
#> cv105_19135.txt cv106_18379.txt cv107_25639.txt cv108_17064.txt cv109_22599.txt
#> topic2 topic1 topic3 topic2 topic5
#> cv110_27832.txt cv111_12253.txt cv112_12178.txt cv113_24354.txt cv114_19501.txt
#> topic5 topic3 topic2 topic2 topic1
#> cv115_26443.txt cv116_28734.txt cv117_25625.txt cv118_28837.txt cv119_9909.txt
#> topic1 topic2 topic4 topic3 topic2
#> cv120_3793.txt cv121_18621.txt cv122_7891.txt cv123_12165.txt cv124_3903.txt
#> topic2 topic2 topic2 topic3 topic5
#> cv125_9636.txt cv126_28821.txt cv127_16451.txt cv128_29444.txt cv129_18373.txt
#> topic2 topic2 topic2 topic3 topic2
#> cv130_18521.txt cv131_11568.txt cv132_5423.txt cv133_18065.txt cv134_23300.txt
#> topic2 topic1 topic2 topic3 topic1
#> cv135_12506.txt cv136_12384.txt cv137_17020.txt cv138_13903.txt cv139_14236.txt
#> topic5 topic2 topic2 topic2 topic4
#> cv140_7963.txt cv141_17179.txt cv142_23657.txt cv143_21158.txt cv144_5010.txt
#> topic2 topic5 topic6 topic2 topic1
#> cv145_12239.txt cv146_19587.txt cv147_22625.txt cv148_18084.txt cv149_17084.txt
#> topic2 topic4 topic6 topic2 topic2
#> cv150_14279.txt cv151_17231.txt cv152_9052.txt cv153_11607.txt cv154_9562.txt
#> topic3 topic4 topic5 topic4 topic2
#> cv155_7845.txt cv156_11119.txt cv157_29302.txt cv158_10914.txt cv159_29374.txt
#> topic6 topic4 topic4 topic5 topic5
#> cv160_10848.txt cv161_12224.txt cv162_10977.txt cv163_10110.txt cv164_23451.txt
#> topic4 topic4 topic1 topic6 topic2
#> cv165_2389.txt cv166_11959.txt cv167_18094.txt cv168_7435.txt cv169_24973.txt
#> topic2 topic2 topic5 topic6 topic3
#> cv170_29808.txt cv171_15164.txt cv172_12037.txt cv173_4295.txt cv174_9735.txt
#> topic2 topic6 topic2 topic2 topic2
#> cv175_7375.txt cv176_14196.txt cv177_10904.txt cv178_14380.txt cv179_9533.txt
#> topic2 topic5 topic5 topic4 topic2
#> cv180_17823.txt cv181_16083.txt cv182_7791.txt cv183_19826.txt cv184_26935.txt
#> topic2 topic6 topic1 topic2 topic3
#> cv185_28372.txt cv186_2396.txt cv187_14112.txt cv188_20687.txt cv189_24248.txt
#> topic2 topic2 topic2 topic2 topic2
#> cv190_27176.txt cv191_29539.txt cv192_16079.txt cv193_5393.txt cv194_12855.txt
#> topic5 topic6 topic4 topic2 topic1
#> cv195_16146.txt cv196_28898.txt cv197_29271.txt cv198_19313.txt cv199_9721.txt
#> topic6 topic6 topic4 topic6 topic2
#> cv200_29006.txt cv201_7421.txt cv202_11382.txt cv203_19052.txt cv204_8930.txt
#> topic3 topic5 topic2 topic6 topic2
#> cv205_9676.txt cv206_15893.txt cv207_29141.txt cv208_9475.txt cv209_28973.txt
#> topic1 topic2 topic4 topic2 topic3
#> cv210_9557.txt cv211_9955.txt cv212_10054.txt cv213_20300.txt cv214_13285.txt
#> topic2 topic1 topic3 topic5 topic2
#> cv215_23246.txt cv216_20165.txt cv217_28707.txt cv218_25651.txt cv219_19874.txt
#> topic1 topic4 topic1 topic2 topic1
#> cv220_28906.txt cv221_27081.txt cv222_18720.txt cv223_28923.txt cv224_18875.txt
#> topic2 topic4 topic2 topic2 topic2
#> cv225_29083.txt cv226_26692.txt cv227_25406.txt cv228_5644.txt cv229_15200.txt
#> topic2 topic2 topic1 topic1 topic2
#> cv230_7913.txt cv231_11028.txt cv232_16768.txt cv233_17614.txt cv234_22123.txt
#> topic2 topic2 topic4 topic6 topic2
#> cv235_10704.txt cv236_12427.txt cv237_20635.txt cv238_14285.txt cv239_29828.txt
#> topic2 topic2 topic1 topic2 topic2
#> cv240_15948.txt cv241_24602.txt cv242_11354.txt cv243_22164.txt cv244_22935.txt
#> topic6 topic2 topic6 topic1 topic2
#> cv245_8938.txt cv246_28668.txt cv247_14668.txt cv248_15672.txt cv249_12674.txt
#> topic2 topic6 topic2 topic3 topic4
#> cv250_26462.txt cv251_23901.txt cv252_24974.txt cv253_10190.txt cv254_5870.txt
#> topic6 topic6 topic2 topic5 topic5
#> cv255_15267.txt cv256_16529.txt cv257_11856.txt cv258_5627.txt cv259_11827.txt
#> topic6 topic2 topic5 topic5 topic1
#> cv260_15652.txt cv261_11855.txt cv262_13812.txt cv263_20693.txt cv264_14108.txt
#> topic3 topic5 topic6 topic2 topic6
#> cv265_11625.txt cv266_26644.txt cv267_16618.txt cv268_20288.txt cv269_23018.txt
#> topic4 topic1 topic4 topic2 topic2
#> cv270_5873.txt cv271_15364.txt cv272_20313.txt cv273_28961.txt cv274_26379.txt
#> topic2 topic6 topic6 topic1 topic4
#> cv275_28725.txt cv276_17126.txt cv277_20467.txt cv278_14533.txt cv279_19452.txt
#> topic6 topic4 topic2 topic2 topic4
#> cv280_8651.txt cv281_24711.txt cv282_6833.txt cv283_11963.txt cv284_20530.txt
#> topic4 topic2 topic5 topic2 topic2
#> cv285_18186.txt cv286_26156.txt cv287_17410.txt cv288_20212.txt cv289_6239.txt
#> topic2 topic2 topic6 topic2 topic3
#> cv290_11981.txt cv291_26844.txt cv292_7804.txt cv293_29731.txt cv294_12695.txt
#> topic5 topic2 topic2 topic3 topic5
#> cv295_17060.txt cv296_13146.txt cv297_10104.txt cv298_24487.txt cv299_17950.txt
#> topic2 topic2 topic5 topic4 topic4
#> cv300_23302.txt cv301_13010.txt cv302_26481.txt cv303_27366.txt cv304_28489.txt
#> topic1 topic1 topic4 topic1 topic1
#> cv305_9937.txt cv306_10859.txt cv307_26382.txt cv308_5079.txt cv309_23737.txt
#> topic3 topic5 topic4 topic1 topic6
#> cv310_14568.txt cv311_17708.txt cv312_29308.txt cv313_19337.txt cv314_16095.txt
#> topic4 topic6 topic6 topic2 topic2
#> cv315_12638.txt cv316_5972.txt cv317_25111.txt cv318_11146.txt cv319_16459.txt
#> topic5 topic5 topic5 topic2 topic2
#> cv320_9693.txt cv321_14191.txt cv322_21820.txt cv323_29633.txt cv324_7502.txt
#> topic3 topic6 topic2 topic2 topic6
#> cv325_18330.txt cv326_14777.txt cv327_21743.txt cv328_10908.txt cv329_29293.txt
#> topic6 topic3 topic2 topic2 topic3
#> cv330_29675.txt cv331_8656.txt cv332_17997.txt cv333_9443.txt cv334_0074.txt
#> topic6 topic2 topic2 topic4 topic2
#> cv335_16299.txt cv336_10363.txt cv337_29061.txt cv338_9183.txt cv339_22452.txt
#> topic5 topic2 topic2 topic3 topic4
#> cv340_14776.txt cv341_25667.txt cv342_20917.txt cv343_10906.txt cv344_5376.txt
#> topic4 topic2 topic6 topic6 topic2
#> cv345_9966.txt cv346_19198.txt cv347_14722.txt cv348_19207.txt cv349_15032.txt
#> topic3 topic1 topic2 topic2 topic2
#> cv350_22139.txt cv351_17029.txt cv352_5414.txt cv353_19197.txt cv354_8573.txt
#> topic2 topic6 topic2 topic2 topic2
#> cv355_18174.txt cv356_26170.txt cv357_14710.txt cv358_11557.txt cv359_6751.txt
#> topic2 topic4 topic2 topic2 topic1
#> cv360_8927.txt cv361_28738.txt cv362_16985.txt cv363_29273.txt cv364_14254.txt
#> topic4 topic3 topic6 topic1 topic3
#> cv365_12442.txt cv366_10709.txt cv367_24065.txt cv368_11090.txt cv369_14245.txt
#> topic5 topic2 topic3 topic3 topic3
#> cv370_5338.txt cv371_8197.txt cv372_6654.txt cv373_21872.txt cv374_26455.txt
#> topic2 topic1 topic4 topic2 topic1
#> cv375_9932.txt cv376_20883.txt cv377_8440.txt cv378_21982.txt cv379_23167.txt
#> topic2 topic1 topic5 topic2 topic6
#> cv380_8164.txt cv381_21673.txt cv382_8393.txt cv383_14662.txt cv384_18536.txt
#> topic3 topic3 topic3 topic1 topic2
#> cv385_29621.txt cv386_10229.txt cv387_12391.txt cv388_12810.txt cv389_9611.txt
#> topic2 topic2 topic2 topic2 topic3
#> cv390_12187.txt cv391_11615.txt cv392_12238.txt cv393_29234.txt cv394_5311.txt
#> topic4 topic2 topic1 topic1 topic4
#> cv395_11761.txt cv396_19127.txt cv397_28890.txt cv398_17047.txt cv399_28593.txt
#> topic1 topic6 topic2 topic2 topic1
#> cv400_20631.txt cv401_13758.txt cv402_16097.txt cv403_6721.txt cv404_21805.txt
#> topic2 topic2 topic4 topic1 topic2
#> cv405_21868.txt cv406_22199.txt cv407_23928.txt cv408_5367.txt cv409_29625.txt
#> topic2 topic6 topic6 topic2 topic5
#> cv410_25624.txt cv411_16799.txt cv412_25254.txt cv413_7893.txt cv414_11161.txt
#> topic6 topic2 topic2 topic2 topic1
#> cv415_23674.txt cv416_12048.txt cv417_14653.txt cv418_16562.txt cv419_14799.txt
#> topic3 topic3 topic4 topic2 topic2
#> cv420_28631.txt cv421_9752.txt cv422_9632.txt cv423_12089.txt cv424_9268.txt
#> topic2 topic3 topic2 topic6 topic5
#> cv425_8603.txt cv426_10976.txt cv427_11693.txt cv428_12202.txt cv429_7937.txt
#> topic2 topic5 topic4 topic6 topic6
#> cv430_18662.txt cv431_7538.txt cv432_15873.txt cv433_10443.txt cv434_5641.txt
#> topic2 topic2 topic2 topic2 topic4
#> cv435_24355.txt cv436_20564.txt cv437_24070.txt cv438_8500.txt cv439_17633.txt
#> topic3 topic3 topic3 topic2 topic4
#> cv440_16891.txt cv441_15276.txt cv442_15499.txt cv443_22367.txt cv444_9975.txt
#> topic5 topic2 topic6 topic2 topic5
#> cv445_26683.txt cv446_12209.txt cv447_27334.txt cv448_16409.txt cv449_9126.txt
#> topic4 topic1 topic2 topic6 topic6
#> cv450_8319.txt cv451_11502.txt cv452_5179.txt cv453_10911.txt cv454_21961.txt
#> topic3 topic4 topic6 topic2 topic3
#> cv455_28866.txt cv456_20370.txt cv457_19546.txt cv458_9000.txt cv459_21834.txt
#> topic2 topic3 topic2 topic1 topic2
#> cv460_11723.txt cv461_21124.txt cv462_20788.txt cv463_10846.txt cv464_17076.txt
#> topic4 topic1 topic1 topic2 topic4
#> cv465_23401.txt cv466_20092.txt cv467_26610.txt cv468_16844.txt cv469_21998.txt
#> topic5 topic2 topic2 topic6 topic5
#> cv470_17444.txt cv471_18405.txt cv472_29140.txt cv473_7869.txt cv474_10682.txt
#> topic2 topic3 topic2 topic5 topic5
#> cv475_22978.txt cv476_18402.txt cv477_23530.txt cv478_15921.txt cv479_5450.txt
#> topic6 topic4 topic2 topic4 topic6
#> cv480_21195.txt cv481_7930.txt cv482_11233.txt cv483_18103.txt cv484_26169.txt
#> topic2 topic3 topic2 topic2 topic2
#> cv485_26879.txt cv486_9788.txt cv487_11058.txt cv488_21453.txt cv489_19046.txt
#> topic2 topic2 topic6 topic2 topic1
#> cv490_18986.txt cv491_12992.txt cv492_19370.txt cv493_14135.txt cv494_18689.txt
#> topic2 topic2 topic2 topic6 topic4
#> cv495_16121.txt cv496_11185.txt cv497_27086.txt cv498_9288.txt cv499_11407.txt
#> topic3 topic4 topic1 topic2 topic4
#> Levels: topic1 topic2 topic3 topic4 topic5 topic6
# }