Skip to contents

Implements semisupervised Latent Dirichlet allocation (Seeded LDA). textmodel_seededlda() allows users to specify topics using a seed word dictionary. Users can run Seeded Sequential LDA by setting gamma > 0.

Usage

textmodel_seededlda(
  x,
  dictionary,
  levels = 1,
  valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE,
  residual = 0,
  weight = 0.01,
  max_iter = 2000,
  auto_iter = FALSE,
  alpha = 0.5,
  beta = 0.1,
  gamma = 0,
  batch_size = 1,
  ...,
  verbose = quanteda_options("verbose")
)

Arguments

x

the dfm on which the model will be fit.

dictionary

a quanteda::dictionary() with seed words that define topics.

levels

levels of entities in a hierarchical dictionary to be used as seed words. See also quanteda::flatten_dictionary.

valuetype

see quanteda::valuetype

case_insensitive

see quanteda::valuetype

residual

the number of undefined topics. They are named "other" by default, but it can be changed via base::options(seededlda_residual_name).

weight

determines the size of pseudo counts given to matched seed words.

max_iter

the maximum number of iteration in Gibbs sampling.

auto_iter

if TRUE, stops Gibbs sampling on convergence before reaching max_iter. See details.

alpha

the value to smooth topic-document distribution.

beta

the value to smooth topic-word distribution.

gamma

a parameter to determine change of topics between sentences or paragraphs. When gamma > 0, Gibbs sampling of topics for the current document is affected by the previous document's topics.

batch_size

split the corpus into the smaller batches (specified in proportion) for distributed computing; it is disabled when a batch include all the documents batch_size = 1.0. See details.

...

passed to quanteda::dfm_trim to restrict seed words based on their term or document frequency. This is useful when glob patterns in the dictionary match too many words.

verbose

logical; if TRUE print diagnostic information during fitting.

References

Lu, Bin et al. (2011). "Multi-aspect Sentiment Analysis with Topic Models". doi:10.5555/2117693.2119585. Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops.

Watanabe, Kohei & Zhou, Yuan (2020). "Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches". doi:10.1177/0894439320907027. Social Science Computer Review.

Watanabe, Kohei & Baturo, Alexander. (2023). "Seeded Sequential LDA: A Semi-supervised Algorithm for Topic-specific Analysis of Sentences". doi:10.1177/08944393231178605. Social Science Computer Review.

See also

keyATM

Examples

# \donttest{
require(seededlda)
require(quanteda)

corp <- head(data_corpus_moviereviews, 500)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE)
dfmt <- dfm(toks) %>%
    dfm_remove(stopwords("en"), min_nchar = 2) %>%
    dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")

dict <- dictionary(list(people = c("family", "couple", "kids"),
                        space = c("alien", "planet", "space"),
                        moster = c("monster*", "ghost*", "zombie*"),
                        war = c("war", "soldier*", "tanks"),
                        crime = c("crime*", "murder", "killer")))
lda_seed <- textmodel_seededlda(dfmt, dict, residual = TRUE, min_termfreq = 10,
                                max_iter = 500)
terms(lda_seed)
#>       people     space     moster         war         crime      other     
#>  [1,] "van"      "space"   "dumb"         "fight"     "killer"   "joe"     
#>  [2,] "ship"     "alien"   "fails"        "war"       "murder"   "school"  
#>  [3,] "computer" "planet"  "mean"         "batman"    "wild"     "girls"   
#>  [4,] "max"      "mars"    "eyes"         "robin"     "crime"    "brothers"
#>  [5,] "de"       "mission" "features"     "patch"     "west"     "harry"   
#>  [6,] "damme"    "earth"   "move"         "chris"     "spawn"    "romantic"
#>  [7,] "virus"    "scream"  "hell"         "king"      "murphy"   "boy"     
#>  [8,] "williams" "aliens"  "number"       "club"      "daughter" "kevin"   
#>  [9,] "dog"      "deep"    "short"        "cage"      "eddie"    "lives"   
#> [10,] "giant"    "humans"  "particularly" "emotional" "smith"    "wedding" 
topics(lda_seed)
#> cv000_29416.txt cv001_19502.txt cv002_17424.txt cv003_12683.txt cv004_12641.txt 
#>          moster          people          moster           space          moster 
#> cv005_29357.txt cv006_17022.txt  cv007_4992.txt cv008_29326.txt cv009_29417.txt 
#>           space             war          moster           crime          moster 
#> cv010_29063.txt cv011_13044.txt cv012_29411.txt cv013_10494.txt cv014_15600.txt 
#>          moster           crime             war             war          people 
#> cv015_29356.txt  cv016_4348.txt cv017_23487.txt cv018_21672.txt cv019_16117.txt 
#>           space          moster           space          moster           other 
#>  cv020_9234.txt cv021_17313.txt cv022_14227.txt cv023_13847.txt  cv024_7033.txt 
#>           crime             war          moster          moster             war 
#> cv025_29825.txt cv026_29229.txt cv027_26270.txt cv028_26964.txt cv029_19943.txt 
#>          moster           space          moster          people           space 
#> cv030_22893.txt cv031_19540.txt cv032_23718.txt cv033_25680.txt cv034_29446.txt 
#>             war          people             war          moster           crime 
#>  cv035_3343.txt cv036_18385.txt cv037_19798.txt  cv038_9781.txt  cv039_5963.txt 
#>           crime          moster           space           space           space 
#>  cv040_8829.txt cv041_22364.txt cv042_11927.txt cv043_16808.txt cv044_18429.txt 
#>          moster          moster           other          moster           other 
#> cv045_25077.txt cv046_10613.txt cv047_18725.txt cv048_18380.txt cv049_21917.txt 
#>          moster          moster          moster          moster          moster 
#> cv050_12128.txt cv051_10751.txt cv052_29318.txt cv053_23117.txt  cv054_4101.txt 
#>          moster           other           other          moster          moster 
#>  cv055_8926.txt cv056_14663.txt  cv057_7962.txt  cv058_8469.txt cv059_28723.txt 
#>          moster          moster           space           other           other 
#> cv060_11754.txt  cv061_9321.txt cv062_24556.txt cv063_28852.txt cv064_25842.txt 
#>           space          moster           space           other          moster 
#> cv065_16909.txt cv066_11668.txt cv067_21192.txt cv068_14810.txt cv069_11613.txt 
#>           other             war          moster          moster           other 
#> cv070_13249.txt cv071_12969.txt  cv072_5928.txt cv073_23039.txt  cv074_7188.txt 
#>          moster           space          moster          moster          moster 
#>  cv075_6250.txt cv076_26009.txt cv077_23172.txt cv078_16506.txt cv079_12766.txt 
#>             war           crime           other          people          people 
#> cv080_14899.txt cv081_18241.txt cv082_11979.txt cv083_25491.txt cv084_15183.txt 
#>          moster          moster          moster           space          moster 
#> cv085_15286.txt cv086_19488.txt  cv087_2145.txt cv088_25274.txt cv089_12222.txt 
#>           space          moster          moster          moster          moster 
#>  cv090_0049.txt  cv091_7899.txt cv092_27987.txt cv093_15606.txt cv094_27868.txt 
#>           other           space             war          moster           other 
#> cv095_28730.txt cv096_12262.txt cv097_26081.txt cv098_17021.txt cv099_11189.txt 
#>           other             war          moster           crime          moster 
#> cv100_12406.txt cv101_10537.txt  cv102_8306.txt cv103_11943.txt cv104_19176.txt 
#>           space          moster          people          moster          moster 
#> cv105_19135.txt cv106_18379.txt cv107_25639.txt cv108_17064.txt cv109_22599.txt 
#>          moster           crime           space             war           crime 
#> cv110_27832.txt cv111_12253.txt cv112_12178.txt cv113_24354.txt cv114_19501.txt 
#>          moster           space          moster           other           crime 
#> cv115_26443.txt cv116_28734.txt cv117_25625.txt cv118_28837.txt  cv119_9909.txt 
#>          moster             war          people           crime          moster 
#>  cv120_3793.txt cv121_18621.txt  cv122_7891.txt cv123_12165.txt  cv124_3903.txt 
#>          moster          moster          moster           space          people 
#>  cv125_9636.txt cv126_28821.txt cv127_16451.txt cv128_29444.txt cv129_18373.txt 
#>           crime          moster          moster           space          moster 
#> cv130_18521.txt cv131_11568.txt  cv132_5423.txt cv133_18065.txt cv134_23300.txt 
#>          moster          moster          moster          people          moster 
#> cv135_12506.txt cv136_12384.txt cv137_17020.txt cv138_13903.txt cv139_14236.txt 
#>           space          moster          moster          moster          moster 
#>  cv140_7963.txt cv141_17179.txt cv142_23657.txt cv143_21158.txt  cv144_5010.txt 
#>             war          people           other          moster             war 
#> cv145_12239.txt cv146_19587.txt cv147_22625.txt cv148_18084.txt cv149_17084.txt 
#>          moster          people             war          moster          moster 
#> cv150_14279.txt cv151_17231.txt  cv152_9052.txt cv153_11607.txt  cv154_9562.txt 
#>          people           other          moster             war             war 
#>  cv155_7845.txt cv156_11119.txt cv157_29302.txt cv158_10914.txt cv159_29374.txt 
#>           other          people           crime          moster           space 
#> cv160_10848.txt cv161_12224.txt cv162_10977.txt cv163_10110.txt cv164_23451.txt 
#>          moster           other           crime          people           crime 
#>  cv165_2389.txt cv166_11959.txt cv167_18094.txt  cv168_7435.txt cv169_24973.txt 
#>          moster           space           crime           other           space 
#> cv170_29808.txt cv171_15164.txt cv172_12037.txt  cv173_4295.txt  cv174_9735.txt 
#>          moster           other          moster          moster           space 
#>  cv175_7375.txt cv176_14196.txt cv177_10904.txt cv178_14380.txt  cv179_9533.txt 
#>           space          moster           space          people          moster 
#> cv180_17823.txt cv181_16083.txt  cv182_7791.txt cv183_19826.txt cv184_26935.txt 
#>          moster           other             war          moster          people 
#> cv185_28372.txt  cv186_2396.txt cv187_14112.txt cv188_20687.txt cv189_24248.txt 
#>           other          moster             war          moster          moster 
#> cv190_27176.txt cv191_29539.txt cv192_16079.txt  cv193_5393.txt cv194_12855.txt 
#>          people          moster             war          moster           other 
#> cv195_16146.txt cv196_28898.txt cv197_29271.txt cv198_19313.txt  cv199_9721.txt 
#>           other           other           other           crime          moster 
#> cv200_29006.txt  cv201_7421.txt cv202_11382.txt cv203_19052.txt  cv204_8930.txt 
#>           space          moster          moster           other           space 
#>  cv205_9676.txt cv206_15893.txt cv207_29141.txt  cv208_9475.txt cv209_28973.txt 
#>           crime          moster          moster          moster             war 
#>  cv210_9557.txt  cv211_9955.txt cv212_10054.txt cv213_20300.txt cv214_13285.txt 
#>           space           crime          moster          people          moster 
#> cv215_23246.txt cv216_20165.txt cv217_28707.txt cv218_25651.txt cv219_19874.txt 
#>          moster          people          moster          moster           crime 
#> cv220_28906.txt cv221_27081.txt cv222_18720.txt cv223_28923.txt cv224_18875.txt 
#>          people           crime          moster           space          moster 
#> cv225_29083.txt cv226_26692.txt cv227_25406.txt  cv228_5644.txt cv229_15200.txt 
#>          moster          moster          moster          people          moster 
#>  cv230_7913.txt cv231_11028.txt cv232_16768.txt cv233_17614.txt cv234_22123.txt 
#>          moster          people             war          moster          moster 
#> cv235_10704.txt cv236_12427.txt cv237_20635.txt cv238_14285.txt cv239_29828.txt 
#>          moster           crime          people          moster           crime 
#> cv240_15948.txt cv241_24602.txt cv242_11354.txt cv243_22164.txt cv244_22935.txt 
#>           space          moster           other           other          moster 
#>  cv245_8938.txt cv246_28668.txt cv247_14668.txt cv248_15672.txt cv249_12674.txt 
#>          moster           other          moster          moster           crime 
#> cv250_26462.txt cv251_23901.txt cv252_24974.txt cv253_10190.txt  cv254_5870.txt 
#>           crime             war           space           other          people 
#> cv255_15267.txt cv256_16529.txt cv257_11856.txt  cv258_5627.txt cv259_11827.txt 
#>          moster          people           crime          moster          moster 
#> cv260_15652.txt cv261_11855.txt cv262_13812.txt cv263_20693.txt cv264_14108.txt 
#>           space           crime           other          moster             war 
#> cv265_11625.txt cv266_26644.txt cv267_16618.txt cv268_20288.txt cv269_23018.txt 
#>           other             war          moster          moster          moster 
#>  cv270_5873.txt cv271_15364.txt cv272_20313.txt cv273_28961.txt cv274_26379.txt 
#>          moster          moster          moster           other          people 
#> cv275_28725.txt cv276_17126.txt cv277_20467.txt cv278_14533.txt cv279_19452.txt 
#>           other          moster          moster          moster          moster 
#>  cv280_8651.txt cv281_24711.txt  cv282_6833.txt cv283_11963.txt cv284_20530.txt 
#>           crime          moster          people           other          moster 
#> cv285_18186.txt cv286_26156.txt cv287_17410.txt cv288_20212.txt  cv289_6239.txt 
#>          moster          moster           other          moster           space 
#> cv290_11981.txt cv291_26844.txt  cv292_7804.txt cv293_29731.txt cv294_12695.txt 
#>          people          moster          moster             war          people 
#> cv295_17060.txt cv296_13146.txt cv297_10104.txt cv298_24487.txt cv299_17950.txt 
#>           other          moster          people          people          moster 
#> cv300_23302.txt cv301_13010.txt cv302_26481.txt cv303_27366.txt cv304_28489.txt 
#>          moster           other          people           crime          moster 
#>  cv305_9937.txt cv306_10859.txt cv307_26382.txt  cv308_5079.txt cv309_23737.txt 
#>           space          people          people          moster           crime 
#> cv310_14568.txt cv311_17708.txt cv312_29308.txt cv313_19337.txt cv314_16095.txt 
#>           space           other           crime          moster          moster 
#> cv315_12638.txt  cv316_5972.txt cv317_25111.txt cv318_11146.txt cv319_16459.txt 
#>           crime          moster          moster           other          moster 
#>  cv320_9693.txt cv321_14191.txt cv322_21820.txt cv323_29633.txt  cv324_7502.txt 
#>          moster             war           crime          moster           other 
#> cv325_18330.txt cv326_14777.txt cv327_21743.txt cv328_10908.txt cv329_29293.txt 
#>             war          moster           other          moster           space 
#> cv330_29675.txt  cv331_8656.txt cv332_17997.txt  cv333_9443.txt  cv334_0074.txt 
#>           crime          moster          moster          moster          moster 
#> cv335_16299.txt cv336_10363.txt cv337_29061.txt  cv338_9183.txt cv339_22452.txt 
#>          people             war          moster           space          people 
#> cv340_14776.txt cv341_25667.txt cv342_20917.txt cv343_10906.txt  cv344_5376.txt 
#>             war          moster           other          people          moster 
#>  cv345_9966.txt cv346_19198.txt cv347_14722.txt cv348_19207.txt cv349_15032.txt 
#>             war           crime          moster           crime           space 
#> cv350_22139.txt cv351_17029.txt  cv352_5414.txt cv353_19197.txt  cv354_8573.txt 
#>           space          moster          moster           crime          moster 
#> cv355_18174.txt cv356_26170.txt cv357_14710.txt cv358_11557.txt  cv359_6751.txt 
#>           crime             war          moster           other           crime 
#>  cv360_8927.txt cv361_28738.txt cv362_16985.txt cv363_29273.txt cv364_14254.txt 
#>          moster           crime           other           crime          people 
#> cv365_12442.txt cv366_10709.txt cv367_24065.txt cv368_11090.txt cv369_14245.txt 
#>          people          people           space          people          people 
#>  cv370_5338.txt  cv371_8197.txt  cv372_6654.txt cv373_21872.txt cv374_26455.txt 
#>          moster          moster          moster          moster          moster 
#>  cv375_9932.txt cv376_20883.txt  cv377_8440.txt cv378_21982.txt cv379_23167.txt 
#>          moster          people          people          moster           space 
#>  cv380_8164.txt cv381_21673.txt  cv382_8393.txt cv383_14662.txt cv384_18536.txt 
#>             war             war          moster           crime          moster 
#> cv385_29621.txt cv386_10229.txt cv387_12391.txt cv388_12810.txt  cv389_9611.txt 
#>          moster          moster          moster          moster          people 
#> cv390_12187.txt cv391_11615.txt cv392_12238.txt cv393_29234.txt  cv394_5311.txt 
#>           crime          moster          moster           crime          people 
#> cv395_11761.txt cv396_19127.txt cv397_28890.txt cv398_17047.txt cv399_28593.txt 
#>          people           other          moster          moster          moster 
#> cv400_20631.txt cv401_13758.txt cv402_16097.txt  cv403_6721.txt cv404_21805.txt 
#>           space          moster           other           crime             war 
#> cv405_21868.txt cv406_22199.txt cv407_23928.txt  cv408_5367.txt cv409_29625.txt 
#>          people          moster             war           crime           crime 
#> cv410_25624.txt cv411_16799.txt cv412_25254.txt  cv413_7893.txt cv414_11161.txt 
#>           other           other          moster          moster           space 
#> cv415_23674.txt cv416_12048.txt cv417_14653.txt cv418_16562.txt cv419_14799.txt 
#>           space           space           other          moster           crime 
#> cv420_28631.txt  cv421_9752.txt  cv422_9632.txt cv423_12089.txt  cv424_9268.txt 
#>          moster          people          moster           other          moster 
#>  cv425_8603.txt cv426_10976.txt cv427_11693.txt cv428_12202.txt  cv429_7937.txt 
#>          moster          people             war          moster           other 
#> cv430_18662.txt  cv431_7538.txt cv432_15873.txt cv433_10443.txt  cv434_5641.txt 
#>           other          moster          moster           space          people 
#> cv435_24355.txt cv436_20564.txt cv437_24070.txt  cv438_8500.txt cv439_17633.txt 
#>          moster          moster           space          moster           other 
#> cv440_16891.txt cv441_15276.txt cv442_15499.txt cv443_22367.txt  cv444_9975.txt 
#>           space          moster          people          moster           crime 
#> cv445_26683.txt cv446_12209.txt cv447_27334.txt cv448_16409.txt  cv449_9126.txt 
#>           crime             war          moster             war           other 
#>  cv450_8319.txt cv451_11502.txt  cv452_5179.txt cv453_10911.txt cv454_21961.txt 
#>             war           crime             war          moster             war 
#> cv455_28866.txt cv456_20370.txt cv457_19546.txt  cv458_9000.txt cv459_21834.txt 
#>          moster           space          moster          moster          moster 
#> cv460_11723.txt cv461_21124.txt cv462_20788.txt cv463_10846.txt cv464_17076.txt 
#>          moster           crime          moster          moster           other 
#> cv465_23401.txt cv466_20092.txt cv467_26610.txt cv468_16844.txt cv469_21998.txt 
#>          moster          moster          moster             war          people 
#> cv470_17444.txt cv471_18405.txt cv472_29140.txt  cv473_7869.txt cv474_10682.txt 
#>          moster           space             war          people          people 
#> cv475_22978.txt cv476_18402.txt cv477_23530.txt cv478_15921.txt  cv479_5450.txt 
#>           space          moster          moster             war           crime 
#> cv480_21195.txt  cv481_7930.txt cv482_11233.txt cv483_18103.txt cv484_26169.txt 
#>          moster             war           space          moster             war 
#> cv485_26879.txt  cv486_9788.txt cv487_11058.txt cv488_21453.txt cv489_19046.txt 
#>          moster          moster           other          moster           crime 
#> cv490_18986.txt cv491_12992.txt cv492_19370.txt cv493_14135.txt cv494_18689.txt 
#>           crime          moster          moster             war           other 
#> cv495_16121.txt cv496_11185.txt cv497_27086.txt  cv498_9288.txt cv499_11407.txt 
#>           other          moster           crime          moster          moster 
#> Levels: people space moster war crime other
# }