Implements semisupervised Latent Dirichlet allocation
(Seeded LDA). textmodel_seededlda()
allows users to specify
topics using a seed word dictionary. Users can run Seeded Sequential LDA by
setting gamma > 0
.
Usage
textmodel_seededlda(
x,
dictionary,
levels = 1,
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
residual = 0,
weight = 0.01,
max_iter = 2000,
auto_iter = FALSE,
alpha = 0.5,
beta = 0.1,
gamma = 0,
batch_size = 1,
...,
verbose = quanteda_options("verbose")
)
Arguments
- x
the dfm on which the model will be fit.
- dictionary
a
quanteda::dictionary()
with seed words that define topics.- levels
levels of entities in a hierarchical dictionary to be used as seed words. See also quanteda::flatten_dictionary.
- valuetype
- case_insensitive
- residual
the number of undefined topics. They are named "other" by default, but it can be changed via
base::options(seededlda_residual_name)
.- weight
determines the size of pseudo counts given to matched seed words.
- max_iter
the maximum number of iteration in Gibbs sampling.
- auto_iter
if
TRUE
, stops Gibbs sampling on convergence before reachingmax_iter
. See details.- alpha
the values to smooth topic-document distribution.
- beta
the values to smooth topic-word distribution.
- gamma
a parameter to determine change of topics between sentences or paragraphs. When
gamma > 0
, Gibbs sampling of topics for the current document is affected by the previous document's topics.- batch_size
split the corpus into the smaller batches (specified in proportion) for distributed computing; it is disabled when a batch include all the documents
batch_size = 1.0
. See details.- ...
passed to quanteda::dfm_trim to restrict seed words based on their term or document frequency. This is useful when glob patterns in the dictionary match too many words.
- verbose
logical; if
TRUE
print diagnostic information during fitting.
Value
The same as textmodel_lda()
with extra elements for dictionary
.
References
Lu, Bin et al. (2011). "Multi-aspect Sentiment Analysis with Topic Models". doi:10.5555/2117693.2119585. Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops.
Watanabe, Kohei & Zhou, Yuan. (2020). "Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches". doi:10.1177/0894439320907027. Social Science Computer Review.
Watanabe, Kohei & Baturo, Alexander. (2023). "Seeded Sequential LDA: A Semi-supervised Algorithm for Topic-specific Analysis of Sentences". doi:10.1177/08944393231178605. Social Science Computer Review.
Examples
# \donttest{
require(seededlda)
require(quanteda)
corp <- head(data_corpus_moviereviews, 500)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE)
dfmt <- dfm(toks) %>%
dfm_remove(stopwords("en"), min_nchar = 2) %>%
dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
dict <- dictionary(list(people = c("family", "couple", "kids"),
space = c("alien", "planet", "space"),
moster = c("monster*", "ghost*", "zombie*"),
war = c("war", "soldier*", "tanks"),
crime = c("crime*", "murder", "killer")))
lda_seed <- textmodel_seededlda(dfmt, dict, residual = TRUE, min_termfreq = 10,
max_iter = 500)
terms(lda_seed)
#> people space moster war crime other
#> [1,] "west" "space" "girls" "war" "killer" "murphy"
#> [2,] "wild" "alien" "scream" "fight" "murder" "joe"
#> [3,] "smith" "planet" "kevin" "team" "crime" "patch"
#> [4,] "daughter" "mars" "rock" "game" "mean" "eddie"
#> [5,] "prison" "batman" "cage" "brother" "number" "brothers"
#> [6,] "disney" "mission" "studio" "jackie" "filmmakers" "romantic"
#> [7,] "voice" "ship" "julie" "son" "dumb" "harry"
#> [8,] "affleck" "earth" "summer" "de" "dull" "daughter"
#> [9,] "cute" "spawn" "jennifer" "king" "mr" "bob"
#> [10,] "sandler" "computer" "george" "kong" "move" "married"
topics(lda_seed)
#> cv000_29416.txt cv001_19502.txt cv002_17424.txt cv003_12683.txt cv004_12641.txt
#> crime space crime people crime
#> cv005_29357.txt cv006_17022.txt cv007_4992.txt cv008_29326.txt cv009_29417.txt
#> space moster people moster crime
#> cv010_29063.txt cv011_13044.txt cv012_29411.txt cv013_10494.txt cv014_15600.txt
#> crime other war war other
#> cv015_29356.txt cv016_4348.txt cv017_23487.txt cv018_21672.txt cv019_16117.txt
#> space crime crime crime other
#> cv020_9234.txt cv021_17313.txt cv022_14227.txt cv023_13847.txt cv024_7033.txt
#> space crime crime crime moster
#> cv025_29825.txt cv026_29229.txt cv027_26270.txt cv028_26964.txt cv029_19943.txt
#> crime crime crime people moster
#> cv030_22893.txt cv031_19540.txt cv032_23718.txt cv033_25680.txt cv034_29446.txt
#> crime people war war moster
#> cv035_3343.txt cv036_18385.txt cv037_19798.txt cv038_9781.txt cv039_5963.txt
#> moster crime space space crime
#> cv040_8829.txt cv041_22364.txt cv042_11927.txt cv043_16808.txt cv044_18429.txt
#> crime crime people crime people
#> cv045_25077.txt cv046_10613.txt cv047_18725.txt cv048_18380.txt cv049_21917.txt
#> crime crime people other war
#> cv050_12128.txt cv051_10751.txt cv052_29318.txt cv053_23117.txt cv054_4101.txt
#> crime moster moster crime crime
#> cv055_8926.txt cv056_14663.txt cv057_7962.txt cv058_8469.txt cv059_28723.txt
#> crime crime people crime people
#> cv060_11754.txt cv061_9321.txt cv062_24556.txt cv063_28852.txt cv064_25842.txt
#> space war space other crime
#> cv065_16909.txt cv066_11668.txt cv067_21192.txt cv068_14810.txt cv069_11613.txt
#> moster war crime crime moster
#> cv070_13249.txt cv071_12969.txt cv072_5928.txt cv073_23039.txt cv074_7188.txt
#> space space war crime crime
#> cv075_6250.txt cv076_26009.txt cv077_23172.txt cv078_16506.txt cv079_12766.txt
#> crime war war space space
#> cv080_14899.txt cv081_18241.txt cv082_11979.txt cv083_25491.txt cv084_15183.txt
#> crime moster crime crime crime
#> cv085_15286.txt cv086_19488.txt cv087_2145.txt cv088_25274.txt cv089_12222.txt
#> moster crime crime crime crime
#> cv090_0049.txt cv091_7899.txt cv092_27987.txt cv093_15606.txt cv094_27868.txt
#> people space war crime other
#> cv095_28730.txt cv096_12262.txt cv097_26081.txt cv098_17021.txt cv099_11189.txt
#> people crime other war moster
#> cv100_12406.txt cv101_10537.txt cv102_8306.txt cv103_11943.txt cv104_19176.txt
#> space crime space crime crime
#> cv105_19135.txt cv106_18379.txt cv107_25639.txt cv108_17064.txt cv109_22599.txt
#> crime crime moster crime moster
#> cv110_27832.txt cv111_12253.txt cv112_12178.txt cv113_24354.txt cv114_19501.txt
#> other space crime other other
#> cv115_26443.txt cv116_28734.txt cv117_25625.txt cv118_28837.txt cv119_9909.txt
#> war other moster war crime
#> cv120_3793.txt cv121_18621.txt cv122_7891.txt cv123_12165.txt cv124_3903.txt
#> crime crime crime space space
#> cv125_9636.txt cv126_28821.txt cv127_16451.txt cv128_29444.txt cv129_18373.txt
#> crime crime crime space crime
#> cv130_18521.txt cv131_11568.txt cv132_5423.txt cv133_18065.txt cv134_23300.txt
#> crime crime crime crime crime
#> cv135_12506.txt cv136_12384.txt cv137_17020.txt cv138_13903.txt cv139_14236.txt
#> space crime crime crime crime
#> cv140_7963.txt cv141_17179.txt cv142_23657.txt cv143_21158.txt cv144_5010.txt
#> space space people crime people
#> cv145_12239.txt cv146_19587.txt cv147_22625.txt cv148_18084.txt cv149_17084.txt
#> crime war war crime crime
#> cv150_14279.txt cv151_17231.txt cv152_9052.txt cv153_11607.txt cv154_9562.txt
#> crime other war other space
#> cv155_7845.txt cv156_11119.txt cv157_29302.txt cv158_10914.txt cv159_29374.txt
#> crime war other crime space
#> cv160_10848.txt cv161_12224.txt cv162_10977.txt cv163_10110.txt cv164_23451.txt
#> crime people other people crime
#> cv165_2389.txt cv166_11959.txt cv167_18094.txt cv168_7435.txt cv169_24973.txt
#> other crime war other space
#> cv170_29808.txt cv171_15164.txt cv172_12037.txt cv173_4295.txt cv174_9735.txt
#> crime war crime crime space
#> cv175_7375.txt cv176_14196.txt cv177_10904.txt cv178_14380.txt cv179_9533.txt
#> people space space other crime
#> cv180_17823.txt cv181_16083.txt cv182_7791.txt cv183_19826.txt cv184_26935.txt
#> moster war moster war people
#> cv185_28372.txt cv186_2396.txt cv187_14112.txt cv188_20687.txt cv189_24248.txt
#> people crime moster crime space
#> cv190_27176.txt cv191_29539.txt cv192_16079.txt cv193_5393.txt cv194_12855.txt
#> space war other people people
#> cv195_16146.txt cv196_28898.txt cv197_29271.txt cv198_19313.txt cv199_9721.txt
#> war other war other crime
#> cv200_29006.txt cv201_7421.txt cv202_11382.txt cv203_19052.txt cv204_8930.txt
#> space crime crime people crime
#> cv205_9676.txt cv206_15893.txt cv207_29141.txt cv208_9475.txt cv209_28973.txt
#> crime crime crime crime war
#> cv210_9557.txt cv211_9955.txt cv212_10054.txt cv213_20300.txt cv214_13285.txt
#> space other crime space crime
#> cv215_23246.txt cv216_20165.txt cv217_28707.txt cv218_25651.txt cv219_19874.txt
#> people other other crime people
#> cv220_28906.txt cv221_27081.txt cv222_18720.txt cv223_28923.txt cv224_18875.txt
#> crime crime crime crime crime
#> cv225_29083.txt cv226_26692.txt cv227_25406.txt cv228_5644.txt cv229_15200.txt
#> crime crime crime other crime
#> cv230_7913.txt cv231_11028.txt cv232_16768.txt cv233_17614.txt cv234_22123.txt
#> crime space other crime crime
#> cv235_10704.txt cv236_12427.txt cv237_20635.txt cv238_14285.txt cv239_29828.txt
#> crime crime other crime crime
#> cv240_15948.txt cv241_24602.txt cv242_11354.txt cv243_22164.txt cv244_22935.txt
#> moster crime moster people crime
#> cv245_8938.txt cv246_28668.txt cv247_14668.txt cv248_15672.txt cv249_12674.txt
#> crime other other space war
#> cv250_26462.txt cv251_23901.txt cv252_24974.txt cv253_10190.txt cv254_5870.txt
#> other war space other people
#> cv255_15267.txt cv256_16529.txt cv257_11856.txt cv258_5627.txt cv259_11827.txt
#> moster space war people crime
#> cv260_15652.txt cv261_11855.txt cv262_13812.txt cv263_20693.txt cv264_14108.txt
#> space space crime crime moster
#> cv265_11625.txt cv266_26644.txt cv267_16618.txt cv268_20288.txt cv269_23018.txt
#> other war crime crime crime
#> cv270_5873.txt cv271_15364.txt cv272_20313.txt cv273_28961.txt cv274_26379.txt
#> crime moster crime other war
#> cv275_28725.txt cv276_17126.txt cv277_20467.txt cv278_14533.txt cv279_19452.txt
#> other moster crime crime crime
#> cv280_8651.txt cv281_24711.txt cv282_6833.txt cv283_11963.txt cv284_20530.txt
#> moster crime crime other crime
#> cv285_18186.txt cv286_26156.txt cv287_17410.txt cv288_20212.txt cv289_6239.txt
#> crime crime people crime people
#> cv290_11981.txt cv291_26844.txt cv292_7804.txt cv293_29731.txt cv294_12695.txt
#> other space crime war space
#> cv295_17060.txt cv296_13146.txt cv297_10104.txt cv298_24487.txt cv299_17950.txt
#> other crime space other moster
#> cv300_23302.txt cv301_13010.txt cv302_26481.txt cv303_27366.txt cv304_28489.txt
#> crime people war other war
#> cv305_9937.txt cv306_10859.txt cv307_26382.txt cv308_5079.txt cv309_23737.txt
#> people space crime crime other
#> cv310_14568.txt cv311_17708.txt cv312_29308.txt cv313_19337.txt cv314_16095.txt
#> moster people crime crime crime
#> cv315_12638.txt cv316_5972.txt cv317_25111.txt cv318_11146.txt cv319_16459.txt
#> space war crime crime crime
#> cv320_9693.txt cv321_14191.txt cv322_21820.txt cv323_29633.txt cv324_7502.txt
#> war moster crime crime other
#> cv325_18330.txt cv326_14777.txt cv327_21743.txt cv328_10908.txt cv329_29293.txt
#> moster crime other crime space
#> cv330_29675.txt cv331_8656.txt cv332_17997.txt cv333_9443.txt cv334_0074.txt
#> other space crime other crime
#> cv335_16299.txt cv336_10363.txt cv337_29061.txt cv338_9183.txt cv339_22452.txt
#> space moster crime space other
#> cv340_14776.txt cv341_25667.txt cv342_20917.txt cv343_10906.txt cv344_5376.txt
#> moster crime people other crime
#> cv345_9966.txt cv346_19198.txt cv347_14722.txt cv348_19207.txt cv349_15032.txt
#> war people crime crime space
#> cv350_22139.txt cv351_17029.txt cv352_5414.txt cv353_19197.txt cv354_8573.txt
#> crime moster crime crime crime
#> cv355_18174.txt cv356_26170.txt cv357_14710.txt cv358_11557.txt cv359_6751.txt
#> crime moster crime moster other
#> cv360_8927.txt cv361_28738.txt cv362_16985.txt cv363_29273.txt cv364_14254.txt
#> crime war war people crime
#> cv365_12442.txt cv366_10709.txt cv367_24065.txt cv368_11090.txt cv369_14245.txt
#> space crime space space crime
#> cv370_5338.txt cv371_8197.txt cv372_6654.txt cv373_21872.txt cv374_26455.txt
#> crime crime crime space war
#> cv375_9932.txt cv376_20883.txt cv377_8440.txt cv378_21982.txt cv379_23167.txt
#> crime other crime crime moster
#> cv380_8164.txt cv381_21673.txt cv382_8393.txt cv383_14662.txt cv384_18536.txt
#> space space crime other crime
#> cv385_29621.txt cv386_10229.txt cv387_12391.txt cv388_12810.txt cv389_9611.txt
#> crime crime crime crime war
#> cv390_12187.txt cv391_11615.txt cv392_12238.txt cv393_29234.txt cv394_5311.txt
#> crime moster crime people war
#> cv395_11761.txt cv396_19127.txt cv397_28890.txt cv398_17047.txt cv399_28593.txt
#> other people crime crime war
#> cv400_20631.txt cv401_13758.txt cv402_16097.txt cv403_6721.txt cv404_21805.txt
#> space crime other other crime
#> cv405_21868.txt cv406_22199.txt cv407_23928.txt cv408_5367.txt cv409_29625.txt
#> crime crime war crime other
#> cv410_25624.txt cv411_16799.txt cv412_25254.txt cv413_7893.txt cv414_11161.txt
#> people crime other crime people
#> cv415_23674.txt cv416_12048.txt cv417_14653.txt cv418_16562.txt cv419_14799.txt
#> space space moster crime other
#> cv420_28631.txt cv421_9752.txt cv422_9632.txt cv423_12089.txt cv424_9268.txt
#> crime people crime people space
#> cv425_8603.txt cv426_10976.txt cv427_11693.txt cv428_12202.txt cv429_7937.txt
#> crime space moster moster war
#> cv430_18662.txt cv431_7538.txt cv432_15873.txt cv433_10443.txt cv434_5641.txt
#> crime crime crime space other
#> cv435_24355.txt cv436_20564.txt cv437_24070.txt cv438_8500.txt cv439_17633.txt
#> war people space crime other
#> cv440_16891.txt cv441_15276.txt cv442_15499.txt cv443_22367.txt cv444_9975.txt
#> space crime war crime space
#> cv445_26683.txt cv446_12209.txt cv447_27334.txt cv448_16409.txt cv449_9126.txt
#> other moster crime moster crime
#> cv450_8319.txt cv451_11502.txt cv452_5179.txt cv453_10911.txt cv454_21961.txt
#> space moster war crime space
#> cv455_28866.txt cv456_20370.txt cv457_19546.txt cv458_9000.txt cv459_21834.txt
#> crime crime crime crime crime
#> cv460_11723.txt cv461_21124.txt cv462_20788.txt cv463_10846.txt cv464_17076.txt
#> war crime people crime other
#> cv465_23401.txt cv466_20092.txt cv467_26610.txt cv468_16844.txt cv469_21998.txt
#> crime crime crime moster space
#> cv470_17444.txt cv471_18405.txt cv472_29140.txt cv473_7869.txt cv474_10682.txt
#> crime war crime space crime
#> cv475_22978.txt cv476_18402.txt cv477_23530.txt cv478_15921.txt cv479_5450.txt
#> moster people crime other war
#> cv480_21195.txt cv481_7930.txt cv482_11233.txt cv483_18103.txt cv484_26169.txt
#> crime space people crime crime
#> cv485_26879.txt cv486_9788.txt cv487_11058.txt cv488_21453.txt cv489_19046.txt
#> crime crime other crime crime
#> cv490_18986.txt cv491_12992.txt cv492_19370.txt cv493_14135.txt cv494_18689.txt
#> crime crime crime moster people
#> cv495_16121.txt cv496_11185.txt cv497_27086.txt cv498_9288.txt cv499_11407.txt
#> crime other other crime space
#> Levels: people space moster war crime other
# }