The wordvector package is developed to create word and document vectors using quanteda. This package currently supports word2vec (Mikolov et al., 2013), doc2vec (Le, Q. V., & Mikolov, T., 2014) and latent semantic analysis (Deerwester et al., 1990).

How to install

wordvector is available on CRAN.

install.packages("wordvector")

The latest version is available on Github.

remotes::install_github("koheiw/wordvector")

Example

We train the word2vec model on a corpus of news summaries collected from Yahoo News via RSS between 2012 and 2016.

Download data

# download data
download.file('https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1', 
              '~/yahoo-news.RDS', mode = "wb")

Train word2vec

library(wordvector)
library(quanteda)
## Package version: 4.3.1
## Unicode version: 15.1
## ICU version: 74.1
## Parallel computing: 16 of 16 threads used.
## See https://quanteda.io for tutorials and examples.

# Load data
dat <- readRDS('~/yahoo-news.RDS')
dat$text <- paste0(dat$head, ". ", dat$body)
corp <- corpus(dat, text_field = 'text', docid_field = "tid")

# Pre-processing
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>% 
    tokens_remove(stopwords("en", "marimo"), padding = TRUE) %>% 
    tokens_select("^[a-zA-Z-]+$", valuetype = "regex", case_insensitive = FALSE,
                  padding = TRUE)

# Train word2vec
wov <- textmodel_word2vec(toks, dim = 50, type = "cbow", min_count = 5, verbose = TRUE)
## Training CBOW model with 50 dimensions
##  ...using 16 threads for distributed computing
##  ...initializing
##  ...negative sampling in 10 iterations
##  ......iteration 1 elapsed time: 6.44 seconds (alpha: 0.0455)
##  ......iteration 2 elapsed time: 13.22 seconds (alpha: 0.0408)
##  ......iteration 3 elapsed time: 19.80 seconds (alpha: 0.0363)
##  ......iteration 4 elapsed time: 26.97 seconds (alpha: 0.0317)
##  ......iteration 5 elapsed time: 34.22 seconds (alpha: 0.0270)
##  ......iteration 6 elapsed time: 41.09 seconds (alpha: 0.0224)
##  ......iteration 7 elapsed time: 47.71 seconds (alpha: 0.0178)
##  ......iteration 8 elapsed time: 54.47 seconds (alpha: 0.0131)
##  ......iteration 9 elapsed time: 61.07 seconds (alpha: 0.0085)
##  ......iteration 10 elapsed time: 67.54 seconds (alpha: 0.0041)
##  ...complete

Similarity between word vectors

similarity() computes cosine similarity between word vectors.

head(similarity(wov, c("amazon", "forests", "obama", "america", "afghanistan"), 
                mode = "character"))
##      amazon       forests       obama                   america          
## [1,] "amazon"     "forests"     "obama"                 "america"        
## [2,] "rainforest" "herds"       "biden"                 "america-focused"
## [3,] "peat"       "rainforests" "relationship-building" "carolina"       
## [4,] "re-grown"   "farmland"    "kerry"                 "american"       
## [5,] "peatlands"  "rainforest"  "hagel"                 "dakota"         
## [6,] "sunflower"  "forest"      "clinton"               "africa"         
##      afghanistan  
## [1,] "afghanistan"
## [2,] "afghan"     
## [3,] "taliban"    
## [4,] "kabul"      
## [5,] "afghans"    
## [6,] "pakistan"

Arithmetic operations of word vectors

analogy() offers interface for arithmetic operations of word vectors.

# What is Amazon without forests?
head(similarity(wov, analogy(~ amazon - forests))) 
##      [,1]            
## [1,] "yahoo"         
## [2,] "smash-hit"     
## [3,] "gawker"        
## [4,] "aggregators"   
## [5,] "troll"         
## [6,] "globe-spanning"
# What is for Afghanistan as Obama for America? 
head(similarity(wov, analogy(~ obama - america + afghanistan))) 
##      [,1]         
## [1,] "afghanistan"
## [2,] "karzai"     
## [3,] "afghan"     
## [4,] "taliban"    
## [5,] "obama"      
## [6,] "nato"

These examples replicates analogical tasks in the original word2vec paper.

# What is for France as Berlin for Germany?
head(similarity(wov, analogy(~ berlin - germany + france))) 
##      [,1]        
## [1,] "paris"     
## [2,] "strasbourg"
## [3,] "brussels"  
## [4,] "berlin"    
## [5,] "amsterdam" 
## [6,] "france"
# What is for slowly as quick for quickly?
head(similarity(wov, analogy(~ quick - quickly + slowly)))
##      [,1]       
## [1,] "uneven"   
## [2,] "stumble"  
## [3,] "backwards"
## [4,] "fades"    
## [5,] "slow"     
## [6,] "upside"