dselivanov/text2vec

Topic modeling guide

Open

#262 opened on May 26, 2018

View on GitHub
 (3 comments) (0 reactions) (0 assignees)R (837 stars) (134 forks)batch import
help wanted

Description

It will be useful to create a comprehensive practical guide for topic modeling. Now we have all components in place:

  • POS tags and lemmatization - thanks to udpipe package
  • coherence measures - thanks to Manuel work
  • fast LDA, thanks to WarpLDA in text2vec
  • fast non-negative matrix factorization, thanks to rsparse package
  • multi-word phrase extraction - several approaches text2vec::Collocations, udpipe::as_phrasemachine

Steps

  • find interesting non-trivial corpus with large number of documents
  • demonstrate how to create tokenizer which only use particular POS
  • create collocation model on top of that
  • create document-term matrix using tokens with multi-word expression
  • fit several topic models (text2vec::LDA, rsparse::WRMF) with different hyper parameters
  • cross-validate / compare them using different coherence metrics
    • demonstrate usage of external corpus for tcm calculation
    • check on how coherence metrics are correlated (is perplexity correlated with them? )

There are already good vignettes in udpipe package topic modeling and phrase extraction. They can be used as inspiration.

@manuelbickel @jwijffels anything we can add to the plan above?

Contributor guide