help wanted
Description
It will be useful to create a comprehensive practical guide for topic modeling. Now we have all components in place:
- POS tags and lemmatization - thanks to
udpipepackage coherencemeasures - thanks to Manuel work- fast LDA, thanks to WarpLDA in text2vec
- fast non-negative matrix factorization, thanks to
rsparsepackage - multi-word phrase extraction - several approaches
text2vec::Collocations,udpipe::as_phrasemachine
Steps
- find interesting non-trivial corpus with large number of documents
- demonstrate how to create tokenizer which only use particular POS
- create collocation model on top of that
- create document-term matrix using tokens with multi-word expression
- fit several topic models (
text2vec::LDA,rsparse::WRMF) with different hyper parameters - cross-validate / compare them using different coherence metrics
- demonstrate usage of external corpus for
tcmcalculation - check on how coherence metrics are correlated (is perplexity correlated with them? )
- demonstrate usage of external corpus for
There are already good vignettes in udpipe package topic modeling and phrase extraction. They can be used as inspiration.
@manuelbickel @jwijffels anything we can add to the plan above?