online-ml/river

Online GapEncoder

Open

#1,439 opened on Nov 3, 2023

View on GitHub
 (1 comment) (0 reactions) (0 assignees)Python (4,574 stars) (553 forks)batch import
Good first issueNew feature

Description

skrub is a wonderful new project related to scikit-learn. You can see Gaël Varoquaux present it here. They have a transformer called GapEncoder: it's a way to embed fuzzy strings. This could be really powerful online, say for classifying Tweets or Twitch messages, where typos are aplenty.

We already have a way to do online TD-IDF/count vectorization. But we don't have Gamma-Poisson matrix factorization. It is doable online though. Once we have it, we could assemble the two into a nice GapEncoder class. See paper here.

This is related to #1412. Indeed, maybe this works well without Gamma-Poisson matrix factorization. For instance, we could use decomposition.LDA, which we already have.

Contributor guide