Calculating Probabilities on Zero-Shot Learning · MaartenGr/BERTopic#2076

(5 commentaires) (2 réactions) (0 assignés)Python (634 forks)batch import

enhancementgood first issue

Métriques du dépôt

Stars: (5 074 stars)
Métriques de merge PR: (Aucune PR mergée en 30 j)

Description

Have you searched existing issues? 🔎

I have searched and found no existing issues

Desribe the bug

Hi everyone, first of all I would like to thank @MaartenGr and all the contributors for this amazing project.

Recently I started looking at BERTopic as a method to classify some customer tickets into some categories defined intozeroshot_topic_list parameter. After fitting the model by calling fit_transform my goal was to look for each document, the probability of that document of belonging to all the topics (both predefined and generated ones).

probs is None after fit_transform, as expected and mentioned at the end of this page https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html#example. Therefore, I then called transform to get the probabilities.

Now I've got two questions I would like to answer:

is this one the right approach to get the probabilities?
most of the documents have almost the same (high) probability among all the topics. Does this mean the clustering didn't fit that much the data? What do you suggest?

Again, I would like to thank you for your effort in advance and I look forward to contribute as well if needed.

Reproduction

topic_model = BERTopic(
  calculate_probabilities = True,
  vectorizer_model = CountVectorizer(stop_words=default_stopwords + custom_stopwords),
  ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True),
  embedding_model = sentence_model,
  min_topic_size = 50,
  zeroshot_topic_list = taxonomy_list,
  zeroshot_min_similarity = .80,
  representation_model = KeyBERTInspired(),
  verbose = True,
)

topics, _ = topic_model.fit_transform(global_ticket_descriptions, embeddings=embeddings)
_, probs = topic_model.transform(global_ticket_descriptions, embeddings=embeddings)
print(probs)

The output is:

2024-07-09 14:48:15,691 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.
[[0.8463203  0.8880229  0.8636132  ... 0.8376566  0.8344382  0.82069004]
 [0.92492086 0.8977871  0.91309047 ... 0.9031642  0.90693504 0.9101905 ]
 [0.9009018  0.9072951  0.91581297 ... 0.9002963  0.8903491  0.8731015 ]
 ...
 [0.85506856 0.9055494  0.8743948  ... 0.8490986  0.8607102  0.8375366 ]
 [0.8783543  0.88571143 0.8983458  ... 0.8795974  0.87251174 0.8653137 ]
 [0.8480984  0.8751351  0.8680049  ... 0.8383051  0.8333502  0.8458183 ]]

BERTopic Version

0.16.2

Guide contributeur

Direction de recherche: Étudiez pourquoi les probabilités zero shot sont uniformément élevées en examinant le calcul de similarité cosinus entre les embeddings de documents et de sujets. Vérifiez si `zeroshot min similarity` est trop bas ou si `embedding model` génère des embeddings similaires pour différents sujets. Confirmez également si `fit transform` ignore `calculate probabilities=True` comme indiqué dans la documentation, et envisagez d'utiliser correctement les probabilités retournées par `transform`.
Stack technique: python
Domaine: machine learningai
Type d'issue: Bug
Difficulté: 2
Temps estimé: Une demi journée
Statut d'activité: Active
Clarté: Plutôt claire
Prérequis: PythonBERTopic
Accessibilité débutant: 70