Calculating Probabilities on Zero-Shot Learning · MaartenGr/BERTopic#2076

(5 comments) (2 reactions) (0 assignees)Python (634 forks)batch import

enhancementgood first issue

Repository metrics

Stars: (5,074 stars)
PR merge metrics: (No merged PRs in 30d)

Description

Have you searched existing issues? 🔎

I have searched and found no existing issues

Desribe the bug

Hi everyone, first of all I would like to thank @MaartenGr and all the contributors for this amazing project.

Recently I started looking at BERTopic as a method to classify some customer tickets into some categories defined intozeroshot_topic_list parameter. After fitting the model by calling fit_transform my goal was to look for each document, the probability of that document of belonging to all the topics (both predefined and generated ones).

probs is None after fit_transform, as expected and mentioned at the end of this page https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html#example. Therefore, I then called transform to get the probabilities.

Now I've got two questions I would like to answer:

is this one the right approach to get the probabilities?
most of the documents have almost the same (high) probability among all the topics. Does this mean the clustering didn't fit that much the data? What do you suggest?

Again, I would like to thank you for your effort in advance and I look forward to contribute as well if needed.

Reproduction

topic_model = BERTopic(
  calculate_probabilities = True,
  vectorizer_model = CountVectorizer(stop_words=default_stopwords + custom_stopwords),
  ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True),
  embedding_model = sentence_model,
  min_topic_size = 50,
  zeroshot_topic_list = taxonomy_list,
  zeroshot_min_similarity = .80,
  representation_model = KeyBERTInspired(),
  verbose = True,
)

topics, _ = topic_model.fit_transform(global_ticket_descriptions, embeddings=embeddings)
_, probs = topic_model.transform(global_ticket_descriptions, embeddings=embeddings)
print(probs)

The output is:

2024-07-09 14:48:15,691 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.
[[0.8463203  0.8880229  0.8636132  ... 0.8376566  0.8344382  0.82069004]
 [0.92492086 0.8977871  0.91309047 ... 0.9031642  0.90693504 0.9101905 ]
 [0.9009018  0.9072951  0.91581297 ... 0.9002963  0.8903491  0.8731015 ]
 ...
 [0.85506856 0.9055494  0.8743948  ... 0.8490986  0.8607102  0.8375366 ]
 [0.8783543  0.88571143 0.8983458  ... 0.8795974  0.87251174 0.8653137 ]
 [0.8480984  0.8751351  0.8680049  ... 0.8383051  0.8333502  0.8458183 ]]

BERTopic Version

0.16.2

Contributor guide

Research direction: Investigate why zero shot probabilities are uniformly high by examining the cosine similarity computation between document and topic embeddings. Check if `zeroshot min similarity` is too low or if the `embedding model` produces similar embeddings for different topics. Also verify if `calculate probabilities=True` is being ignored during `fit transform` as per documentation, and consider using `probabilities` from `transform` correctly.
Tech stack: python
Domain: machine learningai
Issue type: Bug
Difficulty: 2
Estimated time: Half day
Activity status: Active
Clarity: Mostly clear
Prerequisites: PythonBERTopic
Newbie friendliness: 70