MaartenGr/BERTopic

Calculating Probabilities on Zero-Shot Learning

Open

#2,076 opened on Jul 9, 2024

View on GitHub
 (5 comments) (2 reactions) (0 assignees)Python (634 forks)batch import
enhancementgood first issue

Repository metrics

Stars
 (5,074 stars)
PR merge metrics
 (No merged PRs in 30d)

Description

Have you searched existing issues? 🔎

  • I have searched and found no existing issues

Desribe the bug

Hi everyone, first of all I would like to thank @MaartenGr and all the contributors for this amazing project.

Recently I started looking at BERTopic as a method to classify some customer tickets into some categories defined intozeroshot_topic_list parameter. After fitting the model by calling fit_transform my goal was to look for each document, the probability of that document of belonging to all the topics (both predefined and generated ones).

probs is None after fit_transform, as expected and mentioned at the end of this page https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html#example. Therefore, I then called transform to get the probabilities.

Now I've got two questions I would like to answer:

  • is this one the right approach to get the probabilities?
  • most of the documents have almost the same (high) probability among all the topics. Does this mean the clustering didn't fit that much the data? What do you suggest?

Again, I would like to thank you for your effort in advance and I look forward to contribute as well if needed.

Reproduction

topic_model = BERTopic(
  calculate_probabilities = True,
  vectorizer_model = CountVectorizer(stop_words=default_stopwords + custom_stopwords),
  ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True),
  embedding_model = sentence_model,
  min_topic_size = 50,
  zeroshot_topic_list = taxonomy_list,
  zeroshot_min_similarity = .80,
  representation_model = KeyBERTInspired(),
  verbose = True,
)

topics, _ = topic_model.fit_transform(global_ticket_descriptions, embeddings=embeddings)
_, probs = topic_model.transform(global_ticket_descriptions, embeddings=embeddings)
print(probs)

The output is:

2024-07-09 14:48:15,691 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.
[[0.8463203  0.8880229  0.8636132  ... 0.8376566  0.8344382  0.82069004]
 [0.92492086 0.8977871  0.91309047 ... 0.9031642  0.90693504 0.9101905 ]
 [0.9009018  0.9072951  0.91581297 ... 0.9002963  0.8903491  0.8731015 ]
 ...
 [0.85506856 0.9055494  0.8743948  ... 0.8490986  0.8607102  0.8375366 ]
 [0.8783543  0.88571143 0.8983458  ... 0.8795974  0.87251174 0.8653137 ]
 [0.8480984  0.8751351  0.8680049  ... 0.8383051  0.8333502  0.8458183 ]]

BERTopic Version

0.16.2

Contributor guide