MaartenGr/BERTopic

Calculating Probabilities on Zero-Shot Learning

Open

#2,076 创建于 2024年7月9日

在 GitHub 查看
 (5 评论) (2 反应) (0 负责人)Python (634 fork)batch import
enhancementgood first issue

仓库指标

Star
 (5,074 star)
PR 合并指标
 (30 天内没有已合并 PR)

描述

Have you searched existing issues? 🔎

  • I have searched and found no existing issues

Desribe the bug

Hi everyone, first of all I would like to thank @MaartenGr and all the contributors for this amazing project.

Recently I started looking at BERTopic as a method to classify some customer tickets into some categories defined intozeroshot_topic_list parameter. After fitting the model by calling fit_transform my goal was to look for each document, the probability of that document of belonging to all the topics (both predefined and generated ones).

probs is None after fit_transform, as expected and mentioned at the end of this page https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html#example. Therefore, I then called transform to get the probabilities.

Now I've got two questions I would like to answer:

  • is this one the right approach to get the probabilities?
  • most of the documents have almost the same (high) probability among all the topics. Does this mean the clustering didn't fit that much the data? What do you suggest?

Again, I would like to thank you for your effort in advance and I look forward to contribute as well if needed.

Reproduction

topic_model = BERTopic(
  calculate_probabilities = True,
  vectorizer_model = CountVectorizer(stop_words=default_stopwords + custom_stopwords),
  ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True),
  embedding_model = sentence_model,
  min_topic_size = 50,
  zeroshot_topic_list = taxonomy_list,
  zeroshot_min_similarity = .80,
  representation_model = KeyBERTInspired(),
  verbose = True,
)

topics, _ = topic_model.fit_transform(global_ticket_descriptions, embeddings=embeddings)
_, probs = topic_model.transform(global_ticket_descriptions, embeddings=embeddings)
print(probs)

The output is:

2024-07-09 14:48:15,691 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.
[[0.8463203  0.8880229  0.8636132  ... 0.8376566  0.8344382  0.82069004]
 [0.92492086 0.8977871  0.91309047 ... 0.9031642  0.90693504 0.9101905 ]
 [0.9009018  0.9072951  0.91581297 ... 0.9002963  0.8903491  0.8731015 ]
 ...
 [0.85506856 0.9055494  0.8743948  ... 0.8490986  0.8607102  0.8375366 ]
 [0.8783543  0.88571143 0.8983458  ... 0.8795974  0.87251174 0.8653137 ]
 [0.8480984  0.8751351  0.8680049  ... 0.8383051  0.8333502  0.8458183 ]]

BERTopic Version

0.16.2

贡献者指南