Calculating Probabilities on Zero-Shot Learning
#2.076 aberto em 9 de jul. de 2024
Métricas do repositório
- Stars
- (5.074 stars)
- Métricas de merge de PR
- (Nenhuma PRs mesclada em 30d)
Description
Have you searched existing issues? 🔎
- I have searched and found no existing issues
Desribe the bug
Hi everyone, first of all I would like to thank @MaartenGr and all the contributors for this amazing project.
Recently I started looking at BERTopic as a method to classify some customer tickets into some categories defined intozeroshot_topic_list parameter. After fitting the model by calling fit_transform my goal was to look for each document, the probability of that document of belonging to all the topics (both predefined and generated ones).
probs is None after fit_transform, as expected and mentioned at the end of this page https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html#example. Therefore, I then called transform to get the probabilities.
Now I've got two questions I would like to answer:
- is this one the right approach to get the probabilities?
- most of the documents have almost the same (high) probability among all the topics. Does this mean the clustering didn't fit that much the data? What do you suggest?
Again, I would like to thank you for your effort in advance and I look forward to contribute as well if needed.
Reproduction
topic_model = BERTopic(
calculate_probabilities = True,
vectorizer_model = CountVectorizer(stop_words=default_stopwords + custom_stopwords),
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True),
embedding_model = sentence_model,
min_topic_size = 50,
zeroshot_topic_list = taxonomy_list,
zeroshot_min_similarity = .80,
representation_model = KeyBERTInspired(),
verbose = True,
)
topics, _ = topic_model.fit_transform(global_ticket_descriptions, embeddings=embeddings)
_, probs = topic_model.transform(global_ticket_descriptions, embeddings=embeddings)
print(probs)
The output is:
2024-07-09 14:48:15,691 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.
[[0.8463203 0.8880229 0.8636132 ... 0.8376566 0.8344382 0.82069004]
[0.92492086 0.8977871 0.91309047 ... 0.9031642 0.90693504 0.9101905 ]
[0.9009018 0.9072951 0.91581297 ... 0.9002963 0.8903491 0.8731015 ]
...
[0.85506856 0.9055494 0.8743948 ... 0.8490986 0.8607102 0.8375366 ]
[0.8783543 0.88571143 0.8983458 ... 0.8795974 0.87251174 0.8653137 ]
[0.8480984 0.8751351 0.8680049 ... 0.8383051 0.8333502 0.8458183 ]]
BERTopic Version
0.16.2