How to get topic-probs matrix in bertopic modeling


This Content is from Stack Overflow. Question asked by JJD

I ran BERTopic to get topics for 3,500 documents. How could I get the topic-probs matrix for each document and export them to csv? When I export them, I want to export the identifier of each document too.

I tried two approaches: First, I found topic_model.visualize_distribution(probs[#]) gives the information that I want. But how can I export the topics-probs data for each document to csv?

Second, I found this thread (How to get all docoments per topic in bertopic modeling) can be useful if I can add the column for probabilities to the data frame it generates. Is there any way to do that?

Please share any other approaches that can produce and export the topic-probabilities matrix for all documents.

For your information, this is my BERTopic code. Thank you!

embedding_model = SentenceTransformer('all-mpnet-base-v2')
umap_model = UMAP(n_neighbors=15)
hdbscan_model = HDBSCAN(min_cluster_size=20, min_samples=1,

from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
stopwords = list(stopwords.words('english')) + ['http', 'https', 'amp', 'com','agile','agility']
vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words=stopwords)

model1 = BERTopic(
topics, probs = model1.fit_transform(data)


This question is not yet answered, be the first one who answer using the comment. Later the confirmed answer will be published as the solution.

This Question and Answer are collected from stackoverflow and tested by JTuto community, is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?