[SOLVED] How to get all docoments per topic in bertopic modeling

Issue

This Content is from Stack Overflow. Question asked by JJD

I ran BERTopic to get topics for 3,500 documents. How could I get the topic-probs matrix for each document and export them to csv? When I export them, I want to export the identifier of each document too.

I tried two approaches: First, I found topic_model.visualize_distribution(probs[#]) gives the information that I want. But how can I export the topics-probs data for each document to csv?

Second, I found this thread (How to get all docoments per topic in bertopic modeling) can be useful if I can add the column for probabilities to the data frame it generates. Is there any way to do that?

Please share any other approaches that can produce and export the topic-probabilities matrix for all documents.

For your information, this is my BERTopic code. Thank you!

embedding_model = SentenceTransformer('all-mpnet-base-v2')
umap_model = UMAP(n_neighbors=15)
hdbscan_model = HDBSCAN(min_cluster_size=20, min_samples=1,
                        gen_min_span_tree=True,
                        prediction_data=True)

from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
stopwords = list(stopwords.words('english')) + ['http', 'https', 'amp', 'com','agile','agility']
vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words=stopwords)

model1 = BERTopic(
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    embedding_model=embedding_model,
    vectorizer_model=vectorizer_model,
    language='english',
    calculate_probabilities=True,
    verbose=True
)
topics, probs = model1.fit_transform(data)



Solution

There are probably solutions that are more elegant because I am not an expert, but I can share what worked for me (as there are no answers yet):

"topics, probs = topic_model.fit_transform(docs_test)" returns the topics.

Therefore, you can combine this output and the documents.
For example, combine them into a (pandas.)dataframe using

df = pd.DataFrame({'topic': topics, 'document': docs_test})

Now you can filter this dataframe for each topic to identify the referring documents.

topic_0 = df[df.topic == 0]


This Question was asked in StackOverflow by Kaleem and Answered by LM-me It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?