Issue
This Content is from Stack Overflow. Question asked by Anas Jamshed
I have a file named “spam.csv” that contains a collection of emails that have been classified as “spam” or “ham”(not spam)”.
top 5 rows of data
The label (target) associated with each message is in the first column and the actual email message is located in the second column (the other columns are not used)
I want to Cluster all the training data into two categories using K-Means clustering. The feature vector to use for the clustering operation is TF-IDF. And after that I need to :
Display the top 25 tokens from both clusters by way of the following process:
• Store the maximum weight of each token in the TF-IDF vector of the documents labelled ‘ham’
• Store the maximum weight of each token in the TF-IDF vector of the documents labelled ‘spam’
• Create a list of pairings of tokens and weights (ordered by weight in decreasing order)
• Use a for loop to display the top 25 tokens for each class (it doesn’t have to be in two columns)
like: 1.0
good: 1.0
get: 1.0
possible: 1.0
part: 1.0
time: 1.0
send: 1.0
anything: 1.0
said: 1.0
thinking: 1.0
remember: 1.0
going: 1.0
many: 1.0
ok: 1.0
also: 1.0
check: 1.0
watch: 1.0
guess: 1.0
many: 1.0
call: 1.0
give: 1.0
able: 1.0
thank: 1.0
want: 1.0
sent: 1.0
ok: 1.0
talk: 1.0
yeah: 1.0
goin: 1.0
okie: 1.0
easy: 1.0
account: 1.0
hi: 1.0
dinner: 1.0
oops: 1.0
came: 1.0
guy: 1.0
bf: 1.0
pocked: 1.0
bold: 1.0
ryder: 1.0
increase: 1.0
face: 0.9664415695501261
leave: 0.9176210873329959
hi: 0.9040872022980894
sac: 0.8852067555763004
pizza: 0.869246093418147
thinkin: 0.8570998869933819
awake: 0.857007861578994
knw: 0.837215393727471
I am trying this code:
#Importing Libraries
import pandas as pd
from pandas import Series,DataFrame
messages = pd.read_csv(r'spam.csv',sep="," , encoding='ISO-8859-1')
#Show first 5 rows
messages.head()#Import re and NLP packages
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
ps = PorterStemmer()
#Create empty list
corpus = []
#remove proper nouns and remove stop words and only keep alphabetic tokens
for i in range(0,len(messages)):
review = re.sub('[^a-zA-Z]','',messages['message'][i])
review = review.lower()
review = review.split()
review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
review = ''.join(review)
corpus.append(review)
print(corpus)
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()
# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(corpus)
# Print the shape of tfidf_matrix
print(tfidf_matrix)
and output is looking like:
(0, 1109) 1.0
(1, 3190) 1.0
(2, 922) 1.0
(3, 4188) 1.0
(4, 2799) 1.0
(5, 940) 1.0
(6, 854) 1.0
(7, 279) 1.0
(8, 4671) 1.0
(9, 1157) 1.0
(10, 2003) 1.0
(11, 3610) 1.0
(12, 4282) 1.0
(13, 2273) 1.0
(14, 1866) 1.0
(15, 4714) 1.0
(16, 3063) 1.0
(17, 825) 1.0
(18, 887) 1.0
(19, 834) 1.0
(20, 2150) 1.0
(21, 2002) 1.0
(22, 3674) 1.0
(23, 25) 1.0
(24, 879) 1.0
I want 25 tokens for each class with 1 or 0 values as I described in the question. How can I do this? Can anyone help me?
Solution
This question is not yet answered, be the first one who answer using the comment. Later the confirmed answer will be published as the solution.
This Question and Answer are collected from stackoverflow and tested by JTuto community, is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.