News Leaflets
A leading news portal.

Clustering Text Documents using K-Means in Scikit Learn

0 24

Improve Article

Save Article

Like Article

Improve Article

Save Article

Like Article

Clustering text documents is a typical issue in natural language processing (NLP). Based on their content, related documents are to be grouped. The k-means clustering technique is a well-liked solution to this issue. In this article, we’ll demonstrate how to cluster text documents using k-means using Scikit Learn.

K-means clustering algorithm

The k-means algorithm is a well-liked unsupervised learning algorithm that organizes data points into groups based on similarities. The algorithm operates by iteratively assigning each data point to its nearest cluster centroid and then recalculating the centroids based on the newly formed clusters.

Preprocessing

Preprocessing describes the procedures used to get data ready for machine learning or analysis. It frequently involves transforming, reformatting, and cleaning raw data and vectorization into a format appropriate for additional analysis or modeling.

Steps

  1. Loading or preparing the dataset [dataset link: https://github.com/PawanKrGunjan/Natural-Language-Processing/blob/main/Sarcasm%20Detection/sarcasm.json]
  2. Preprocessing of text in case the text is loaded instead of manually adding it to the code
  3. Vectorizing the text using TfidfVectorizer
  4. Reduce the dimension using PCA
  5. Clustering the documents
  6. Plot the cluster using matplotlib

Python3

import json

import numpy as np

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.decomposition import PCA

from sklearn.cluster import KMeans

import matplotlib.pyplot as plt

  

df=pd.read_json('sarcasm.json')

  

sentence = df.headline

  

vectorizer = TfidfVectorizer(stop_words='english')

  

vectorized_documents = vectorizer.fit_transform(sentence)

  

pca = PCA(n_components=2)

reduced_data = pca.fit_transform(vectorized_documents.toarray())

  

  

num_clusters = 2

kmeans = KMeans(n_clusters=num_clusters, n_init=5,

                max_iter=500, random_state=42)

kmeans.fit(vectorized_documents)

  

  

results = pd.DataFrame()

results['document'] = sentence

results['cluster'] = kmeans.labels_

  

print(results.sample(5))

  

colors = ['red', 'green']

cluster = ['Not Sarcastic','Sarcastic']

for i in range(num_clusters):

    plt.scatter(reduced_data[kmeans.labels_ == i, 0],

                reduced_data[kmeans.labels_ == i, 1], 

                s=10, color=colors[i], 

                label=f' {cluster[i]}')

plt.legend()

plt.show()

Output:

                                                document  cluster
16263  study finds majority of u.s. currency has touc...        0
5318   an open and personal email to hillary clinton ...        0
12994        it's not just a muslim ban, it's much worse        0
5395   princeton students confront university preside...        0
24591     why getting married may help people drink less        0

Text clustering using KMeans

Last Updated :
09 Jun, 2023

Like Article

Save Article

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! News Leaflets is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment
vulvatube.com teenpornvideo.mobi desi girls sexy
sex video in bus tubekitty.mobi kamapichachi
the broken marriage vow march 12 full episode teleseryepisode.com mateo lorenzo net worth
telugu hd sex pornflex.org bluefilmtamil
indian sex stories lesbian indaporn.com best sex mms
hinde xnxx video redwap2.com bangla chudachudi
طيز خليجي filmstreamingporno.com التحرش بالمنقبات
affair sex video ultraporn.mobi deshimagi
tubb99 nuporn.mobi mumbaixvideo
agimat ng agila cast watchpinoyteleserye.com stl today result
age of tamanna pimpmovs.com xxxxx inden
indan xvideo com xxx-tube-list.net indina six video
gonzo xxx sunny leone eporner.name desi pirn
بنت تلعب في كسها teentubeonline.com سكس امهات اسيوي
largeporn film tubepatrol.porn kama katai