Document Clustering: An Unsupervised Approach to Categorizing Textual Data

Document clustering is a powerful technique used to categorize documents into distinct groups based on their textual and semantic context. This unsupervised learning method is particularly valuable in fields such as information retrieval and search engines, where the goal is to organize vast amounts of unlabelled data efficiently. In this article, we will explore the process of document clustering using the K-Means algorithm, delve into the necessary preprocessing steps, and visualize the results.

Getting Started with Document Clustering

To classify documents based on their content, we will employ the K-Means algorithm. Given that our dataset lacks labels, this is a classic unsupervised learning problem, making K-Means a suitable choice. While other algorithms, such as Gaussian Mixture Models or deep learning methods like Autoencoders, could also be utilized, we will focus on K-Means for its simplicity and effectiveness.

For our implementation, we will use Python within a Jupyter Notebook environment, allowing us to combine code, results, and documentation seamlessly. The development will take place in an Anaconda environment, utilizing several key libraries:

Pandas: For data handling
Scikit-learn (Sklearn): For machine learning and preprocessing
Matplotlib: For plotting
NLTK: For natural language processing algorithms
BeautifulSoup: To parse text from XML files and clean the data

Parsing the Data

The first step in our clustering process is to parse the data. We will use the xml.etree.ElementTree library to extract relevant information from an XML file. Specifically, we will focus on the title and description of each document, as these elements are crucial for understanding the semantic content.

import xml.etree.ElementTree as ET
import pandas as pd
from bs4 import BeautifulSoup

def parseXML(xmlfile):
    tree = ET.parse(xmlfile)
    root = tree.getroot()
    titles = []
    descriptions = []

    for item in root.findall('./channel/item'):
        for child in item:
            if child.tag == 'title':
                titles.append(child.text)
            if child.tag == 'description':
                soup = BeautifulSoup(str(child.text).encode('utf8', 'ignore'), "lxml")
                strtext = soup.text.replace(u'\xa0', u' ').replace('\n', ' ')
                descriptions.append(strtext)
    return titles, descriptions

bef_titles, bef_descriptions = parseXML('data.source.rss-feeds.xml')

After parsing, we filter out items with very short descriptions, as they can negatively impact the clustering results.

titles = []
descriptions = []
for i in range(len(bef_titles)):
    if len(bef_descriptions[i]) > 500:
        titles.append(bef_titles[i])
        descriptions.append(bef_descriptions[i])

Tokenizing and Stemming

Next, we need to preprocess the text data by tokenizing it into words and removing morphological affixes. This step helps in reducing the dimensionality of our dataset and focuses on the root forms of words.

import nltk
from nltk import SnowballStemmer
import re

stemmer = SnowballStemmer("english")

def tokenize_and_stem(text):
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

totalvocab_stemmed = []
for i in descriptions:
    allwords_stemmed = tokenize_and_stem(i)
    totalvocab_stemmed.extend(allwords_stemmed)

vocab_frame = pd.DataFrame({'words': totalvocab_stemmed})

Vectorizing the Data

Before we can apply the K-Means algorithm, we must convert our text data into a numerical format. The Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer is a popular choice for this task, as it reflects the importance of a word in a document relative to the entire corpus.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000, min_df=0.2, stop_words='english', 
                                     use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1, 3))
tfidf_matrix = tfidf_vectorizer.fit_transform(descriptions)

Applying K-Means Clustering

Now that we have our TF-IDF matrix, we can proceed with clustering the documents using the K-Means algorithm. We will create five clusters and analyze the results.

from sklearn.cluster import KMeans

num_clusters = 5
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()

To visualize the clustering results, we can create a DataFrame that organizes the titles and their corresponding clusters.

items = {'title': titles, 'description': descriptions}
frame = pd.DataFrame(items, index=[clusters], columns=['title', 'cluster'])

Visualization of Clusters

To better understand the clustering results, we can visualize the data using t-SNE (t-Distributed Stochastic Neighbor Embedding), which helps reduce the dimensionality of our data for plotting.

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
pos = tsne.fit_transform(tfidf_matrix.toarray())

xs, ys = pos[:, 0], pos[:, 1]
df = pd.DataFrame(dict(x=xs, y=ys, label=clusters, title=titles))
groups = df.groupby('label')

fig, ax = plt.subplots(figsize=(16, 8))
for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=12, label=name)
ax.legend(numpoints=1)
plt.show()

The resulting plot provides a visual representation of how well the documents have been clustered. While some overlap may occur, the clusters are generally well-defined.

Conclusion and Future Work

In this article, we explored the process of document clustering using the K-Means algorithm. We discussed the importance of data parsing, tokenization, stemming, and vectorization in preparing the data for clustering. While our initial results are promising, there is still room for improvement. Future work could involve optimizing the parameters of the TF-IDF vectorizer, experimenting with different clustering algorithms, or employing advanced techniques like doc2vec or HDBSCAN.

By continuing to refine our approach, we can enhance the accuracy and effectiveness of document clustering, paving the way for more sophisticated applications in information retrieval and data analysis.

For those interested in diving deeper into the world of deep learning and machine learning operations, consider exploring resources that cover building, training, deploying, and maintaining deep learning models.

AI Summer: Document Clustering Techniques

Document Clustering: An Unsupervised Approach to Categorizing Textual Data

Getting Started with Document Clustering

Parsing the Data

Tokenizing and Stemming

Vectorizing the Data

Applying K-Means Clustering

Visualization of Clusters

Conclusion and Future Work

Table of contents

rewrite this title How Purpose-Driven Entrepreneurs Are Changing the World

rewrite this title Neko Health Raises $260M to Expand AI-Powered Body Scans

rewrite this title FOMC Interest Rates Decision 2025: What It Means for Crypto

rewrite this title KLAS Names Top EHR Implementation Partners for Providers

rewrite this title Safemoon and Vine Are Trending Again – Are We Reviving the Ghosts of the Past?

Related updates

rewrite this title Six Feared Dead in Tragic Air Disaster

Building a Neural Network from the Ground Up – Part 1

Building a Neural Network from the Ground Up – Part 2

Deep Learning: A Promising Future or Just Another AI Buzzword?

rewrite this title Pitchfest finalist Kidney Beam...

Equity Charter Unveils 10 Principles at Rewired...

Introduction to Generative Learning: Exploring GANs in...

rewrite this title How Purpose-Driven Entrepreneurs Are...

rewrite this title Neko Health Raises $260M...

rewrite this title FOMC Interest Rates Decision...