Document Clustering: An Unsupervised Approach to Categorizing Textual Data
Document clustering is a powerful technique used to categorize documents into distinct groups based on their textual and semantic context. This unsupervised learning method is particularly valuable in fields such as information retrieval and search engines, where the goal is to organize vast amounts of unlabelled data efficiently. In this article, we will explore the process of document clustering using the K-Means algorithm, delve into the necessary preprocessing steps, and visualize the results.
Getting Started with Document Clustering
To classify documents based on their content, we will employ the K-Means algorithm. Given that our dataset lacks labels, this is a classic unsupervised learning problem, making K-Means a suitable choice. While other algorithms, such as Gaussian Mixture Models or deep learning methods like Autoencoders, could also be utilized, we will focus on K-Means for its simplicity and effectiveness.
For our implementation, we will use Python within a Jupyter Notebook environment, allowing us to combine code, results, and documentation seamlessly. The development will take place in an Anaconda environment, utilizing several key libraries:
- Pandas: For data handling
- Scikit-learn (Sklearn): For machine learning and preprocessing
- Matplotlib: For plotting
- NLTK: For natural language processing algorithms
- BeautifulSoup: To parse text from XML files and clean the data
Parsing the Data
The first step in our clustering process is to parse the data. We will use the xml.etree.ElementTree
library to extract relevant information from an XML file. Specifically, we will focus on the title and description of each document, as these elements are crucial for understanding the semantic content.
import xml.etree.ElementTree as ET
import pandas as pd
from bs4 import BeautifulSoup
def parseXML(xmlfile):
tree = ET.parse(xmlfile)
root = tree.getroot()
titles = []
descriptions = []
for item in root.findall('./channel/item'):
for child in item:
if child.tag == 'title':
titles.append(child.text)
if child.tag == 'description':
soup = BeautifulSoup(str(child.text).encode('utf8', 'ignore'), "lxml")
strtext = soup.text.replace(u'\xa0', u' ').replace('\n', ' ')
descriptions.append(strtext)
return titles, descriptions
bef_titles, bef_descriptions = parseXML('data.source.rss-feeds.xml')
After parsing, we filter out items with very short descriptions, as they can negatively impact the clustering results.
titles = []
descriptions = []
for i in range(len(bef_titles)):
if len(bef_descriptions[i]) > 500:
titles.append(bef_titles[i])
descriptions.append(bef_descriptions[i])
Tokenizing and Stemming
Next, we need to preprocess the text data by tokenizing it into words and removing morphological affixes. This step helps in reducing the dimensionality of our dataset and focuses on the root forms of words.
import nltk
from nltk import SnowballStemmer
import re
stemmer = SnowballStemmer("english")
def tokenize_and_stem(text):
tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]
stems = [stemmer.stem(t) for t in filtered_tokens]
return stems
totalvocab_stemmed = []
for i in descriptions:
allwords_stemmed = tokenize_and_stem(i)
totalvocab_stemmed.extend(allwords_stemmed)
vocab_frame = pd.DataFrame({'words': totalvocab_stemmed})
Vectorizing the Data
Before we can apply the K-Means algorithm, we must convert our text data into a numerical format. The Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer is a popular choice for this task, as it reflects the importance of a word in a document relative to the entire corpus.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000, min_df=0.2, stop_words='english',
use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1, 3))
tfidf_matrix = tfidf_vectorizer.fit_transform(descriptions)
Applying K-Means Clustering
Now that we have our TF-IDF matrix, we can proceed with clustering the documents using the K-Means algorithm. We will create five clusters and analyze the results.
from sklearn.cluster import KMeans
num_clusters = 5
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()
To visualize the clustering results, we can create a DataFrame that organizes the titles and their corresponding clusters.
items = {'title': titles, 'description': descriptions}
frame = pd.DataFrame(items, index=[clusters], columns=['title', 'cluster'])
Visualization of Clusters
To better understand the clustering results, we can visualize the data using t-SNE (t-Distributed Stochastic Neighbor Embedding), which helps reduce the dimensionality of our data for plotting.
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
pos = tsne.fit_transform(tfidf_matrix.toarray())
xs, ys = pos[:, 0], pos[:, 1]
df = pd.DataFrame(dict(x=xs, y=ys, label=clusters, title=titles))
groups = df.groupby('label')
fig, ax = plt.subplots(figsize=(16, 8))
for name, group in groups:
ax.plot(group.x, group.y, marker='o', linestyle='', ms=12, label=name)
ax.legend(numpoints=1)
plt.show()
The resulting plot provides a visual representation of how well the documents have been clustered. While some overlap may occur, the clusters are generally well-defined.
Conclusion and Future Work
In this article, we explored the process of document clustering using the K-Means algorithm. We discussed the importance of data parsing, tokenization, stemming, and vectorization in preparing the data for clustering. While our initial results are promising, there is still room for improvement. Future work could involve optimizing the parameters of the TF-IDF vectorizer, experimenting with different clustering algorithms, or employing advanced techniques like doc2vec or HDBSCAN.
By continuing to refine our approach, we can enhance the accuracy and effectiveness of document clustering, paving the way for more sophisticated applications in information retrieval and data analysis.
For those interested in diving deeper into the world of deep learning and machine learning operations, consider exploring resources that cover building, training, deploying, and maintaining deep learning models.