Glossary term
K-Means Clustering
An unsupervised learning method that partitions data into k clusters by minimizing within-cluster squared distance to centroids.
Definition
methodK-means clustering partitions observations into k clusters by assigning each observation to the nearest centroid and updating centroids to minimize within-cluster squared distance.
K-means is an unsupervised learning algorithm for grouping numerical data. It alternates between assignment and centroid-update steps. The objective is to minimize the sum of squared distances from each point to its assigned cluster centroid. It is widely used for exploratory data analysis, vector quantization, image compression, segmentation, anomaly screening, and feature engineering, but its results depend strongly on k, scaling, initialization, distance metric, and cluster geometry.
K-means clustering divides data into a chosen number of clusters, k. Each cluster is represented by a centroid, usually the mean of the points assigned to that cluster. The algorithm seeks to minimize the within-cluster sum of squared distances:
where C_i is a cluster and \mu_i is its centroid.
Algorithm
The standard Lloyd algorithm has two repeating steps. First, assign each point to the nearest centroid. Second, recompute each centroid as the mean of the points assigned to it. These steps repeat until assignments stop changing, the objective improves only slightly, or an iteration limit is reached.
The method is simple and fast, which makes it attractive for large datasets. It is also easy to explain: each point belongs to the nearest representative centre. This simplicity is why k-means appears in production analytics, preprocessing pipelines, image colour reduction, customer segmentation, fault grouping, document embeddings, and vector quantization.
Assumptions and limits
K-means works best when clusters are compact, roughly spherical, similarly sized, and separable under Euclidean distance. It performs poorly when clusters are elongated, overlapping, nonconvex, strongly imbalanced, or separated by density rather than distance. It is sensitive to feature scaling because a variable with large numerical range can dominate distance.
The objective is not globally convex in both assignments and centroids. Different initial centroids can lead to different local minima. Methods such as k-means++ initialization, multiple restarts, and validation metrics reduce but do not remove this sensitivity.
Choosing k
The number of clusters is an engineering choice, not something k-means discovers automatically. Elbow plots, silhouette scores, gap statistics, cross-validation against downstream performance, or domain constraints can help. In industrial settings, interpretability and actionability often matter more than the mathematically smallest within-cluster distance.
Common mistakes
A common mistake is applying k-means to unscaled features or mixed categorical and numerical data without a suitable representation. Another is interpreting clusters as real physical classes without validation. K-means always returns clusters if asked, even when the data has no meaningful cluster structure. Good use reports preprocessing, distance metric, initialization method, number of restarts, chosen k, validation metric, and whether clusters are stable under perturbation.