Slides adapted from David Sontag.
\[ \textrm{dist}(\mathbf{x}, \mathbf{y}) = \lVert \mathbf{x} - \mathbf{y} \rVert \]
Assign data points to closest cluster center
Change the cluster center to the average of the assigned points
Repeat until convergence
Objective: \(\min_{\mu}\min_{C}\sum_{i=1}^k \sum_{x\in C_i} \lVert x - \mu_i \rVert^2\)
Kmeans takes an alternating optimization approach. Each step is guaranteed to decrease the objective—thus guaranteed to converge.
K-means algorithm is a heuristic:
Local optima dependent on how the problem was specified:
How should we define “closest” for clusters with multiple elements?
Mouse tumor data from Hastie et al.
sklearn.cluster.KMeans
n_clusters
: Number of clusters to forminit
: Initialization methodn_init
: Number of initializationsmax_iter
: Maximum steps algorithm can takesklearn.cluster.AgglomerativeClustering
import numpy as np
from matplotlib.pyplot import figure
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans
from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data, iris.target
est = KMeans(n_clusters=3)
ax = Axes3D(figure(), rect=[0, 0, .95, 1], elev=48, azim=134)
est.fit(X)
labels = est.labels_
ax.scatter(X[:, 3], X[:, 0], X[:, 2],
c=labels.astype(np.float), edgecolor='k')
# ...