Clustering

Aaron Meyer (based on slides from David Sontag)

Clustering

  • Unsupervised learning
  • Requires data, but no labels
  • Detect patterns e.g. in
    • Gene expression between patient samples
    • Images
    • Really any sets of measurements across samples
  • Useful when don’t know what you’re looking for
  • But: can be gibberish

Clustering

  • Basic idea: group together similar instances
  • Example: 2D point patterns
Groups of points, circled to be placed within two clusters.

Clustering is subjective!

Applications

  • Clustering patients into groups based on molecular or etiological measurements
  • Cells into groups based on molecular measurements
  • Neuronal signals

Title slide or header image for Breast Cancer clusters.

Heatmap or plot showing clustering of breast cancer patients based on molecular measurements.

What does “similar” mean?

What does “similar” mean?

  • One option: Euclidean distance

\[ \textrm{dist}(\mathbf{x}, \mathbf{y}) = \lVert \mathbf{x} - \mathbf{y} \rVert \]

  • Clustering results are completely dependent on the measure of similarity (or distance) between observations to be clustered

What properties should a distance measure have?

Symmetric

  • \(D(A, B) = D(B, A)\)
  • Otherwise we can say A looks like B but not vice versa

Positivity and self-similarity

  • \(D(A,B) > 0\), and \(D(A,B)=0\) iff \(A=B\)
  • Otherwise there will be objects that we can’t tell apart

Triangle inequality

  • \(D(A, B) + D(B, C) \geq D(A,C)\)
  • Otherwise one can say “A like B, B like C, but A not like C at all”

Clustering algorithms

  • Partition algorithms (flat)
    • K-means
    • Gaussian mixtures
  • Hierarchical algorithms
    • Bottom up - agglomerative
    • Top down - divisive

Characters from the Simpsons, clustered according to partitioning or agglomerative clustering.

K-means, an iterative clustering algorithm

  • Initialize: Pick K random points as cluster centers
  • Alternate until no assignments change:
    1. Assign data points to the closest cluster center
    2. Change the cluster center to the average of its assigned points

Illustration of K-means initialization: random cluster centers are picked.

K-means, an iterative clustering algorithm

  • Initialize: Pick K random points as cluster centers
  • Alternate until no assignments change:
    1. Assign data points to the closest cluster center
    2. Change the cluster center to the average of its assigned points

Illustration of K-means step: data points assigned to closest centers.

K-means clustering: Example

  • Pick K random points as cluster centers (means)
    • Shown here for K=2
Step 1: K random points are chosen as initial cluster centers.

K-means clustering: Example

Iterative Step 1

Assign data points to the closest cluster center

Step 2: Data points are assigned to the nearest cluster center.

K-means clustering: Example

Iterative Step 2

Change the cluster center to the average of the assigned points

Step 3: Cluster centers are updated to the average of assigned points.

K-means clustering: Example

Repeat until convergence

Step 4: Repeat assignment of points to new centers.

K-means clustering: Example

Step 5: Centers update again.

K-means clustering: Example

Final step: Convergence, assignments do not change.

An example

Animation showing the iterative process of K-means clustering, where centroids move and partition the data until convergence.

https://commons.wikimedia.org/wiki/User:Chire

Properties of K-means algorithm

  • Guaranteed to converge in a finite number of iterations
  • Running time per iteration:
    1. Assign data points to the closest cluster center:
      • O(KN) time
    2. Change the cluster center to the average of its assigned points:
      • O(N) time

K-Means convergence

Objective: \(\min_{\mu}\min_{C}\sum_{i=1}^k \sum_{x\in C_i} \lVert x - \mu_i \rVert^2\)

  1. Fix \(\mu\), optimize \(C\): \[\min_C \sum_{i=1}^k \sum_{x\in C_i} \lVert x - \mu_i \rVert^2 = \min_C \sum_{i}^n \lVert x_i - \mu_{x_i} \rVert^2\]
  2. Fix \(C\), optimize \(\mu\): \[\min_{\mu}\sum_{i=1}^k \sum_{x\in C_i} \lVert x - \mu_i \rVert^2\]
    • Take partial derivative of \(\mu_i\) and set to zero, we have \[\mu_i = \frac{1}{\lVert C_i \rVert} \sum_{x \in C_i} x\]

Kmeans takes an alternating optimization approach. Each step is guaranteed to decrease the objective—thus guaranteed to converge.

K-means is (somewhat) sensitive to initialization

  • Various heuristic schemes exist for preventing problematic results.
  • None of them are perfect.

Comparison of two different K-means results on the same data, showing how different initializations can lead to different clusterings.

K-Means Getting Stuck

Local optima dependent on how the problem was specified:

Example of K-means getting stuck in a local optimum, where the resulting clusters are suboptimal compared to the global optimum.

K-means not able to properly cluster

K-means failing to cluster concentric circles, as it separates them linearly.

Changing the features (distance function) can help

Successful clustering of the concentric circles after transforming the features (e.g., using polar coordinates or kernel method).

Agglomerative Clustering

Agglomerative Clustering

  • Agglomerative clustering:
    • First merge very similar instances
    • Incrementally build larger clusters out of smaller clusters
  • Algorithm:
    • Maintain a set of clusters
    • Initially, each instance in its own cluster
    • Repeat:
      • Pick the two closest clusters
      • Merge them into a new cluster
      • Stop when there’s only one cluster left
  • Produces not one clustering, but a family of clusterings represented by a dendrogram

Schematic of agglomerative clustering: initially each point is a cluster, then closest pairs are merged.

Agglomerative Clustering

How should we define “closest” for clusters with multiple elements?

Illustration of measuring distance between two clusters with multiple points.

Agglomerative Clustering

  • How should we define “closest” for clusters with multiple elements?
  • Many options:
    • Closest pair (single-link clustering)
    • Farthest pair (complete-link clustering)
    • Average of all pairs
  • Different choices create different clustering behaviors
Comparison of single-link, complete-link, and average-link distance measures between clusters.

Agglomerative Clustering

  • How should we define “closest” for clusters with multiple elements?
  • Many options:
    • Closest pair (left)
    • Farthest pair (right)
    • Average of all pairs
  • Different choices create different clustering behaviors
Visual comparison of clustering results using single-link (closest pair) versus complete-link (farthest pair) criteria.

Clustering Behavior

Dendrogram showing the hierarchical clustering of mouse tumor data.

Mouse tumor data from Hastie et al.

Agglomerative Clustering Questions

  • Will agglomerative clustering converge?
    • To a global optimum?
  • Will it always find the true patterns in the data?
  • Do people ever use it?
  • How many clusters to pick?

Reconsidering “hard assignments”?

  • Clusters may overlap
  • Some clusters may be “wider” than others
  • Distances can be deceiving!
Diagram showing overlapping probability distributions, illustrating that clusters can overlap and hard assignments might be misleading.

Review

Review section title slide.

Review figure 1.

Review figure 2.

Review figure 3.

Review figure 4.

Implementation

K-means

sklearn.cluster.KMeans

  • n_clusters: Number of clusters to form
  • init: Initialization method
  • n_init: Number of initializations
  • max_iter: Maximum steps algorithm can take

Agglomerative

sklearn.cluster.AgglomerativeClustering

  • n_clusters: Number of clusters to return
  • metric: Distance metric to use
  • linkage: Linkage metric to use

Fisher’s iris dataset

Photograph of Iris flowers (setosa, versicolor, virginica) used in the dataset.

https://en.wikipedia.org/wiki/Iris_flower_data_set

K-means implementation

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import datasets

iris = datasets.load_iris()
X, y = iris.data, iris.target

est = KMeans(n_clusters=3)
est.fit(X)
labels = est.labels_
# ...
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels)

PCA plot of the Iris dataset showing the results of K-means clustering. The points are colored by their assigned cluster (teal, purple, yellow) plotted against the first two principal components.

PCA plot of the Iris dataset colored by the true species labels (setosa, versicolor, virginica). This allows comparison with the K-means clustering results.

Review

Further Reading

Review Questions

  1. What is clustering? What does it seek to accomplish?
  2. What is agglomerative clustering? What does the resulting output look like?
  3. What is the difference between single-link, complete-link, and average clustering?
  4. What are three properties that are required of a distance metric? What kind of things can be compared with a distance metric?
  5. What decisions does one have to make before performing k-means clustering?
  6. Describe the process of fitting that k-means performs.
  7. You and your colleague each run k-means with the same settings, and arrive at different results. How could this happen?
  8. What are two cases in which k-means clustering can fail or provide bad results?
  9. How do you determine that your clustering is “good”?