Clustering

Aaron Meyer (based on slides from David Sontag)

Clustering

  • Unsupervised learning
  • Requires data, but no labels
  • Detect patterns e.g. in
    • Gene expression between patient samples
    • Images
    • Really any sets of measurements across samples
  • Useful when don’t know what you’re looking for
  • But: can be gibberish

Clustering

  • Basic idea: group together similar instances
  • Example: 2D point patterns
Groups of points, circled to be placed within two clusters.

Clustering

  • Basic idea: group together similar instances
  • Example: 2D point patterns

Clustering

  • Basic idea: group together similar instances
  • Example: 2D point patterns

What could “similar” mean?

  • One option: Euclidean distance

\[ \textrm{dist}(\mathbf{x}, \mathbf{y}) = \lVert \mathbf{x} - \mathbf{y} \rVert \]

  • Clustering results are completely dependent on the measure of similarity (or distance) between “points” to be clustered

What properties should a distance measure have?

  • Symmetric
    • \(D(A, B) = D(B, A)\)
    • Otherwise we can say A looks like B but not vice versa
  • Positivity and self-similarity
    • \(D(A,B) > 0\), and \(D(A,B)=0\) iff \(A=B\)
    • Otherwise there will be objects that we can’t tell apart
  • Triangle inequality
    • \(D(A, B) + D(B, C) \geq D(A,C)\)
    • Otherwise one can say “A is like B, B is like C, but A is not like C at all”

Clustering algorithms

  • Partition algorithms (flat)
    • K-means
    • Gaussian mixtures
  • Hierarchical algorithms
    • Bottom up - agglomerative
    • Top down - divisive

Characters from the Simpsons, clustered according to partitioning or agglomerative clustering.

K-means, an iterative clustering algorithm

  • Initialize: Pick K random points as cluster centers
  • Alternate until no assignments change:
    1. Assign data points to the closest cluster center
    2. Change the cluster center to the average of its assigned points

K-means, an iterative clustering algorithm

  • Initialize: Pick K random points as cluster centers
  • Alternate until no assignments change:
    1. Assign data points to the closest cluster center
    2. Change the cluster center to the average of its assigned points

K-means clustering: Example

  • Pick K random points as cluster centers (means)
    • Shown here for K=2

K-means clustering: Example

Iterative Step 1

Assign data points to the closest cluster center

K-means clustering: Example

Iterative Step 2

Change the cluster center to the average of the assigned points

K-means clustering: Example

Repeat until convergence

K-means clustering: Example

K-means clustering: Example

Another Example

Step 1

Another Example

Step 2

Another Example

Step 3

Another Example

Step 4

Another Example

Step 5

Another Example

Step 6

Another Example

Step 7

Properties of K-means algorithm

  • Guaranteed to converge in a finite number of iterations
  • Running time per iteration:
    1. Assign data points to the closest cluster center:
      • O(KN) time
    2. Change the cluster center to the average of its assigned points:
      • O(N) time

K-Means Convergence

Objective: \(\min_{\mu}\min_{C}\sum_{i=1}^k \sum_{x\in C_i} \lVert x - \mu_i \rVert^2\)

  1. Fix \(\mu\), optimize \(C\): \[\min_C \sum_{i=1}^k \sum_{x\in C_i} \lVert x - \mu_i \rVert^2 = \min_C \sum_{i}^n \lVert x_i - \mu_{x_i} \rVert^2\]
  2. Fix \(C\), optimize \(\mu\): \[\min_{\mu}\sum_{i=1}^k \sum_{x\in C_i} \lVert x - \mu_i \rVert^2\]
    • Take partial derivative of \(\mu_i\) and set to zero, we have \[\mu_i = \frac{1}{\lVert C_i \rVert} \sum_{x \in C_i} x\]

Kmeans takes an alternating optimization approach. Each step is guaranteed to decrease the objective—thus guaranteed to converge.

Initialization

K-means algorithm is a heuristic:

  • Requires initial means
  • It does matter what you pick!
  • What can go wrong?
  • Various schemes for preventing this kind of thing: variance-based split / merge, initialization heuristics

K-Means Getting Stuck

Local optima dependent on how the problem was specified:

K-means not able to properly cluster

Changing the features (distance function) can help

Agglomerative Clustering

  • Agglomerative clustering:
    • First merge very similar instances
    • Incrementally build larger clusters out of smaller clusters
  • Algorithm:
    • Maintain a set of clusters
    • Initially, each instance in its own cluster
    • Repeat:
      • Pick the two closest clusters
      • Merge them into a new cluster
      • Stop when there’s only one cluster left
  • Produces not one clustering, but a family of clusterings represented by a dendrogram

Agglomerative Clustering

How should we define “closest” for clusters with multiple elements?

Agglomerative Clustering

  • How should we define “closest” for clusters with multiple elements?
  • Many options:
    • Closest pair (single-link clustering)
    • Farthest pair (complete-link clustering)
    • Average of all pairs
  • Different choices create different clustering behaviors

Agglomerative Clustering

  • How should we define “closest” for clusters with multiple elements?
  • Many options:
    • Closest pair (left)
    • Farthest pair (right)
    • Average of all pairs
  • Different choices create different clustering behaviors

Clustering Behavior

Mouse tumor data from Hastie et al.

Agglomerative Clustering Questions

  • Will agglomerative clustering converge?
    • To a global optimum?
  • Will it always find the true patterns in the data?
  • Do people ever use it?
  • How many clusters to pick?

Reconsidering “hard assignments”?

  • Clusters may overlap
  • Some clusters may be “wider” than others
  • Distances can be deceiving!

Applications

  • Clustering patients into groups based on molecular or etiological measurements
  • Cells into groups based on molecular measurements
  • Neuronal signals

Clustering Patients

Clustering Patients

Review

Review

Review

Review

Review

Review

Review

Implementation

K-means

sklearn.cluster.KMeans

  • n_clusters: Number of clusters to form
  • init: Initialization method
  • n_init: Number of initializations
  • max_iter: Maximum steps algorithm can take

Agglomerative

sklearn.cluster.AgglomerativeClustering

Implementation - K-means

import numpy as np
from matplotlib.pyplot import figure
from mpl_toolkits.mplot3d import Axes3D

from sklearn.cluster import KMeans
from sklearn import datasets

iris = datasets.load_iris()
X, y = iris.data, iris.target

est = KMeans(n_clusters=3)

ax = Axes3D(figure(), rect=[0, 0, .95, 1], elev=48, azim=134)
est.fit(X)
labels = est.labels_

ax.scatter(X[:, 3], X[:, 0], X[:, 2],
           c=labels.astype(np.float), edgecolor='k')
# ...

Implementation - K-means

Implementation - K-means

Further Reading

K-Means clustering

  1. What is clustering? What does it seek to accomplish?
  2. What is agglomerative clustering? What does the resulting output look like?
  3. What is the difference between single-link, complete-link, and average clustering?
  4. What are three properties that are required of a distance metric? What kind of things can be compared with a distance metric?
  5. What decisions does one have to make before performing k-means clustering?
  6. Describe the process of fitting that k-means performs.
  7. You and your colleague each run k-means with the same settings, and arrive at different results. How could this happen?
  8. What are two cases in which k-means clustering can fail or provide bad results?
  9. How do you determine that your clustering is “good”?