Clustering

Author

Aaron Meyer (based on slides from David Sontag)

Clustering

Unsupervised learning
Requires data, but no labels
Detect patterns e.g. in
- Gene expression between patient samples
- Images
- Really any sets of measurements across samples
Useful when don’t know what you’re looking for
But: can be gibberish

Clustering

Basic idea: group together similar instances
Example: 2D point patterns

Groups of points, circled to be placed within two clusters.

Go over what is the data input and output (discrete).

Clustering is subjective!

Go over example of the Simpsons - Lisa - Bart - Maggie - Marge - Krusty - Sideshow Bob

Applications

Clustering patients into groups based on molecular or etiological measurements
Cells into groups based on molecular measurements
Neuronal signals

We use clusters because they (hopefully) translate to meaningful differences.
What if you cluster based on different measurements?

What does “similar” mean?

One option: Euclidean distance

\[ \textrm{dist}(\mathbf{x}, \mathbf{y}) = \lVert \mathbf{x} - \mathbf{y} \rVert \]

Clustering results are completely dependent on the measure of similarity (or distance) between observations to be clustered

What properties should a distance measure have?

Symmetric

\(D(A, B) = D(B, A)\)
Otherwise we can say A looks like B but not vice versa

Positivity and self-similarity

\(D(A,B) > 0\), and \(D(A,B)=0\) iff \(A=B\)
Otherwise there will be objects that we can’t tell apart

Triangle inequality

\(D(A, B) + D(B, C) \geq D(A,C)\)
Otherwise one can say “A like B, B like C, but A not like C at all”

Clustering algorithms

Partition algorithms (flat)
- K-means
- Gaussian mixtures
Hierarchical algorithms
- Bottom up - agglomerative
- Top down - divisive

Characters from the Simpsons, clustered according to partitioning or agglomerative clustering.

K-means, an iterative clustering algorithm

Initialize: Pick K random points as cluster centers
Alternate until no assignments change:
1. Assign data points to the closest cluster center
2. Change the cluster center to the average of its assigned points

K-means, an iterative clustering algorithm

Initialize: Pick K random points as cluster centers
Alternate until no assignments change:
1. Assign data points to the closest cluster center
2. Change the cluster center to the average of its assigned points

K-means clustering: Example

Pick K random points as cluster centers (means)
- Shown here for K=2

K-means clustering: Example

Iterative Step 1

Assign data points to the closest cluster center

Critical step is right here. How do we pick what is close? Which is closer?

K-means clustering: Example

Iterative Step 2

Change the cluster center to the average of the assigned points

Can be slow for very large numbers of data points.
Can use stochastic sampling in some cases.

K-means clustering: Example

Repeat until convergence

K-means clustering: Example

An example

https://commons.wikimedia.org/wiki/User:Chire

Properties of K-means algorithm

Guaranteed to converge in a finite number of iterations
Running time per iteration:
1. Assign data points to the closest cluster center:
  - O(KN) time
2. Change the cluster center to the average of its assigned points:
  - O(N) time

K-Means convergence

Objective: \(\min_{\mu}\min_{C}\sum_{i=1}^k \sum_{x\in C_i} \lVert x - \mu_i \rVert^2\)

Fix \(\mu\), optimize \(C\): \[\min_C \sum_{i=1}^k \sum_{x\in C_i} \lVert x - \mu_i \rVert^2 = \min_C \sum_{i}^n \lVert x_i - \mu_{x_i} \rVert^2\]
Fix \(C\), optimize \(\mu\): \[\min_{\mu}\sum_{i=1}^k \sum_{x\in C_i} \lVert x - \mu_i \rVert^2\]
- Take partial derivative of \(\mu_i\) and set to zero, we have \[\mu_i = \frac{1}{\lVert C_i \rVert} \sum_{x \in C_i} x\]

Kmeans takes an alternating optimization approach. Each step is guaranteed to decrease the objective—thus guaranteed to converge.

K-means is (somewhat) sensitive to initialization

Various heuristic schemes exist for preventing problematic results.
None of them are perfect.

Also go through cluster identity invariance.

K-Means Getting Stuck

Local optima dependent on how the problem was specified:

K-means not able to properly cluster

Go over number of clusters.
Show example data.
What would within-cluster distance look like in the ideal case?

Changing the features (distance function) can help

This is the kernel method.
Will use again with SVMs!

Agglomerative Clustering

Agglomerative clustering:
- First merge very similar instances
- Incrementally build larger clusters out of smaller clusters
Algorithm:
- Maintain a set of clusters
- Initially, each instance in its own cluster
- Repeat:
  - Pick the two closest clusters
  - Merge them into a new cluster
  - Stop when there’s only one cluster left
Produces not one clustering, but a family of clusterings represented by a dendrogram

Agglomerative Clustering

How should we define “closest” for clusters with multiple elements?

Agglomerative Clustering

How should we define “closest” for clusters with multiple elements?
Many options:
- Closest pair (single-link clustering)
- Farthest pair (complete-link clustering)
- Average of all pairs
Different choices create different clustering behaviors

Agglomerative Clustering

How should we define “closest” for clusters with multiple elements?
Many options:
- Closest pair (left)
- Farthest pair (right)
- Average of all pairs
Different choices create different clustering behaviors

Clustering Behavior

Agglomerative Clustering Questions

Will agglomerative clustering converge?
- To a global optimum?
Will it always find the true patterns in the data?
Do people ever use it?
How many clusters to pick?

Reconsidering “hard assignments”?

Clusters may overlap
Some clusters may be “wider” than others
Distances can be deceiving!

Review

Implementation

K-means

sklearn.cluster.KMeans

n_clusters: Number of clusters to form
init: Initialization method
n_init: Number of initializations
max_iter: Maximum steps algorithm can take

Agglomerative

sklearn.cluster.AgglomerativeClustering

n_clusters: Number of clusters to return
metric: Distance metric to use
linkage: Linkage metric to use

Fisher’s iris dataset

https://en.wikipedia.org/wiki/Iris_flower_data_set

K-means implementation

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import datasets

iris = datasets.load_iris()
X, y = iris.data, iris.target

est = KMeans(n_clusters=3)
est.fit(X)
labels = est.labels_
# ...
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels)

Review

Review Questions

What is clustering? What does it seek to accomplish?
What is agglomerative clustering? What does the resulting output look like?
What is the difference between single-link, complete-link, and average clustering?
What are three properties that are required of a distance metric? What kind of things can be compared with a distance metric?
What decisions does one have to make before performing k-means clustering?
Describe the process of fitting that k-means performs.
You and your colleague each run k-means with the same settings, and arrive at different results. How could this happen?
What are two cases in which k-means clustering can fail or provide bad results?
How do you determine that your clustering is “good”?

Clustering

Clustering

Clustering is subjective!

Applications

What does “similar” mean?

What does “similar” mean?

What properties should a distance measure have?

Symmetric

Positivity and self-similarity

Triangle inequality

Clustering algorithms

K-means, an iterative clustering algorithm

K-means, an iterative clustering algorithm

K-means clustering: Example

K-means clustering: Example

Iterative Step 1

K-means clustering: Example

Iterative Step 2

K-means clustering: Example

K-means clustering: Example

K-means clustering: Example

An example

Properties of K-means algorithm

K-Means convergence

K-means is (somewhat) sensitive to initialization

K-Means Getting Stuck

K-means not able to properly cluster

Changing the features (distance function) can help

Agglomerative Clustering

Agglomerative Clustering

Agglomerative Clustering

Agglomerative Clustering

Agglomerative Clustering

Clustering Behavior

Agglomerative Clustering Questions

Reconsidering “hard assignments”?

Review

Implementation

K-means

Agglomerative

Fisher’s iris dataset

K-means implementation

Review

Further Reading

Review Questions