Y
sY | \(X_1\) | \(X_2\) |
---|---|---|
?? | 0.1 | 2.1 |
?? | 0.5 | -1.1 |
?? | 0.0 | 3.0 |
?? | -0.1 | -2.0 |
?? | 0.2 | 1.5 |
… | … | … |
Each data point is sampled from a generative process:
\[p(Y_i = y_k) = \theta_k\]
\[P(X_i = x \mid Y = y_k) = N(x \mid \mu_{ik}, \sigma_{ik})\]
\[P(X = x_j \mid Y = i) = N(x_j, \mu_i, \Sigma_i)\]
\[P(X=\mathbf{x}_{j})=\frac{1}{(2\pi)^{m/2}\|\Sigma\|^{1/2}}\exp\left[-\frac{1}{2}(\mathbf{x}_{j}-\boldsymbol{\mu})^{T}\Sigma^{-1}(\mathbf{x}_{j}-\boldsymbol{\mu})\right]\]
\(\Sigma\) is the identity matrix
\[P(X=\mathbf{x}_j) = \frac{1}{(2\pi)^{m/2} \|\Sigma\|^{1/2}} \exp\left[-\frac{1}{2}(\mathbf{x}_j - \boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x}_j - \boldsymbol{\mu})\right]\]
\(\Sigma\) is a diagonal matrix
\(X_i\) are independent and uncorrelated
\[P(X=\mathbf{x}_j) = \frac{1}{(2\pi)^{m/2} \|\Sigma\|^{1/2}} \exp\left[-\frac{1}{2}(\mathbf{x}_j - \boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x}_j - \boldsymbol{\mu})\right]\]
\(\Sigma\) is an arbitrary (semidefinite) matrix
Specifies rotation/change of basis
Eigenvalues specify the relative elongation
\[P(X=\mathbf{x}_j) = \frac{1}{(2\pi)^{m/2} \|\Sigma\|^{1/2}} \exp\left[-\frac{1}{2}(\mathbf{x}_j - \boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x}_j - \boldsymbol{\mu})\right]\]
Old Faithful Data Set
Old Faithful Data Set
Combine simple models into a complex model:
\[p(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N} (\mathbf{x} | \boldsymbol{\mu}_k, \mathbf{\Sigma}_k)\]
\[\forall k: \pi_k \geq 0\]
\[\sum_{k=1}^K \pi_k = 1\]
Model data as mixture of multivariate Gaussians
Model data as mixture of multivariate Gaussians
Model data as mixture of multivariate Gaussians
Shown is the posterior probability that a point was generated from \(i\)th Gaussian: \(Pr(Y = i \mid x)\)
\[\mu_{MLE} = \frac{1}{N}\sum_{i=1}^N x_i \quad\quad \sigma_{MLE}^2 = \frac{1}{N}\sum_{i=1}^N (x_i -\hat{\mu})^2\]
ML estimate for each of the Multivariate Gaussians is given by:
\[\mu_{ML}^k = \frac{1}{n} \sum_{j=1}^n x_n \quad\quad \Sigma_{ML}^k = \frac{1}{n}\sum_{j=1}^n \left(\mathbf{x}_j - \mu_{ML}^k\right) \left(\mathbf{x}_j - \mu_{ML}^k\right)^T\]
Sums over \(x\) generated from the \(k\)’th Gaussian
\[P(y = k | \mathbf{x}_j) \propto \frac{1}{(2\pi)^{n/2} ||\Sigma_k||^{1/2}} \exp\left[-\frac{1}{2}(\mathbf{x}_j - \mu_k)^T \Sigma_k^{-1}(\mathbf{x}_j - \mu_k)\right]P(y = k)\]
Marginal likelihood: \[\prod_{j=1}^m P(\mathbf{x}_j) = \prod_{j=1}^m \sum_{k=1}^K P(\mathbf{x}_j, y = k)\] \[= \prod_{j=1}^m \sum_{k=1}^K \frac{1}{(2\pi)^{n/2} ||\Sigma_k||^{1/2}} \exp\left[-\frac{1}{2}(\mathbf{x}_j - \mu_k)^T \Sigma_k^{-1}(\mathbf{x}_j - \mu_k)\right]P(y = k)\]
Need to differentiate and solve for \(\mu_k\), \(\sum_k\), and P(Y=k) for k=1..K
There will be no closed form solution, gradient is complex, lots of local optimum
Wouldn’t it be nice if there was a better way!?!
Consider:
\[ \prod_{j=1}^{m}\sum_{k=1}^{K} P(x,Y_{j} = k) \propto \prod_{j=1}^{m}\sum_{k=1}^{K=2}\exp\left[-\frac{1}{2\sigma^{2}} \| x - \mu_{k} \|^{2}\right] P(Y_{j}=k) \]
Iterate: On the \(t\)’th iteration let our estimates be
\[ \lambda_t = \{ \mu_1^{(t)}, \mu_2^{(t)} ... \mu_K^{(t)}, \Sigma_1^{(t)}, \Sigma_2^{(t)} ... \Sigma_K^{(t)}, p_1^{(t)}, p_2^{(t)} ... p_K^{(t)} \} \]
\(p_k^{(t)}\) is shorthand for estimate of \(P(y=k)\) on \(t\)’th iteration
Compute “expected” classes of all datapoints for each class
\[ P(Y_j = k|x_j, \lambda_t) \propto p_k^{(t)}p(x_j|\mu_k^{(t)}, \Sigma_k^{(t)}) \] Just evaluate a Gaussian at \(x_j\)
Compute weighted MLE for \(\mu\) given expected classes above
\[ \mu_k^{(t+1)} = \frac{\sum_j P(Y_j = k|x_j,\lambda_t)x_j}{\sum_j P(Y_j = k|x_j,\lambda_t)} \]
\[ \Sigma_k^{(t+1)} = \frac{\sum_j P(Y_j = k|x_j,\lambda_t) [x_j - \mu_k^{(t+1)}][x_j - \mu_k^{(t+1)}]^T}{\sum_j P(Y_j = k|x_j,\lambda_t)} \]
\[ p_k^{(t+1)} = \frac{\sum_j P(Y_j = k|x_j,\lambda_t)}{m} \]
\(m\) is the number of training examples
On the \(t\)’th iteration let our estimates be: \(\lambda_t = \{ \mu_1^{(t)}, \mu_2^{(t)} ... \mu_K^{(t)} \}\)
Compute “expected” classes of all datapoints
\[P(Y_j = k|x_j, \mu_1 ... \mu_K) \propto \exp(-\frac{1}{2\sigma^2}\|x_j - \mu_k\|^2)P(Y_j = k)\]
Compute most likely new \(\mu\)s given class expectations
\[\mu_k = \frac{\sum_{j=1}^{m} P(Y_j = k|x_j) x_j}{\sum_{j=1}^{m} P(Y_j = k|x_j)}\]
On the \(t\)’th iteration let our estimates be \(\lambda_t = \{ \mu_1^{(t)}, \mu_2^{(t)} ... \mu_K^{(t)} \}\)
Compute “expected” classes of all datapoints
\[P(Y_j = k | x_j, \mu_1...\mu_K) \propto \exp(-\frac{1}{2\sigma^2}\|x_j - \mu_k\|^2)\]
On the t’th iteration let our estimates be \(\lambda_t = \{ \mu_1^{(t)}, \mu_2^{(t)} ... \mu_K^{(t)} \}\)
Compute most likely new \(\mu\)s given class expectations \[\mu_k = \frac{\sum_{j=1}^m \delta(Y_j=k,x_j)x_j}{\sum_{j=1}^m \delta(Y_j=k,x_j)}\]
\(\delta\) represents hard assignment to “most likely” or nearest cluster
Equivalent to k-means clustering algorithm!
sklearn.mixture.GaussianMixture
implements GMMs within sklearn
.
GaussianMixture
creates the class
n_components
indicates the number of Gaussians to use.covariance_type
is type of covariance
full
spherical
diag
tied
means all components share the same covariance matrixmax_iter
is EM iterations to useM.fit(X)
fits using the EM algorithmM.predict_proba(X)
is the posterior probability of each component given the dataM.predict(X)
predict the class labels of each data point