Gaussian Mixture Model

2019-01-31

Statistics

Contents
Gaussian Mixture Model with 2 Mixture Components
Gaussian Mixture Model: General Case

Mixture model이란 여러 하위 모집단(subpopulation)들이 존재하는 모집단(population)을 나타내는 확률 모형이다. 현실의 데이터는 서로 다른 특성을 갖는 여러 하위집단이 있는 경우가 많으므로, mixture model을 이용한 분석은 다양한 분야의 데이터에 적용가능하다. Mixture model은 여러 확률분포의 가중합 형태로 나타나는 mixture distribution을 이용하며, 각 observation이 어떤 하위 집단에서 생성된 것인지에 대한 정보 없이, 전체 집단의 데이터만을 가지고 각 하위 집단(mixture component)들이 어떤 특성을 갖는지를 추론하는 것이 분석의 주 목적이 된다. 여기서 더 나아가 observation들이 어떤 하위 집단에 속하는지, 혹은 observation들이 각 하위 집단에 속할 확률이 어떻게 되는지를 추론한다면, 이는 mixture model을 이용하여 clustering을 수행한 것이 된다.

Gaussian Mixture Model with 2 Mixture Components

그 중에서도 가장 일반적이고 널리 알려진 Gaussian Mixture Model에 대해 소개해보고자 한다. Gaussian Mixture Model은 하위 집단들이 정규(Gaussian) 분포를 따른다고 가정한 mixture model이다. 먼저 아래와 같은 Gaussian Mixture Model의 간단한 예시를 살펴보자.

위 그림의 히스토그램에 나타난 것과 같이, 20개의 observation으로 이루어진 data가 있다고 하자. 이 data는 bi-modal, 즉 쌍봉형태의 모습을 보이기 때문에, 정규분포나 다른 종 모양의 분포로 이 data의 density를 모델링하는 것은 바람직하지 않을 것이다. 따라서 아래와 같이 두 정규분포의 mixture 형태로 이 확률변수를 모델링하겠다.

\[Y_1 \sim N(\mu_1,\sigma_1^2) \text{ , } \enspace Y_2 \sim N(\mu_2,\sigma_2^2)\] \[Y = (1-\Delta)\cdot Y_1 + \Delta \cdot Y_2\] \[\text{where } \Delta \in \{ 0,1 \} \text{ with Pr}(\Delta=1)=\pi \text{ , Pr}(\Delta=0)=1-\pi\]

확률변수 $Y$로부터 sample을 뽑는 과정을 살펴보면 다음과 같다.

$\text{Pr}(\Delta=1)=\pi$인 $\Delta$를 생성한다.
$\Delta=0$이 나왔다면 모델 $1$ ($Y_1$)으로부터, $\Delta=1$이 나왔다면 모델 $2$ ($Y_2$)로부터 sample을 뽑아, 이를 $Y$로부터의 sample로 취한다.

따라서 $Y$의 확률밀도함수, $g_Y(y)$는 다음과 같다. 여기서 $\phi_\theta(x)$는 $\theta=(\mu,\sigma^2)$를 모수로 갖는 정규분포의 확률밀도함수를 의미한다.

\[g_Y(y)=(1-\pi)\phi_{\theta_1}(y)+\pi\phi_{\theta_2}(y)\]

이제 두 개의 gaussian의 mixture 형태를 갖는 모형을 maximum likelihood를 이용하여, 위 data에 fit하고자 한다. 이 때 추정해야할 모수는 아래와 같다.

\[\theta=(\pi,\theta_1, \theta_2)=(\pi,\mu_1, \sigma_1^2,\mu_2, \sigma_2^2)\]

20개의 training data를 $\mathbf{Z}={ Z_1, \cdots , Z_{20} }$라고 하면, likelihood와 log-likelihood는 다음과 같다.

\[L(\theta;\mathbf{Z})=\prod_{i=1}^{20} g_Y(y_i)=\prod_{i=1}^{20} [(1-\pi)\phi_{\theta_1}(y_i)+\pi\phi_{\theta_2}(y_i)]\] \[l(\theta;\mathbf{Z})=\sum_{i=1}^{20} \log [g_Y(y_i)]=\sum_{i=1}^{20} \log [(1-\pi)\phi_{\theta_1}(y_i)+\pi\phi_{\theta_2}(y_i)]\]

Data Augmentation

로그 안에 합이 있는 형태이므로 이를 직접적으로 $(\pi,\mu_1, \sigma_1^2,\mu_2, \sigma_2^2)$에 대해 각각 미분하여, Likelihood가 최대가 되는 모수 값을 직접적으로 찾는 것은 매우 어렵다. 따라서 우리는 training data $\mathbf{Z}$ 외에, 관측되지 않은 잠재 변수(latent variable)을 추가로 고려한 likelihood를 세울 것이다. 그 잠재 변수는 바로 관측된 observation이 $Y_1, Y_2$ 중 어떤 mixture component에서 온 것인지를 알려주는 $\Delta= { \Delta_1, \cdots , \Delta_{20} }$이다. 위에서 Mixture model이 무엇인지 설명할 때, 다음과 같이 서술하였다.

각 observation이 어떤 하위 집단에서 생성된 것인지에 대한 정보 없이, 전체 집단의 데이터만을 가지고 각 하위 집단(mixture component)들이 어떤 특성을 갖는지를 추론하는 것이 분석의 주 목적이 된다.

위 히스토그램과 같은 20개의 observation으로 이루어진 data가 주어졌을 때, 우리는 이 observation들이 각각 $Y_1$에서 온 것인지 혹은 $Y_2$에서 온 것인지 알지 못한다. 다만 이 데이터를 이용하여, $Y$의 구성요소가 되는 두 정규분포 $Y_1,Y_2$의 모수($\mu_1, \sigma_1^2,\mu_2, \sigma_2^2$)와 mixture weight($\pi$)를 추정해서, Mixture component들이 어떤 특성을 갖는지를 추론하고자 하는 것이다.

$\Delta$를 고려하여, 관측된 data $\mathbf{Z}$와 관측되지 않는 잠재 변수 $\Delta$의 joint probability, $P(\mathbf{Z},\Delta;\theta)$를 구하면 아래와 같다.

\[P(\mathbf{Z} \mid \Delta;\theta)=\prod_{i=1}^{20} \phi_{\theta_1}(y_i)^{1-\Delta_i}\phi_{\theta_2}(y_i)^{\Delta_i}\] \[P(\Delta ; \theta)=\prod_{i=1}^{20} (1-\pi)^{1-\Delta_i}\pi^{\Delta_i}\] \[P(\mathbf{Z},\Delta;\theta)=P(\mathbf{Z}\mid \Delta;\theta)P(\Delta ; \theta)=\prod_{i=1}^{20} \phi_{\theta_1}(y_i)^{1-\Delta_i}\phi_{\theta_2}(y_i)^{\Delta_i}(1-\pi)^{1-\Delta_i}\pi^{\Delta_i}\]

Data의 joint probability는 likelihood와 같으므로, 관측된 data $\mathbf{Z}$와 관측되지 않는 잠재 변수 $\Delta$를 고려한 우리의 likelihood와 log-likelihood는 다음과 같다.

\[L(\theta;\mathbf{Z},\Delta)=\prod_{i=1}^{20} \phi_{\theta_1}(y_i)^{1-\Delta_i}\phi_{\theta_2}(y_i)^{\Delta_i}(1-\pi)^{1-\Delta_i}\pi^{\Delta_i}\]

$l(\theta;\mathbf{Z},\Delta)=\sum_{i=1}^{20} \Big[ (1-\Delta_i)\log\phi_{\theta_1}(y_i)+ \Delta_i\log\phi_{\theta_2}(y_i) \Big] + \sum_{i=1}^{20} \Big[ (1-\Delta_i)\log(1-\pi))+ \Delta_i \log{\pi} \Big]$

Expectation-Maximization Algorithm

$\Delta_i$는 실제로 관측이 되지 않는 latent variable이기 때문에, 이 likelihood를 최대화하기 위해서는 EM 알고리즘이라는 방법을 사용한다. EM 알고리즘은 관측되지 않는 변수가 있는 likelihood를 간접적으로 최대화하는 방법이다. 이 EM 알고리즘에 대한 설명은 별도의 포스트에 정리해 두었으며, EM 알고리즘이 익숙하지 않은 독자는 이 포스트를 읽고 다시 여기로 돌아오는 것을 강력하게 추천한다. (Link) 요약하자면, EM 알고리즘은 다음과 같은 작업을 반복적으로 수행한다.

추정하고자 하는 모수 $\theta$의 초기값을 설정한다.
Expectation step : 모수 $\theta$를 한 값으로 고정하고, 그 때의 latent variable의 조건부분포를 이용하여 log-likelihood의 기대값을 구한다.
Maximization step : E-step에서 구한 log-likelihood의 기대값을 이용해 모수 $\theta$의 최적값을 구하고, 이를 새로운 $\theta$값으로 업데이트한다.
이 과정을 반복할 때, $\theta$가 특정 값으로 수렴한다면, 우리는 이를 $\theta$의 추정치로 삼는다.

Expectation Step

$\theta$의 초기값, $\theta^{(0)}$를 다음과 같이 설정한다.

\[\theta^{(0)}=(\pi^{(0)},\mu_1^{(0)}, {\sigma_1^2}^{(0)},\mu_2^{(0)},{\sigma_2^2}^{(0)})\]

위에서 도출한 log-likelihood를 $\theta$의 초기값에 대한 latent variable의 조건부분포, 즉 $P(\Delta \mid \mathbf{Z},\theta=\theta^{(0)})$로 기대값을 취하면 다음과 같다.

\[\mathbb{E}_{\Delta \mid \mathbf{Z},\theta=\theta^{(0)}} \Big[ l(\theta;\mathbf{Z},\Delta) \Big]=\sum_{\Delta} l(\theta;\mathbf{Z},\Delta) \cdot P(\Delta \mid \mathbf{Z},\theta=\theta^{(0)})\] \[=\sum_{i=1}^{20} \Big[ (1-\mathbb{E}_{\Delta \mid \mathbf{Z},\theta=\theta^{(0)}}[\Delta_i]) \cdot \log\phi_{\theta_1}(y_i)+ \mathbb{E}_{\Delta \mid \mathbf{Z},\theta=\theta^{(0)}}[\Delta_i] \cdot \log\phi_{\theta_2}(y_i) \Big]\] \[+ \sum_{i=1}^{20} \Big[ (1-\mathbb{E}_{\Delta \mid \mathbf{Z},\theta=\theta^{(0)}}[\Delta_i]) \cdot \log(1-\pi))+ \mathbb{E}_{\Delta \mid \mathbf{Z},\theta=\theta^{(0)}}[\Delta_i] \cdot \log{\pi} \Big]\] \[=\sum_{i=1}^{20} \Big[ \big( 1-\gamma_i(\theta^{(0)}) \big) \cdot \log\phi_{\theta_1}(y_i)+ \gamma_i(\theta^{(0)}) \cdot \log\phi_{\theta_2}(y_i) \Big]\] \[+ \sum_{i=1}^{20} \Big[ \big( 1-\gamma_i(\theta^{(0)}) \big) \cdot \log(1-\pi))+ \gamma_i(\theta^{(0)}) \cdot \log{\pi} \Big]\] \[\text{where } \enspace \gamma_i(\theta^{(0)})=\mathbb{E}_{\Delta \mid \mathbf{Z},\theta=\theta^{(0)}}[\Delta_i]=Pr(\Delta_i=1 \mid \mathbf{Z},\theta=\theta^{(0)})\]

이 때, $\gamma_i(\theta)$를 observation $i$의 모델 $2$ ($Y_2$)에 대한 responsibility라고 부른다. 이는 parameter 값이 $\theta$일 때, $i$번째 observation이 모델 $2$ ($Y_2$)에서 왔을 확률을 의미한다. 실제로 log-likelihood 속 $\Delta_i$는 관측이 되지 않을 뿐이지, $0$ 또는 $1$의 값을 가져 $Y_1$에서 온 것인지 $Y_2$에서 온 것인지를 나타내 줄 것이다. 이와 같이 어떤 관측치가 어느 class/model에 속하는 것인지에 대해, “속한다 / 속하지 않는다”와 같이 확실하게 여부를 알려주는 것을 hard assignment라고 한다. 이와 반대로, 어느 class/model에 속하는 것인지에 대해, 각 class/model에 속할 확률을 알려주는 것을 soft assignment라고 한다. 따라서 Gaussian Mixture Model에 적용된 EM 알고리즘의 E-step은, hard assignment의 기능을 하는 $\Delta_i$에 기대값을 취함으로써, 각 observation이 어떤 mixture component에 속하는 지에 대한 soft assignment를 수행하는 것이 된다.

$i$번째 observation의 responsibility, $\gamma_i(\theta)$는 $i$번째 observation이 모델 $2$ ($Y_2$)에서 왔을 확률을 의미하기 때문에, 이를 다음과 같이 soft assignment를 수행한다.

\[\hat \gamma_i(\theta^{(0)})=\frac{\pi^{(0)} \cdot \phi_{\theta_2^{(0)}}(y_i)}{(1-\pi^{(0)})\cdot \phi_{\theta_1^{(0)}}(y_i)+\pi^{(0)} \cdot \phi_{\theta_2^{(0)}}(y_i)}\]

Maximization Step

Maximization step에서는 latent variable로 조건부 기대값을 취한 log-likelihood, $Q(\theta \mid \theta^{(0)})$를 최대화하는 $\theta=(\pi,\mu_1, \sigma_1^2,\mu_2, \sigma_2^2)$의 값을 찾고, 이를 새 $\theta$ 값, $\theta^{(1)}$로 업데이트한다. $Q(\theta \mid \theta^{(0)})$를 $\pi,\mu_1, \sigma_1^2,\mu_2, \sigma_2^2$로 각각 미분하여 일계미분조건을 만족시키는 값을 찾으면 된다.

\[\theta^{(1)}=(\pi^{(1)},\mu_1^{(1)}, {\sigma_1^2}^{(1)},\mu_2^{(1)}, {\sigma_2^2}^{(1)})= \text{arg} \max_\theta Q(\theta \mid \theta^{(0)})\]

그 식을 도출한 결과는 아래와 같다.

\[\mu_1^{(1)}=\frac{\sum_{i=1}^{20} (1-\hat \gamma_i)y_i}{\sum_{i=1}^{20} (1-\hat \gamma_i)} \enspace , \enspace \enspace \mu_2^{(1)}=\frac{\sum_{i=1}^{20} \hat \gamma_iy_i}{\sum_{i=1}^{20} \hat \gamma_i}\] \[{\sigma_1^2}^{(1)}=\frac{\sum_{i=1}^{20} (1-\hat \gamma_i)(y_i-\mu_1^{(1)})^2}{\sum_{i=1}^{20} (1-\hat \gamma_i)} \enspace , \enspace \enspace {\sigma_2^2}^{(1)}=\frac{\sum_{i=1}^{20} \hat \gamma_i(y_i-\mu_2^{(1)})^2}{\sum_{i=1}^{20} \hat \gamma_i}\] \[\pi^{(1)} = \frac{\sum_{i=1}^{20} \hat \gamma_i}{20}\]

Result

위 예시에 EM 알고리즘을 수행한 결과는 아래와 같다.

\[\hat \mu_1=4.62\text{ , }\hat \mu_2=1.06\text{ , }\hat \sigma_1^2=0.87\text{ , }\hat \sigma_2^2=0.77\text{ , }\hat \pi=0.546\]

추정된 모수 값을 이용하여 $Y$의 density를 그린 결과는 아래와 같다. 초록색 곡선으로 나타난 것은 각 observation들의 responsibility, 즉 observation이 model 2에 들어갈 확률을 그린 것이다. 즉, responsibility는 20개의 observation에 대하여 각 mixture component(여기서는 model 1,2)에 들어갈 확률을 구한 것인데, 이는 clustering, 그 중에서도 soft clustering을 수행한 것으로 볼 수 있다. Gaussian Mixture Model을 clustering 방법으로 보는 것은, mixture component의 모수에 대한 추정을 하는 것이 목표가 아니라, 각 observation들의 responsibility의 값을 추정하는 것을 목표로 삼고 Gaussian Mixture Model을 수행한 것이다.

다음 그림은 EM 알고리즘의 각 Iteration에 대해 log-likelihood 값을 그래프로 그린 것이다. Iteration을 10회 시행한 것으로도 log-likelihood가 잘 수렴하는 것을 볼 수 있다.

Gaussian Mixture Model: General Case

위의 예시에서는 1변수 정규분포 두 개를 mixture component로 갖는 가장 간단한 경우의 Gaussian Mixture Model에 대해 살펴보았다. 이는 $p$차원의 다변수 정규분포 $K$개를 mixture component로 갖는 Gaussian Mixture Model로 쉽게 확장이 가능하다. 그 경우 모형의 기본 setting은 아래와 같다.

$\mathbf{Z}$는 $p$개의 변수에 대한 observation $N$개로 이루어진 data
$\Delta$ : $K$-ary variable, Mixture component $1,2,\cdots,K$ 중 어디에 속하는지 나타내주는 Indicator 역할을 한다.

\[\Delta_i=\left[ {\begin{array}{c} \Delta_{i,1} \\ \Delta_{i,2} \\ \vdots \\ \Delta_{i,K} \\ \end{array} } \right] \in \Bigg\{ \left[ {\begin{array}{c} 1 \\ 0 \\ \vdots \\ 0 \\ \end{array} } \right] , \left[ {\begin{array}{c} 0 \\ 1 \\ \vdots \\ 0 \\ \end{array} } \right] , \cdots , \left[ {\begin{array}{c} 0 \\ 0 \\ \vdots \\ 1 \\ \end{array} } \right] \Bigg\}\] \[\Delta_{i,j}=1 \iff i \text{ th observation is from } j \text{ th mixture component}\]

$\theta$ : 추정해야 할 모수의 집합

\[\theta=\big\{ \pi_1,\cdots,\pi_K,\mu_1,\cdots,\mu_K,\Sigma_1,\cdots,\Sigma_K \big\}\] \[\pi_j = \text{probability of being generated from } j \text{ th mixture component model}\] \[\mu_j = \text{mean vector of } j \text{ th mixture component model } \in \mathbb{R}^p\] \[\Sigma_j = \text{covariance matrix of } j \text{ th mixture component model } \in \mathbb{R}^{p\times p}\]

$\phi_{\theta_j}(y_i)$ : $\theta_j=(\mu_j,\Sigma_j)$를 모수로 갖는 $p$차원 다변수 정규분포의 density

위의 setting 하에서, Likelihood와 log-likelihood는 다음과 같다.

\[L(\theta;\mathbf{Z},\Delta)=P(\mathbf{Z},\Delta ; \theta)=\prod_{i=1}^{N} \prod_{j=1}^{K} \pi_j^{\Delta_{ij}} \phi_{\theta_j} (\mathbf{y}_i)^{\Delta_{ij}}\] \[l(\theta;\mathbf{Z})=\log L(\theta;\mathbf{Z},\Delta)=\sum_{i=1}^{N} \sum_{j=1}^{K} {\Delta_{ij}} \log [\pi_j \cdot \phi_{\theta_j}(\mathbf{y}_i)]\]

일반적인 case의 Gaussian Mixture Model의 경우도, EM 알고리즘을 이용하여 likelihood를 최대화하는 모수 $\theta$의 값을 간접적으로 찾는다. E-step에서 soft assignment의 역할을 하는 responsibility는 아래와 같다. 이 때 responsibility는 $i$ 번째 observation이 $j$ 번째 mixture component에서 생성되었을 확률을 의미한다.

\[\gamma_{ij}(\theta)=\mathbb{E}_{\Delta \mid \mathbf{Z},\theta=\theta^{(0)}}[\Delta_{ij}]=Pr(\Delta_{ij} =1 \mid \mathbf{Z},\theta=\theta^{(0)}) = \frac{\pi_j \cdot \phi_{\theta_j}(\mathbf{y}_i)}{\sum_{k=1}^{K} \pi_k \cdot \phi_{\theta_k}(\mathbf{y}_i)}\]

Reference

Hastie, T., Tibshirani, R.,, Friedman, J. (2001). The Elements of Statistical Learning. New York, NY, USA: Springer New York Inc..

 Heuristic Derivation of Smoothing Spline ESL: Ch 9. Additive Models, Trees, and Related Methods 