Introduction
Today, I will write about a theorem of kernel K-means. The kernel K-means cover the weak point of K-means. I will explain this weak point of K-means and strong point of kernel K-means. If you have not looked yet, please look at the Theorem of K-means.
I implement kernel K-means. Its post is Implement kernel K-means.
Overview
A weak point of K-means
For example, I prepare the following dataset.
It is impossible for this dataset to cluster by K-means because this data is distributed shape of the circle. K-means classify data in accordance with the Euclid distance between data and prototype. The prototype is representative of each class. A Prototype of K-means is mean vector. Thus, K-means classify dataset as follows.
K-means does not work, if not so this like dataset.
The dataset which is able to classify by K-means is consist of mass for each class. For example,
Kernel K-means cover this weak point of K-means.
Kernel Trick
Firstly, I explain Kernel Trick.
If dataset X is not able to classify linear hyperplane. Then, the map \phi send to space which is able to classify linear hyperplane.
Kernel is defined as follows.
K(x,y) = \phi(x)^T \phi(y)
It is difficult to compute \phi, but It is easy to compute K(x,y).
This method is called kernel trick.
kernel K means
I review the objective function of K-means.
J = \sum_{n=1}^{N} \sum_{k=1}^{K} r_{nk} ||x_n-\mu_k||^2
here, prototype is \mu_i ~\forall k \in K.
r_n is the 1 of K coding scheme and r_{nk} is k'th element of r_n
then
\mu_k = \frac{\sum_{n} r_{n_k} x_n}{\sum_n r_{n_k}}
and
k = \arg \min_{j} || x_n - \mu_{j} || \implies r_{nk} = 1
else \implies r_{n_k} = 0
I rewrite this objective function as follows.
J = \sum_{n=1}^{N} \sum_{k=1}^{K} r_{nk} ||\phi(x_n)-\mu_k||^2
then
\mu_k = \frac{\sum_{n} r_{n_k} \phi(x_n)}{\sum_n r_{n_k}}
Thus, distance between x_n and prototype \mu_k is
||\phi(x_n) - \frac{\sum_{m}^{N} r_{m_k} \phi(x_m)} {\sum_{m}^{N} r_{m_k}} ||^2
= \phi(x_n)^T \phi(x_n) - \frac{2 \sum_{m}^{N} r_{n_k} \phi(x_n)^T \phi(x_m)}{\sum_{m}^{N} r_{n_k}} + \frac{\sum_{m,l}^{N} r_{n_k} r_{n_k} \phi(x_m)^T \phi(x_l)}{ \{ \sum_{m}^{N} r_{n_k} \}^2 }
kernel K-means compute \phi(x_n)^T \phi(x_m) as K(x_n,x_m)
Algorithm
Reference
http://www.cs.utexas.edu/users/inderjit/public_papers/kdd_spectral_kernelkmeans.pdf
I implement kernel K-means. Its post is Implement kernel K-means.
Overview
- A weak point of K-means
- Kernel trick
- kernel K means
- Algorithm
A weak point of K-means
For example, I prepare the following dataset.
It is impossible for this dataset to cluster by K-means because this data is distributed shape of the circle. K-means classify data in accordance with the Euclid distance between data and prototype. The prototype is representative of each class. A Prototype of K-means is mean vector. Thus, K-means classify dataset as follows.
K-means does not work, if not so this like dataset.
The dataset which is able to classify by K-means is consist of mass for each class. For example,
Kernel K-means cover this weak point of K-means.
Kernel Trick
Firstly, I explain Kernel Trick.
If dataset X is not able to classify linear hyperplane. Then, the map \phi send to space which is able to classify linear hyperplane.
Kernel is defined as follows.
K(x,y) = \phi(x)^T \phi(y)
It is difficult to compute \phi, but It is easy to compute K(x,y).
This method is called kernel trick.
kernel K means
I review the objective function of K-means.
J = \sum_{n=1}^{N} \sum_{k=1}^{K} r_{nk} ||x_n-\mu_k||^2
here, prototype is \mu_i ~\forall k \in K.
r_n is the 1 of K coding scheme and r_{nk} is k'th element of r_n
then
\mu_k = \frac{\sum_{n} r_{n_k} x_n}{\sum_n r_{n_k}}
and
k = \arg \min_{j} || x_n - \mu_{j} || \implies r_{nk} = 1
else \implies r_{n_k} = 0
I rewrite this objective function as follows.
J = \sum_{n=1}^{N} \sum_{k=1}^{K} r_{nk} ||\phi(x_n)-\mu_k||^2
then
\mu_k = \frac{\sum_{n} r_{n_k} \phi(x_n)}{\sum_n r_{n_k}}
Thus, distance between x_n and prototype \mu_k is
||\phi(x_n) - \frac{\sum_{m}^{N} r_{m_k} \phi(x_m)} {\sum_{m}^{N} r_{m_k}} ||^2
= \phi(x_n)^T \phi(x_n) - \frac{2 \sum_{m}^{N} r_{n_k} \phi(x_n)^T \phi(x_m)}{\sum_{m}^{N} r_{n_k}} + \frac{\sum_{m,l}^{N} r_{n_k} r_{n_k} \phi(x_m)^T \phi(x_l)}{ \{ \sum_{m}^{N} r_{n_k} \}^2 }
kernel K-means compute \phi(x_n)^T \phi(x_m) as K(x_n,x_m)
Algorithm
- make initial value of prototype. input K: number of clusters.
- for iteration in iteration times.
- for n \in N do
- for k \in K do
- Compute distance x_n and prototype of class k.
- end for k
- Pick up class k_n \in {1,2,..,K} which make distance x_n and prototype of class k minimizing.
- divide x_n in class k_n
- end for n
- if there is no change, finish repeating iteration.
Reference
http://www.cs.utexas.edu/users/inderjit/public_papers/kdd_spectral_kernelkmeans.pdf
コメント
コメントを投稿