Introduction
I will explain theorem of SVM.
Please look at my implementation of SVM.
Implement linear SVM
Implement kernel SVM
Today, I will explain about SVM until deriving the objective function.
Overview
Generalized linear model
SVM is used generalized linear model. Generalized linear model is following function
$$f(x) = w^T\phi(x) + b$$
b is called bias.
$$0 = w^T\phi(x) + b$$is hyper plane. This hyper plane separate two class of $\phi(x)$.
hyper plance is n-dimensional plane. if n = 1, hyper plane is line. if n = 2, hyper plane is normal plane.
$\phi(x)$ have effect of converting x to data which can be separated by a line.
image of $\phi(x)$ is
the left image has nonlinear data.
right image has linear data.
$\phi(x)$ convert from left image to right image.
I will handle $w^T \phi(x) + b$ as line in feature space.
Next, I will explain the object of SVM
Explain SVM
We want to make decisions function which $\forall x \in X$
$$f(x_i) > 0 \implies y_i = 1 $$
$$f(x_i) < 0 \implies y_i = -1$$
Let f(x) is $w^T \phi(x) + b$, I will optimize w and b of parametor.
However, optimization needs a standard of a good boundary. Its standard is magin. Next, I will explain hard margin.
Hard margin
What is margin? I will explain.
pick up data which exist nearest from $w^T \phi(x) +b = 0$. Margin is the distance between the data and $w^T \phi(x) +b = 0$.
Look at following image of margin in 2-dimensional.
this distance of green line is margin. SVM decide $w^T \phi(x) + b= 0$ to depend on only data which exist nearest from $w^T \phi(x) + b = 0$. This data called sopport vector.
We decide w and b of the parameter by a maximum margin.
Let dataset is X, $\forall x_i \in X$, distance between x and $w^T \phi(x) + b = 0$ is
$$\frac{|w^T \phi(x_i) + b|}{||W||}$$
Now, Assume linear hyperplane is enabled to complicately classify.
this image is data which complicately separated hyperplane.
this image is else data.
Thus,
$$f(x_i) > 0 \implies y_i = 1 $$
$$f(x_i) < 0 \implies y_i = -1$$
is complitely practical.
Therefore,
$$\forall i \in N,~~~~~~~y_i(w^T \phi(x_i) + b) > 0$$
Therefore
$$\frac{|w^T \phi(x_i) + b|}{||W||} = \frac{y(w^T \phi(x_i) + b)}{||W||}$$
Next, Let $i_0$ as follow.
$$\forall i_0 \in \arg_{n \in N} \min_{x \in X} \frac{|w^T \phi(x_n) + b|}{||W||}$$,
Let M is
$$M = y_{i_0}(w^T \phi(x_{i_0}) + b)$$
Because $\forall i \in N,~y_i(w^T \phi(x_i) + b) > 0$, $M > 0$ is practical.
M is value of distance between $w^T \phi(x) + b = 0$ and data which exist nearest $w^T \phi(x) + b = 0$
The objective function in SVM is expressed as follow.
$$\max_{w,b,M} \frac{M}{||W||}$$ $$~~s.t~~ \forall i \in N ~, y_i(w^T \phi(x_i) + b) \geq M$$
Here, when $w^{\star} = \frac{w}{M}, b^{\star} = \frac{b}{M}$, the objective function is expressed as follow.
$$\max_{w^{\star},b^{\star}} \frac{1}{||W^{\star}||}$$
$$~~s.t~~ \forall i \in N, y_i (w^{\star} \phi(x_i) + b^{\star}) \geq 1$$
I convert this from. because $||W^{\star}|| > 0$,
$$\max_{w^{\star},b^{\star}} \frac{1}{||W^{\star}||}$$
$$\iff \min_{w^{\star},b^{\star}} ||W^{\star}||$$
$$\iff \min_{w^{\star},b^{\star}} ||W^{\star}||^2$$
therefore, the objectice function in SVM is
$$\min_{w,b} ||W||^2$$
$$~~s.t~~ \forall i \in N, y_i (w^T \phi(x_i) + b) \geq 1$$
I define $W^{\star} = W, b^{\star} = b$ again.
We assume data is completely separated by a hyperplane. This method is called hard margin.
However this assumption is strict in the real world, so the soft margin is invented.
Next, I will explain soft margin.
Soft margin
I loosen $\forall i \in N, y_i (w^T \phi(x_i) + b) \geq 1$. This condition is rewrited as follow.
$$ \forall i \in N, y_i (w^T \phi(x_i) + b) \geq 1 - \epsilon_i$$
if $x_i$ is beyond $w^T \phi(x) + b = 0$, $\epsilon_i > 0$ is practical.
$x_5$ and $x_8$ and $x_9$ is beyond $w^T \phi(x) + b = 0$.
This distance of black line is $\epsilon_i$
I rewrite the objective function.
$$\min_{w,b} \frac{1}{2}||W||^2 + C\sum_{i \in N} \epsilon_i$$
$$~~s.t~~ \forall i \in N, y_i (w^T \phi(x_i) + b) \geq 1 - \epsilon_i ,~~~~\epsilon \geq 0 , \forall i \in N$$
C is called regulation parameter.
This parameter is a hyperparameter, so We decide before computing SVM algorithm.
C has the role which adjusts degree of suppression of misclassification.
The smaller C is, The smaller effect of $\sum_{i \in N}\epsilon_i$ is. Thus, it is easy to accept misclassification. On the other hand, the bigger C is, The bigger effect of $\sum_{i \in N}\epsilon_i$ is.
When $C = \infty$, It become hard margin.
Reference
I will explain theorem of SVM.
Please look at my implementation of SVM.
Implement linear SVM
Implement kernel SVM
Today, I will explain about SVM until deriving the objective function.
Overview
- Generalized linear model
- Explain SVM
- hard margin
- soft margin
Generalized linear model
SVM is used generalized linear model. Generalized linear model is following function
$$f(x) = w^T\phi(x) + b$$
b is called bias.
$$0 = w^T\phi(x) + b$$is hyper plane. This hyper plane separate two class of $\phi(x)$.
hyper plance is n-dimensional plane. if n = 1, hyper plane is line. if n = 2, hyper plane is normal plane.
$\phi(x)$ have effect of converting x to data which can be separated by a line.
image of $\phi(x)$ is
right image has linear data.
$\phi(x)$ convert from left image to right image.
I will handle $w^T \phi(x) + b$ as line in feature space.
Next, I will explain the object of SVM
Explain SVM
We want to make decisions function which $\forall x \in X$
$$f(x_i) > 0 \implies y_i = 1 $$
$$f(x_i) < 0 \implies y_i = -1$$
Let f(x) is $w^T \phi(x) + b$, I will optimize w and b of parametor.
However, optimization needs a standard of a good boundary. Its standard is magin. Next, I will explain hard margin.
Hard margin
What is margin? I will explain.
pick up data which exist nearest from $w^T \phi(x) +b = 0$. Margin is the distance between the data and $w^T \phi(x) +b = 0$.
Look at following image of margin in 2-dimensional.
this distance of green line is margin. SVM decide $w^T \phi(x) + b= 0$ to depend on only data which exist nearest from $w^T \phi(x) + b = 0$. This data called sopport vector.
We decide w and b of the parameter by a maximum margin.
Let dataset is X, $\forall x_i \in X$, distance between x and $w^T \phi(x) + b = 0$ is
$$\frac{|w^T \phi(x_i) + b|}{||W||}$$
Now, Assume linear hyperplane is enabled to complicately classify.
this image is data which complicately separated hyperplane.
this image is else data.
Thus,
$$f(x_i) > 0 \implies y_i = 1 $$
$$f(x_i) < 0 \implies y_i = -1$$
is complitely practical.
Therefore,
$$\forall i \in N,~~~~~~~y_i(w^T \phi(x_i) + b) > 0$$
Therefore
$$\frac{|w^T \phi(x_i) + b|}{||W||} = \frac{y(w^T \phi(x_i) + b)}{||W||}$$
Next, Let $i_0$ as follow.
$$\forall i_0 \in \arg_{n \in N} \min_{x \in X} \frac{|w^T \phi(x_n) + b|}{||W||}$$,
Let M is
$$M = y_{i_0}(w^T \phi(x_{i_0}) + b)$$
Because $\forall i \in N,~y_i(w^T \phi(x_i) + b) > 0$, $M > 0$ is practical.
M is value of distance between $w^T \phi(x) + b = 0$ and data which exist nearest $w^T \phi(x) + b = 0$
The objective function in SVM is expressed as follow.
$$\max_{w,b,M} \frac{M}{||W||}$$ $$~~s.t~~ \forall i \in N ~, y_i(w^T \phi(x_i) + b) \geq M$$
Here, when $w^{\star} = \frac{w}{M}, b^{\star} = \frac{b}{M}$, the objective function is expressed as follow.
$$\max_{w^{\star},b^{\star}} \frac{1}{||W^{\star}||}$$
$$~~s.t~~ \forall i \in N, y_i (w^{\star} \phi(x_i) + b^{\star}) \geq 1$$
I convert this from. because $||W^{\star}|| > 0$,
$$\max_{w^{\star},b^{\star}} \frac{1}{||W^{\star}||}$$
$$\iff \min_{w^{\star},b^{\star}} ||W^{\star}||$$
$$\iff \min_{w^{\star},b^{\star}} ||W^{\star}||^2$$
therefore, the objectice function in SVM is
$$\min_{w,b} ||W||^2$$
$$~~s.t~~ \forall i \in N, y_i (w^T \phi(x_i) + b) \geq 1$$
I define $W^{\star} = W, b^{\star} = b$ again.
We assume data is completely separated by a hyperplane. This method is called hard margin.
However this assumption is strict in the real world, so the soft margin is invented.
Next, I will explain soft margin.
Soft margin
I loosen $\forall i \in N, y_i (w^T \phi(x_i) + b) \geq 1$. This condition is rewrited as follow.
$$ \forall i \in N, y_i (w^T \phi(x_i) + b) \geq 1 - \epsilon_i$$
if $x_i$ is beyond $w^T \phi(x) + b = 0$, $\epsilon_i > 0$ is practical.
$x_5$ and $x_8$ and $x_9$ is beyond $w^T \phi(x) + b = 0$.
This distance of black line is $\epsilon_i$
I rewrite the objective function.
$$\min_{w,b} \frac{1}{2}||W||^2 + C\sum_{i \in N} \epsilon_i$$
$$~~s.t~~ \forall i \in N, y_i (w^T \phi(x_i) + b) \geq 1 - \epsilon_i ,~~~~\epsilon \geq 0 , \forall i \in N$$
C is called regulation parameter.
This parameter is a hyperparameter, so We decide before computing SVM algorithm.
C has the role which adjusts degree of suppression of misclassification.
The smaller C is, The smaller effect of $\sum_{i \in N}\epsilon_i$ is. Thus, it is easy to accept misclassification. On the other hand, the bigger C is, The bigger effect of $\sum_{i \in N}\epsilon_i$ is.
When $C = \infty$, It become hard margin.
Reference
コメント
コメントを投稿