Introduction
I will explain theorem of SVM.
Please look at my implementation of SVM.
Implement linear SVM
Implement kernel SVM
Today, I will explain about SVM until deriving the objective function.
Overview
Generalized linear model
SVM is used generalized linear model. Generalized linear model is following function
f(x) = w^T\phi(x) + b
b is called bias.
0 = w^T\phi(x) + bis hyper plane. This hyper plane separate two class of \phi(x).
hyper plance is n-dimensional plane. if n = 1, hyper plane is line. if n = 2, hyper plane is normal plane.
\phi(x) have effect of converting x to data which can be separated by a line.
image of \phi(x) is
the left image has nonlinear data.
right image has linear data.
\phi(x) convert from left image to right image.
I will handle w^T \phi(x) + b as line in feature space.
Next, I will explain the object of SVM
Explain SVM
We want to make decisions function which \forall x \in X
f(x_i) > 0 \implies y_i = 1
f(x_i) < 0 \implies y_i = -1
Let f(x) is w^T \phi(x) + b, I will optimize w and b of parametor.
However, optimization needs a standard of a good boundary. Its standard is magin. Next, I will explain hard margin.
Hard margin
What is margin? I will explain.
pick up data which exist nearest from w^T \phi(x) +b = 0. Margin is the distance between the data and w^T \phi(x) +b = 0.
Look at following image of margin in 2-dimensional.
this distance of green line is margin. SVM decide w^T \phi(x) + b= 0 to depend on only data which exist nearest from w^T \phi(x) + b = 0. This data called sopport vector.
We decide w and b of the parameter by a maximum margin.
Let dataset is X, \forall x_i \in X, distance between x and w^T \phi(x) + b = 0 is
\frac{|w^T \phi(x_i) + b|}{||W||}
Now, Assume linear hyperplane is enabled to complicately classify.
this image is data which complicately separated hyperplane.
this image is else data.
Thus,
f(x_i) > 0 \implies y_i = 1
f(x_i) < 0 \implies y_i = -1
is complitely practical.
Therefore,
\forall i \in N,~~~~~~~y_i(w^T \phi(x_i) + b) > 0
Therefore
\frac{|w^T \phi(x_i) + b|}{||W||} = \frac{y(w^T \phi(x_i) + b)}{||W||}
Next, Let i_0 as follow.
\forall i_0 \in \arg_{n \in N} \min_{x \in X} \frac{|w^T \phi(x_n) + b|}{||W||},
Let M is
M = y_{i_0}(w^T \phi(x_{i_0}) + b)
Because \forall i \in N,~y_i(w^T \phi(x_i) + b) > 0, M > 0 is practical.
M is value of distance between w^T \phi(x) + b = 0 and data which exist nearest w^T \phi(x) + b = 0
The objective function in SVM is expressed as follow.
\max_{w,b,M} \frac{M}{||W||} ~~s.t~~ \forall i \in N ~, y_i(w^T \phi(x_i) + b) \geq M
Here, when w^{\star} = \frac{w}{M}, b^{\star} = \frac{b}{M}, the objective function is expressed as follow.
\max_{w^{\star},b^{\star}} \frac{1}{||W^{\star}||}
~~s.t~~ \forall i \in N, y_i (w^{\star} \phi(x_i) + b^{\star}) \geq 1
I convert this from. because ||W^{\star}|| > 0,
\max_{w^{\star},b^{\star}} \frac{1}{||W^{\star}||}
\iff \min_{w^{\star},b^{\star}} ||W^{\star}||
\iff \min_{w^{\star},b^{\star}} ||W^{\star}||^2
therefore, the objectice function in SVM is
\min_{w,b} ||W||^2
~~s.t~~ \forall i \in N, y_i (w^T \phi(x_i) + b) \geq 1
I define W^{\star} = W, b^{\star} = b again.
We assume data is completely separated by a hyperplane. This method is called hard margin.
However this assumption is strict in the real world, so the soft margin is invented.
Next, I will explain soft margin.
Soft margin
I loosen \forall i \in N, y_i (w^T \phi(x_i) + b) \geq 1. This condition is rewrited as follow.
\forall i \in N, y_i (w^T \phi(x_i) + b) \geq 1 - \epsilon_i
if x_i is beyond w^T \phi(x) + b = 0, \epsilon_i > 0 is practical.
x_5 and x_8 and x_9 is beyond w^T \phi(x) + b = 0.
This distance of black line is \epsilon_i
I rewrite the objective function.
\min_{w,b} \frac{1}{2}||W||^2 + C\sum_{i \in N} \epsilon_i
~~s.t~~ \forall i \in N, y_i (w^T \phi(x_i) + b) \geq 1 - \epsilon_i ,~~~~\epsilon \geq 0 , \forall i \in N
C is called regulation parameter.
This parameter is a hyperparameter, so We decide before computing SVM algorithm.
C has the role which adjusts degree of suppression of misclassification.
The smaller C is, The smaller effect of \sum_{i \in N}\epsilon_i is. Thus, it is easy to accept misclassification. On the other hand, the bigger C is, The bigger effect of \sum_{i \in N}\epsilon_i is.
When C = \infty, It become hard margin.
Reference
I will explain theorem of SVM.
Please look at my implementation of SVM.
Implement linear SVM
Implement kernel SVM
Today, I will explain about SVM until deriving the objective function.
Overview
- Generalized linear model
- Explain SVM
- hard margin
- soft margin
Generalized linear model
SVM is used generalized linear model. Generalized linear model is following function
f(x) = w^T\phi(x) + b
b is called bias.
0 = w^T\phi(x) + bis hyper plane. This hyper plane separate two class of \phi(x).
hyper plance is n-dimensional plane. if n = 1, hyper plane is line. if n = 2, hyper plane is normal plane.
\phi(x) have effect of converting x to data which can be separated by a line.
image of \phi(x) is
right image has linear data.
\phi(x) convert from left image to right image.
I will handle w^T \phi(x) + b as line in feature space.
Next, I will explain the object of SVM
Explain SVM
We want to make decisions function which \forall x \in X
f(x_i) > 0 \implies y_i = 1
f(x_i) < 0 \implies y_i = -1
Let f(x) is w^T \phi(x) + b, I will optimize w and b of parametor.
However, optimization needs a standard of a good boundary. Its standard is magin. Next, I will explain hard margin.
Hard margin
What is margin? I will explain.
pick up data which exist nearest from w^T \phi(x) +b = 0. Margin is the distance between the data and w^T \phi(x) +b = 0.
Look at following image of margin in 2-dimensional.
this distance of green line is margin. SVM decide w^T \phi(x) + b= 0 to depend on only data which exist nearest from w^T \phi(x) + b = 0. This data called sopport vector.
We decide w and b of the parameter by a maximum margin.
Let dataset is X, \forall x_i \in X, distance between x and w^T \phi(x) + b = 0 is
\frac{|w^T \phi(x_i) + b|}{||W||}
Now, Assume linear hyperplane is enabled to complicately classify.
this image is data which complicately separated hyperplane.
this image is else data.
Thus,
f(x_i) > 0 \implies y_i = 1
f(x_i) < 0 \implies y_i = -1
is complitely practical.
Therefore,
\forall i \in N,~~~~~~~y_i(w^T \phi(x_i) + b) > 0
Therefore
\frac{|w^T \phi(x_i) + b|}{||W||} = \frac{y(w^T \phi(x_i) + b)}{||W||}
Next, Let i_0 as follow.
\forall i_0 \in \arg_{n \in N} \min_{x \in X} \frac{|w^T \phi(x_n) + b|}{||W||},
Let M is
M = y_{i_0}(w^T \phi(x_{i_0}) + b)
Because \forall i \in N,~y_i(w^T \phi(x_i) + b) > 0, M > 0 is practical.
M is value of distance between w^T \phi(x) + b = 0 and data which exist nearest w^T \phi(x) + b = 0
The objective function in SVM is expressed as follow.
\max_{w,b,M} \frac{M}{||W||} ~~s.t~~ \forall i \in N ~, y_i(w^T \phi(x_i) + b) \geq M
Here, when w^{\star} = \frac{w}{M}, b^{\star} = \frac{b}{M}, the objective function is expressed as follow.
\max_{w^{\star},b^{\star}} \frac{1}{||W^{\star}||}
~~s.t~~ \forall i \in N, y_i (w^{\star} \phi(x_i) + b^{\star}) \geq 1
I convert this from. because ||W^{\star}|| > 0,
\max_{w^{\star},b^{\star}} \frac{1}{||W^{\star}||}
\iff \min_{w^{\star},b^{\star}} ||W^{\star}||
\iff \min_{w^{\star},b^{\star}} ||W^{\star}||^2
therefore, the objectice function in SVM is
\min_{w,b} ||W||^2
~~s.t~~ \forall i \in N, y_i (w^T \phi(x_i) + b) \geq 1
I define W^{\star} = W, b^{\star} = b again.
We assume data is completely separated by a hyperplane. This method is called hard margin.
However this assumption is strict in the real world, so the soft margin is invented.
Next, I will explain soft margin.
Soft margin
I loosen \forall i \in N, y_i (w^T \phi(x_i) + b) \geq 1. This condition is rewrited as follow.
\forall i \in N, y_i (w^T \phi(x_i) + b) \geq 1 - \epsilon_i
if x_i is beyond w^T \phi(x) + b = 0, \epsilon_i > 0 is practical.
x_5 and x_8 and x_9 is beyond w^T \phi(x) + b = 0.
This distance of black line is \epsilon_i
I rewrite the objective function.
\min_{w,b} \frac{1}{2}||W||^2 + C\sum_{i \in N} \epsilon_i
~~s.t~~ \forall i \in N, y_i (w^T \phi(x_i) + b) \geq 1 - \epsilon_i ,~~~~\epsilon \geq 0 , \forall i \in N
C is called regulation parameter.
This parameter is a hyperparameter, so We decide before computing SVM algorithm.
C has the role which adjusts degree of suppression of misclassification.
The smaller C is, The smaller effect of \sum_{i \in N}\epsilon_i is. Thus, it is easy to accept misclassification. On the other hand, the bigger C is, The bigger effect of \sum_{i \in N}\epsilon_i is.
When C = \infty, It become hard margin.
Reference
コメント
コメントを投稿