Take two-class dataset as an example: Two classes of data are linearly separable if and only if there exists a hyperplane $w^Tx+b=0$ that seperates the two classes. The SVM provides a systematic method for separating the data “optimally”. The optimal hyperplane for a given training set is one that gives the maximum geometric margin $\gamma^g$ over all possible hyperplanes. Like the simplest SVM {linear SVM} is the linear classifier with the maximum margin.

Given an example $(x_i,d_i)$ with respect to a hyperplane $(w,b)$ , $\gamma_i^g =\frac{d_i(w^Tx_i+b)}{||w||}$

$$\gamma = max [\gamma^g_{w1,b1},\gamma^g_{w_2,b_2}…..,]$$

But why it is important to find the margin of training set? Larger margin leads to lower probability of misclassification. That is, find a hyperplane that divides your data properly, but is also as far as possible from your data points. Then when a new data comes in, even if it is a litter closer to the wrong class than the training points, it will still lie on the right side of the hyperplane.

How to find the optimal hyperplane? We could regard this problem as constrainted optimization problem. maximize hyperplane ====> minimize ||w|| . Use KKT conditions to transform the primal problem to the dual problem. For data point $x_i$ that is not a support vector: $d_i(w_o^Tx_i + b_o)>1$ For data point $x_i$, that is a support vector: $d_i(w_o^T x_i + b_o) = 1$

How to understand the kernel of SVM? In ML, a kernel is used to refer to the kernel trick , a method of using linear classifier to solve a non-linear problem. In SVM , if there are some outliers, we could use kernel methods { linear, non-linear, poly, RBF, sigmod ….} to map the data to high dimensional space. Then the data is linearly separable in the new feature space.

---------------------- 本文结束----------------------