Take two-class dataset as an example: Two classes of data are linearly separable if and only if there exists a hyperplane $w^Tx+b=0$ that seperates the two classes. The SVM provides a systematic method for separating the data “optimally”. The optimal hyperplane for a given training set is one that gives **the maximum geometric margin $\gamma^g$ over all possible hyperplanes**. Like the simplest SVM {linear SVM} is the linear classifier with the maximum margin.

Given an example $(x_i,d_i)$ with respect to a hyperplane $(w,b)$ , $\gamma_i^g =\frac{d_i(w^Tx_i+b)}{||w||}$

$$\gamma = max [\gamma^g_{w1,b1},\gamma^g_{w_2,b_2}…..,]$$

But why it is important to find the margin of training set? **Larger margin leads to lower probability of misclassification**. That is, find a hyperplane that divides your data properly, but is also as far as possible from your data points. Then when a new data comes in, __even if it is a litter closer to the wrong class than the training points, it will still lie on the right side of the hyperplane.__

How to find the optimal hyperplane? We could regard this problem as **constrainted optimization problem**. maximize hyperplane ====> minimize ||w|| . Use KKT conditions to transform the primal problem to the dual problem. For data point $x_i$ that is not a support vector: $d_i(w_o^Tx_i + b_o)>1$ For data point $x_i$, that is a support vector: $d_i(w_o^T x_i + b_o) = 1$

How to understand the kernel of SVM? In ML, a kernel is used to refer to the kernel trick , a method of using linear classifier to solve a non-linear problem. In SVM , if there are some outliers, we could use kernel methods { linear, non-linear, poly, RBF, sigmod ….} to map the data to high dimensional space. Then the data is linearly separable in the new feature space.

插一句，比较两个model的performance的时候有很多决定因素 like the distribution of data, linear or non-linear problem, the chosen hyper-parameters, the number of training examples. 更具体的就svm来看这个问题，我们需要考虑an appropriate kernel {poly, RBF, linear …}, the kernel parameters such as the poly degree, the regularization penalty. 好了，继续来看svm哈～