1. Mathematics Background:

Scatter Matrix:It is common to use the correlation to understand the relation between two variables in data science tasks. A scatter matrix is a estimation of covariance matrix when covariance cannot be calculated or costly to calculate.

$$S = \sum_{k=1}^{n}(x_k - m)(x_k - m)^T$$ where m is the mean vector

The class-specific mean vector: $$\mu_i =\frac{1}{n_i} \sum_{x\in C_{i}}x$$

The class-specific covariance matrix $S_i = \frac{1}{n_i}\sum_{x\in C_i}(x-\mu_i)(x - \mu_i)^T$

The total mean vector (sample) $\mu = \frac{1}{N}\sum X$

The within-class scatter: $S_w = \sum_{i=1}^{C}\frac{n_i}{N}S_i = \sum_{i=1}^{C}P_iS_i$

The between-class scatter: $S_B = \sum_{i=1}^{C}P_i(\mu_i-\mu)(\mu_i-\mu)^T$

Total covariance**: $S_T = S_w + S_B$

1. LDA:

LDA is a supervised technique of feature extraction used to find a linear combination of the availabel features to seperate the classes. LDA reduces the number of dimensions of the input feature vector by also considering the inter-class separation between classes.

. The goal of Linear Discriminant Analysis is to find most discriminative projection by maximizing between-class distance { Distance between the centroids of the different classes }and minimizing within-class distance {Accumulated distance of an instance to the centroid of its class}.

Take an example: Two-class: we look for a projection where the same class are projected very close to each other and, at the same time, the projected means are as farther apart as possible. LDA is defined as a linear function $y = w^T x$ , to find the optimum $w^*$,In multi-class problem, we seek (C-1) projections $[y_1,y_2,…,y_{c-1}]$, $y_i = w_i^Tx$

Instead of estimating $P(Y|X)$ directly, we estimate $\hat P(X|Y)$: {given the response, the distribution of the inputs} and $\hat P(Y)$ {how likely are each of the categories} then we use Bayes rule to obtain the estimate

$$\hat P(Y=k | X= x) = \frac{\hat P(X= x|Y=k)\hat P(Y=k)}{\hat P(X=x)}$$

$\hat P(X= x|Y=k) = \hat f_k(x)$ as multivariate normal distribution,{this means it assumes the examples within one-class are normally distributed. } $\hat P(Y=k) = \hat \pi_k$ as the likelihood of k-th category

What is the difference between Naive Bayes and LDA?

LDA operates on continuous-valued features, it assumes the examples within one-class are normally distributed.Naive Bayes is explicitly modeling the probability of a class using Bayes’s Rule and make the assumption that the features are independent instead.

---------------------- 本文结束----------------------