**Mathematics Background**:

Scatter Matrix:It is common to use the correlation to understand the relation between two variables in data science tasks. A scatter matrix is a estimation of covariance matrix when covariance cannot be calculated or costly to calculate.

$$ S = \sum_{k=1}^{n}(x_k - m)(x_k - m)^T$$ where m is the mean vector

The class-specific mean vector: $$\mu_i =\frac{1}{n_i} \sum_{x\in C_{i}}x$$

The class-specific covariance matrix $S_i = \frac{1}{n_i}\sum_{x\in C_i}(x-\mu_i)(x - \mu_i)^T$

The total mean vector (sample) $\mu = \frac{1}{N}\sum X$

The within-class scatter: $S_w = \sum_{i=1}^{C}\frac{n_i}{N}S_i = \sum_{i=1}^{C}P_iS_i$

The between-class scatter: $S_B = \sum_{i=1}^{C}P_i(\mu_i-\mu)(\mu_i-\mu)^T$

Total covariance**: $S_T = S_w + S_B$

**LDA**:

LDA is a supervised technique of feature extraction used to find a linear combination of the availabel features to seperate the classes. LDA reduces the number of dimensions of the input feature vector **by also considering the inter-class separation between classes**.

.

The goal of Linear Discriminant Analysis is to find most discriminative projection by **maximizing between-class distance** { Distance between the centroids of the different classes }and **minimizing within-class distance** {Accumulated distance of an instance to the centroid of its class}.

**Take an example**: Two-class: we look for a projection where the same class are projected very close to each other and, at the same time, the projected means are as farther apart as possible. LDA is defined as a linear function $y = w^T x$ , to find the optimum $w^*$,In multi-class problem, we seek (C-1) projections $[y_1,y_2,…,y_{c-1}]$, $y_i = w_i^Tx$

上面是对我上学期上pattern recognition的LDA的一个总结，看了斯坦福的LDA的PPT，感觉他们从另外一个角度来解释这个事情，也蛮好玩的。

Instead of estimating $ P(Y|X)$ directly, we estimate $\hat P(X|Y)$: {given the response, the distribution of the inputs} and $\hat P(Y)$ {how likely are each of the categories} then we use Bayes rule to obtain the estimate

$$ \hat P(Y=k | X= x) = \frac{\hat P(X= x|Y=k)\hat P(Y=k)}{\hat P(X=x)}$$

$\hat P(X= x|Y=k) = \hat f_k(x)$ as multivariate normal distribution,{this means **it assumes** **the examples within one-class are normally distributed**. } $\hat P(Y=k) = \hat \pi_k$ as the likelihood of k-th category

*What is the difference between Naive Bayes and LDA?*

**LDA** operates on **continuous-valued features**, it assumes **the examples within one-class are normally distributed**.**Naive Bayes** is explicitly modeling the probability of a class using Bayes’s Rule and make the assumption that **the features are independent** instead.