# Linear Discriminant Analysis (LDA)

Unlike logistic regression directly modeling $\Pr (Y=j \mid X=x)$ using the logistic function, linear discriminant analysis model the distribution of the predictors X separately in each of the response classes (i.e. $\Pr (X=x \mid Y=j)$), and then use Bayes’ theorem to flip these around into estimates for $\Pr (Y=j \mid X=x)$.

The advantage of linear discriminant model over logistic regression:

1. When the classes are well-separated, the parameter estimates for logistic regression model are surprisingly unstable.
2. When the distribution of the predictor X is approximately normal, the linear discriminant analysis model is very similar in form to logistic regression. However, when the n is small, the LDA is more stable than logistic regression.
3. LDA is more popular for multiclass classification problem.

Saying we have a K classes classification problem (K >= 2), let values. Let πk represent the overall or porior probability that a randomly chosen observation comes from the kth class and let Let $f_k(X) \equiv \Pr(X=x \mid Y=k)$ denote the density function of X for an observation that comes from the kth class. According to Bayes’ theorem, we have

$\displaystyle \Pr(Y=k \mid X=x) = \frac{\pi_k f_k(x)}{\sum_{l=1}^{K}\pi_l f_l(x)}$

Thus the LDA provide us a way to indirectly computer $\Pr(Y=k \mid X=x)$ by plugging in estimates of $\pi_k$ and $f_k(X)$. Once we get the $\Pr(Y=k \mid X=x)$ (posterior probability), we can develop a classifier that approximates the Bayes classifier by classifying an observation to the class for which this probability is largest.

Now the question has been changed to how we estimate $\pi_k$ and $f_k(X)$. In general, estimating $\pi_k$ is easy if we have a random sample of Y s from the population: we simply compute the fraction of the training observations that belong to the kth class. However, for $f_k(X)$, we need to assume some simple forms for these densities. Suppose we assume that $f_k(X)$ is Gaussian or normal distributed. In one-dimensional setting (i.e. we have only one predictor), the normal density function is

$\displaystyle f_k(x) = \frac{1}{\sqrt{2\pi}\sigma_k}\exp\Big(-\frac{1}{2\sigma_k^2}(x-\mu)^2\Big)$

And let’s further assume there is a shared variance term across all K classes, that is, $\sigma_1^2=\dots=\sigma_K^2=\sigma^2$. Then we plug the normal density function into the Bayes’ formula and get

$\displaystyle \Pr(Y=k \mid X=x) = \frac{\pi_k \frac{1}{\sqrt{2\pi}\sigma}\exp\Big(-\frac{1}{2\sigma^2}(x-mu_k)^2\Big)}{\sum_{l=1}^K{\pi_l\frac{1}{\sqrt{2\pi}\sigma}\exp\Big(-\frac{1}{2\sigma^2}(x-\mu_l)^2\Big)}}$

Taking the log of the equation above and rearranging the terms, it’s easy to find out that this is equivalent to assigning the observation to the class for which

$\displaystyle \delta_k(x) = x\cdot\frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + \log(\pi_k)$

is largest. Recall Bayes classifier assigns an observation according to the biggest probability and has the lowest error rate. But in a real-life situation, since we don’t know the parameters $\mu_1,\dots,\mu_K, \pi_1,\dots,\pi_K$ and $\sigma^2$, we are actually not able to calculate the Bayes classifier. So what linear discriminant analysis (LDA) tries to do is to approximate the Bayes classifier by estimating parameters and calculate the corresponding probabilities. In particular, the following estimates are used:

$\displaystyle \hat{\mu}_k = \frac{1}{n_k}\sum_{i:y_i=k}x_i$
$\displaystyle \hat{\sigma}^2= \frac{1}{n - K}\sum_{k=1}^K x_i$

The estimate for $\mu_k$ is simply the average of all the training observation from the kth class, while $\hat{\sigma}^2$ can be seen as a weighted average of the sample variance for each of the K classes. As for $\pi_k$, LDA estimates it using the proportion of the training observation that belong to the kth class ($\hat{\pi}_k = n_k/n$).

Finally,  LDA just assigns an observation X = x to the class for which

$\displaystyle \hat{\delta}_k(x) = x\cdot\frac{\hat{\mu}_k}{\hat{\sigma}^2} - \frac{\hat{\mu}_k^2}{2\hat{\sigma}^2} + \log(\hat{\pi}_k)$

is largest. The term linear comes from the fact the discriminant functions $\hat{\delta}_k(x)$ are linear functions of x.

When extending to multiple predictors, the situation is very similar. Now we need assume that $X = (X_1,X_2,\dots,X_p)$ is drawn from a multi-variate Gaussian (or multivariate normal) distribution, with a class-specific mean vector and a common covariance matrix.

All in all, the LDA classifier results from assuming that the observations within each class come from a normal distribution with a class-specific mean vector and a common variance $\sigma^2$, and plugging estimates for these parameters into the Bayes classifier.

Reference:

An Introduction to Statistical Learning by Trevor Hastie, Robert Tibshirani