Machine learning and statistics are two close subjects. Here in this post, I try to provide a clear picture how they connect.
Maximum Likelihood Estimation
Maximum likelihood estimation (MLE) is the procedure of finding the values of $\Theta$ for a given data set which make the likelihood function a maximum 1. Here the likelihood function is simply the probability that the event leading to the given data happens. Taking coin flipping as an example, we have observed three heads and two tails in five trails of the same coin. The number of heads in a given number of trails forms a binomial distribution with probability $p$. Here $p$ is the probability of head in a toss. We are not sure if the coin is fair or not. However, we want to estimate the most likely $p$ for this coin based on our observations. MLE can help here. The procedure is the following:
- Calculate the likelihood as a function distribution parameter $\Theta$. Here $\Theta$ is $p$ in the binomial distribution case
- Take the negative logarithm of the likelihood function
The benefits of using the negative logarithm are
- avoiding overflow or underflow
-
changing the product of probabilities to the sum of probabilities.
- Find the $p$ that minimize $-\log(\mathcal{L}(p))$.
The solution is $\sqrt{\frac{3}{5}}$. It seems not like a fari coin. (For a more careful conclusion, we need to estimate the confidence interval.)
MLE for regression
In statistics, the regression of a parametric model assumes the response variable $Y$ following certain probability distribution. For example, linear regression assumes $Y$ following a normal distribution and logistic regression assumes $Y$ following a binomial distribution.
The problem setup is the following:
- We have a set of data (observations).
- We make an assumption on the distribution of the response variable.
- We want to find the parameters $\theta$ that the set of observations are most likely to happen.
In general cases, let be the probability distribution for the response variable. $\Theta$ is the distribution parameter that is a function of independent variables $x_i$ and parameters $W$. For example, for the linear regression case, $\Theta$ is the mean and given by $\Theta = W^TX$. Given a set of observations (a sample) $D$ with $n$ pairs of $[y_i, x_i]$, then the likelihood function is
The negative logarithm of it is
The MLE objective function is
Linear regression
In linear regression, we assumes the a linear relationship with an error term where the intercept term is included by adding a dummy variable in $X$. $\xi$ follows a Gaussian distribution.
So the response variable follows a Gaussian distribution
where $\Theta$ is the mean and $\sigma$ is the standard deviation for the noise $\xi$.
Follow the MLE objective function
So the MLE becomes a problem to find $w$ that minimize the sum of squared errors
So now we can see that the origin of object (loss) function for linear regression is MLE of a Gaussian distributed response variable.
Logistic regression
In logistic regression, the response variable $Y$ is binary and follows a binomial distribution. Let’s set the binary values to be -1 and 1 2. The parameter of the model is $p$, the probability for $Y = 1$. $p$ is a function of independent variables $X$ and parameters $W$. One of the most popular choice is the sigmoid function
Therefore, for an observation with $y_i = 1$, we have
since $y_i = 1$, so $h(w^Tx_i) = h(y_iw^Tx_i)$. For an observation with $y_i = -1$, we have
since $y_i = -1$, so $h(-w^Tx_i) = h(y_iw^Tx_i)$. So for each observation $y_i$, we can write the probability in a clean way
Follow the general MLE objective function
So the MLE becomes a problem to find $w$ that minimize the objective (loss) function
where $z_i = y_iw^Tx_i$. ($x_i$ includes a dummy variable for intercept.)
Maximum a posterior estimation
Maximum a posterior estimation is an estimate of an unknown quantity, that equals the mode of the posterior distribution, without worrying the full distribution of the posterior. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. Here we show that how MAP can lead to the regularization in the regression models.
MAP for regression
Given a set of data $D$, what is the distribution of parameters $W$? We can use the Bayesian rule
here $P(D)$ is independent from $w$, so only the numerator is important for determining $w$.
The problem setup is the following:
- We have observed some data.
- We make an assumption on the distribution of the response variable.
- We make an assumption on the prior distribution of parameter $w$ based on our knowledge.
- We want to find the parameters $w$ that corresponds to the maximum of the posterior given the observed data.
In the general case, let $f(Y | w, X)$ be the probability distribution for the response variable and $g(w)$ be the prior distribution of parameter $w$. Given a set of observations (a sample) $D$ with $n$ pairs of $[y_i, x_i]$, then the posterior is
The negative logarithm is
The MAP problem becomes
MAP for linear regression with a Gaussian prior
Suppose the prior of $w$ is a Gaussian distribution $N(0, \sigma_w)$. This indicates that $w$ are likely around zero and extremely large values in $w$ are very unlikely to happen.
Similar to MLE for linear regression, we insert the probability density functions for the response variable and parameters prior in the MAP objective function
So MAP for linear regression with a Gaussian prior becomes a problem to find $w$ that minimize the sum of squared errors with a $L_2$ regularization
where $\lambda = \frac{\sigma^2}{\sigma_w^2}$. Thus the origin of object (loss) function for linear regression with $L_2$ regularization is MAP estimation of $w$ with a Gaussian prior.
The question is how to choose the hyperparameter $\lambda$. Typically, we choose the $\lambda$ using cross-validation. In this sense, the hyperparameter is detemined by data.
MAP for robust regression with a Laplace prior
We still have the model
where the intercept is taken into account by including as an dummy variable in $X$. Here the noise $\xi$ is from a Laplace distribution 3 instead of a Gaussian distribution. Laplace distribution is a heavy tail distribution, so extreme values away from the mean are more likely than Gaussian distribution. Outliers can be taken care more appropriately in this distribution.
With these assumption, $Y$ is given by
where $\theta$ is the mean and $b$ is the diversity for the noise term $\xi$.
We also assume the parameter prior is a Laplace distribution
Then the MAP objective function
So MAP for robust regression with a Laplace prior becomes a problem to find $w$ that minimize the sum of absolute errors with a $L_1$ regularization
where $\lambda = \frac{b}{b_w}$.
_Some further thoughts:
- The problem setup, suppose we know $w$.
- The assumption on the probability distribution is what we believe the world.
- The choice of MLE or MAP is what we believe how the world works.
- Laplace distribution: no longer that unbelievable that have a point far away from the line. _