Linear least squares (mathematics)

Linear least squares is a mathematical optimization technique to find an approximate solution for a system of linear equations that has no exact solution. This usually happens if the number of equations (m) is bigger than the number of variables (n). (See also linear regression.)

Definition

In mathematical terms, we want to find a solution ${\hat {\mathbf {x} }}$ for the "equation"

A{\hat {\mathbf {x} }}\approx \mathbf {b} ,

where A is a known m-by-n matrix (usually with m > n), x is an unknown n-dimensional parameter vector, and b is a known m-dimensional measurement vector. More precisely, we want to minimize the Euclidean norm squared of the residual Ax − b, that is, the quantity

\|A\mathbf {x} -\mathbf {b} \|^{2}=\left([A\mathbf {x} ]_{1}-\mathbf {b} _{1}\right)^{2}+\left([A\mathbf {x} ]_{2}-\mathbf {b} _{2}\right)^{2}+\dots +\left([A\mathbf {x} ]_{m}-\mathbf {b} _{m}\right)^{2},

where [Ax]_i denotes the i-th component of the vector Ax. Hence the name "least squares".

Using the fact that the squared norm of v is v^Tv, where v^T stands for the transpose of v, rewrite the expression as

(A{\mathbf {x} }-{\mathbf {b} })^{T}(A{\mathbf {x} }-{\mathbf {b} })=(A{\mathbf {x} })^{T}(A{\mathbf {x} })-{\mathbf {b} }^{T}A{\mathbf {x} }-(A{\mathbf {x} })^{T}{\mathbf {b} }+{\mathbf {b} }^{T}{\mathbf {b} }.

The two middle terms b^T(Ax) and (Ax)^Tb are equal and the minimum is found at the zero of the derivative with respect to x,

{\frac {d}{dx}}[(A{\mathbf {x}})^{T}(A{\mathbf {x}})-2(A{\mathbf {x}})^{T}{\mathbf {b}}]=2A^{T}A{\hat {\mathbf {x}}}-2A^{T}{\mathbf {b}}={\mathbf {0}}.

Therefore the minimizing vector ${\hat {\mathbf {x} }}$ is a solution of the normal equation

A^{T}\!A{\hat {\mathbf {x} }}=A^{T}\mathbf {b} .

Note that this corresponds to a system of linear equations. The matrix A^TA on the left-hand side is an n-by-n square matrix, which is invertible if A has full column rank (that is, if the rank of A is n). In that case, the solution of the system of linear equations is unique and given by

{\hat {\mathbf {x} }}=(A^{T}\!A)^{-1}A^{T}\mathbf {b} .

The matrix $(A^{T}A)^{-1}A^{T}$ is called the pseudoinverse of A. We cannot use the true matrix inverse of A (that is, $A^{-1}$ ), because it does not exist as A is not a square matrix (m ≠ n).

Computation

If the matrix A^TA has full rank and is well-conditioned, the normal equations can be solved directly by using the Cholesky decomposition A^TA = R^TR, giving:

R^{T}R{\hat {\mathbf {x} }}=A^{T}\mathbf {b}

where R is an upper triangular matrix.

A slower but more numerically stable method, which still works if A is not full rank, can be obtained by computing the QR decomposition A = Q R. One can then solve

R{\hat {\mathbf {x} }}=Q^{T}\mathbf {b}

where Q is an orthogonal matrix and R is an upper triangular matrix.

A third alternative is to use the singular value decomposition (SVD). If $A=U\Sigma V^{*}$ is the singular value decomposition of A, then the pseudoinverse of the matrix A is V Σ⁺ U^*, so

{\hat {\mathbf {x} }}=V\Sigma ^{+}U^{*}\mathbf {b} \,

where Σ⁺ is the transpose of Σ with every nonzero entry replaced by its reciprocal. This method is the most computationally intensive, but is particularly useful if the matrix A is very ill-conditioned (i.e. if its condition number multiplied by the machine's relative round-off error is appreciably large). In that case, including the smallest singular values in the inversion merely adds numerical noise to the solution. This can be cured using the truncated SVD approach, giving a more stable and exact answer, by explicitly setting to zero all singular values below a certain threshold and so ignoring them, before calculating the pseudoinverse.

Applications

The linear least squares method can be used to find an affine function Rⁿ → R that best fits a given set of data (see general least squares method). It is widely and erroneously thought that the word linear in the term linear regression refers to the linear or affine nature of the fitted function.

For example

y=\alpha +\beta x+\gamma x^{2}+\varepsilon

is still a linear regression model, for the right-hand side is a linear combination of the parameters α, β, and γ; moreover, the least-squares estimates of those parameters are linear in the vector of observed y-values. In this case it is useful to think of x² as a new independent variable, formed by modifying the original variable x. However, it is convention to call this a quadratic fit or a 2nd-order polynomial fit.

We write the linear function we try to find as a 1-by-n matrix x^T (so x is actually a column vector, see also linear transformation).

The set of data consists of m (n + 1)-tuples (x₁, ..., x_n, y). Those can be written into an m-by-n matrix A and a vector b, where every tuple corresponds to a row of A, the y becoming the corresponding entry in b.

Then,

Ax ≈ b

yields the function x we seek.

Limitations

The least squares approach relies on the calculation of the pseudo inverse $(A^{T}\!A)^{-1}A^{T}$ . The pseudo inverse is guaranteed to exist for any full-rank matrix $A$ . However, in some cases the matrix $A^{T}A$ is ill-conditioned; this occurs when the measurements are only marginally related to the estimated parameters. In these cases, the least squares estimate amplifies the measurement noise, and may be grossly inaccurate. This may occur even when the pseudo inverse itself can be accurately calculated numerically. Various regularization techniques can be applied in such cases, the most common of which is called Tikhonov regularization. If further information about the parameters is known, for example, a range of possible values of x, then minimax techniques can also be used to increase the stability of the solution.

Another drawback of the least squares estimator is the fact that it seeks to minimize the norm of the measurement error, Ax − b. In many cases, one is truly interested in obtaining small error in the parameter x, e.g., a small value of $\|\mathbf {x} -{\hat {\mathbf {x} }}\|$ . However, since x is unknown, this quantity cannot be directly minimized. If a prior probability on x is known, then a Bayes estimator can be used to minimize the mean squared error, $E\left\{\|\mathbf {x} -{\hat {\mathbf {x} }}\|^{2}\right\}$ . The least squares method is often applied when no prior is known. Surprisingly, however, better estimators can be constructed, an effect known as Stein's phenomenon. For example, if the measurement error is Gaussian, several estimators are known which dominate, or outperform, the least squares technique; the most common of these is the James-Stein estimator.