Deriving Least Squares geometrically

0. Introduction

The workhorse model in econometrics always has been and still is OLS regression, i.e. least-squares interpolation. In every introductory econometrics course you will (and should) see a mathematical proof for the well-known formula

\begin{equation} x = \big(A^tA\big)^{-1}A^tb \end{equation}

where $A \in \mathbb{R}^{m \times n}$ with $m > n$ and $\text{rk}(A) = n$, $x \in \mathbb{R}^n$ and $b \in \mathbb{R}^m$. The proof, however, always builds on matrix calculus by way of minimizing the (differentiable) expression $\Vert Ax - b \Vert_{2}^{2}$ where $\Vert \cdot \Vert$ denotes the Euclidean norm.

Linear algebra holds another proof that only needs basic geometric properties of linear algebra and which I will present in the following. The main reference I use are numerical linear algebra lecture notes by Prof. F. Wübbeling.

1.Notation and auxiliary results

Let $A \in \mathbb{R}^{m \times n}$. From basic linear algebra, we know that

\begin{align} \ker(A) &:= \{x \in \mathbb{R}^n \, \vert \, Ax = 0\} \\
\text{Im}(A) &:= \{Ax \, \vert \, x \in \mathbb{R}^n\} \\
\text{rk}(A) &:= \dim \text{Im}(A) \\\ \end{align}

and that it holds that

\begin{align} \text{rk}(A) &= \text{rk}(A^t) \\\ \dim \ker(A) &+ \dim \text{Im}(A) = n \end{align}

Lemma. We have that

\begin{equation} \boxed{\mathbb{R}^{m} = \text{Im}(A) \oplus \ker(A^t)} \qquad\qquad (1) \end{equation}

and

\begin{equation} \boxed{\ker(A^tA) = \ker(A)} \qquad\qquad (2) \end{equation}

Proof. To prove (1) note that for $x \in \ker(A^t) \subset \mathbb{R}^m$ and some $y \in \mathbb{R}^n$ we have that \begin{equation} A^tx = 0 \implies y^tA^tx = 0 \iff \langle A^tx, y \rangle = \langle x, Ay \rangle = 0 \implies x \in \text{Im}(A)^{\perp} \end{equation} and therefore, $\ker(A^t) \subset \text{Im}(A)^{\perp}$.

Conversely, for $x \in \text{Im}(A)^{\perp} \subset \mathbb{R}^m$ it is true that

\begin{equation} 0 = \langle x, AA^tx \rangle = \langle A^tx, A^tx \rangle = \Vert A^tx \Vert_{2}^2 \iff A^tx = 0 \implies x \in \ker(A^t) \end{equation}

considering the definition of a norm. Hence, $\text{Im}(A)^{\perp} \subset \ker(A^t)$. In total, we proved $\ker(A^t) = \text{Im}(A)^{\perp}$ and hence (1).

To prove (2), observe that one direction is trivial as \begin{equation} \mathbb{R}^m \ni Ax = 0 \implies A^tAx = A^t0 = 0 \end{equation} and hence, $\ker(A) \subset \ker(A^tA)$. For the other direction, we see immediately that

\begin{equation} A^tAx = 0 \implies 0 = \langle A^tAx, x \rangle = \langle Ax, Ax \rangle = \Vert Ax \Vert_{2}^{2} = 0 \iff Ax = 0
\end{equation}

and therefore, $\ker(A^tA) \subset \ker(A)$. In sum, we again have $\ker(A^tA) = \ker(A) \qquad\qquad\qquad\qquad \square$.

2. Deriving the least squares formula

Now that we have the necessary foundation, we can derive the least squares formula in form of the following theorem.

Theorem. Let $A \in \mathbb{R}^{m \times n}$ and $b \in \mathbb{R}^m$. Then, $x \in \mathbb{R}^n$ is the least squares solution of the equation $Ax = b$ iff

\begin{equation} \boxed{A^tAx = A^tb} \end{equation}

Furthermore, $x = (A^tA)^{-1}A^tb$ is unique iff $\text{rk}(A) = n$.

Proof. First, we orthogonally decompose $b \in \mathbb{R}^m$ according to our previous lemma in two parts $b_1 \in \text{Im}(A)$ and $b_2 \in \ker(A^t)$. Now assume $x \in \mathbb{R}^n$. Then we have \begin{equation} \Vert Ax - b \Vert_{2}^2 = \Vert Ax - (b_1 + b_2)\Vert_{2}^{2} = \Vert (Ax - b_1) + (-b_{2})\Vert_{2}^{2} \end{equation}

Note that $(Ax - b_{1}) \in \text{Im}(A)$ as $\text{Im}(A)$ is as subspace closed under addition and scalar multiplication. Because of the orthogonality of both summands in the squared norm, i.e. $(Ax - b_1) \perp b_2$, we can use the generalized pythogarian theorem to conclude

\begin{equation} \Vert (Ax - b_1) + (-b_{2})\Vert_{2}^{2} = \Vert (Ax - b_1)\Vert_{2}^{2} + \Vert b_{2}\Vert_{2}^{2} \end{equation}

The last expression becomes minimal in case of $x = z$ where $Az = b_{1}$. Therefore, $z$ is the least squares solution. But then, the previous lemma implies

\begin{equation} Ax = b_{1} \land Az = b_1 \iff Ax - Az = 0 \iff A(x - z) = 0 \iff 0 = A^tA(x - z) = A^tAx - A^tAz \end{equation}

However, $A^tAz = A^tb_1 = A^t(b_1 + b_2) = A^tb$ and finally,

\begin{equation} Ax = b_{1} \land Az = b_1 \iff A^tAx - A^tb = 0 \iff A^tAx = A^tb \end{equation}

For the uniqueness, assume there are two least squares solutions $x_1, x_2$. Then, clearly

\begin{equation} A^tAx_1 = A^tb \land A^tAx_2 = A^tb \iff A^tA(x_1 - x_2) = 0 \iff A(x_1 - x_2) = 0 \iff x_1 - x_2 \in \ker(A) \end{equation}

Therefore, $x_1 = x_2 \iff \ker(A) = 0 \iff \text{rk}(A) = n \qquad\qquad\qquad\qquad \square$.

3. Conclusion

This concludes the geometric derivation of the least squares. It highlights the fact that the least squares solution $x$ collapses into the exact solution of the equation $Ax = b$ as soon as $b \in \text{Im}(A)$. If $b$ does not lie in the image of $A$, then $b$ necessarily has a component that is part of $\text{Im}(A)^{\perp}$.
In this case, the least squares solution targets that component of $b$ that does lie in the image of $A$ by projecting $b$ orthogonally onto $b_1 \in \text{Im}(A)$.
There are several follow-up topics: the closely related pseudo-inverse as minimum norm least squares solution, the singular value decomposition (SVD) and nonlinear least squares, e.g. Levenberg-Marquardt. Also, Krylow space methods and iterative solving of linear systems of equations also make exciting topics for exposition (and are widely used in the industry I imagine).