Deriving Least Squares geometrically
0. Introduction
The workhorse model in econometrics always has been and still is OLS regression, i.e. least-squares interpolation. In every introductory econometrics course you will (and should) see a mathematical proof for the well-known formula
\begin{equation} x = \big(A^tA\big)^{-1}A^tb \end{equation}
where $A \in \mathbb{R}^{m \times n}$ with $m > n$ and $\text{rk}(A) = n$, $x \in \mathbb{R}^n$ and $b \in \mathbb{R}^m$. The proof, however, always builds on matrix calculus by way of minimizing the (differentiable) expression $\Vert Ax - b \Vert_{2}^{2}$ where $\Vert \cdot \Vert$ denotes the Euclidean norm.
Linear algebra holds another proof that only needs basic geometric properties of linear algebra and which I will present in the following. The main reference I use are numerical linear algebra lecture notes by Prof. F. Wübbeling.
1.Notation and auxiliary results
Let $A \in \mathbb{R}^{m \times n}$. From basic linear algebra, we know that
\begin{align}
\ker(A) &:= \{x \in \mathbb{R}^n \, \vert \, Ax = 0\} \\
\text{Im}(A) &:= \{Ax \, \vert \, x \in \mathbb{R}^n\} \\
\text{rk}(A) &:= \dim \text{Im}(A) \\\
\end{align}
and that it holds that
\begin{align} \text{rk}(A) &= \text{rk}(A^t) \\\ \dim \ker(A) &+ \dim \text{Im}(A) = n \end{align}
Lemma. We have that
\begin{equation} \boxed{\mathbb{R}^{m} = \text{Im}(A) \oplus \ker(A^t)} \qquad\qquad (1) \end{equation}
and
\begin{equation} \boxed{\ker(A^tA) = \ker(A)} \qquad\qquad (2) \end{equation}
Proof. To prove (1) note that for $x \in \ker(A^t) \subset \mathbb{R}^m$ and some $y \in \mathbb{R}^n$ we have that \begin{equation} A^tx = 0 \implies y^tA^tx = 0 \iff \langle A^tx, y \rangle = \langle x, Ay \rangle = 0 \implies x \in \text{Im}(A)^{\perp} \end{equation} and therefore, $\ker(A^t) \subset \text{Im}(A)^{\perp}$.
Conversely, for $x \in \text{Im}(A)^{\perp} \subset \mathbb{R}^m$ it is true that
\begin{equation} 0 = \langle x, AA^tx \rangle = \langle A^tx, A^tx \rangle = \Vert A^tx \Vert_{2}^2 \iff A^tx = 0 \implies x \in \ker(A^t) \end{equation}
considering the definition of a norm. Hence, $\text{Im}(A)^{\perp} \subset \ker(A^t)$. In total, we proved $\ker(A^t) = \text{Im}(A)^{\perp}$ and hence (1).
To prove (2), observe that one direction is trivial as \begin{equation} \mathbb{R}^m \ni Ax = 0 \implies A^tAx = A^t0 = 0 \end{equation} and hence, $\ker(A) \subset \ker(A^tA)$. For the other direction, we see immediately that
\begin{equation}
A^tAx = 0 \implies 0 = \langle A^tAx, x \rangle = \langle Ax, Ax \rangle = \Vert Ax \Vert_{2}^{2} = 0 \iff Ax = 0
\end{equation}
and therefore, $\ker(A^tA) \subset \ker(A)$. In sum, we again have $\ker(A^tA) = \ker(A) \qquad\qquad\qquad\qquad \square$.
2. Deriving the least squares formula
Now that we have the necessary foundation, we can derive the least squares formula in form of the following theorem.
Theorem. Let $A \in \mathbb{R}^{m \times n}$ and $b \in \mathbb{R}^m$. Then, $x \in \mathbb{R}^n$ is the least squares solution of the equation $Ax = b$ iff
\begin{equation} \boxed{A^tAx = A^tb} \end{equation}
Furthermore, $x = (A^tA)^{-1}A^tb$ is unique iff $\text{rk}(A) = n$.
Proof. First, we orthogonally decompose $b \in \mathbb{R}^m$ according to our previous lemma in two parts $b_1 \in \text{Im}(A)$ and $b_2 \in \ker(A^t)$. Now assume $x \in \mathbb{R}^n$. Then we have \begin{equation} \Vert Ax - b \Vert_{2}^2 = \Vert Ax - (b_1 + b_2)\Vert_{2}^{2} = \Vert (Ax - b_1) + (-b_{2})\Vert_{2}^{2} \end{equation}
Note that $(Ax - b_{1}) \in \text{Im}(A)$ as $\text{Im}(A)$ is as subspace closed under addition and scalar multiplication. Because of the orthogonality of both summands in the squared norm, i.e. $(Ax - b_1) \perp b_2$, we can use the generalized pythogarian theorem to conclude
\begin{equation} \Vert (Ax - b_1) + (-b_{2})\Vert_{2}^{2} = \Vert (Ax - b_1)\Vert_{2}^{2} + \Vert b_{2}\Vert_{2}^{2} \end{equation}
The last expression becomes minimal in case of $x = z$ where $Az = b_{1}$. Therefore, $z$ is the least squares solution. But then, the previous lemma implies
\begin{equation} Ax = b_{1} \land Az = b_1 \iff Ax - Az = 0 \iff A(x - z) = 0 \iff 0 = A^tA(x - z) = A^tAx - A^tAz \end{equation}
However, $A^tAz = A^tb_1 = A^t(b_1 + b_2) = A^tb$ and finally,
\begin{equation} Ax = b_{1} \land Az = b_1 \iff A^tAx - A^tb = 0 \iff A^tAx = A^tb \end{equation}
For the uniqueness, assume there are two least squares solutions $x_1, x_2$. Then, clearly
\begin{equation} A^tAx_1 = A^tb \land A^tAx_2 = A^tb \iff A^tA(x_1 - x_2) = 0 \iff A(x_1 - x_2) = 0 \iff x_1 - x_2 \in \ker(A) \end{equation}
Therefore, $x_1 = x_2 \iff \ker(A) = 0 \iff \text{rk}(A) = n \qquad\qquad\qquad\qquad \square$.
3. Conclusion
This concludes the geometric derivation of the least squares.
It highlights the fact that the least squares solution $x$ collapses into the exact solution of the equation $Ax = b$ as soon as $b \in \text{Im}(A)$.
If $b$ does not lie in the image of $A$, then $b$ necessarily has a component that is part of $\text{Im}(A)^{\perp}$.
In this case, the least squares solution targets that component of $b$ that does lie in the image of $A$ by projecting $b$ orthogonally onto $b_1 \in \text{Im}(A)$.
There are several follow-up topics: the closely related pseudo-inverse as minimum norm least squares solution, the singular value decomposition (SVD) and nonlinear least squares, e.g. Levenberg-Marquardt.
Also, Krylow space methods and iterative solving of linear systems of equations also make exciting topics for exposition (and are widely used in the industry I imagine).