Intro
I'm taking a Statistical Inference course this semester. It's been hard, the kind of hard where you think you understand something and then realize you were just pattern-matching to a familiar shape.
We finally got to linear regression, and I thought: finally, something I know. I've seen this in machine learning, I've implemented it, I've debugged it. So when the lecture notes wrote out the RSS and started manipulating it in matrix form, I was nodding along.
Then something happened.
We took the derivative of the RSS with respect to $\boldsymbol{\beta}$, set it to zero, and solved. No learning rate. No iterations. No update rule. Just solved.
I was confused because... I don't now, I was waiting a gradient descent-like solution? Like, doesn't the gradient give us a direction? Don't we then multiply by a learning rate and update? But nope. The notes just algebraically solved for $\tilde{\boldsymbol{\beta}}$ directly. I had gradient descent so wired into my intuition that I couldn't see why you'd ever just stop after one step.
Apparently, the answer is convexity.
The Setup
We have $n$ observations. Each observation $i$ has a feature vector $x_i \in$ ${R}^{k+1}$ (with $x_{i,0} = 1$ for the intercept) and a response $y_i \in \mathbb{R}$. Stacked into matrices:
$$y = X\boldsymbol{\beta} + \mathbf{e}$$
where $X$ is $n \times (k+1)$, $\boldsymbol{\beta} = (\beta_0, \beta_1, \dots, \beta_k)^T$, and $\mathbf{e}$ is the noise vector. The residual sum of squares is:
$$\text{RSS}(\boldsymbol{\beta}) = |y - X\boldsymbol{\beta}|_2^2 = (y - X\boldsymbol{\beta})^T(y - X\boldsymbol{\beta})$$
Expanding:
$$\text{RSS}(\boldsymbol{\beta}) = y^Ty - 2\boldsymbol{\beta}^TX^Ty + \boldsymbol{\beta}^TX^TX\boldsymbol{\beta}$$
Look at this carefully. In $\boldsymbol{\beta}$, this is a quadratic, one quadratic term $\boldsymbol{\beta}^TX^TX\boldsymbol{\beta}$, one linear term $-2\boldsymbol{\beta}^TX^Ty$, and a constant $y^Ty$. The shape of the loss landscape is determined entirely by that quadratic term.
The Closed-Form Solution
Since RSS is differentiable, we find its minimum by setting the gradient to zero. The gradient with respect to $\boldsymbol{\beta}$ is:
$$\nabla_{\boldsymbol{\beta}} \text{RSS} = -2X^Ty + 2X^TX\boldsymbol{\beta}$$
Setting this to zero:
$$X^TX\boldsymbol{\beta} = X^Ty$$
These are the normal equations. If $X$ has rank $k+1$ (no perfect multicollinearity among features), then $X^TX$ is invertible, and we get the unique solution directly:
$$\tilde{\boldsymbol{\beta}} = (X^TX)^{-1}X^Ty$$
No iterations. No learning rate. We set the gradient to zero, solve the linear system, done. The reason this works, the reason setting the gradient to zero is guaranteed to give a global minimum and not some saddle point or local minimum, is that RSS is convex.
Why RSS is Convex: The Hessian Proof
A twice-differentiable function is convex if and only if its Hessian is positive semi-definite (PSD) everywhere. Positive semi-definite means: for any non-zero vector $\mathbf{z}$,
$$\mathbf{z}^T H \mathbf{z} \geq 0$$
The Hessian of RSS with respect to $\boldsymbol{\beta}$ is just the matrix of second derivatives. From the gradient $-2X^Ty + 2X^TX\boldsymbol{\beta}$, differentiating again:
$$H = 2X^TX$$
Now we check PSD. Take any $\mathbf{z} \in \mathbb{R}^{k+1}$:
$$\mathbf{z}^T H \mathbf{z} = 2\mathbf{z}^T X^T X \mathbf{z} = 2(X\mathbf{z})^T(X\mathbf{z}) = 2|X\mathbf{z}|_2^2$$
The squared $\ell_2$-norm of any vector is non-negative. So:
$$\mathbf{z}^T H \mathbf{z} \geq 0 \quad \forall \mathbf{z}$$
$H$ is PSD everywhere, which means RSS is convex everywhere. If $X$ has full column rank, $X\mathbf{z} = 0$ only when $\mathbf{z} = 0$, so the inequality is strict, RSS is strictly convex, with a single unique global minimum.
This is the thing I was missing. The landscape isn't just bowl-shaped by coincidence or by the grace of your dataset. It's bowl-shaped by construction, because the loss is a quadratic form in $\boldsymbol{\beta}$ with a PSD coefficient matrix.
Why Other Models Need Gradient Descent
Linear regression is special. The moment you move to logistic regression, the loss is a sum of log-sigmoids, no longer quadratic in the parameters, no closed form. Neural networks compose nonlinearities across layers; the loss landscape becomes non-convex, full of saddle points and local minima. You need iterative methods because there's no algebraic shortcut to the bottom.
The gradient descent intuition I came in with, compute gradient, get a direction, step, repeat, isn't wrong. It's just solving a harder problem. Linear regression's convexity means you can skip directly to where gradient descent would eventually converge anyway, and get there in one shot.
Interactive: The Loss Landscape
I spent one hour getting confused about gradient descent only to learn that linear regression doesn't use it.
But in the and, I learned something.