The fundamental problem in machine learning is that the population joint distribution $\mathcal{D}$ of the target and features is unknown. Instead, we are given only a set of data $$S=\left\lbrace Y_i,X_i: i=1,\ldots,n\right\rbrace.$$ We can compute the losses $\ell(f(X_i),Y_i)$ over all observations $Y_i,X_i$, where $f$ denotes any candidate prediction rule of interest. We call the average loss over the sample $S$ the empirical risk function.
Empirical Risk Function #
$$L_S(f)=\frac{1}{n}\sum_{i=1}^{n}\ell(f(X_i),Y_i),\quad f\in\mathcal{F}.$$
By minimizing the empirical risk function rather than population risk function over candidate prediction rules, we obtain the so-called empirical risk minimizer. Regularizing the estimation process often, if not always, improves forecasting performance. The idea is to approximate the population risk $L_{\mathcal{D}}$ by the empirical risk $L_S$ only in an appropriate subspace $\mathcal{H}\subset \mathcal{F}$ and use the restricted estimator given by $$\widehat{f}\in\{f^{*}\in\mathcal{F}:L_{S}(f^{*})\leq L_{S}(f),~\forall f\in\mathcal{H}\},$$ or, equivalently, $$\widehat{f}=\underset{f\in\mathcal{H}}{\operatorname{argmin}}~L_{S}(f).$$
We trade-off estimation variance and bias to obtain better prediction rules in terms of population risk rather than the empirical risk:
- The variance of our estimator typically increases with the supremum estimation error of the risk function, namely, $$\sup_{f\in\mathcal{H}}\left| L_S(f)-L_{\mathcal{D}}(f)\right|,$$ which is worse for a larger space $\mathcal{H}$.
- On the other hand, shrinking the space $\mathcal{H}$ restricts our choices of candidate prediction rules. The best population risk we can reach is $\inf_{f\in\mathcal{H}} L_{\mathcal{D}}(f)$, and the smallest bias in terms of population risk given by $$ \inf_{f\in\mathcal{H}} L_{\mathcal{D}}(f)-\inf_{f\in\mathcal{F}}L_{\mathcal{D}}(f),$$ which can still be large if $\mathcal{H}$ is too small to find anything close to the ideal prediction rule.
Penalized Methods #
Suppose that we can index the functions $f=f(x;\theta)\in\mathcal{H}$ with parameters $\theta\in\Theta\subset\mathbb{R}^q$ such that $$\mathcal{H}=\left\lbrace f(\cdot;\theta)\in\mathcal{F}: C(\theta)\leq b, \theta\in\Theta \right\rbrace,$$ where $C:\Theta\rightarrow [0,\infty)$ is some complexity criterion function and $b\in [0,\infty)\cup \{\infty\}$ is some hyperparameter; one should interpret $b=\infty$ as a relaxation of the constraint $C(\theta)\leq b$.
Using the Lagrange multiplier method, we can solve $\widehat{f}(x)=f(x;{\widehat{\theta}})$ by minimizing the penalized error $$L_S(\theta)+\lambda\cdot C(\theta),\quad \theta\in\Theta$$ with some appropriate penalty coefficient $\lambda=\lambda(b)>0$ one-to-one mapped to $b$. The larger penalty $\lambda$, the smaller $b$ and the smaller set $\mathcal{H}$ are. In practice, it is easier to tune the hyperparameter $\lambda$ rather than the complexity bound $b$.
Example: Support Vector Machine #
Consider the two-class classification problem with $Y \in \{-1,1\}$ but let us extend the label set to be $\mathcal{Y}=\mathbb{R}$. Lemma 2.1 in Blanchard et al. (2008) states that the Bayes classifier is also ideal for the hinge loss function given by
$$\begin{align*}\ell_{\text{H}}(f(x),y)=&\max\{0,1-y\cdot f(x)\}\\=&\phi_{\text{H}}(y\cdot f(x))\end{align*}$$ with the weakly convex function $$\phi_{\text{H}}(x)=\max\{0,1-x\}=:(1-x)_{+}.$$
Now use the hinge loss function $\ell_{\text{H}}$ and restrict our prediction rules among linear functions given by $$f(x;\beta,\beta_0)=\beta^Tx+\beta_0$$ whose complexity is measured through $C(\beta,\beta_0)=\left\|\beta \right\|^2.$ The penalized method then minimizes $$\frac{1}{n}\sum_{i=1}^{n}(1-Y_i\left(\beta^TX_i+\beta_0 \right))_++\lambda\left\|\beta \right\|^2$$ over the parameters $\beta$ and $\beta_0$.
This is equivalent to the soft-margin support vector machine algorithm that minimizes $$\lambda \left\|\beta \right\|^2+\frac{1}{n}\sum_{i=1}^{n}\xi_i$$ over the parameters $\beta$, $\beta_0$ and $$\{\xi_i:i=1,\ldots,n\}$$ subject to the constraints $$Y_i(\beta^TX_i+\beta_0)\geq 1-\xi_i,\\ \xi_i\geq 0,\quad \forall i=1,\ldots,n.$$