We are interested in forecasting some target variable $Y$ using $d$ features $X_1$,$\ldots$,$X_d$.

Let $Y$ belong to a **label set** $\mathcal{Y}$, and the vector of features $X=(X_1,\ldots,X_d)$ belong to a **domain set** $\mathcal{X}$.

Examples of label sets include:

- A set of only two elements such as $\{0,1\}$ or $\{-1,+1\}$. The binary labels usually represent a YES/NO decision.
- A bounded interval on the real line that contains infinitely many elements, such as $[-M,M]$ for some $M>0$.
- A unbounded interval such as $(-\infty,+\infty)$.
- A set of multivariate labels such as $$\mathcal{Y}=\{(0,0),(0,1),(1,0),(1,1)\}.$$ Consider a bivariate target $Y=(Y_1,Y_2)$ and where $Y_1$ and $Y_2$ are two dummy variables.

Given any realization of $X$, denoted by $x\in\mathcal{X}$, we need to make a prediction $f(x)$ of the target variable $Y$ *before* observing it. Any function $f$ serving this purpose is a prediction rule, formally defined as follows.

Prediction Rule#A prediction rule is a function $f:\mathcal{X}\rightarrow \mathcal{Y}$ that maps the domain set to the label set.

*After* observing the target value $Y$, we quantify the predictive loss by

$$\ell(f(X),Y)$$

where $\ell:\mathcal{Y}\times\mathcal{Y}\rightarrow [0,\infty)$ is some **loss function** specified by the user. A larger loss implies a worse prediction performance, and the loss remains *non-negative* such that
$$ \ell (f(X), Y)\geq \ell (Y,Y)=0.$$

Depending on the label set and the purpose of learning, one may choose different loss functions. Two important examples are the squared loss and zero-one loss functions.

For a real-valued variable or vector $Y$, being continuous or discrete, such that the magnitude of estimation error is essential, we often use the (half) squared loss.

Squared Loss#$$\ell_{2}(\widehat{y},y)=\frac{1}{2}(\widehat{y}-y)^2$$ or, more generally, $$\ell_{2}(\widehat{y},y)=\frac{1}{2}\left\|\widehat{y}-y\right\|^2.$$ The scaling factor $\frac{1}{2}$ is unnecessary, but we keep it to simplify the expression for its derivative.

For a categorical variable $Y\in \mathcal{Y}=\{\mathcal{C}_1,\ldots,\mathcal{C}_K\}$ with no ordering of labels, the squared loss is not particularly meaningful. For example, think of a 4-class variable with $\mathcal{Y}=\{1,2,3,4\}$, but the numbers are just to distinguish different labels. The labels 2 and 4 are equally ‘different’ from 1. If what matters for the prediction is whether the predicted label is correct or not, it is more natural to consider the zero-one loss.

Zero-One Loss#$$\ell_{0-1}(\widehat{y},y)= \begin{cases}1& \widehat{y}\neq y\\0& \widehat{y}= y.\end{cases}$$

Once given the loss function, we want to find an optimal prediction rule. Note that we cannot avoid making errors by always choosing $f(X)=Y$ because we shall consider that $X$ and $Y$ are both random and not perfectly dependent.

Furthermore, when $X$ and $Y$ are treated as random vectors (or random variables), the loss $\ell(f(X),Y)$ also becomes random. To summarize the forecast performance over random events, we often measure a nonrandom error of a prediction rule $f$ through the *population* risk function given by
$$
L_{\mathcal{D}}(f)=\mathbb{E}[\ell(f(X),Y)], \quad f\in\mathcal{F},
$$

where

- $\mathcal{F}$ is a set of candidate prediction rules; if $\mathcal{F}$ is unspecified, take it as large as possible (there are no restrictions).
- $\mathcal{D}$ denotes the joint distribution of the target variable(s) $Y$ and features $X$.
- $\mathbb{E}$ denotes the expectation operator.

Ideal Prediction Rule#For a given distribution $\mathcal{D}$ and a loss function $\ell$, a prediction rule $f^{*}\in\mathcal{F}$ is ideal in class $\mathcal{F}$ if it has the lowest loss: $$L_{\mathcal{D}}(f^{*})\leq L_{\mathcal{D}}(f),\quad\text{for all}~ f\in\mathcal{F}.$$

The ideal prediction rule for the squared loss and zero-one loss are called the regression function and the Bayes classifier respectively.

Regression Function#An ideal prediction rule for the (half) squared loss function is the

regression function$$\mu(x)=\mathbb{E}[Y\mid X=x].$$

Bayes classifier#An ideal prediction rule for the zero-one loss with a categorical target $Y\in\{\mathcal{C}_1,\ldots,\mathcal{C}_K\}$ is the

Bayes classifiergiven by$$\begin{align*}&C^{\operatorname{Bayes}}(x)\\&=\underset{\mathcal{C}_k\in\mathcal{Y}}{\operatorname{argmax}}~\mathbb{P}(Y=\mathcal{C}_k\mid X=x)\end{align*}$$

which chooses the label with the largest posterior probability (break ties arbitrarily).

The ideal prediction rule is not necessarily unique, strictly speaking. Nevertheless, we can usually avoid such identification issues by properly restricting candidate functions. Therefore, unless specified otherwise, we shall assume that the ideal prediction rule is *unique* with an appropriate choice of the space $\mathcal{F}$ of candidate prediction rules.