Additive Regression Models #
Consider a base $\mathcal{G}$, in particular stumps, such that $g\in\mathcal{G}$ implies that $w\cdot g\in\mathcal{G}$ for all constants $w\in (-\infty,\infty)$.
Suppose we want to estimate the regression function $\mu(x)=\mathbb{E}[Y\mid X=x]$ by some prediction rule $f\in\operatorname{span}(\mathcal{G})$ of the additive form given by $$f(x)=\sum_{m=1}^{M}f_m(x),\quad f_m\in\mathcal{G}$$ for some (large) number $M$ as a hyperparameter.
Let us take stumps $$\begin{align*}f_m(x)=&g(x;j_m,\theta_{m},c_{m1},c_{m2})\\=&c_{m1}\mathbf{1}[x_{j_m}<\theta_m]+c_{m2}\mathbf{1}[x_{j_m}\geq \theta_m].\end{align*}$$ Then the empirical qudratic risk is given by $$\begin{align*} &L_S(f)\\=&\frac{1}{2n}\sum_{i=1}^{n}(f(X_i)-Y_i)^2\\=&\frac{1}{2n}\sum_{i=1}^{n}\left( \sum_{m=1}^{M}f_m(X_i)-Y_i\right)^2\\=&\frac{1}{2n}\sum_{i=1}^{n}\Bigg[ \sum_{m=1}^{M}\Big(c_{m1}\mathbf{1}[X_{ij_m}<\theta_m]\\&\qquad\qquad+c_{m2}\mathbf{1}[X_{ij_m}\geq\theta_m]\Big)-Y_i\Bigg] ^2.\end{align*}$$
It is difficult to minimize this error function simultaneously with respect to a large number of $4M$ parameters $$\{j_m,\theta_m,c_{m1},c_{m2}:m=1,\ldots,M\}.$$ Even if we are willing to omit the computational costs, the estimator $\widehat{f}(x)$ may suffer from the curse of dimensionality, meaning that its statistical performance can be poor for a large $M$.
Note that the problem can get even worse when considering more complicated base learners such as $11$-terminal nodes trees.
Greedy Stagewise Approach #
The least-squares estimation in naive case with $M=1$, however, is easy as the squared error $$\begin{align*} L_S(f_1)=&\frac{1}{2n}\sum_{i=1}^{n}\Big( c_{11}\mathbf{1}[X_{ij_1}<\theta_1]\\&\qquad+c_{12}\mathbf{1}[X_{ij_1}\geq\theta_1]-Y_i\Big) ^2\end{align*}$$ involves only four parameters $(j_1,\theta_1,c_{11},c_{12})\in \{1,\ldots,d\}\times \mathbb{R}^3$.
This motivates a sequential optimization algorithm. Now let us consider a large $M$, say, $M=500$ but assume that all the base learners $f_1,\ldots,f_{M-1}$ are already given except for the last one $f_M$. Then
$$\begin{align*} &\frac{1}{2n}\sum_{i=1}^{n}\left( \sum_{m=1}^{M}f_m(X_i)-Y_i\right)^2\\=& \frac{1}{2n}\sum_{i=1}^{n}\left( f_M(X_i)-\widetilde{Y}_i^{(M-1)}\right)^2 \end{align*}$$ with known residuals $$\widetilde{Y}_i^{(M-1)}=Y_i-\sum_{m=1}^{M-1}f_m(X_i)$$ using the given $f_1,\ldots,f_{M-1}$. Solving $f_M$ is then as easy as for the case $M=1$.
This sequential optimization algorithm leads to the least-squares boosting method:
Least-Squares Boosting #
- Initialize $F^{(0)}(x)$ with zero or a constant $F^{(0)}(x)=\bar{Y}$, where $\bar{Y}$ denotes the sample average of the target values.
- For $m = 1~\text{to}~M$ do:
- Calculate the residuals $$\widetilde{Y}_i=Y_i-F^{(m-1)}(X_i)$$ for all $1\leq i\leq n$.
- Fit a base to the residuals: $$g_m=\underset{g\in\mathcal{G}}{\operatorname{argmin}}\frac{1}{2n}\sum_{i=1}^{n}\left( g(X_i)-\widetilde{Y}_i^{(m-1)}\right)^2$$
- Accumulate the base learners $$F^{(m)}=F^{(m-1)}+g_m$$
- Output $\widehat{f}(x)=F^{(M)}(x)$.
Least-squares boosting is a stagewise method in a sense that the new base learners does not change the estimation of earlier bases. It is greedy as it fits the best base learner in every step.
Regularization #
A special pattern of boosting method is that the overfitting process occurs slowly as a small pool of weak learners cannot change the committee predictions dramatically. The boosting method can still overfit, however, after too many steps. To trade-off estimation bias and variance, we can add the new base subject to a shrinkage factor say $\eta\in (0,1]$:
$$F^{(m)}=F^{(m-1)}+f_m=F^{(m-1)}+\eta\cdot g_m.$$
This hyperparameter $\eta$ is called the learning rate.