Boosting
Weak Learner
The idea of boosting is to generate a committee learner $f(x)$ by aggregating many weak learners $f_m(x)$. A weak learner should generate estimates better than a trivial prediction rule (such as a constant regression function or random labeling) by exploiting some information from the feature(s). However, we do not require a weak learner to fit an arbitrary regression function or Bayes classifier. Here, we discuss two common choices of weak learners. ...Learn More>
Least Squares Boosting
Additive Regression Models # Consider a base $\mathcal{G}$, in particular stumps, such that $g\in\mathcal{G}$ implies that $w\cdot g\in\mathcal{G}$ for all constants $w\in (-\infty,\infty)$. Suppose we want to estimate the regression function $\mu(x)=\mathbb{E}[Y\mid X=x]$ by some prediction rule $f\in\operatorname{span}(\mathcal{G})$ of the additive form given by $$f(x)=\sum_{m=1}^{M}f_m(x),\quad f_m\in\mathcal{G}$$ for some (large) number $M$ as a hyperparameter. Let us take stumps \begin{align*}f_m(x)=&g(x;j_m,\theta_{m},c_{m1},c_{m2})\\=&c_{m1}\mathbf{1}[x_{j_m}<\theta_m]+c_{m2}\mathbf{1}[x_{j_m}\geq \theta_m].\end{align*} Then the empirical qudratic risk is given by \begin{align*} &L_S(f)\\=&\frac{1}{2n}\sum_{i=1}^{n}(f(X_i)-Y_i)^2\\=&\frac{1}{2n}\sum_{i=1}^{n}\left( \sum_{m=1}^{M}f_m(X_i)-Y_i\right)^2\\=&\frac{1}{2n}\sum_{i=1}^{n}\Bigg[ \sum_{m=1}^{M}\Big(c_{m1}\mathbf{1}[X_{ij_m}<\theta_m]\\&\qquad\qquad+c_{m2}\mathbf{1}[X_{ij_m}\geq\theta_m]\Big)-Y_i\Bigg] ^2. ...Learn More> Log Odds Consider a two-class classification problem with the label set \mathcal{Y}=\{-1,1\}. Define the log odds function bya(x)=\log\frac{\mathbb{P}(Y=1|X=x)}{\mathbb{P}(Y=-1|X=x)},$$which equals to the logarithm of the odds ratio$$\mathbb{P}(Y=1|X=x)/\mathbb{P}(Y=-1|X=x).$$By construction,$$\begin{align*}&a(x)>0\\&\Leftrightarrow \mathbb{P}(Y=1|X=x)>\mathbb{P}(Y=-1|X=x)\end{align*}$$and$$\begin{align*}&a(x)<0\\&\Leftrightarrow \mathbb{P}(Y=1|X=x)<\mathbb{P}(Y=-1|X=x).\end{align*}$$Therefore, the Bayes classifier is equal to the sign of the log odds:$$C^{\operatorname{Bayes}}(x)=\operatorname{sign}(a(x)).$$Note that a:\mathcal{X}\rightarrow \mathbb{R} is a real-valued function that is usually continuous. This brings us back to the unified view of classification and regression problems. ...Learn More> AdaBoost Additive Logistic Models # Consider a two-class classification problem with Y\in\{-1,1\}. Take a base \mathcal{G}, in particular decision stumps, such that g\in\mathcal{G} implies that -g\in\mathcal{G}. We want to estimate the log odds function a(x) using a prediction rule f\in\operatorname{span}(\mathcal{G}) obeying an additive form given by$$f(x)=\sum_{m=1}^{M}f_m(x)=\sum_{m=1}^{M}\alpha_m g_m(x), where $g_m:\mathcal{X}\rightarrow \{-1,1\}$ are base classifiers or weak classifiers such that $g_m\in\mathcal{G}$; and $\alpha_m>0$ are positive weights. $M$ is some (large) hyperparameter. ...Learn More>