Boosting

Weak Learner: The idea of boosting is to generate a committee learner $f(x)$ by aggregating many weak learners $f_m(x)$. A weak learner should generate estimates better than a trivial prediction rule (such as a constant regression function or random labeling) by exploiting some information from the feature(s). However, we do not require a weak learner to fit an arbitrary regression function or Bayes classifier. Here, we discuss two common choices of weak learners. ...Learn More>
Least Squares Boosting: Additive Regression Models # Consider a base $\mathcal{G}$, in particular stumps, such that $g\in\mathcal{G}$ implies that $w\cdot g\in\mathcal{G}$ for all constants $w\in (-\infty,\infty)$. Suppose we want to estimate the regression function $\mu(x)=\mathbb{E}[Y\mid X=x]$ by some prediction rule $f\in\operatorname{span}(\mathcal{G})$ of the additive form given by $$f(x)=\sum_{m=1}^{M}f_m(x),\quad f_m\in\mathcal{G}$$ for some (large) number $M$ as a hyperparameter. Let us take stumps $$\begin{align*}f_m(x)=&g(x;j_m,\theta_{m},c_{m1},c_{m2})\\=&c_{m1}\mathbf{1}[x_{j_m}<\theta_m]+c_{m2}\mathbf{1}[x_{j_m}\geq \theta_m].\end{align*}$$ Then the empirical qudratic risk is given by $$\begin{align*} &L_S(f)\\=&\frac{1}{2n}\sum_{i=1}^{n}(f(X_i)-Y_i)^2\\=&\frac{1}{2n}\sum_{i=1}^{n}\left( \sum_{m=1}^{M}f_m(X_i)-Y_i\right)^2\\=&\frac{1}{2n}\sum_{i=1}^{n}\Bigg[ \sum_{m=1}^{M}\Big(c_{m1}\mathbf{1}[X_{ij_m}<\theta_m]\\&\qquad\qquad+c_{m2}\mathbf{1}[X_{ij_m}\geq\theta_m]\Big)-Y_i\Bigg] ^2. ...Learn More>
Log Odds: Consider a two-class classification problem with the label set $\mathcal{Y}=\{-1,1\}$. Define the log odds function by $$a(x)=\log\frac{\mathbb{P}(Y=1|X=x)}{\mathbb{P}(Y=-1|X=x)},$$ which equals to the logarithm of the odds ratio $$\mathbb{P}(Y=1|X=x)/\mathbb{P}(Y=-1|X=x).$$ By construction, $$\begin{align*}&a(x)>0\\&\Leftrightarrow \mathbb{P}(Y=1|X=x)>\mathbb{P}(Y=-1|X=x)\end{align*}$$ and $$\begin{align*}&a(x)<0\\&\Leftrightarrow \mathbb{P}(Y=1|X=x)<\mathbb{P}(Y=-1|X=x).\end{align*}$$ Therefore, the Bayes classifier is equal to the sign of the log odds: $$C^{\operatorname{Bayes}}(x)=\operatorname{sign}(a(x)).$$ Note that $a:\mathcal{X}\rightarrow \mathbb{R}$ is a real-valued function that is usually continuous. This brings us back to the unified view of classification and regression problems. ...Learn More>
AdaBoost: Additive Logistic Models # Consider a two-class classification problem with $Y\in\{-1,1\}$. Take a base $\mathcal{G}$, in particular decision stumps, such that $g\in\mathcal{G}$ implies that $-g\in\mathcal{G}$. We want to estimate the log odds function $a(x)$ using a prediction rule $f\in\operatorname{span}(\mathcal{G})$ obeying an additive form given by $$f(x)=\sum_{m=1}^{M}f_m(x)=\sum_{m=1}^{M}\alpha_m g_m(x),$$ where $g_m:\mathcal{X}\rightarrow \{-1,1\}$ are base classifiers or weak classifiers such that $g_m\in\mathcal{G}$; and $\alpha_m>0$ are positive weights. $M$ is some (large) hyperparameter. ...Learn More>