Consider a two-class classification problem with $Y\in\{-1,1\}$. Take a base $\mathcal{G}$, in particular decision stumps, such that $g\in\mathcal{G}$ implies that $-g\in\mathcal{G}$. We want to estimate the log odds function $a(x)$ using a prediction rule $f\in\operatorname{span}(\mathcal{G})$ obeying an additive form given by

$$f(x)=\sum_{m=1}^{M}f_m(x)=\sum_{m=1}^{M}\alpha_m g_m(x),$$ where

• $g_m:\mathcal{X}\rightarrow \{-1,1\}$ are base classifiers or weak classifiers such that $g_m\in\mathcal{G}$; and
• $\alpha_m>0$ are positive weights.
• $M$ is some (large) hyperparameter.

One may interpret our estimator $f(x)$ as a voting system: each classifier $g_m$ represents a voting machine with weak predictive power, and $\alpha_m$ represents its voting power. The function $f(x)$ is a weighted ensemble of the base classifiers, resulting an estimated Bayes classifier $$h(x)=\operatorname{sign}(f(x))$$ as a weighted majority vote given by $$h(x)=\operatorname{sign}\left(\sum_{m=1}^{M}\alpha_m g_m(x)\right).$$

Greedy Stagewise Approach for Exponential Error #

Suppose we want to estimate the log odds function by minimizing the empirical exponential risk function given by \begin{align*}&L_S(f)\\=&\frac{1}{n}\sum_{i=1}^{n}\exp\left(-\frac{1}{2}f(X_i)\cdot Y_i\right)\\=&\frac{1}{n}\sum_{i=1}^{n}\prod_{m=1}^{M}\exp\left(-\frac{1}{2}f_m(X_i)\cdot Y_i\right)\\=&\frac{1}{n}\sum_{i=1}^{n}\prod_{m=1}^{M}\exp\left(-\frac{1}{2}\alpha_m g_m(X_i)\cdot Y_i\right)\end{align*} where the last two steps are due to the additive form of $f(x)$.

It is hard to optimize this error function over all base functions $g_m$ and weights $\alpha_m$ simultaneously. This is a similar issue as that for the least-squares estimation for additive regression models.

In the same spirit of least-squares boosting, one may use sequential optimization techniques to generate the base learners $f_m$ one-by-one. Let $M\geq 1$ and suppose that the base learners $f_m=\alpha_m g_m$ are given (both $\alpha_m$ and $g_m$ are known) for $1\leq m\leq M-1$ except for the last one $f_M=\alpha_Mg_M$. Then we can rewrite the exponential error

\begin{align*}&\frac{1}{n}\sum_{i=1}^{n}\prod_{m=1}^{M}\exp\left(-\frac{1}{2}\alpha_m g_m(X_i)Y_i\right)\\=&\sum_{i=1}^{n}\widetilde{w}_{i}^{(M)}\exp\left(-\frac{1}{2}\alpha_M g_M(X_i)Y_i\right)\end{align*} as a weighted sum of exponential losses for only one base learner with known weights

$$\widetilde{w}_i^{(M)}=\frac{1}{n}\prod_{m=1}^{M-1}\exp\left(-\frac{1}{2}\alpha_m g_m(X_i)Y_i\right).$$

Note that the weights $\widetilde{w}_i^{(M)}$ are not yet normalized in the sense that they do not sum up to 1 in general. We shall normalize them to satisfy the axioms of probabiliy and define the normalized weights as

$$w_i^{(M)}=\frac{\widetilde{w}_i^{(M)}}{\sum_{i=1}^{n}\widetilde{w}_i^{(M)}},\quad 1\leq i\leq n.$$

By construction, $w_i^{(M)}>0$ for all $1\leq i\leq n$, and

$$\sum_{i=1}^{n}w_i^{(M)}=1.$$

Substituting the weights $\widetilde{w}_{i}^{(M)}$ with $w_i^{(M)}$ yields a slightly different error function

$$L_S^{(M)}(\alpha,g)=\sum_{i=1}^{n}w_{i}^{(M)}\exp\left(-\frac{1}{2}\alpha g(X_i)Y_i\right),$$ whose minimizer is the next base learner given by $$(\alpha_M,g_M)=\underset{\alpha>0, g\in\mathcal{G}}{\operatorname{argmin}}~L_S^{(M)}(\alpha,g)$$

Note that the choice between weights $\widetilde{w}_{i}^{(M)}$ and $w_i^{(M)}$ does not influence the minimizer $(\alpha_M,g_M)$.

Now, using the greedy stagewise approach we obtain the Adaptive Boosting (AdaBoost) algorithm:

• Initialize $F^{(0)}(x)=0$ and weights $w_i^{(1)}=1/n$.
• For $m=1~\text{to}~M$ do:
1. Fit a base regression function $f_m(x)=\alpha_m g_m(x)\in\mathcal{B}$ such that $$(\alpha_m,g_m)=\underset{\alpha>0, g\in\mathcal{G}}{\operatorname{argmin}}~L_S^{(m)}(\alpha,g).$$
2. Accumulate the base learners $$F^{(m)}=F^{(m-1)}+f_m=F^{(m-1)}+\alpha_m g_m.$$ If $\alpha_m=0$, one may terminate the for loop immediately and output $F^{(M)}=F^{(m-1)}$.
3. If $m<M$, update and normalize the weights $w_i^{(m+1)}.$
• Output $\widehat{f}(x)=F^{(M)}(x)$.
Above is a sketch of AdaBoost. We shall explain how to solve each base learner and update the weights in details.

AdaBoost: Solving the Base Learner #

To solve the base learner, one need to use an important identity for binary variables $g(X_i), Y_i\in\{-1,1\}$:

$$g(X_i)\cdot Y_i=\begin{cases}-1 & g(X_i)\neq Y_i\\1&g(X_i)=Y_i\end{cases}.$$

Hence, we can rewrite the exponential loss as

\begin{align*}&\exp\left(-\frac{1}{2}\alpha g(X_i)Y_i\right)\\=&\begin{cases}e^{\frac{1}{2}\alpha}& g(X_i)\neq Y_i\\e^{-\frac{1}{2}\alpha}&g(X_i)=Y_i\end{cases}\\=&e^{\frac{1}{2}\alpha}\mathbf{1}[g(X_i)\neq Y_i]\\&+e^{-\frac{1}{2}\alpha}\mathbf{1}[g(X_i) = Y_i]\end{align*}

Note that $$\mathbf{1}[g(X_i)\neq Y_i]+\mathbf{1}[g(X_i)=Y_i]=1$$ or, equivalently, $$\mathbf{1}[g(X_i)=Y_i]=1-\mathbf{1}[g(X_i)\neq Y_i].$$

Subsituting this back to the last expression of the exponential loss yields that \begin{align*}&\exp\left(-\frac{1}{2}\alpha g(X_i)Y_i\right)\\=&e^{\frac{1}{2}\alpha}\mathbf{1}[g(X_i)\neq Y_i]\\&+e^{-\frac{1}{2}\alpha}(1-\mathbf{1}[g(X_i)\neq Y_i])\\=&\left(e^{\frac{1}{2}\alpha}-e^{-\frac{1}{2}\alpha}\right)\mathbf{1}[g(X_i)\neq Y_i]+e^{-\frac{1}{2}\alpha}\end{align*}

Summing up the losses with the normalized weights $w_i^{(m)}$,

\begin{align*}&\sum_{i=1}^{n}w_i^{(m)}\exp\left(-\frac{1}{2}\alpha g(X_i)Y_i\right)\\=&\left(e^{\frac{1}{2}\alpha}-e^{-\frac{1}{2}\alpha}\right)\underbrace{\sum_{i=1}^{n}w_i^{(m)}\mathbf{1}[g(X_i)\neq Y_i]}_{J_m(g)}+e^{-\frac{1}{2}\alpha},\end{align*}

where $J_m(g)$ is the weighted classification error.

The last expression is important because it allows us to optimize the base classifier $g$ independently:

$$g_m=\underset{g\in\mathcal{G}}{\operatorname{argmin}}~J_m(g).$$ This step can be solved by a standard computer package for fitting stumps, for example.

After solving $g_m$ and obtaining the minimum weighted classification error $$\epsilon_m:=J_m(g_m)=\min \{J_m(g): g\in\mathcal{G}\},$$ the error function become a univariate function of $\alpha$, namely, $$\left(e^{\frac{1}{2}\alpha}-e^{-\frac{1}{2}\alpha}\right)\epsilon_m+e^{-\frac{1}{2}\alpha}.$$ By solving the first-order condition one can obtain $$\alpha_m=\log \frac{1-\epsilon_m}{\epsilon_m}.$$ Note that $\alpha_m\geq 0$ as $\epsilon_m\leq 1/2$ by definition. When $\alpha_m=0$ with $\epsilon_m=1/2$, there will be no update in the ensemble estimator nor in the weights and so we might terminate the algorithm immediately.

In principle, one can re-calculate the weights $w_i^{(m)}$ for every $m$ without storing the weights in previous steps. However, using a recursive algorithm to update the weights slightly each time can save significant time costs.

We start with the improper weights $\widetilde{w}_i^{(m)}$ obeying a recursive relation given by

$$\widetilde{w}_i^{(m+1)}=\widetilde{w}_i^{(m)}\cdot \exp\left(-\frac{1}{2}\alpha_m g_m(X_i)Y_i\right).$$

Using again that for $g_m(X_i), Y_i\in\{-1,1\}$

\begin{align*}g_m(X_i)\cdot Y_i=&\begin{cases}-1 & g_m(X_i)\neq Y_i\\1&g_m(X_i)=Y_i\end{cases}\\=&1-2\cdot \mathbf{1}[g_m(X_i)\neq Y_i]\end{align*},

we can rewrite that

\begin{align*}&\widetilde{w}_i^{(m+1)}\\=&\widetilde{w}_i^{(m)}\cdot \exp\left(-\frac{1}{2}\alpha_m (1-2\cdot \mathbf{1}[g_m(X_i)\neq Y_i])\right)\\=&\widetilde{w}_i^{(m)}\exp(\alpha_m\mathbf{1}[g_m(X_i)\neq Y_i])\cdot e^{-\frac{\alpha_m}{2}}\\=&w_i^{(m)}\exp(\alpha_m\mathbf{1}[g_m(X_i)\neq Y_i])\cdot \delta_m\end{align*} where $$\delta_m=\sum_{i=1}^{n}\widetilde{w}_i^{(m)}\cdot e^{-\frac{\alpha_m}{2}}.$$ The scaling factor $\delta_m>0$ does not depend on $n$ and therefore it will disappear after normalization. In particular,

$$w_i^{(m+1)}=\frac{\widehat{w}_i^{(m)}\cdot \xcancel{\delta_m}}{\sum_{i=1}^{n}\widehat{w}_i^{(m)}\cdot \xcancel{\delta_m}}$$ where $$\widehat{w}_i^{(m)}=w_i^{(m)}\exp(\alpha_m\mathbf{1}[g_m(X_i)\neq Y_i]).$$

In programming, we might not want to create new variables $\widehat{w}_i^{(m)}$ but prefer to update $w_i^{(m)}$ twice in the code:

\begin{align*}w_i^{(m+1)}\leftarrow&~ w_i^{(m)}\exp(\alpha_m\mathbf{1}[g_m(X_i)\neq Y_i]),\\w_i^{(m+1)}\leftarrow&~\frac{w_i^{(m)}}{\sum_{i=1}^{n}w_i^{(m)}}.\end{align*}

The first step stores the values of $\widehat{w}_i^{(m)}$ temporarily in the variables $w_i^{(m)}$, and the second step completes the update of $w_i^{(m)}$.

• Initialize $F^{(0)}(x)=0$ and weights $w_i^{(1)}=1/n$.
• For $m=1~\text{to}~M$ do:
1. Fit a base classifier $$g_m=\underset{g\in\mathcal{G}}{\operatorname{argmin}}~J_m(g)$$ where $J_m$ is the weighted classification error $$J_m(g)=\sum_{i=1}^{n}w_i^{(m)}\mathbf{1}[g(X_i)\neq Y_i].$$
2. Calculate its voting weight $$\alpha_m=\log \frac{1-\epsilon_m}{\epsilon_m}$$ where $\epsilon_m=J_m(g_m)$ denotes the minimum weighted classification error.
3. Accumulate the base learners $$F^{(m)}=F^{(m-1)}+\alpha_m g_m.$$ If $\alpha_m=0$, one may terminate the for loop immediately and output $F^{(M)}=F^{(m-1)}$.
4. If $m<M$, update and normalize the example weights (for fitting the next base): \begin{align*}w_i^{(m+1)}\leftarrow& w_i^{(m)}\exp(\alpha_m\mathbf{1}[g_m(X_i)\neq Y_i])\\w_i^{(m+1)}\leftarrow&\frac{w_i^{(m+1)}}{\sum_{i=1}^{n}w_i^{(m+1)}}\end{align*}. Note that points that are misclassified by one of the base classifiers are given greater weight when used to train the next classifier in the sequence. One can only save the new weights $w_i^{(m+1)}$ to optimize memory usage.
• Output $\widehat{f}(x)=F^{(M)}(x)$.