#### Bagging Reduces Estimation Variance #

Bootstrap aggregating (or, in short, *bagging*) is a computationally intensive method for reducing the variance of unstable estimators. The idea is to aggregate the predictions by a *committee* of unstable estimators rather than using a single one:

- By bootstrapping we mean drawing $n^*$ data points randomly from the original database
$$S=\{Y_i,X_i: i=1,\ldots, n\}.$$
We draw
**with replacement**, allowing the same data points to occur more than once in the resampled dataset. - Repeat the step above $M$ times to extract the bootstrap samples
$$S_m^{*}=\{Y_i^{*},X_i^{*}: i=1,\ldots, n^{*}\},\\~m=1,\ldots,M.$$
Note that we are
**not**generating new data points beyond the original ones, and therefore data sets $S_m^{*}$ are dependent by using the same original database $S$. - Fit a base estimator $\widehat{f}_m(x)$ to every bootstrap sample $S_m^{*}$ and output the ensemble estimator $$\widehat{f}(x)=\frac{1}{M}\sum_{m=1}^{M}\widehat{f}_m(x).$$
**Bagging uses $n^*=n$ unless specified otherwise**.

#### Bagging Is Not Boosting #

Bagging and boosting are both ensemble learning methods that aggregate many base learners. However, there are at least five crucial differences:

- Bagging involves
*randomization*, while the (modern) boosting methods are usually deterministic algorithms. - Bagging generates base estimators
*without*ordering, whereas boosting generates base estimators sequentially. - Bagging uses
*strong*base learners in contrast to the weak base learners used by boosting. Bagging performs very poorly with stumps. - Bagging does
*not*require an additive model. - Bagging should run
*forever*with a $M$ as large as possible, but we must stop boosting early to avoid overfitting (although it happens slowly).

#### Bagging Is Smoothing #

Consider again our toy example of unstable estimator. Let $\bar{X}_m^{*}$ denote the $m$-th bootstrap average with $n^*=n$. The bagging estimator for this example is given by

$$\widehat{f}(x)=\frac{1}{M}\sum_{m=1}^{M}\mathbf{1}[\bar{X}_m^*\leq x].$$

By law of large numbers **conditioning** on the original data set $S$, for large $M$
$$\widehat{f}(x)\approx \mathbb{P}(\bar{X}^*\leq x\mid S),$$
where the variable $\bar{X}^{*}$ represents a generic bootstrap average that has a conditional mean $\bar{X}$ given $S$.

If the **bootstrap principle** applies, that is, if our resampling mechanism from a *given* sample $S$ mimics the sampling mechanism of $S$ from the population, then *uniformly* for $t\in\mathbb{R}$
$$\begin{align*}&\mathbb{P}(\sqrt{n}(\bar{X}^{*}-\bar{X})\leq t\mid S)\\\approx &~\mathbb{P}(\sqrt{n}(\bar{X}-\mu)\leq t)=\Phi(t),&\end{align*}$$
where $\Phi$ denotes the standard normal distribution function. The last equality holds because $\bar{X}\sim N(\mu,1/n)$.

Substituting $t=\sqrt{n}(x-\bar{X})$, $$\begin{align*}&\mathbb{P}(\bar{X}^{*}\leq x\mid S)\\=&~\mathbb{P}(\sqrt{n}(\bar{X}^{*}-\bar{X})\leq \sqrt{n}(x-\bar{X}))\\\approx&~\Phi(\sqrt{n}(x-\bar{X})).\end{align*}$$ Now define $$c(x)=\sqrt{n}(x-\mu)$$ and $$Z=\sqrt{n}(\bar{X}-\mu)\sim N(0,1).$$ It follows that, for large $M$, $$\widehat{f}(x)\approx\Phi(\sqrt{n}(x-\bar{X}))=\Phi(c(x)-Z)$$

Comparing with the unstable estimator $$\widehat{f}(x)=\mathbf{1}[\bar{X}\leq x]=\mathbf{1}[Z-c(x)\leq 0],$$ we observe that bagging smooths the indicator function $\mathbf{1}[u\leq 0]$ to be $\Phi(-u)$.

The figure below illustrates the smoothing effect:

- The solid line is the indicator function $\mathbf{1}[u\leq 0]$; and
- The dotted line is the smoothed function $\Phi(-u)$.

The function $\Phi(-u)$ is continuous in $u$ with a bounded derivative such that $$\begin{align*}\left|\frac{d}{du}\Phi(-u)\right|=&\left|-\phi(-u)\right|\\=&\phi(-u)\leq\phi(0)=\frac{1}{\sqrt{2\pi}},\end{align*}$$ where $\phi$ is the standard normal density function given by $$\phi(x)=\frac{1}{\sqrt{2\pi}}e^{-x^2/2}.$$

A small change, say, $\delta$ in $Z$ can only result in a small change in the estimator in the sense that (using a first-order Taylor expansion) $$\sup_{c\in\mathbb{R}}|\Phi(c-(Z+\delta))-\Phi(c-Z)|\leq \frac{|\delta|}{\sqrt{2\pi}},$$ where the upper bound is nonrandom and tends to 0 as $|\delta|\rightarrow 0$.

The smoothing effect leads to a significant reduction in the estimation variance. At the most unstable point $x=\bar{X}$ with $c(x)=0$, the unstable estimator has the variance $$\operatorname{var}(\mathbf{1}[Z\leq 0])=\frac{1}{4}$$ whereas the bagging estimator has only a limiting variance (as $M\rightarrow\infty$) of $$\operatorname{var}(\Phi(-Z))=\operatorname{var}(U)=\frac{1}{12},\\U\sim\operatorname{Unif}(0,1).$$ Below is a plot of their variances as a function of $c$. The solid line is for the unstable estimator $\mathbf{1}[Z-c\leq 0]$ and the dashed line is for the limit of the bagging estimator $\Phi(c-Z)$.