Bagging & Boosting

Bagging and boosting are two ensemble methods that a set of weak learner are combined to create a strong learner, which has a better performance than a single learner.

In learning, the main cause of error are due to noise, bias and variance. The goal of these ensemble methods is to reduce the error from bias and variance. Thus, it produces a more reliable classifier / regressor.

How they work and differ

Both bagging and boosting get N learners by training on new training dataset produced by sampling with replacement from the original dataset.

At sampling stage
- Bagging uses equal weights to sample a new dataset
- Boosting applies an unequal weights for each observation
At training stage
- Bagging builds each model independently
- Boosting builds each learner in sequential way. Weights are redistributed after each training step. That is, putting more weights on difficult cases.
At classification / prediction stage
- Bagging averages over the response of N learners
- Boosting takes the weighted average of the estimates, which are assigned with different weights

How to select the best suited method

This decision depends on your problem at hand. First, I will list out the main features of each method.

Bagging:
- reduce variance
- handle overfitting
- independent classifers
Boosting:
- reduce variance and bias
- can overfit
- sequential classifiers

Both methods try to decrease a variance of a single estimate to create a higher stability model. However, bagging suffers a low performance if a single model is underfitted. Because bagging rarely get a better bias.

When the problem is overfitting a single model, boosting will perform worse by increasing the model complexity.

Next, I will look at one of the most popular boosting method - Gradient Boosting Method and create a toy example to illustrate the concept.

Gradient Boosting Method

Setup: Denote $\{(x_i, y_i)\}_{i=1}^n$ as a training set, a differentiable loss function $L(y, F(x))$ and a number of iteration $M$

Initialize model with a constant value: $F_0(x) = argmin_\gamma \sum_i^n L(y_i; \gamma)$
For $m = 1$ to $M$
1. Compute pseudo residuals: $$r_{im} = -\left[\frac{\partial L(y_i; F(x_i))}{\partial F(x_i)}\right]$$
2. Fit a base learner (tree) $h_m(x)$ to pseudo residuals, ${(r_{im}, x_i)}_i^n$
3. Compute multiplier $\gamma_m$ by solving the one-dim optimization problem $$\gamma_m = argmin_\gamma \sum_{i = 1}^n L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i))$$
4. Update the model $F_m(x) = F_{m-1}(x) + \gamma_m h_m(x)$
Output $F_M(x)$

A toy example:

In [1]:

import numpy as np 

# To plot pretty figures
import matplotlib.pyplot as plt 
%matplotlib inline
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

In [2]:

# generate a toy example
rs =124
np.random.seed(rs)

# generate X ~ Uniform[-1,1]
X = 2 * np.random.rand(100, 1) - 1

# generate y = 4*X^2 + N(0, 0.5)
y = 4 * X[:, 0]**2 + 0.5 * np.random.randn(100)

In [3]:

from sklearn.tree import DecisionTreeRegressor
tree1 = DecisionTreeRegressor(max_depth=2, random_state=rs)

# fit the first decision tree regression uses original outcomes and X
tree1.fit(X,y)

Out[3]:

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=124, splitter='best')

In [4]:

y2 = y - tree1.predict(X)
tree2 = DecisionTreeRegressor(max_depth=2, random_state=rs)
# fit the 2nd decision tree regression uses residuals as outcomes
tree2.fit(X, y2)

Out[4]:

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=124, splitter='best')

In [5]:

y3 = y2 - tree2.predict(X)
# fit the last decision tree regression uses the residuals of the 
# 2nd model as outcomes
tree3 = DecisionTreeRegressor(max_depth=2, random_state=rs)
tree3.fit(X, y3)

Out[5]:

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=124, splitter='best')

In [8]:

# plot sequential residuals and ensemble predictions
def _plot(regressors, X, y, axes, label=None, style="r-", data_style="b.", data_label=None):
    x1 = np.linspace(axes[0], axes[1], 500)
    y_pred = sum(regressor.predict(x1.reshape(-1, 1)) for regressor in regressors)
    plt.plot(X[:, 0], y, data_style, label=data_label)
    plt.plot(x1, y_pred, style, linewidth=2, label=label)
    if label or data_label:
        plt.legend(loc="upper center", fontsize=12)
    plt.axis(axes)

plt.figure(figsize=(11,11))

plt.subplot(321)
_plot([tree1], X, y, axes=[-1.1, 1.1, -1.1, 5.5], label="$f_1(x_1)$", style="g-", data_label="Training set")
plt.ylabel("$y$", fontsize=12, rotation=0)
plt.title("Residuals and tree predictions", fontsize=16)

plt.subplot(322)
_plot([tree1], X, y, axes=[-1.1, 1.1, -1.1, 5.5], label="$f(x_1) = f_1(x_1)$", data_label="Training set")
plt.ylabel("$y$", fontsize=12, rotation=0)
plt.title("Ensemble predictions", fontsize=16)

plt.subplot(323)
_plot([tree2], X, y2, axes=[-1.1, 1.1, -2, 4.5], label="$f_2(x_1)$", style="g-", data_style="k+", data_label="Residuals")

plt.ylabel("$y - f_1(x_1)$", fontsize=12)

plt.subplot(324)
_plot([tree1, tree2], X, y, axes=[-1.1, 1.1, -1.1, 5.5], label="$f(x_1) = f_1(x_1) + f_2(x_1)$")

plt.ylabel("$y$", fontsize=12, rotation=0)

plt.subplot(325)
_plot([tree3], X, y3, axes=[-1.1, 1.1, -2, 4], label="$f_3(x_1)$", style="g-", data_style="k+")

plt.ylabel("$y - f_1(x_1) - f_2(x_1)$", fontsize=12)

plt.xlabel("$x_1$", fontsize=12)

plt.subplot(326)
_plot([tree1, tree2, tree3], X, y, axes=[-1.1, 1.1, -1.1, 6], label="$f(x_1) = f_1(x_1) + f_2(x_1) + f_3(x_1)$")

plt.xlabel("$x_1$", fontsize=12)
plt.ylabel("$y$", fontsize=12, rotation=0)

plt.show()

Ensemble methods

Bagging & Boosting

How they work and differ

How to select the best suited method

Gradient Boosting Method

A toy example:

Remarks: as more tree regressors are added to the ensemble, the model can quickly overfit.