The Machine Learning “Advent Calendar” Day 19: Bagging in Excel

towardsdatascience.com

For 18 days, we have explored most of the core machine learning models, organized into three major families: distance- and density-based models, tree- or rule-based models, and weight-based models.

Up to this point, each article focused on a single model, trained on its own. Ensemble learning changes this perspective completely. It is not a standalone model. Instead, it is a way of combining these base models to build something new.

As illustrated in the diagram below, an ensemble is a meta-model. It sits on top of individual models and aggregates their predictions.

Trois learning steps in Machine Learning – Image by author

Voting: the simplest ensemble idea

The simplest form of ensemble learning is voting.

The idea is almost trivial: train several models, take their predictions, and compute the average. If one model is wrong in one direction and another is wrong in the opposite direction, the errors should cancel out. At least, that is the intuition.

On paper, this sounds reasonable. In practice, things are very different.

As soon as you try voting on real models, one fact becomes obvious: voting is not magic. Simply averaging predictions does not guarantee better performance. In many cases, it actually makes things worse.

The reason is simple. When you combine models that behave very differently, you also combine their weaknesses. If the models do not make complementary errors, averaging can dilute useful structure instead of reinforcing it.

To see this clearly, consider a very simple example. Take a decision tree and a linear regression trained on the same dataset. The decision tree captures local, non-linear patterns. The linear regression captures a global linear trend. When you average their predictions, you do not obtain a better model. You obtain a compromise that is often worse than each model taken individually.

Voting machine learning – all images by author

This illustrates an important point: ensemble learning requires more than averaging. It requires a strategy. A way to combine models that actually improves stability or generalization.

Moreover, if we consider the ensemble as a single model, then it must be trained as such. Simple averaging offers no parameter to adjust. There is nothing to learn, nothing to optimize.

One possible improvement to voting is to assign different weights to the models. Instead of giving each model the same importance, we could try to learn which ones should matter more. But as soon as we introduce weights, a new question appears: how do we train them? At that point, the ensemble itself becomes a model that needs to be fitted.

This observation leads naturally to more structured ensemble methods.

In this article, we begin with one statistical approach to resample the training dataset before averaging: Bagging.

The intuition behind Bagging

Why “bagging”?

What is bagging?

The answer is actually hidden in the name itself.

Bagging = Bootstrap + Aggregating.

You can immediately tell that a mathematician or a statistician named it. 🙂

Behind this slightly intimidating word, the idea is extremely simple. Bagging is about doing two things: first, creating many versions of the dataset using the bootstrap, and second, aggregating the results obtained from these datasets.

The core idea is therefore not about changing the model. It is about changing the data.

Bootstrapping the dataset

Bootstrapping means sampling the dataset with replacement. Each bootstrap sample has the same size as the original dataset, but not the same observations. Some rows appear several times. Others disappear.

In Excel, this is very easy to implement and, more importantly, very easy to see.

You start by adding an ID column to your dataset, one unique identifier per row. Then, using the RANDBETWEEN function, you randomly draw row indices. Each draw corresponds to one row in the bootstrap sample. By repeating this process, you generate a full dataset that looks familiar, but is slightly different from the original one.

This step alone already makes the idea of bagging concrete. You can literally see the duplicates. You can see which observations are missing. Nothing is abstract.

Below, you can see examples of bootstrap samples generated from the same original dataset. Each sample tells a slightly different story, even though all of them come from the same data.

These alternative datasets are the foundation of bagging.

Dataset generated by author – image by author

Bagging linear regression: understanding the principle

Bagging process

Yes, this is probably the first time you hear about bagging linear regression.

In theory, there is nothing wrong with it. As we said earlier, bagging is an ensemble method that can be applied to any base model. Linear regression is a model, so technically, it qualifies.

In practice, however, you will quickly see that this is not very useful.

But nothing prevents us from doing it. And precisely because it is not very useful, it makes for an excellent learning example. So let us do it.

For each bootstrap sample, we fit a linear regression. In Excel, this is straightforward. We can directly use the LINEST function to estimate the coefficients. Each color in the plot corresponds to one bootstrap sample and its associated regression line.

So far, everything behaves exactly as expected. The lines are close to each other, but not identical. Each bootstrap sample slightly changes the coefficients, and therefore the fitted line.

Bagging of linear regression – image by author

Now comes the key observation.

You may notice that one additional model is plotted in black. This one corresponds to the standard linear regression fitted on the original dataset, without bootstrapping.

What happens when we compare it to the bagged models?

When we average the predictions of all these linear regressions, the final result is still a linear regression. The shape of the prediction does not change. The relationship between the variables remains linear. We did not create a more expressive model.

And more importantly, the bagged model ends up being very close to the standard linear regression trained on the original data.

We can even push the example further by using a dataset with a clearly non-linear structure. In this case, each linear regression fitted on a bootstrap sample struggles in its own way. Some lines tilt slightly upward, others downward, depending on which observations were duplicated or missing in the sample.

Bagging of linear regression – image by author

Bootstrap confidence intervals

From a prediction performance point of view, bagging linear regression is not very useful.

However, bootstrapping remains extremely useful for one important statistical notion: estimating the confidence interval of the predictions.

Instead of looking only at the average prediction, we can look at the distribution of predictions produced by all the bootstrapped models. For each input value, we now have many predicted values, one from each bootstrap sample.

A simple and intuitive way to quantify uncertainty is to compute the standard deviation of these predictions. This standard deviation tells us how sensitive the prediction is to changes in the data. A small value means the prediction is stable. A large value means it is uncertain.

This idea works naturally in Excel. Once you have all the predictions from the bootstrapped models, computing their standard deviation is straightforward. The result can be interpreted as a confidence band around the prediction.

This is clearly visible in the plot below. The interpretation is straightforward: in regions where the training data is sparse or highly dispersed, the confidence interval becomes wide, as predictions vary significantly across bootstrap samples.

Conversely, where the data is dense, predictions are more stable and the confidence interval narrows.

Now, when we apply this to non-linear data, something becomes very clear. In regions where the linear model struggles to fit the data, the predictions from different bootstrap samples spread out much more. The confidence interval becomes wider.

This is an important insight. Even when bagging does not improve prediction accuracy, it provides valuable information about uncertainty. It tells us where the model is reliable and where it is not.

Seeing these confidence intervals emerge directly from bootstrap samples in Excel makes this statistical concept very concrete and intuitive.

Bagging decision trees: from weak learners to a strong model

Now we move to decision trees.

The principle of bagging remains exactly the same. We generate multiple bootstrap samples, train one model on each of them, and then aggregate their predictions.

I improved the Excel implementation to make the splitting process more automatic. To keep things manageable in Excel, we restrict the trees to a single split. Building deeper trees is possible, but it quickly becomes cumbersome in a spreadsheet.

Below, you can see two of the bootstrapped trees. In total, I built eight of them by simply copying and pasting formulas, which makes the process straightforward and easy to reproduce.

Since decision trees are highly non-linear models and their predictions are piecewise constant, averaging their outputs has a smoothing effect.

As a result, bagging naturally smooths the predictions. Instead of sharp jumps created by individual trees, the aggregated model produces more gradual transitions.

In Excel, this effect is very easy to observe. The bagged predictions are clearly smoother than the predictions of any single tree.

Some of you may have already heard of decision stumps, which are decision trees with a maximum depth of one. That is exactly what we use here. Each model is extremely simple. On its own, a stump is a weak learner.

The question here is:
is a collection of decision stumps sufficient when combined with bagging?

We will come back to this later in my Machine Learning “Advent Calendar”.

Random Forest: extending bagging

What about Random Forest?

This is probably one of the favorite models among data scientists.

So why not talk about it here, even in Excel?

In fact, what we have just built is already very close to a Random Forest!

To understand why, recall that Random Forest introduces two sources of randomness.

The first one is the bootstrap of the dataset. This is exactly what we have already done with bagging.
The second one is randomness in the splitting process. At each split, only a random subset of features is considered.

In our case, however, we only have one feature. That means there is nothing to select from. Feature randomness simply does not apply.

As a result, what we obtain here can be seen as a simplified Random Forest.

Once this concept is clear, extending the idea to multiple features is just an additional layer of randomness, not a new concept.

And you may even ask, we can apply this principle to Linear Regression, and do a Random

Conclusion

Ensemble learning is less about complex models and more about managing instability.

Simple voting is rarely effective. Bagging linear regression changes little and remains mostly pedagogical, though it is useful for estimating uncertainty. With decision trees, however, bagging truly matters: averaging unstable models leads to smoother and more robust predictions.

Random Forest naturally extends this idea by adding extra randomness, without changing the core principle. Seen in Excel, ensemble methods stop being black boxes and become a logical next step.

Feeds