The Machine Learning “Advent Calendar” Day 16: Kernel Trick in Excel

article about SVM, the next natural step is Kernel SVM.

At first sight, it looks like a completely different model. The training happens in the dual form, we stop talking about a slope and an intercept, and suddenly everything is about a “kernel”.

In today’s article, I will make the word kernel concrete by visualizing what it really does.

There are many good ways to introduce Kernel SVM. If you have read my previous articles, you know that I like to start from something simple that you already know.

A classic way to introduce Kernel SVM is this: SVM is a linear model. If the relationship between the features and the target is non-linear, a straight line will not separate the classes well. So we create new features. Polynomial regression is still a linear model, we simply add polynomial features (x, x², x³, …). From this point of view, a polynomial kernel performs polynomial regression implicitly, and an RBF kernel can be seen as using an infinite series of polynomial features…

Maybe another day we will follow this path, but today we will take a different one: we start with KDE.

Yes, Kernel Density Estimation.

Let’s get started.

And you can use this link to get Google sheet

Kernel trick in Excel – all images by author

1. KDE as a sum of individual densities

I introduced KDE in the article about LDA and QDA, and at that time I said we would reuse it later. This is the moment.

We see the word kernel in KDE, and we also see it in Kernel SVM. This is not a coincidence, there is a real link.

The idea of KDE is simple:
around each data point, we place a small distribution (a kernel).
Then, we add all these individual densities together to obtain a global distribution.

Keep this idea in mind. It will be the key to understanding Kernel SVM.

KDE in Excel – all images by author

We can also adjust one parameter to control how smooth the global density is, from very local to very smooth, as illustrated in the GIF below.

KDE in Excel – all images by author

As you know, KDE is a distance or density-based model, so here, we are going to create a link between two models from two different families.

2. Turning KDE into a model

Now we reuse exactly the same idea to build a function around each point, and then this function can be used for classification.

Do you remember that the classification task with the weight-based models is first a regression task, because the value y is always considered as continuous? We only do the classification part after we got the decision function or f(x).

2.1. (Still) using a simple dataset

Someone once asked me why I always use around 10 data points to explain machine learning, saying it is meaningless.

I strongly disagree.

If someone cannot explain how a Machine Learning model works with 10 points (or less) and one single feature, then they do not really understand how this model works.

So this will not be a surprise for you. Yes, I will still use this very simple dataset, that I already used for logistic regression and SVM. I know this dataset is linearly separable, but it is interesting to compare the results of the models.

And I also generated another dataset with data points that are not linearly separable and visualized how the kernelized model works.

Dataset for kernel SVM in Excel – all images by author

2.2. RBF kernel centered on points

Let us now apply the KDE idea to our dataset.

For each data point, we place a bell-shaped curve centered on its x value. At this stage, we do not care about classification yet. We are only doing one simple thing: creating one local bell around each point.

This bell has a Gaussian shape, but here it has a specific name: RBF, for Radial Basis Function.

On this figure, we can see the RBF (Gaussian) kernel centered on this point x₇

The name sounds technical, but the idea is actually very simple.

Once you see RBFs as “distance-based bells”, the name stops being mysterious.

How to read this intuitively

x is any position on the x-axis
x₇ is the center of the bell (the 7th point)
γ (gamma) controls the width of the bell

So the bell reaches its maximum exactly at the point.

As x moves away from x₇, the value decreases smoothly toward 0.

Role of γ (gamma)

Small γ means wide bell (smooth, global influence)
Large γ means narrow bell (very local influence)

So γ plays the same role as the bandwidth in KDE.

At this stage, nothing is combined yet. We are just building the elementary blocks.

2.3. Combining bells with class labels

On the figures below, you first see the individual bells, each centered on a data point.

Once this is clear, we move to the next step: combining the bells.

This time, each bell is multiplied by its label yi.
As a result, some bells are added and others are subtracted, creating influences in two opposite directions.

This is the first step toward a classification function.

And we can see all the components from each data point that are adding together in Excel to get the final score.

This already looks extremely similar to KDE.

But we are not done yet.

2.4. From equal bells to weighted bells

We said earlier that SVM belongs to the weight-based family of models. So the next natural step is to introduce weights.

In distance-based models, one major limitation is that all features are treated as equally important when computing distances. Of course, we can rescale features, but this is often a manual and imperfect fix.

Here, we take a different approach.

Instead of simply summing all the bells, we assign a weight to each data point and multiply each bell by this weight.

At this point, the model is still linear, but linear in the space of kernels, not in the original input space.

To make this concrete, we can assume that the coefficients αi are already known and directly plot the resulting function in Excel. Each data point contributes its own weighted bell, and the final score is just the sum of all these contributions.

If we apply this to a dataset with a non-linearly separable boundary, we clearly see what Kernel SVM is doing: it fits the data by combining local influences, instead of trying to draw a single straight line.

3. Loss function: where SVM really starts

Up to now, we have only talked about the kernel part of the model. We have built bells, weighted them, and combined them.

But our model is called Kernel SVM, not just “kernel model”.

The SVM part comes from the loss function.

And as you may already know, SVM is defined by the hinge loss.

3.1 Hinge loss and support vectors

The hinge loss has a very important property.

If a point is:

correctly classified, and
far enough from the decision boundary,

then its loss is zero.

As a direct consequence, its coefficient αi becomes zero.

Only a few data points remain active.

These points are called support vectors.

So even though we started with one bell per data point, in the final model, only a few bells survive.

In the example below, you can see that for some points (for instance points 5 and 8), the coefficient αi is zero. These points are not support vectors and do not contribute to the decision function.

Depending on how strongly we penalize violations (through the parameter C), the number of support vectors can increase or decrease.

This is a crucial practical advantage of SVM.

When the dataset is large, storing one parameter per data point can be expensive. Thanks to hinge loss, SVM produces a sparse model, where only a small subset of points is kept.

3.2 Kernel ridge regression: same kernels, different loss

If we keep the same kernels but replace the hinge loss with a squared loss, we obtain kernel ridge regression:

Same kernels.
Same bells.
Different loss.

This leads to a very important conclusion:

Kernels define the representation.
The loss function defines the model.

With kernel ridge regression, the model must store all training data points.

Since squared loss does not force any coefficient to zero, every data point keeps a non-zero weight and contributes to the prediction.

In contrast, Kernel SVM produces a sparse solution: only support vectors are stored, all other points disappear from the model.

3.3 A quick link with LASSO

There is an interesting parallel with LASSO.

In linear regression, LASSO uses an L1 penalty on the primal coefficients. This penalty encourages sparsity, and some coefficients become exactly zero.

In SVM, hinge loss plays a similar role, but in a different space.

LASSO creates sparsity in the primal coefficients
SVM creates sparsity in the dual coefficients αi

Different mechanisms, same effect: only the important parameters survive.

Conclusion

Kernel SVM is not just about kernels.

Kernels build a rich, non-linear representation.
Hinge loss selects only the essential data points.

The result is a model that is both flexible and sparse, which is why SVM remains a powerful and elegant tool.

Tomorrow, we will look at another model that deals with non-linearity. Stay tuned.

All the Excel files are available through this Kofi link. Your support means a lot to me. The price will increase during the month, so early supporters get the best value.

All Excel/Google sheet files for ML and DL

Feeds