The Machine Learning “Advent Calendar” Day 15: SVM in Excel

towardsdatascience.com

we are.

This is the model that motivated me, from the very beginning, to use Excel to better understand Machine Learning.

And today, you are going to see a different explanation of SVM than you usually see, which is the one with:

margin separators,
distances to a hyperplane,
geometric constructions first.

Instead, we will build the model step by step, starting from things we already know.

So maybe this is also the day you finally say “oh, I understand better now.”

Building a New Model on What We already Know

One of my main learning principles is simple:
always start from what we already know.

Before SVM, we already studied:

logistic regression,
penalization and regularization.

We will use these models and concepts today.

The idea is not to introduce a new model, but to transform an existing one.

Training datasets and label convention

We will use a one-feature dataset to explain the SVM.
Yes, I know, this is probably the first time you see someone explain SVM using only one feature.

Why not?

In fact, it is necessary, for several reasons.

For other models, such as linear regression or logistic regression, we usually start with a single feature. We should do the same with SVM, so that we can compare the models properly.

If you build a model with many features and think you understand how it works, but you cannot explain it with just one feature, then you do not really understand it yet.

Using a single feature makes the model:

simpler to implement,
easier to visualize,
and much easier to debug.

So, we use two datasets that I generated to illustrate the two possible situations a linear classifier can face:

one dataset is completely separable
the other is not completely separable

You may already know why we use these two datasets, whereas we only use one, right?

We also use the label convention -1 and 1 instead of 0 and 1.

Why? We will see later, that is actually interesting history, about how the models are seen in GLM and Machine Learning perspectives.

SVM in Excel – All images by author

In logistic regression, before applying the sigmoid, we compute a logit. And we can call it f, this is a linear score.

This quantity is a linear score that can take any real value, from −∞ to +∞.

positive values correspond to one class,
negative values correspond to the other,
zero is the decision boundary.

Using labels -1 and 1 matches this interpretation naturally.
It emphasizes the sign of the logit, without going through probabilities.

So, we are working with a pure linear model, not within the GLM framework.

There is no sigmoid, no probability, only a linear decision score.

A compact way to express this idea is to look at the quantity:

y(ax + b) = y f(x)

If this value is positive, the point is correctly classified.
If it is large, the classification is confident.
If it is negative, the point is misclassified.

At this point, we are still not talking about SVMs.
We are only making explicit what good classification means in a linear setting.

From log-loss to a new loss function

With this convention, we can write the log-loss for logistic regression directly as a function of the quantity:

y f(x) = y (ax+b)

We can plot this loss as a function of yf(x).
Now, let us introduce a new loss function called the hinge loss.

When we plot the two losses on the same graph, we can see that they are quite similar in shape.

Do you remember Gini vs. Entropy in Decision Tree Classifiers?
The comparison is very similar here.

In both cases, the idea is to penalize:

points that are misclassified yf(x)<0,
points that are too close to the decision boundary.

The difference is in how this penalty is applied.

The log-loss penalizes errors in a smooth and progressive way.
Even well-classified points are still slightly penalized.
The hinge loss is more direct and abrupt.
Once a point is correctly classified with a sufficient margin, it is no longer penalized at all.

So the goal is not to change what we consider a good or bad classification,
but to simplify the way we penalize it.

One question naturally follows.

Could we also use a squared loss?

After all, linear regression can also be used as a classifier.

But when we do this, we immediately see the problem:
the squared loss keeps penalizing points that are already very well classified.

Instead of focusing on the decision boundary, the model tries to fit exact numeric targets.

This is why linear regression is usually a poor classifier, and why the choice of the loss function matters so much.

Description of the new model

Let us now assume that the model is already trained and look directly at the results.

For both models, we compute exactly the same quantities:

the linear score (and it is called logit for Logistic Regression)
the probability (we can just apply the sigmoid function in both cases),
and the loss value.

This allows a direct, point-by-point comparison between the two approaches.

Although the loss functions are different, the linear scores and the resulting classifications are very similar on this dataset.

For the completely separable dataset, the result is immediate: all points are correctly classified and lie sufficiently far from the decision boundary. As a consequence, the hinge loss is equal to zero for every observation.

This leads to an important conclusion.

When the data is perfectly separable, there is not a unique solution. In fact, there are infinitely many linear decision functions that achieve exactly the same result. We can shift the line, rotate it slightly, or rescale the coefficients, and the classification remains perfect, with zero loss everywhere.

So what do we do next?

We introduce regularization.

Just as in ridge regression, we add a penalty on the size of the coefficients. This additional term does not improve classification accuracy, but it allows us to select one solution among all the possible ones.

So in our dataset, we get the one with the smallest slope a.

And congratulations, we have just built the SVM model.

We can now just write down the cost function of the two models: Logistic Regression and SVM.

Do you remember that Logistic Regression can be regularized, and it is still called so, right?

Now, why does the model include the term “Support Vectors”?

If you look at the dataset, you can see that only a few points, for example the ones with values 6 and 10, are enough to determine the decision boundary. These points are called support vectors.

At this stage, with the perspective we are using, we cannot identify them directly.

We will see later that another viewpoint makes them appear naturally.

And we can do the same exercise for another dataset, with non-separable dataset, but the principle is the same. Nothing changed.

But now, we can see that for certains points, the hinge loss is not zero. In our case below, we can see visually that there are four points that we need as Support Vectors.

SVM Model Training with Gradient Descent

We now train the SVM model explicitly, using gradient descent.
Nothing new is introduced here. We reuse the same optimization logic we already applied to linear and logistic regression.

New convention: Lambda (λ) or C

In many models we studied previously, such as ridge or logistic regression, the objective function is written as:

data-fit loss +λ ∥w∥

Here, the regularization parameter λ controls the penalty on the size of the coefficients.

For SVMs, the usual convention is slightly different. We rather use C in front of the data-fit term.

Both formulations are equivalent.
They only differ by a rescaling of the objective function.

We keep the parameter C because it is the standard notation used in SVMs. And we will see why we have this convention later.

Gradient (subgradient)

We work with a linear decision function, and we can define the margin for each point as: mi = yi (axi + b)

Only observations such that mi<1 contribute to the hinge loss.

The subgradients of the objective are as follows, and we can implement in Excel, using logical masks and SUMPRODUCT.

Parameter update

With a learning rate or step size η, the gradient descent updates are as follows, and we can do the usual formula:

We iterate these updates until convergence.

And, by the way, this training procedure also gives us something very nice to visualize. At each iteration, as the coefficients are updated, the size of the margin changes.

So we can visualize, step by step, how the margin evolves during the learning process.

Optimization vs. geometric formulation of SVM

This figure below shows the same objective function of the SVM model written in two different languages.

On the left, the model is expressed as an optimization problem.
We minimize a combination of two things:

a term that keeps the model simple, by penalizing large coefficients,
and a term that penalizes classification errors or margin violations.

This is the view we have been using so far. It is natural when we think in terms of loss functions, regularization, and gradient descent. It is the most convenient form for implementation and optimization.

On the right, the same model is expressed in a geometric way.

Instead of talking about losses, we talk about:

margins,
constraints,
and distances to the separating boundary.

When the data is perfectly separable, the model looks for the separating line with the largest possible margin, without allowing any violation. This is the hard-margin case.

When perfect separation is impossible, violations are allowed, but they are penalized. This leads to the soft-margin case.

What is important to understand is that these two views are strictly equivalent.

The optimization formulation automatically enforces the geometric constraints:

penalizing large coefficients corresponds to maximizing the margin,
penalizing hinge violations corresponds to allowing, but controlling, margin violations.

So this is not two different models, and not two different ideas.
It is the same SVM, seen from two complementary perspectives.

Once this equivalence is clear, the SVM becomes much less mysterious: it is simply a linear model with a particular way of measuring errors and controlling complexity, which naturally leads to the maximum-margin interpretation everyone knows.

Unified Linear Classifier

From the optimization viewpoint, we can now take a step back and look at the bigger picture.

What we have built is not just “the SVM”, but a general linear classification framework.

A linear classifier is defined by three independent choices:

a linear decision function,
a loss function,
a regularization term.

Once this is clear, many models appear as simple combinations of these elements.

In practice, this is exactly what we can do with SGDClassifier in scikit-learn.

From the same viewpoint, we can:

combine the hinge loss with L1 regularization,
replace hinge loss with squared hinge loss,
use log-loss, hinge loss, or other margin-based losses,
choose L2 or L1 penalties depending on the desired behavior.

Each choice changes how errors are penalized or how coefficients are controlled, but the underlying model remains the same: a linear decision function trained by optimization.

Primal vs Dual Formulation

You may already have heard about the dual form of SVM.

So far, we have worked entirely in the primal form:

we optimized the model coefficients directly,
using loss functions and regularization.

The dual form is another way to write the same optimization problem.

Instead of assigning weights to features, the dual form assigns a coefficient, usually called alpha, to each data point.

We will not derive or implement the dual form in Excel, but we can still observe its result.

Using scikit-learn, we can compute the alpha values and verify that:

the primal and dual forms lead to the same model,
same decision boundary, same predictions.

What makes the dual form particularly interesting for SVM is that:

most alpha values are exactly zero,
only a few data points have non-zero alpha.

These points are the support vectors.

This behavior is specific to margin-based losses like the hinge loss.

Finally, the dual form also explains why SVMs can use the kernel trick.

By working with similarities between data points, we can build non-linear classifiers without changing the optimization framework.

We will see this tomorrow.

Conclusion

In this article, we did not approach SVM as a geometric object with complicated formulas. Instead, we built it step by step, starting from models we already know.

By changing only the loss function, then adding regularization, we naturally arrived at the SVM. The model did not change. Only the way we penalize errors did.

Seen this way, SVM is not a new family of models. It is a natural extension of linear and logistic regression, viewed through a different loss.

We also showed that:

the optimization view and the geometric view are equivalent,
the maximum-margin interpretation comes directly from regularization,
and the notion of support vectors emerges naturally from the dual perspective.

Once these links are clear, SVM becomes much easier to understand and to place among other linear classifiers.

In the next step, we will use this new perspective to go further, and see how kernels extend this idea beyond linear models.

All the Excel files are available through this Kofi link. Your support means a lot to me. The price will increase during the month, so early supporters get the best value.

All Excel/Google sheet files for ML and DL

Feeds