Benders’ Decomposition 101: How to Crack Open a Stochastic Program That’s Too Big to Swallow Whole

towardsdatascience.com

In my first TDS post, I a real-world problem into an integer linear program. In my second, I made that program robust against uncertainty. In my third, I introduced stochastic programming: four principled ways to put uncertainty into the model rather than hand-waving it away.

The third post ended with a promise. I said two-stage recourse models are easy to write down and tidy in theory, but in practice they have an embarrassing problem: as you add more scenarios, the so-called “deterministic equivalent” explodes in size and the solver starts to complain. I name-dropped Benders’ decomposition as the standard fix and then disappeared without showing how it actually works.

This is the post that makes good on that promise.

We’ll start with where the previous post left off, show why the obvious approach falls apart at scale, work out the one observation that saves us, and then walk slowly through the algorithm itself. There’s a bit more math than last time, but nothing exotic; the most advanced tool we’ll use is LP duality, and even that comes packaged as “we can write the same problem in a different way.”

Quick recap: the model we’re trying to solve

In the previous post, our hero was a fashion company in Germany sourcing winter clothing from Bangladesh. Demand ξ is a random variable; production x has to be decided before ξ is known. After demand is revealed, a recourse decision y patches up any shortfall at a higher cost (an emergency batch from Romania, say).

If we assume a discrete distribution with S scenarios — values ξ¹, ξ², …, ξS^S with probabilities p₁, …, pS — the two-stage recourse model collapses into a single big LP, the deterministic equivalent:

One copy of the second stage per scenario, all glued to the same first-stage xx. Hand it to HiGHS or Gurobi, wait, and get a solution. End of story.

Except sometimes the waiting becomes the whole story.

When the deterministic equivalent hits a wall

For small SS, everything is fine. You write the LP, you solve it, you go home. But scenarios tend to multiply faster than your patience:

A continuous demand distribution that you approximate with sample average approximation might need thousands of samples before the optimal xx stabilizes.
A multi-stage model fans out exponentially with the number of stages. Three stages with ten branches each already gives you a thousand leaves; five stages and you’re well into the hundreds of thousands.
In real applications, such as hydropower scheduling, supply-chain planning under demand shocks, or energy dispatch, SS in the tens of thousands isn’t unusual. Hydropower people regularly model scenario trees that are too big to even enumerate, let alone hand to a solver.

The deterministic equivalent dutifully grows with SS. It has one block of constraints per scenario, one set of ysy_s variables per scenario, and one TsxT_sx coupling each block to the shared first-stage xx. If you take a look at the constraint matrix, it has a very particular shape:

This is called block-angular structure. Each scenario contributes its own diagonal block WsW_s, plus a column of TsT_s on the left that ties it to xx. The shared row at the bottom is the first-stage feasibility constraint Ax≥bAx \geq b.

The shape is informative, but for the solver it’s bad news. LP runtime grows superlinearly in problem size. A quick Google search will tell you it’s roughly O(n2.5)O(n^{2.5}) for simplex-style methods and O(n3.5)O(n^{3.5}) for interior-point methods in the worst case, which means doubling the number of scenarios doesn’t double the runtime — it sextuples it. At some point the problem stops fitting in memory entirely, and your friendly solver starts swapping to disk, which is a polite way of saying “this is too much.”

So we need something cleverer than “shove the whole thing into a solver and hope for the best.”

Why naive tricks don’t help

Before we get to the clever idea, let’s rule out the obvious ones.

Idea 1: Just use fewer scenarios. Sometimes this is fine; sometimes the answer changes meaningfully as you add scenarios, and your “optimum” was a coincidence of the sample you happened to draw. The sample average approximation literature is precisely about when you can get away with this and when you can’t.

Idea 2: Solve each scenario separately and average the results. Tempting, but wrong. The whole point of having one xx is that it has to work for all scenarios simultaneously. Solving scenario ss on its own gives you the scenario-specific optimum (the “wait-and-see” solution from the last post), not a first-stage decision you can commit to.

Idea 3: Solve the deterministic problem with mean demand 𝔼[ξ], then deal with the leftovers. This is the EV solution. It can be very wrong, and that wrongness is exactly what VSS measures.

So we can’t shrink the problem, we can’t decouple it, and we can’t average our way around it. We need something that respects the structure of the deterministic equivalent while not solving it as one giant LP.

That something has been around since 1962.

The one observation that saves us

Look at the constraint matrix again:

Now imagine that xx is fixed and pick any value, doesn’t matter which. The first-stage block Ax≥bAx \geq b either holds or it doesn’t. And the rest of the constraints decouple: scenario s’s constraint becomes Wsys≥hs−TsxW_s y_s ≥ h_s − T_s x, which involves only yₛ. The connection between scenarios runs through xx. Fix xx, and the scenarios fall apart into SS independent LPs.

This is the key idea. The variable xx is a complicating variable: if it were known, life would be much easier. Each subproblem is small (the size of one scenario, not all of them), and they’re independent, so you can even solve them in parallel.

The Benders’ decomposition recipe is built around this observation. Roughly:

Guess a first-stage decision xx.
Solve the S scenario subproblems independently, given xx.
Use what the subproblems tell you to update your guess of xx.
Repeat until your guesses stop improving.

The hard part (and the elegant part) is step 3: how do the subproblems tell us anything useful about xx? That’s where LP duality earns its keep.

Benders!

Let’s slow down and build the algorithm one piece at a time.

The value function v(ξˢ, x)

For a fixed first-stage decision xx and a fixed scenario ss, the second-stage subproblem is just a small LP:

Its dual is

This dual has a feature we’ll lean on. Its feasible region Λs={λ≥0:λ⊤Ws≤qs}\Lambda_s = \{\lambda \geq 0 : \lambda^\top W_s \leq q_s\} is a polyhedron, and it doesn’t depend on xx at all. The xx only shows up in the objective. Combined with the LP fact that an optimum is always attained at a vertex of the feasible region, we get a useful rewrite:

where the λks\lambda^s_k are the (finitely many) vertices of Λs\Lambda_s.

That’s a very nice expression to stare at. As a function of xx, v(ξs,x)v(\xi^s, x) is the pointwise maximum of a finite collection of affine functions. That makes it piecewise linear and convex. Picture the upper envelope of a fan of lines.

The expected recourse cost is just a weighted sum of these:

A weighted sum of piecewise-linear convex functions is itself piecewise-linear and convex. So Q(x)Q(x), the thing we’re trying to minimize in the first stage on top of c⊤xc^\top x, is a piecewise-linear convex function of xx. We don’t have an explicit formula for it, because that would require enumerating every vertex of every Λs\Lambda_s, but we know its shape.

That shape is the lever Benders pulls.

Approximating Q with cuts

We don’t know QQ in closed form, but a piecewise-linear convex function has a very forgiving property: at any point x‾\bar{x} where we know the value Q(x‾)Q(\bar x) and a subgradient g∈∂Q(x‾)g ∈ ∂Q(x̄), the linear function

is a global lower bound on QQ. It touches QQ at x‾x̄ and stays underneath everywhere else.

For the recourse cost specifically, the subgradient is something we can compute. If we solve the dual subproblems and obtain optimal dual vertices λk∗s\lambda^s_{k*} for each scenario, a subgradient of QQ at x‾x̄ is

So every time we solve the subproblems at some candidate xτx^\tau, we get a tight affine lower bound on QQ for free. Benders calls each such lower bound an optimality cut.

The plan is now visible. We’re going to build up a sequence of affine lower bounds on QQ, one per iteration, and use them as a stand-in for QQ in the first-stage problem.

The master problem

We replace Q(x)Q(x) with a placeholder variable θ\theta that has to lie above every lower bound we’ve accumulated so far. Iteration tt has the master problem:

Here LL is any safe lower bound on QQ — often L=0L = 0 if recourse costs are non-negative, or something more negative if not — and each (αt,βt)(αₜ, βₜ) is an optimality cut from a previous iteration. The master problem is just a regular LP, much smaller than the deterministic equivalent because it only carries the first-stage variables and a growing pile of cuts.

The candidate solution (xτ,θτ)(x^\tau, θ^\tau) from this master is then fed to the subproblems, which return a fresh cut, which gets added, and we go around again.

The algorithm

Putting it together, the uni-cut Benders algorithm is:

1. Solve the master problem to get (xτ,θτ)(x^\tau, θ^\tau).

2. For each scenario s=1,…,Ss = 1, …, S, solve the dual subproblem at xτx^\tau to get an optimal vertex λτsλˢ_τ and the optimal value v(ξs,xτ)v(ξˢ, x^\tau).

3. Build the optimality cut θ≥αt+βt⊤xθ ≥ α_t + β_t^\top x with

and add it to the master.

4. Stop if the master’s θτθ^\tau already satisfies the new cut at xτx^\tau — that is, if θτ≥αt+βt⊤xτθ^\tau ≥ α_t + β_t^\top x^\tau. Otherwise, go back to step 1.

A few things are quietly remarkable about this loop.

Finiteness. Every cut comes from some vertex λksλ^s_k of some Λs\Lambda_s. Each Λs\Lambda_s has finitely many vertices. So in the worst case, the algorithm enumerates all possible cuts and stops — it cannot loop forever. In practice, it stops much sooner: typically only a small handful of vertices end up mattering.

Decomposition done right. The big shared LP has been replaced by a small master problem plus S tiny, independent subproblems. The subproblems can be solved in parallel. The master grows, but only by one constraint per iteration.

Information, not enumeration. We never write out all the scenarios in one LP. We let the subproblems tell us, through their dual variables, what the master needs to know about them. The cuts are the language they speak.

Uni-cut versus multi-cut

In the recipe above, we aggregate the per-scenario information into a single cut on QQ. The natural alternative is to keep SS separate placeholders θ1,…,θSθ₁, …, θ_S, one per scenario, and add SS separate cuts each iteration. This is the multi-cut variant.

Multi-cut uses more information per iteration, so the master’s approximation of QQ tightens faster. The cost is that you’re adding SS constraints per iteration instead of one, and the master gets unwieldy faster. There’s no universal winner — You and Grossmann (2013) report meaningful speedups from multi-cut on supply-chain models, but it’s case-by-case. A reasonable default is to start with uni-cut and try multi-cut if convergence is slow.

A picture of one iteration

Sometimes the easiest way to absorb the algorithm is to walk through what each iteration accomplishes geometrically.

Imagine Q(x)Q(x) as the true (unknown) piecewise-linear convex function plotted over xx. The master’s current approximation Q^τ(x)\hat Q^\tau(x) is the pointwise maximum of all cuts we’ve added so far — a polyhedral lower bound that sits underneath Q everywhere.

At iteration τ\tau, the master minimizes c⊤x+Q^τ(x)c^\top x + Q̂^\tau (x), and lands on some xτx^\tau where its approximation thinks the optimum is. We then ask the subproblems: “what’s the true QQ at xτx^\tau?” They answer with both a value Q(xτ)Q(x^\tau) and a subgradient (the dual solution). That subgradient gives us a new linear function that touches QQ at xτx^\tau — the cut. We add it; the master’s approximation rises, just enough to be exact at xτx^\tau; and the next iteration’s xτ+1x^{\tau+1} will (usually) be different.

The loop stops when the master’s approximation already predicts the true value at xτx^\tau. Visually, this means our polyhedral lower envelope has caught up with QQ exactly at the point we care about.

You won’t always reach an exact match, because for numerical reasons most implementations stop at a small tolerance, but the geometry is the same.

It’s not a magic tool

So far this has read like an algorithmic free lunch: same answer, much smaller LPs, parallelizable, finite. Let’s be honest about the disadvantages as well.

Slow convergence in the wild. In theory, Benders terminates in finitely many iterations; in practice, “finitely many” can mean quite a lot of iterations, and each one solves an LP. Rahmaniani et al. (2017) is the standard literature review on the dozen-odd tricks people use to speed things up: cut selection, cut removal when old cuts become dominated, trust regions to avoid wildly oscillating xτx^\tau, and so on.

The master problem becomes the bottleneck. It’s not unusual for more than 90% of total runtime to go into solving the (continually growing) master. Common remedies include not solving the master to full optimality on early iterations, removing cuts that no longer help, and exploiting structure when the master is itself a structured LP.

Subproblems can dominate too. Especially when SS is large or each subproblem is itself a hard LP, the second stage takes over. The fix is mostly engineering: parallelize across scenarios, warm-start each subproblem from the previous iteration’s basis, batch them when possible.

The hidden assumption: relatively complete recourse. Everything above quietly assumed that the second-stage subproblem is feasible for any xx the master might propose. That’s the property of (relatively) complete recourse from the previous post: for every feasible first-stage x and every scenario ξξ, there exists a recourse y≥0y ≥ 0 satisfying Wy≥h(ξ)−T(ξ)xWy ≥ h(ξ) − T(ξ)x.

When that property fails, some xτx^\tau will leave the subproblem infeasible — the second-stage LP has no y that works, so the implicit cost is +∞+ \infty, meaning xτx^\tau was never a feasible first-stage choice to begin with. Benders patches this with feasibility cuts, which are derived by solving an auxiliary LP that measures how infeasible the subproblem is and using its dual to produce a hyperplane that slices xτx^\tau out of the master’s feasible region. Without going through the algebra, the structure is the same as optimality cuts: solve a small LP, look at the dual, build a linear constraint, hand it back to the master. In each iteration of Benders, then, you either add an optimality cut (if every subproblem was feasible) or a feasibility cut (if some subproblem was not). The recourse model with no complete-recourse guarantee still works; you just need both kinds of cuts.

It only fully applies to two-stage models. For multi-stage stochastic programs, the natural generalization isn’t just “Benders with more stages.” It’s stochastic dual dynamic programming (SDDP), which uses sampling and approximate value functions to avoid building out the full scenario tree. SDDP is the workhorse in hydropower scheduling, and it deserves, and probably will eventually get, its own post.

Integer first stages get tricky. If xx has to be integer (think capacity decisions, facility location), then the master is an MILP, and Benders’ clean convex theory loses some of its bite. There are extensions, such as generalized Benders, integer L-shaped, branch-and-Benders-cut, etc., but they’re past the scope of this post.

None of this means Benders is bad. It means it’s a method, not a miracle. Used thoughtfully — with the right master/subproblem split, the right cut variant, and the right speedup tricks — it’s the difference between solving a problem and not.

Wrapping up

We started with a problem: stochastic programs are easy to write down but easy to make too big to solve. The deterministic equivalent grows linearly in the number of scenarios, and LP solvers grow superlinearly in problem size, which is a quadratic-or-worse collision course.

Benders’ decomposition takes that block-angular structure seriously. It carves out the complicating first-stage variables, leaves the scenario-decoupled second stage to a swarm of small subproblems, and uses LP duality to ferry information between the two: optimality cuts when the subproblems are feasible, feasibility cuts when they’re not. The result is an iterative scheme that converges in finitely many steps and very often a lot faster than that.

To summarize what’s actually doing the work:

The second-stage value function v(ξ,x)v(ξ, x) is piecewise-linear and convex in xx. So is the expected recourse cost Q(x)Q(x).
A piecewise-linear convex function can be approximated arbitrarily well by a growing collection of affine lower bounds (cuts), each obtained for free from a dual solution.
The first stage doesn’t need to know QQ exactly — just to lower-bound it tightly at the points it cares about. The master/subproblem loop builds up exactly that.
Uni-cut and multi-cut are two ways to bundle the per-scenario information. Feasibility cuts handle the case when subproblems can fail.

If you take one thing away, let it be the block-angular look of the constraint matrix. Whenever you can rewrite an optimization problem so that fixing some variables makes the rest separable, you could try Benders. This idea shows up in network design, scheduling, energy planning, anywhere you have a “hard” set of decisions that bind together otherwise easy ones.

In the next post in this series, we’ll see what to do when even a two-stage Benders is the wrong frame — when the world doesn’t politely split into “decide, observe, correct” but goes “decide, observe, decide, observe, decide…” indefinitely. That’s the territory of SDDP. If you’ve ever wondered how hydropower operators schedule reservoirs over a multi-year horizon with thousands of joint inflow scenarios, the answer is in that next post.

Credits and references

As with the previous post, this one is based on lectures by Dr. Ruben van Beesten (Norwegian University of Science and Technology) from his October 2023 course on Stochastic Programming, which I had the pleasure of attending in Trondheim. The pedagogical buildup comes straight from his slides; any clumsiness in the retelling is mine. Also, all images, except for the thumbnail, in this post are made by me.

Four references worth knowing about:

Benders, J. F. (1962). Partitioning procedures for solving mixed-variables programming problems. Numerische Mathematik, 4(1), 238–252. The original paper. Surprisingly readable.
Van Slyke, R., and Wets, R. (1969). L-shaped linear programs with applications to optimal control and stochastic programming. SIAM Journal on Applied Mathematics, 17(4), 638–663. The stochastic-programming-flavored version; this is where the name “L-shaped method” comes from.
Fengqi You and Ignacio E Grossmann. (2013). Multicut benders decomposition algorithm for process supply chain planning under uncertainty. Annals of Operations Research, 210:191–211.
Rahmaniani, R., Crainic, T. G., Gendreau, M., and Rei, W. (2017). The Benders decomposition algorithm: A literature review. European Journal of Operational Research, 259(3), 801–817. The reference for accelerations, variants, and modern usage.

And the previous posts in the series: Five questions that will help you model integer linear programs better, Four steps to robustify your linear program, and A Gentle Introduction to Stochastic Programming.

Feeds