How to Mathematically Choose the Optimal Bins for Your Histogram

towardsdatascience.com

Have you ever wondered how to choose your bins in a histogram? Did you ever ask yourself whether there are deeper reasons for choices that go beyond that it just looks nice? While histograms are the most fundamental tool for data visualization, setting their resolution is important, especially when the histogram itself is used for further analyses. Histograms are often computed to visualize the density of the data. In this post, we explore the mathematics of density fitting, specifically looking at how bins should shrink as our dataset grows. Inspired by adjacent fields such as perturbation theory in physics and Taylor expansions in mathematics, we will find a rigorous method for constructing densities.

All images are by the author

Background

Approximations

The intuition is simple: the more data you have, the more detail you should be able to see. If you are looking at a sample of ten observations, two or three wide bins are likely all you can afford before your visualization becomes a sparse collection of empty gaps. But if you have ten million observations, those wide bins start to feel like a low-resolution pixelated photograph. You want to “zoom in” by increasing the number of bins. The question, however, is: How exactly should we scale this resolution?

In physics, when we face a system that is too complex to solve exactly, we often turn to Perturbation Theory. In Quantum Electrodynamics (QED), for example, we approximate complex interactions by expanding them in terms of a small coupling constant, like the electron charge e. This “interaction strength” provides a natural hierarchy for our approximations. But for a histogram, what is the analogous “charge”? Is there a fundamental parameter that governs the interaction between our discrete data points and the underlying distribution we are trying to estimate?

Mathematics offers another path: the Taylor Expansion. If we assume the underlying density function is sufficiently smooth (analytical), we can describe it locally using its derivatives. This feels like a promising lead as the higher orders can be demonstrated to vanish. Although we may want to accept a restriction to analytical distributions, it is not clear how this leads to a certain bin size.

Alternatively, we might treat the problem as an Expansion in Basis Functions. Just like we can represent a piece-wise continuous function using a Fourier transform or Legendre polynomials, we could view histogram bins as a set of basis functions. Using such an approach we could approximate the function in terms of L2. But this approach introduces its own set of hurdles. How do we compute the coefficients for these functions efficiently? And more importantly, how do we satisfy the physical constraints of a probability density function? Unlike a general Fourier series, a density function must be strictly positive-definite and normalized to one. We will see in the following that the method obtained from information theory has similar aspects to expanding in basis functions.

Information Theory

Priors & Posteriors

For an introduction to Bayesian statistics or information theory, the reader is referred to (Murphy, 2022). In a Bayesian approach, a model P(X|θ)P(X|\theta) , where X are the observables we want to model and θ\theta are our parameters, also contains a prior distribution 𝑃(𝜃|ℳ) that reflects our belief on the distribution before data was observed. After the data has been observed, we can estimate the posterior distribution P(θ|X)P(\theta | X)

𝑃(𝜃|𝑋) = 𝑃(𝑋|𝜃)𝑃(𝜃|ℳ)/𝑃(𝑋)

This procedure is mathematically elegant because it is 100% safe against overfitting. However, it demands a strict discipline: we are not allowed to choose our model or prior after having seen the data. If we use the data to decide which model structure to use, we break the underlying logic of the inference.

The most-likely model given the data versus model weighting

The quality of a model can be computed by considering its surprisal (see e.g. (Vries, 2026))

log 𝑃(𝑋|ℳ) = −surprisal = accuracy – complexity

Models with an excessive number of parameters (because one may be tempted to include all kind of hypothetical interactions) may achieve an incredible accuracy, but they are “killed” by the penalty of their own complexity. The ideal model isn’t the most detailed one; it is the one that captures the most information with the least amount of unnecessary baggage.

When considering a set of models, one can compute the likelihood of each model in comparison with the models under consideration

𝑃(ℳ𝑖 ∣ 𝑋) ~ 𝑃(𝑋 | ℳ𝑖) 𝑃(ℳ𝑖 )

It is tempting to simply pick the model with the highest probability and move on. But this “winner takes-all” approach carries risks:

Statistical Fluctuations: The data 𝑋 might contain a random fluke that makes a sub-optimal model look temporarily superior.
The Weight of the Crowd: Sometimes, the sum of many “less likely” models actually outweighs the probability of the single “best” model.

Because of this, a more robust path is to carry all models forward, weighting them by their probability. It is important to note that this is not a “mixture” of different truths; we still assume only one model is actually true, but we use the full distribution of possibilities to account for our own uncertainty.

Densities

A density using Bayesian approach

To treat a density as a formal model, we view each of its 𝐾 bins as a parameter. Specifically, we assign a weight wkw_k to each bin, representing the probability of a data point falling into that interval. Because the total probability must sum to one (∑kwk=1\sum_k w_k=1), a density with 𝐾 bins is defined by 𝐾 −1 independent parameters, such models are also called mixtures. In our Bayesian framework, we need to assign a prior to these weights. Given that we are dealing with categorical proportions that must sum to one, the Dirichlet distribution is the mathematically natural choice.

Choosing the Hyperparameters

The Dirichlet distribution is governed by hyperparameters, often denoted as 𝛼. These values represent our “pseudo-counts”—essentially what we believe the density looks like before we
have even seen the first data point. When we assume a flat prior (where the evidence 𝑃(𝑋) is constant), two primary strategies emerge for choosing 𝛼:

𝛼 =1/𝐾 (The Sparse Choice): This is often used when we expect the data to be highly concentrated. It assumes a priori that the majority of bins will be empty, making it a “sparsity-promoting” prior.
𝛼 =1 (The Uniform Choice): Also known as the flat or Laplace prior, this assumes that every possible distribution of weights is equally likely. It essentially adds one “virtual” observation to every bin before the real data arrives.

For the purpose of constructing a standard density, the second choice 𝛼 = 1 is often the most natural. It reflects a neutral starting point where we assume the data is uniformly distributed across the interval until the evidence proves otherwise.

By defining our bins this way, we have transformed the “pixelation” of a density into a rigorous model. We now have a fixed set of parameters (𝐾 − 1 weights) and a clear prior (𝛼 = 1). The next step is to use the data to determine the optimal number of bins 𝐾 by balancing the accuracy of the fit against the complexity of the parameters.

Example

Please look at the data in the figure below:

When fitting with 8 bins we obtain:

What one can see in this density is that the right-most bin is above zero although no data points were present in this bin. This is a result of the Bayesian approach which estimates the believed density based on our prior belief and the data that we observed.

Summarizing, we obtained a density using a Bayesian approach. We defined a prior 𝑃(𝜃) that reflected our expectation for a uniform density. Then we took the data and we computed the posterior 𝑃(𝜃|𝑋) that underlies the resulting density.

Weighted densities

Using the approach of the previous section we can make densities using 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024 bins. More bins give a more accurate fit of the data but also introduce additional complexities. As was discussed in the previous section, one can use accuracy and complexity to compute its evidence. When viewing each density as a model, we can compute its likelihood to be true compared to the set of models we are considering. This yields the figure below:

In the previous section it was discussed that one may choose the “best” model which would in this case be the use of 8 bins. However, it is safer to take a weighted sum over all the models. This
yields:

It is important to realize that from a Bayesian perspective this is the best that we can do. Also note that in this graph there is a density present of 1024 bins. Lastly, one can prove that densities of higher orders N will diminish.

Densities with unequal bins

The previously obtained density above looks a bit blocky which originates from the choice of using equal bins. There are other options available such as taking random splits (and compensating the prior for it). This yields the graph below:

Densities with error bars

Now to close off the construction of densities, it may be of interest to visualize our uncertainty in these densities. Although numerically expensive to compute, the expression for computing the standard deviation in the density is remarkably straightforward (F. Pijlman, 2023)

σP(x|X)2=P(x|X)(P(x|x,X)−P(x|X))\sigma_{P(x|X)}^2 = P(x|X) \left( P(x|{x,X}) – P(x|X) \right)

This yields the densities below:

Conclusions

We began with a simple question: Is there a mathematical foundation for choosing the bins in a histogram? As the concept of bins inherently connects data points with densities, we studied how
to choose bins for densities.

Using a Bayesian approach (information theory) one can fit densities without having to worry of overfitting (too many bins showing too much detail). Although one can compute the “best” bin-width, we saw that:

Model weighting allows us to combine multiple resolutions, providing a smoother and more honest representation of the data.
Dirichlet Priors give us a rigorous way to express our initial assumptions about data distribution.

Just as perturbation theory provides a hierarchy for physical interactions, this Bayesian framework provides a hierarchy for data resolution. The resolution scales naturally as more data becomes available. Note that such ideas can also be used when learning models in which one has an expansion in interactions.

The method of combining densities of various resolutions was also explored in case random bins are chosen. This led to smooth histograms which may appear to be more natural for most data
sets.

We also presented the use of standard deviations in histograms. Although the calculation of standard deviations was derived for Bayesian models, its calculation-procedure suggests a wider applicability. As such, it can be for visualizing the remaining uncertainties in densities.

Acknowledgements

The EdgeAI “Edge AI Technologies for Optimised Performance Embedded Processing” project has received funding from Key Digital Technologies Joint Undertaking (KDT JU) under grant agreement No. 101097300. The KDT JU receives support from the European Union’s Horizon Europe research and innovation program and Austria, Belgium, France, Greece, Italy, Latvia, Luxembourg, Netherlands, and Norway.

References

F. Pijlman, J. L. (2023). Variance of Likelihood of Data. https://sitb2023.ulb.be/proceedings/, 34/37.
Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.
Vries, B. d. (2026). Active Inference for Physical AI Agents. arXiv.

Bio

Fetze Pijlman is a Principal Scientist at Signify Research in Eindhoven, the Netherlands. His research focus spans probabilistic machine learning, Bayesian inference, and signal processing, with a particular interest in applying these mathematical frameworks to IoT, sensing, and smart systems.

Feeds