Random assignment of treatments in experiments has the amazing tendency to balance out confounders and other covariates across testing groups. This tendency provides a lot of favorable features for analyzing the results of experiments and drawing conclusions. However, randomization tends to balance covariates — it is not guaranteed.
What if randomization doesn’t balance the covariates? Does imbalance undermine the validity of the experiment?
I grappled with this question for some time before I came to a satisfactory conclusion. In this article, I’ll walk you through the thought process I took to understand that experimental validity depends on independence of the covariates and the treatment, not balance.
Here are the specific topics that I’ll cover:
The Central Limit Theorem (CLT) shows that a randomly selected sample’s mean is normally distributed with a mean equal to the population mean and a variance equal to the population variance divided by the sample size. This concept is very applicable to our conversation because we are interested in balance — i.e., when the means of our random samples are close. The CLT provides a distribution for these sample means.
Because of the CLT, we can think of the mean of a sample the same way we would any other random variable. If you remember back to probability 101, given the distribution of a random variable, we can calculate the probabilities that an individual draw from the distribution falls between a specific range.
Before we get too theoretical, let’s jump into an example to build intuition. Say we are wanting to do an experiment that needs two randomly selected groups of rabbits. We’ll assume that an individual rabbit’s weight is basically normally distributed with a mean of 3.5 lbs and a variance of 0.25 lbs.

Hypothetical weight distribution of rabbit population – by author
The simple Python function below calculates the probability that our random sample of rabbits falls in a specific range given the population distribution and a sample size:
from scipy.stats import norm
def normal_range_prob(lower,
upper,
pop_mean,
pop_std,
sample_size):
sample_std = pop_std/np.sqrt(sample_size)
upper_prob = norm.cdf(upper, loc=mean, scale=sample_std)
lower_prob = norm.cdf(lower, loc=mean, scale=sample_std)
return upper_prob - lower_prob
Let’s say that we would consider two sample means as balanced if they both fall within +/-0.10 lbs of the population mean. Additionally, we’ll start with a sample size of 100 rabbits each. We can calculate the probability of a single sample mean falling in this range using our function like below:

probability of our random sample having a mean between 3.4 and 3.6 pounds – image by author
With a sample size of 100 rabbits, we have about a 95% chance of our sample mean falling within 0.1 lbs of the population mean. Because randomly sampling two groups are independent events, we can use the Product Rule, to calculate the probability of two samples being within 0.1 lbs of the population mean by simply squaring the original probability. So, the probability of the two samples being balanced and close to the population mean is 0.90% (0.952). If we had three sample sizes, the probability of all of them balancing close to the mean is 0.953 = 87%.
There are two relationships I want to call out here — (1) when the sample size goes up, the probability of balancing increases and (2) as the number of test groups increase, the probability of all of them balancing goes down.
The table below shows the probability of all randomly assigned test groups balancing for multiple sample sizes and test group numbers:

Image by author
Here we see that with a sufficiently large sample size, our simulated rabbit weight is very likely to balance, even with 5 test groups. But, with a combination of smaller sample sizes and more test groups, that probability shrinks.
Now that we have an understanding of how randomization tends to balance covariates in favorable circumstances, we’ll jump into a discussion of why covariates sometimes don’t balance out.
Note: In this discussion, we only considered the possibility that covariates balance near the sample mean. Hypothetically, they could balance at a location away from the sample mean, but that would be very unlikely. We ignored that possibility here — but I wanted to call out that it does exist.
In the previous discussion, we built intuition on why covariates tend to balance out with random assignment. Now we’ll transition to discussing what factors can drive imbalances in covariates across testing groups.
Below are the five reasons I’ll cover:
Covariate balancing is always associated with probabilities and there is never a perfect 100% probability of balancing. Because of this, there is always a chance — even under very good randomization conditions — that the covariates in an experiment won’t balance.
When we have small sample sizes, the variance of our mean distribution is large. This large variance can lead to high probabilities of large differences in our average covariates across testing populations, which can ultimately lead to covariate imbalance.

Image by author
Until now, we’ve also assumed that our treatment groups all have the same sample sizes. There are many circumstances where we will want to have different sample sizes across treatment groups. For example, we may have a preferred medication for patients with a specific illness; but we also want to test if a new medication is better. For a test like this, we want to keep most patients on the preferred medication while randomly assigning some patients to a potentially better, but untested medication. In situations like this, the smaller testing groups will have a wider distribution for their sample mean and therefore have a higher probability of having a sample mean further from the population mean and which can cause misbalances.
The CLT correctly identifies that the sample mean of any distribution is normally distributed with sufficient sample size. However, sufficient sample size is not the same for all distributions. Extreme distributions require more sample size for the sample mean to become normally distributed. If a population has covariates with extreme distributions, larger samples will be required for the sample means to behave nicely. If the sample sizes are relatively large, but too small to compensate for the extreme distributions, you may face the small sample size problem that we discussed in the previous section even though you may have a large sample size.

Image by author
Ideally, we want all testing groups to have balanced covariates. As the number of testing groups increases, that becomes less and less likely. Even in extreme cases where a single testing group has a 99% chance of being close to the population mean, having 100 groups means we should expect at least one to fall outside that range.
While one hundred testing groups seems pretty extreme. It is not uncommon practice to have many testing groups. Common experimental designs include multiple factors to be tested, each with various levels. Imagine we are testing the efficacy of different plant nutrients on plant growth. We may want to test 4 different nutrients and 3 different levels of concentration. If this experiment was full-rank (we create a test group for each possible combination of treatments), we would create 81 (34) test groups.
In our rabbit experiment example, we only discussed a single covariate. In practice, we want all impactful covariates to balance out. The more impactful covariates there are, the less likely complete balance is to be achieved. Similar to the problem of too many testing groups, each covariate has a probability of not balancing — the more covariates, the less likely it is that all of them will balance. We should consider not only the covariates we know are important, but also the unmeasured ones we don’t track or even know about. We want those to balance too.
Those are five reasons that we may not see balance in our covariates. It isn’t a comprehensive list, but it is enough for us to have a good grasp of where the problem often comes up. We are now in a good position to start talking about why experiments are valid even if covariates don’t balance.
Balanced covariates have benefits when analyzing the results of an experiment, but they are not required for validity. In this section, we will explore why balance is beneficial, but not necessary for a valid experiment.
When covariates balance across test groups, treatment effect estimates tend to be more precise, with lower variance in the experimental sample.
It is often a good idea to include covariates in the analysis of an experiment. When covariates balance, estimated treatment effects are less sensitive to the inclusion and specification of covariates in the analysis. When covariates do not balance, both the magnitude and interpretation of the estimated treatment effect can depend more heavily on which covariates are included and how they are modeled.
While balance is ideal, it isn’t required for a valid experiment. Experimental validity is all about breaking the treatment’s dependence on any covariate. If that is broken, then the experiment is valid — correct randomization always breaks the systematic relationship between treatment and all covariates.
Let’s go back to our rabbit example again. If we allowed the rabbits to self-select the diet, there might be factors that impact both weight gain and diet selection. Maybe younger rabbits prefer the higher fat diet and younger rabbits are more likely to gain weight as they grow. Or perhaps there is a genetic marker that makes rabbits more likely to gain weight and more likely to prefer higher fat meals. Self-selection could cause all sorts of confounding issues in the conclusion of our analysis.
If instead, we did randomization, the systematic relationships between diet selection (treatment) and age or genetics (confounders) are broken and our experimental process would be valid. As a result, any remaining association between treatment and covariates is due to chance rather than selection, and causal inference from the experiment is valid.

Image by author
While randomization breaks the link between confounders and treatments and makes the experimental process valid. It doesn’t guarantee that our experiment won’t come to an incorrect conclusion.
Think about simple hypothesis testing from your intro to statistics course. We randomly draw a sample from a population to decide if a population mean is or is not different from to a given value. This process is valid — meaning it has well-defined long-run error rates, but bad luck in a single random sample can cause type I or type II errors. In other words, the approach is sound, even though it does not guarantee a correct conclusion every time.

Image by author
Randomization in experimentation works the same way. It is a valid approach to causal inference, but that does not mean every individual randomized experiment will yield the correct conclusion. Chance imbalances and sampling variation can still affect results in any individual experiment. The possibility of erronous conclusions doesn’t invalidate the approach.
Randomization tends to balance covariates across treatment groups, but it does not guarantee balance in any single experiment. What randomization guarantees is validity. The systematic relationship between treatment assignment and covariates is broken by design. Covariate balance improves precision, but it is not a prerequisite for valid causal inference. When imbalance occurs, covariate adjustment can mitigate its consequences. The key takeaway is that balance is desirable and helpful, but randomization (not balance) is what makes an experiment valid.