Why Small Data Can Beat Big Data in Causal Questions

Causal models thrive where machine learning struggles

Aug 29, 2025

Sometimes, knowing just a few building blocks let you sense the larger structure. Image generated with Leonardo AI

In data science, the mantra has long been: bigger is better. Feed more rows to the model, crank up the GPUs, and let the patterns reveal themselves. In many domains — image recognition, speech processing, natural language — this logic has been transformative. Scale has delivered astonishing accuracy.

But bigger isn’t always better. When we care about cause and effect, less can actually be more! In many cases, “small data” delivers deeper, more reliable insights than a terabyte of correlations.

Consider healthcare: a randomized trial of just a few hundred patients can reveal whether a new drug works. In marketing, an A/B test with a thousand users can identify the impact of a new design. In manufacturing, a handful of controlled production runs can tell you which process change really reduces defects.

As a statistician, I used to shake my head at such small-number studies. But I’ve learned a lot over the past year or two.

These are all small-data causal problems. The goal is not to predict the next outcome, but to understand what would have happened under different choices. (This quite naturally tells you why things happen.)

At my own firm, we see the same patterns in finance and sustainability. The datasets are rarely “big,” and the questions are often causal. That’s exactly the terrain where small data beats big data.

Why Most Industries Are Small-Data Causal Problems

If you look across industries, you’ll notice a common paradox: the questions leaders care most about are not “big-data” prediction questions, but “small-data” causal ones.

Healthcare: A randomized trial of a few hundred patients can determine whether a new drug saves lives. The goal isn’t to predict which patient will recover; it’s to establish whether the treatment causes recovery.
Marketing: An A/B test with a thousand website visitors can reveal whether a new design improves click-through rates. You don’t need millions of observations, because the experiment isolates cause directly.
Manufacturing: A handful of controlled production runs can show whether a process change really reduces defects. Here, the leverage comes from controlling variables, not amassing terabytes of sensor data.
Public policy: A pilot program in one city can demonstrate whether a new regulation improves outcomes. If designed well, the insights transfer more broadly.

Across all these cases, the datasets are modest — hundreds or thousands of observations, not billions. Yet the stakes are high, and the answers are actionable.

Finance and sustainability fit this pattern too. Asset managers may want to know whether board diversity improves resilience, or whether investing in renewables enhances returns. These are not prediction problems; they are counterfactuals. And the datasets available are almost never “big.”

The problem is that, often, processes for doing causal inference are either quite niche or quite siloed in various industries. There have been efforts to pull together all these different approaches, but to be honest it seems quite early days. My firm is an active contributor in this space of new methodologies.

Why Causal Inference Thrives in Small Data

In the world of prediction, more data is usually the answer. If your model misclassifies cats and dogs, feeding it another million images will improve accuracy. But causality plays by different rules.

Causal inference is less about detecting patterns and more about identifying the underlying mechanisms that generate them.

Take a simple A/B test in marketing. You don’t need millions of participants to conclude that one version of a website drives higher conversions. What you need is a clean comparison where the only difference between groups is the treatment itself. The clarity comes from design, not size.

Causal inference relies on three pillars that often make small data sufficient:

Counterfactual framing: The question is not “what usually happens?” but “what would have happened under a different choice?” Such a question can often be answered with fewer but better-structured observations.
Domain knowledge: Causal inference formalizes expert intuition through tools like Directed Acyclic Graphs (DAGs). By specifying which variables influence which, we constrain the problem space, making big data less critical.
Uncertainty quantification: With small samples, techniques like bootstrapping and Bayesian updating let us honestly report confidence intervals. Rather than masking noise under “accuracy,” causal methods acknowledge limits while still offering actionable guidance.

In industries like finance or sustainability, the reality is that datasets are rarely massive. You may have a few dozen firms, a decade of history, and incomplete coverage. Not enough for a neural network — but enough to answer carefully structured causal questions.

Attribution analysis or “driver-based” decomposition might tell you where the numbers came from. But causal inference tells you whether changing a lever would truly shift the outcome. Which turns out to be exactly the type of insight that decision-makers crave all day long.

Where Causal AI Fits In

So where does this leave Causal AI? If causal inference thrives on small datasets, and AI thrives on large ones, what’s the middle ground?

Traditional AI thrives on prediction: feed it massive datasets, and it will detect patterns humans can’t see. This is why it works so well for text, images, or credit scoring. But AI on its own rarely tells us why. It can recommend a product, but it can’t explain if the recommendation causes a sale.

Causal AI is an emerging hybrid that combines the strengths of both worlds. It layers causal structure on top of machine learning, enabling models that are both more data-efficient and more transferable.

This is actually a fantastic combination, for several reasons:

Data efficiency: By embedding causal assumptions, causal AI needs fewer examples to avoid overfitting. Instead of blindly learning every correlation, it focuses on the relationships that matter.
Generalization: A predictive model trained on one dataset often fails when the context shifts. A causal model, by focusing on mechanisms, can adapt better. For example, a causal model of consumer behavior trained in Europe may transfer more effectively to the US than a purely correlative AI model.
Interpretability: Causal AI doesn’t just output probabilities. It can provide answers to “what if?” questions: what if prices changed, what if regulation tightened, what if policies shifted?

Of course, the field is still young. Today’s causal AI systems remain fragile: get the causal graph wrong, and the results mislead. And most production AI systems remain correlation-driven. But the trajectory is clear: the future isn’t just about scaling data; it’s about embedding intelligence into how we ask questions.

For industries with inherently small datasets — healthcare trials, industrial process improvements, or ESG data — this is especially promising. You don’t need a trillion data points to answer causal questions. You need the right structure, the right assumptions, and perhaps a little help from smarter algorithms.

The Bottom Line: Causality needs depth, not breadth

Big data and AI have transformed how we recognize patterns. They thrive in domains where prediction is the end goal: recommending the next song, spotting fraudulent transactions, or translating a sentence. In those settings, more data almost always improves performance.

We basically ask, “what will happen?” But if we ask, “what would happen if we did X instead?” ML models are clueless.

When the task shifts from prediction to intervention, the rules change. What matters most is not scale, but structure. A carefully designed small dataset, analyzed through a causal lens, can answer questions that even the largest predictive model cannot.

This is where causal AI comes in. By combining causal graphs with machine learning, researchers are starting to build models that are both data-efficient and generalizable. They can learn from smaller samples, transfer across contexts, and provide not only forecasts but explanations. It’s still early, but the direction is clear: the future of AI isn’t just bigger, it’s smarter.

For decision-makers across industries, from healthcare to finance, from marketing to manufacturing, the message is the same. Don’t just chase volume. If you want strategic insights, just ask the causal questions instead.