The Data Isn’t Dumb—But Your Model Might Be

A sharp look at why bad sustainability insights aren’t always the data’s fault.

Jul 11, 2025

A smart model doesn’t crumble under pressure—or, at least, it crumbles in a way that makes us learn something. Image generated with Leonardo AI

Sustainability data has a reputation problem.

Talk to any skeptical analyst or PM, and you’ll hear it:

“The data just isn’t good enough.”
“Too sparse, too noisy, too subjective.”
“Give me something clean, and then I’ll take it seriously.”

It’s a tempting narrative: I’m not ignoring sustainability—the data just isn’t ready.
But there’s a quieter truth that rarely gets acknowledged:

The bottleneck isn’t always the data. Sometimes it’s the model.

Or more precisely: the model’s structure, its assumptions, and the way it digests that data.

Let’s unpack why even the smartest, most expensive dataset can’t save you if your model is conceptually flat—and how to fix that.

Good Data Can’t Save a Dumb Pipeline

Imagine you’ve got a dataset containing company-level sustainability metrics: emissions, energy mix, labor practices, biodiversity impacts, board diversity. You spent a fortune licensing it. Maybe you even scraped it from sustainability reports yourself.

You plug it into your usual pipeline. A regression. A risk model. A dashboard.

The result?

Either nothing looks significant…
Or you get results that feel flimsy, hard to interpret, or just plain wrong.

And you blame the data.

But let’s pause. What if the data is doing its job—signaling complexity, latent structure, causal loops—but your model is flattening all of that into a dumb pipe that can’t capture it?

Garbage-in, garbage-out is real.
But so is: gold-in, mush-out.

The Real Problem: Structural Blindness

Most classical financial models are designed for well-behaved variables:

Linear relationships
Independent features
Short time lags
No endogenous feedback

But sustainability variables often behave badly—in the best possible way:

Emissions are driven by decisions made years ago.
Board diversity may affect culture, which affects retention, which affects long-term performance.
Social risk may look fine… until a scandal breaks and the stock tanks.

These aren’t linear effects. They’re structured systems. And structured systems need structural models.

If your model doesn’t reflect how sustainability affects business fundamentals, the smartest dataset in the world won’t help you. You'll just be projecting confusion through a rigid pipeline.

A Quick Example: Flattening Carbon Risk

Let’s say you want to estimate the financial risk of carbon intensity.

Here’s a flat approach:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X=df[["carbon_intensity"]], y=df["stock_volatility"])

Your model says: “Carbon intensity has no significant effect.”
You conclude: “It’s not financially material.”

But this is lazy modeling.

A more thoughtful analyst might ask:

What else determines carbon intensity? (industry, regulation, production method)
Under what conditions does carbon risk matter? (pricing schemes, consumer sentiment)
Are we looking at immediate risk or long-term transition exposure?

Now you’ve got causal structure. Maybe even counterfactuals.
And suddenly the same data starts yielding insight—not just noise.

How Bad Models Flatten Nuance

Let’s name a few common ways traditional modeling pipelines sabotage sustainability insight:

1. Feature Reduction Without Theory

You’ve got 70 sustainability variables. So you throw them into PCA or an automated feature selector.

End result? You lose all interpretability. Now you have “Component 1” instead of “renewable energy mix,” and no clue what it means.

What to do instead: build a structural understanding of which variables matter and why—and keep them explicit.

2. Over-Control and Collider Bias

You add every variable you can find into the model. Just in case.
But now you’ve blocked causal paths, introduced collider bias, and your coefficient on carbon_intensity is meaningless.

What to do instead: use a causal graph (DAG) to identify valid adjustment sets. Tools like DoWhy or DAGitty can help.

3. Treating Latent Constructs as Observables

Sustainability often shows up through latent variables—like trust, resilience, adaptability, public reputation.

You can’t measure these directly. But you can model them with structural equation modeling (SEM), Bayesian nets, or proxy variable techniques.

What to do instead: acknowledge what’s hidden—and find smart ways to estimate it, rather than ignoring it.

4. Ignoring Time Lags and Feedbacks

Many sustainability effects are slow-burning. A risk builds over quarters or years—then suddenly erupts.

If your model only looks at contemporaneous correlations, you’ll miss the slow fuse.

What to do instead: use lagged variables, recursive models, or even system dynamics to represent delays and loops.

Building Smarter Pipelines: What Actually Helps

So what makes a model capable of using “difficult” data like sustainability metrics well?

There are three key principles.

1. Structure Before Estimation

Before you run anything, ask:

What’s the theory?
What causes what?
What are the feedback loops?

Sketch a causal DAG. Decide on your adjustment set. Then model.

Structure filters confusion before it hits your estimators.

2. Fit Models to Mechanisms, Not Just Metrics

Ask yourself: what’s the mechanism behind this variable?

If board diversity affects retention which affects innovation which affects valuation—then don’t jam it into a univariate regression. Model the chain.

Use:

Chain-based regression
SEM or path models
Counterfactual frameworks

3. Use the Right Tool for the Data Type

Here’s a cheat sheet:

Categorical sustainability ratings
→ Traditional: One-hot encoding
→ Smarter alternative: Causal graphs + d-separation logic
Event-driven shocks (e.g. climate regulations, scandals)
→ Traditional: Dummy variables
→ Smarter alternative: Event studies + regime-switch models
Qualitative disclosures (e.g. reports, statements)
→ Traditional: Often ignored
→ Smarter alternative: NLP-based embeddings + topic modeling
Feedback-heavy systems (e.g. climate risk, social instability)
→ Traditional: OLS regression
→ Smarter alternative: System dynamics or agent-based models

Don’t blame the data. Use the right lens to see it clearly.

Final Thought: Dumb Model, Dumb Questions

At the end of the day, your model determines what questions you can even ask.

A dumb model asks:

“Does carbon intensity affect returns?” (flat yes/no)

A smart model asks:

“Under what conditions does carbon risk matter most?”
“Which mechanisms translate emissions into volatility?”
“How robust are these relationships under different policy or consumer sentiment scenarios?”

It’s the same data.
But the questions are better—because the model is better.

And that, ultimately, is the point:

The data isn’t dumb.
But if your model ignores structure, theory, and time—it just might be.