How to Tell If a Policy Actually Worked: A Step-by-Step Causal Inference Example in Python

Going beyond correlation—using DoWhy to evaluate the real impact of a sustainability intervention.

Jun 20, 2025

What worked, and what was coincidence? Image generated with Leonardo AI

Let’s start with a confession.

When I left physics to enter the world of sustainable finance, I expected to trade elegance for complexity. I thought I’d be stepping out of a world governed by symmetry and conservation laws into one full of noise, uncertainty, and politics. And I was right—sort of.

But what I didn’t expect was this: how little causal reasoning actually shows up in sustainability work.

People track metrics. People build dashboards. People speak in KPIs and frameworks and heatmaps. But when you ask a simple question—“Did this actually work?”—you get a shrug. Or worse: a beautifully styled PowerPoint that answers a different question entirely.

This is a technical post. But it’s also a deeply personal one. Because I’ve seen, again and again, how much effort goes into sustainability strategies that have no feedback loops. How easy it is to conflate correlation with impact. And how dangerous that is when the stakes are planetary.

So today, I want to show you—step by step—how to use Python to test whether a policy actually had a causal effect. I’ll walk you through a real-world-inspired example using the DoWhy library. It's part tutorial, part therapy—for anyone who’s ever asked “Are we just pretending this works?”

Setting the Scene: Water Use and Wishful Thinking

Let’s say you’re working with a government agency that introduced a national water conservation policy in 2019. They claim it reduced industrial water use significantly.

You’re skeptical. Not because you’re cynical, but because you’ve been burned before. You've seen charts with celebratory dips that turned out to be weather-related. You’ve seen “impact evaluations” that didn’t control for GDP fluctuations. You’ve seen the lipstick, and you want the pig.

Here’s what you have:

Monthly data from 2016 to 2023
A water_use column, measured in cubic meters
A binary policy column: 0 before 2019, 1 after
Three controls: temperature, gdp, and rainfall

Your question:

Did the policy cause a reduction in water use, controlling for other factors?

We’re not just looking for a drop. We want to know if the drop was caused by the policy, not by externalities or luck.

Step 1: Load and Visualize the Data

We start simple. Plot water usage over time:

import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt 

df = pd.read_csv("water_policy_data.csv", parse_dates=["date"]) 
sns.lineplot(data=df, x="date", y="water_use", hue="policy") 
plt.title("Water Use Over Time (Before and After Policy)")

At first glance, things look promising. Water use drops off after 2019. But… is that because of the policy?

Any physicist—or really, any good analyst—knows that patterns are cheap. What we care about is explanation.

Step 2: Build a Causal Graph

Enter DoWhy, a causal inference library that helps formalize your intuition into something testable.

from dowhy import CausalModel 

model = CausalModel( 
    data=df, 
    treatment="policy", 
    outcome="water_use", 
    common_causes=["temperature", "gdp", "rainfall"] 
) 

model.view_model()

This gives you a causal DAG, which you can visualize using pygraphviz. It roughly looks like this:

temperature → 
             \ 
              → water_use ← policy 
             / 
gdp →       / 
        rainfall

This diagram tells us: yes, water use is influenced by policy—but also by the economy, climate, and seasonality. These are the confounders we’ll adjust for.

This may sound obvious. But I’ve reviewed corporate sustainability reports that skipped this step entirely—jumping straight from “we implemented a change” to “look at the great results!”

Don’t be that report.

Step 3: Identify the Estimand

This is where things get elegant. We’re asking: What statistical expression corresponds to the causal effect we want to estimate?

identified_estimand = model.identify_effect() 
print(identified_estimand)

DoWhy replies with something like:

“Backdoor criterion satisfied. Estimand expression: E[Y|do(policy)] = E[Y|policy, X]”

Translation: If we adjust for the known confounders (temperature, gdp, rainfall), we can estimate the effect of the policy as if it were randomized.

In other words: physics logic applied to social messiness. Feels like home.

Step 4: Estimate the Effect

Now we use linear regression to actually compute the effect:

estimate = model.estimate_effect( 
    identified_estimand, 
    method_name="backdoor.linear_regression" 
) 

print(f"Causal effect: {estimate.value}")

Let’s say the output is –72.5. That means the policy caused a reduction of 72.5 cubic meters of water use per month, on average.

Not just correlated. Not just observed. Causally driven.

You could stop here. But let’s not.

Step 5: Try to Break It

The best part of DoWhy is that it encourages you to stress test your own conclusions. As a physicist, I respect this deeply. Truth doesn’t fear scrutiny.

Let’s run a placebo test:

refutation = model.refute_estimate( 
    identified_estimand, 
    estimate, 
    method_name="placebo_treatment_refuter" 
) 

print(refutation)

If the estimate collapses under random noise, it was probably spurious. If it holds, you’ve got something solid.

You can also try:

Adding a random confounder
Subsetting the data
Testing with unobserved biases

We do this kind of thing all the time at Wangari. Not because we distrust the data—but because we respect it.

Step 6: Sanity Check with Propensity Score Matching

Want a second opinion? Use another estimator:

estimate_psm = model.estimate_effect( 
    identified_estimand,     
    method_name="backdoor.propensity_score_matching" 
) 

print(f"PSM Estimate: {estimate_psm.value}")

If both estimators give similar results, your findings are robust. If they don’t… go back and check assumptions.

That’s science. That’s integrity.

The Bottom Line: From Storytelling to Signal

If there’s one thing I’ve learned in this strange journey from particle physics to sustainable finance, it’s this:

Most sustainability metrics are designed to look good—not to explain what actually happened.

And the cost of that—economically, environmentally, morally—is staggering.

We spend billions on ESG initiatives with no clear sense of whether they work.
We build dashboards instead of feedback loops.
We celebrate correlations without checking causes.

But here’s the good news: we can do better. Tools like DoWhy make it easier than ever to bring rigorous, testable thinking into ESG work. You don’t need perfect data. You just need better questions.

This is what we’re building at Wangari: models that don’t just report but reveal. Tools that don’t just track, but test. Frameworks that invite humility, not just performance.

We’re not here to criticize sustainability. We’re here to make it smarter.

Because knowing what works isn’t just a luxury—it’s the only way forward.

If you're experimenting with similar methods—or want to—you’re not alone. We're figuring this out too. And I'd love to hear how you're making the invisible visible in your own work.