Missing Data? Use Explainable AI to Fill the Gaps (Correctly)
Random Forests, XGBoost, and Neural Networks for interpolation without the black box problem
Big changes ahead. Starting next week, Friday articles will be for paid subscribers only, but Tuesday’s free posts will continue, and we’re launching exclusive insider content for premium subscribers on Thursdays.
For a limited time, get 37% off forever on your premium plan—only available until March 20th. Lock in this lower rate today.
€17/month instead of €27 | €161.81/year instead of €257.
I used to work as a scientist in theoretical particle physics. It was super data-heavy, but I generated most of that data myself with Monte Carlo simulations. If a data point was missing, I just generated a new one.
These days, I work with real-world data in finance. I process tons of datapoints that companies produce, and figure out what it means for a company’s ability to make money.
You might be surprised to hear this, but even bluechip companies can be incredibly sloppy about reporting consistent data. This complicates my job—and I know that for many other data professionals out there the situation is even worse.
If you’ve ever built a model from imperfect data and found that it only returned garbage, then you know exactly what I mean. Garbage-in-garbage-out is true. But incomplete-in-garbage-out is true, too. Sadly.
Many data scientists respond to this challenge with simple tools. Techniques like mean imputation and forward are easy to implement. But these techniques are just band-aids, not a cure. In many cases, they improve models but still lead to distorted results.
Machine learning techniques often do a better job at interpolating missing data. However, accuracy is not the only important criterion here. You’re essentially inventing new data, so you want to make sure that it is as transparent and as reproducible as possible.
Many machine learning techniques are black boxes. They’re not transparent at all, and often they’re not even that reproducible either.
There are techniques, however, that help you get around this. We’ll look into various classic ML techniques for data interpolation, and we’ll cover how you can sprinkle SHAP and LIME on top of them for explainability.
Why Non-ML Approaches Fail
Traditional approaches are easy to implement and to explain. Nevertheless, they are often not enough to interpolate missing data points. Below are some of the most common ones and their drawbacks.
Mean/Median Imputation
In this scenario, missing values are replaced with the mean or median of the available data. This is okay for datasets in which there are not many missing values—but it’s not that great otherwise because it flattens variability, introduces unrealistic values, and fails for time series.
For example, if I study three companies, of which one emits 50,000 tones of carbon per year and another emits 2,000, I cannot just assume that the third would emit the average, i.e., 26,000 tonnes. For all we know, the third company might be emitting a million tonnes and the dataset would still be perfectly fine!
In addition, time-series data is mostly unsteady. Assuming that it is steady is nonsensical. If company A emitted 50,000 tonnes of carbon this year and 60,000 the year before, does this mean that it will emit 55,000 tonnes next year? Why not 40,000 tonnes?
In other words, mean and medium imputation sound like a good idea in theory, but in practice they have a very limited scope of application.
Linear & Polynomial Interpolation
Similar things can be said about linear and polynomial interpolation. Here, we try to bridge the gap between two points with a linear or polynomial function.
The problem with this is is that it assumes smooth, predictable trends. While it can be better than a mean imputation, it still isn’t that great. In the real world, data is often jumpy and volatile—no amount of polynomials will reflect this accurately.
In addition, polynomials often overfit with small datasets. They also can be highly divergent; extrapolate too far out and you’ll get completely nonsensical results.
In addition, neither of these methods is particularly explainable. They don’t provide uncertainty metrics or confidence intervals either.
Forward Fill & Backward Fill
Another approach is filling missing values with the last known value (forward fill) or the next known value (backward fill). This fails for similar reasons as mean/median imputation: Variability falls flat, and volatility is completely ignored.
Worse, this technique assumes that there’s no change in values over time. This is wrong in many cases. For example, copying forward Apple’s 2020 iPhone sales for 2021 would be wildly inaccurate.
Regression-Based Imputation
Finally, the one slightly more sophisticated fix on this list works by predicting missing values using a basic regression model with other available variables.
Even this approach is doomed in many cases though. It assumes linear relationships between variables, when the reality is usually a lot more complex.
In addition, this imputation does not capture causality. Just because some supporting variable exhibits a linear correlation doesn’t imply that there’s cause and effect at play.
Like with polynomial interpolations, this technique is also prone to overfitting on small datasets.
The big problem with all these techniques is that they only work in fairly simple use cases. Machine learning offers better ways to interpolate missing data—but some adjustments need to be made to make it transparent and explainable.
Subscribe now to keep reading.
Keep reading with a 7-day free trial
Subscribe to Wangari Digest to keep reading this post and get 7 days of free access to the full post archives.