Building an AI-Powered Investment Machine
Identifying winners and losers using current and future techniques

The financial world has come a long way since its depiction in Wolf of Wall Street. Far from relying on low tech and less-than-ethical practices for making money, workers in finance are realizing that in today’s world you get ahead by being smart and ethical.
Even if people try to be unethical, they can’t get very far these days. Sustainability-related measures are baked into financial regulation and into many processes within financial institutions and the corporates they bankroll.
Due to regulatory and public pressure, corporates are increasingly disclosing their sustainability-related performance alongside their financial metrics. This has resulted in a mountain of corporate sustainability data that is waiting to be exploited.
The problem is that this data is scattered all over the place and even today is not fully standardized. Historical data—which one also needs for investment analyses—is often hardly standardized at all. It varies not only from sector to sector and company to company but even from year to year within that same company in terms of its methodology, structure and coverage.
This means that data gathering and cleaning is a necessary but superhuman task.
And the obvious solution to this is AI. Except, of course, that building AI capabilities at scale is far from obvious. My team is busy implementing such pipelines, and it’s a lot of work. Our system is called esgGPT, because it’s a cool word and we fancied it—but it can process financial and other data, too, alongside ESG- or sustainability data.
For the sake of transparency and in the hopes that this might empower others to build their own capabilities, we’ll detail below where we’re at. We’ll talk about gathering and cleaning financial and non-financial data using AI. We’ll also talk more about where the magic happens, which is in our algorithms: With the help of AI, they uncover the connections between sustainability and finance, and ultimately calculate what this means for the value of a company.
Using AI to Gather and Process (Non-)Financial Data
Before we proceed to analyzing any data, we need to find, extract, clean, and organize it. Traditional ESG analysis is rarely even this data-driven because the task would be so tedious manually. Instead, it usually relies on processing a few recent short reports and then calling it a day.
Doing things the traditional way is a mistake because one misses out on crucial insights. For example, we would have never found out about how women in management drive profits in male-dominated sectors if we’d contented ourselves with reading a few recent reports and relied on a few anecdotes.
Instead we rigorously combed through the data we gathered and made sure we only searched for statistically significant and robust results. As a result, we have produced insights that can help companies make many millions more than they would have otherwise.
Doing this manually is some mind-numbing work though. It will keep you up until midnight and beyond—I’m speaking from experience. Working so closely to the data pays off though because it makes sustainability data a usable input for financial valuation models.
The first challenge is obtaining the data, extracting it, and cleaning it. This applies both to financial and non-financial (i.e. mostly sustainability-related) data, but typically is much more challenging for the latter. Nevertheless we’ll be covering both below, because we’ll need both later on.
Finding Public Data
To find public data, we scrape corporate websites and those of official regulatory bodies like the SEC. Why would you need AI for this, you may ask?
Well, even financial reporting documents are not always as organized as you’d like them to be. Usually they’d be called “Annual Report,” but sometimes they’ll be called something different. With sustainability, there’s no agreed-upon nomenclature at all. It gets worse when you’re searching on corporate websites, as opposed to sites of regulatory bodies.
Hard-coding all naming variants for documents and all website URLs to find these documents is close to impossible. We are therefore building Natural Language Processing (NLP) capacities to automatically find the right documents. Looking ahead, we also plan to monitor sites in order to retrieve new documents as soon as they become available.
Locating Critical Information
Once one has found the right documents, the work has hardly begun. These statements are often 400+ pages long—frankly, nobody has the time to read through all of that.
As far as financial statements go, one can get pretty far with the three fundamental statements, that is, the balance sheet, income statement, and cash flow statement. Sometimes more detailed information might be necessary, but in first approximation this is enough.
For sustainability data, it is a bit more complex. Generally speaking, there is textual and numerical data. We give priority to numerical data because it allows us to run calculations. That being said, plans to process textual data are being made too.
At this moment in time, we collect all the sustainability we can get about a company, as long as we can be confident upon first sight that it is reliable. (If it turns out that it is only partially reliable, further adjustments need to be made during cleaning.)
To locate the right sections within documents, we use text summarization models and table recognition AIs.
Data Extraction
Once we have identified within the documents we have obtained the data we are interested in, we need to extract it. Often, these documents come as PDFs. We want just a few hundred of the most important datapoints per fiscal year, stored in a comma-separated file (CSV).
This means that we need Optical Character Recognition (OCR) techniques alongside NLP to extract numbers, labels, and units from reports. Table recognition AIs are used again to keep the structure of tabular data intact. We use NLP and entity recognition AI to understand different data labels. For example, “CO2 Scope 1” is the same as “Scope 1 Carbon Emissions” and should therefore be entered in the same row of data.
Giving this data to an LLM directly also allows to query that LLM for more punctual analyses, in which a small amount of datapoints are sufficient. We’re still working on making such an LLM virtually error-free, and on allowing it to output more data with a single query in order to save energy and computing costs.
Data Cleaning
No matter how well you extract your data, there will almost always be some formatting issues and faults. Data cleaning is easily the most time-consuming step of this whole process. Doing this manually for 15 years’ worth of a single company can easily take a couple of working days.
We use anomaly detection to flag and fix errors. We use fixed standards for converting varying units, currencies, and formats into a single schema. Predictive AI will be used to fill missing values using historical data and peer benchmarks obtained from previous extractions, but as we write this piece we have not implemented it yet.
With historical data, adjustments sometimes need to be made, due to differences in methodology or due to faulty claims that might be labeled as greenwashing. At this point in time, we omit such unreliable datapoints—but we plan to implement more sophisticated AI-powered adjustments to make some of them usable.
Data Merging
An individual financial or sustainability-related statement only contain two or three years’ worth of data. We often need over a decade’s worth.
We therefore need to aggregate the data that we extracted from several statements from different years. In an ideal case, merging these datasets would be a matter of a couple lines of code. In practice, differences in labels, number formats, and so on can still exist even after cleaning. We therefore use AI and fuzzy logic to allow for such deviations and harmonize them across the merged dataset.
AI-Powered Investment Models
Once the data is structured and cleaned, AI can help generate investment insights that arise between financial and sustainability-related datapoints.
Simple linear correlation studies can often be carried out without AI; however, for more complex relationships it becomes quite necessary. It also helps pin down causal relationships (e.g., between the presence of women in management and a company’s profit).
The goal is automate the identification of long-term winners and losers based on financial and sustainability-related factors. That being said, one does not need to wait decades: Investment returns can show pretty early; within six months returns based on trading on weather patterns can show, and social factors often peak between 3 and 5 years.
Correlations and Causations
Going deeper than linear relationships, time-series machine learning can help find lagging effects of ESG on stock returns, credit risk, and profitability. Causal inference AI such as DoWhy or EconML helps determine whether sustainability-related factors cause improved financial performance.
Ultimately, our esgGPT system should be able to predict predict risk-adjusted returns more accurately than traditional models because they take sustainability-related factors into account more rigorously. This is still off in the future, though, because we’re working on aggregating enough data to train it.
AI for Corporate Valuations
Traditional valuation models like Discounted Cash Flows (DCF) and multiples fail to fully account for the impact of corporate sustainability. DCF models can be adjusted by using the most important correlation coefficients between variables that go directly into the model (such as profits or the cost of debt) and variables that influence them (such as women in management or carbon emissions).
By taking the correlations and the historical trajectories of different variables into account, AI can help build a more differentiated view on how much a company is worth. In addition to this, complex scenario analyses are easier to conduct with AI algorithms (I did this in my PhD; albeit for dark matter physics and not for corporate valuations). Risk models can become more sophisticated and more robust also.
Building Winning Portfolios
This part is still further into the future for us, but one could further use AI to dynamically adjust portfolio weightings based on ESG risk-reward tradeoffs. Using textual sustainability data and real-time sentiment analysis from news data might help inform a winning strategy and generate more alpha.
The Bottom Line: Building the Investment Machine of the Future
When you have a hammer, everything looks like a nail. As a result, I’m weary of using AI everywhere I can just for the sake of it.
That being said, I think it’s abundantly clear that with the type of work that Wangari is doing we’ll be using AI at every step. Our esgGPT is still somewhat in its infancy. It will, however, grow to make many millions of dollars for our clients. What at the moment is highly specialized semi-automated work will become a seamless integration.
Investors and asset managers will get their insights within minutes. This will allow them to generate more alpha than they ever had before. We’re living through a time where we simultaneously have more access to alternative data than ever, and also have the AI tools to process it.
Losing out on this opportunity would be a bad idea. Not only would the best companies not get the capital they deserve; investors and their clients would also be missing out on remarkable returns.
One Smart Question
The correct answer will be revealed in next Tuesday’s edition.
Answer of last week’s One Smart Question: About 65 percent of investors consider sustainability-related factors when making investment decisions.
Wangari’s Curated Reads
If your sustainability work is just compliance, you are cooked, says
in a recent piece. In a fun coincidence, the two of us ended up talking about the same topic from slightly different angles on the same day last week. I was talking about how finance people are encroaching on sustainability-related turf because they viewed it basically like a compliance exercise. Matthew took the stance that this stance is outright dangerous—and unnecessary with the upcoming simplification of CSRD. Sustainability professionals ought to get back to delivering business value.Batteries: how cheap can they get? It’s quite crucial that we get the answer to this right because it is the biggest limiting factor in everything from cars to industrial power plants.
writes that batteries might be becoming ridiculously cheap. He has solid data and calculations to back this up. Auke concludes that the future is an open and secure grid around local demand response, almost like the Internet. It’s a compelling stance, and one that deserves to be heard more.- keeps bringing out incredibly interesting pieces that explain our built surroundings. If you ever wondered Why Skyscrapers Became Glass Boxes, the answer is a mixture of cost, investor risk aversion, and tenants prioritizing heavily used features over architectural details like an exterior wall. This is interesting because it mirrors the way we’ve been doing a lot business in the past decades: Prioritize making our direct clients happy (in this case tenants), and let the community bear the cost of any externalities (in this case, ugly buildings). It’s about time we changed this!