Engineering the Edge
Building tactical features that turn raw data into winning predictions.
What does your model actually see when it looks at your data?
Numbers, yes — but not meaning. Patterns, but not purpose.
A model doesn’t truly “see” until we teach it how.
That act — the translation of intuition into numerical sight — is called feature engineering.
In Soccer Analytics with Machine Learning (O’Reilly, co-authored with Haipeng Gao), we explore this idea through one of the most unpredictable systems imaginable: the game of soccer. A sport where randomness and structure collide, where the ball can bounce anywhere — and yet patterns persist if you know how to look.
Teaching a model to see soccer isn’t all that different from teaching it to understand markets, weather, or credit risk. In every case, the challenge is the same: helping algorithms perceive the world as we do — in context, through time, and with a sense of quality.
Perception begins with preparation
Before a model can perceive, we have to give it something worth perceiving.
Raw data — passes, goals, corners, shots — is like a live broadcast before commentary. There’s movement but no narrative.
Our job as analysts is to create the lens through which the algorithm observes that movement.
In practice, that means cleaning, aligning, and encoding events so the data has structure.
We turn timestamps into sequences.
We group players into teams, matches into contexts.
We convert strings into categories and categories into integers.
The goal is not just to make the data machine-readable, but meaningful.
A simple preprocessing pipeline in pandas already begins that transformation:
matches[’goal_diff’] = matches[’home_goals’] - matches[’away_goals’]
matches[’result’] = (matches[’goal_diff’] > 0).astype(int)It’s trivial, but it tells the model what success looks like — something every learning system, human or machine, needs before it can improve.
Temporal vision — learning from momentum
The first way a model learns to “see” is through time.
Raw snapshots can’t capture momentum, fatigue, or rhythm.
That’s where rolling features come in.
In the book, we introduce a feature called form_5 — a five-match rolling average of goal differences.
matches[’form_5’] = (
matches.groupby(’team’)[’goal_diff’]
.rolling(window=5)
.mean()
.reset_index(level=0, drop=True)
)That single column teaches the model something essential: teams evolve.
It’s no longer predicting outcomes in isolation, but interpreting trajectories.
Short windows detect bursts of form; longer ones represent identity.
When you apply this thinking to finance, it becomes moving averages or rolling volatilities — the same attempt to see trend instead of noise.
Temporal vision transforms prediction into pattern recognition. It’s how a machine begins to develop “memory.”
Spatial vision — understanding context
Every event happens somewhere, and “somewhere” changes everything.
A home match is not the same as an away match.
A rainy midweek fixture after three consecutive games carries a different energy than a sun-lit opener.
In data terms, context is as fundamental as content.
We encode it through variables like is_home, rest_days, and opponent_strength.
matches[’is_home’] = (matches[’venue’] == ‘home’).astype(int)
matches[’rest_days’] = (
matches.groupby(’team’)[’date’].diff().dt.days.fillna(7)
)A few lines of code, and suddenly the model has spatial and situational awareness.
It can tell whether performance came from comfort or exhaustion.
In our experiments, rest-related features consistently outranked possession, shot volume, and passing accuracy. Fatigue mattered more than flair.
Once we quantified that, the model started predicting matches the way coaches think — probabilistically, through circumstance.
Spatial vision is what lets a model feel the weather of the data.
Qualitative vision — how, not just what
A team that takes 20 bad shots and another that takes 5 good ones look identical in raw stats.
But anyone watching knows they’re not the same.
To teach a model that distinction, we need quality features — representations of how actions occur.
In soccer, that means Expected Goals, or xG: the probability that a shot becomes a goal based on distance and angle.
Even a simple handcrafted version conveys crucial nuance:
import numpy as np
def shot_quality(distance, angle):
return np.exp(-0.1 * distance) * np.cos(angle)
shots[’xG’] = shot_quality(shots[’distance’], shots[’angle’])Aggregating by team gives us a measure of offensive efficiency:
xg_features = shots.groupby([’match_id’, ‘team’])[’xG’].sum().reset_index()
matches = matches.merge(xg_features, on=[’match_id’, ‘team’], how=’left’)Now the model sees not just activity, but intent.
It learns that five dangerous shots can outweigh twenty speculative ones — a form of vision that translates seamlessly to finance (expected vs. realized return) or sustainability (potential vs. observed impact).
Quality features are the model’s way of learning to judge — to distinguish signal from quantity.
Relational vision — who you beat matters
Once we teach the model time, space, and quality, there’s one more dimension left: relationship.
A team’s true strength is defined not just by how often it wins, but by whom it beats.
In the book, we use ranking systems like Elo and PageRank to encode that relational intelligence.
Elo works iteratively, updating ratings after each match: beating a strong team increases your rating more than beating a weak one.
from elote import EloCompetitor
teams = {name: EloCompetitor(name) for name in matches[’team’].unique()}
for _, row in matches.iterrows():
home, away = teams[row[’home_team’]], teams[row[’away_team’]]
home.match(away, result=row[’result’])PageRank goes further, treating victories as directional links in a network:
import networkx as nx
G = nx.DiGraph()
for _, row in matches.iterrows():
G.add_edge(row[’winner’], row[’loser’])
pagerank = nx.pagerank(G)These methods don’t just count wins — they value who you won against.
Relational features give the model social intelligence: it learns hierarchy, prestige, and influence.
That’s as useful in soccer as it is in investment networks or supplier risk analysis.
When the model finally sees
The first time we ran a full feature-engineered model on Women’s World Cup data, we expected traditional metrics — possession, shots, passing — to dominate.
They didn’t.
The top predictors were rest days, xG difference, and Elo rating.
The model wasn’t dazzled by volume; it was grounded in context.
That was the moment we realized it had started to “see.”
It wasn’t mimicking coaches or betting algorithms.
It was interpreting — connecting form, fatigue, and quality into a coherent picture of reality.
In financial terms, that’s the shift from raw tick data to informed signals; in environmental modeling, from raw temperature to climate regime.
Vision in data science is simply structure meeting sense.
Transferable lessons — seeing across domains
Though our examples come from soccer, the principles hold anywhere.
Type of VisionSoccer ExampleFinance/Other ExampleTemporalRolling form (recent goal difference)Moving averages, rolling volatilitySpatialHome/away, rest daysMarket regime, liquidity conditionQualitativeExpected goals (xG)Expected returns, adjusted metricsRelationalElo/PageRank team strengthNetwork centrality, peer influence
Each is a way of encoding perception.
Together they form a hierarchy of awareness — the process by which models evolve from blind calculators to informed analysts.
Building a model that thinks tactically
Once these features are ready, training becomes straightforward.
The heavy lifting — the thinking — has already happened upstream.
from sklearn.ensemble import GradientBoostingClassifier
features = [
‘form_5’, ‘is_home’, ‘rest_days’, ‘short_rest’,
‘xG’, ‘elo_rating’, ‘pagerank_score’
]
X = matches[features]
y = matches[’result’] # 1 = win, 0 = not win
model = GradientBoostingClassifier()
model.fit(X, y)When we visualized feature importances, we saw a clear ranking:
Rest days
xG difference
Opponent strength
Recent form
In other words: context first, then quality, then history.
That hierarchy mirrors how great analysts — and great athletes — think.
The Bottom Line: Seeing is understanding
Teaching a model to see is really about teaching ourselves to describe the world precisely.
Each engineered feature is a hypothesis, each transformation a small act of interpretation.
When we craft features, we decide what “matters.”
When we test them, we confront whether our beliefs hold up.
That’s why feature engineering is not just preprocessing — it’s epistemology. It’s how data becomes understanding.
And that’s the real edge.
Not the algorithm, not the architecture — but the perception we bring to the problem.
Once a model sees clearly, prediction becomes almost secondary.
The insight is the goal.
A rough and un-edited early release of our book is already out now! Check it out here.



