How to Build Zero-Hallucination AI
Or why deterministic pipelines, structured prompts, and validation guardrails are the only responsible path forward

Large Language Models are reshaping the financial services industry. They can summarize dense regulatory documents, draft client-facing reports, and explain complex accounting movements in plain English — tasks that previously required hours of skilled analyst time. But they come with a fundamental flaw that is unacceptable in a regulated environment: they hallucinate.
An LLM hallucination is not a random glitch. It is a structural property of how these models work. They are trained to predict the next most probable token in a sequence, not to retrieve verified facts. When the model encounters a gap in its knowledge — an obscure regulatory reference, a specific numeric figure, a recent policy change — it fills that gap with whatever sounds statistically plausible. The output is fluent, confident, and potentially entirely wrong 1.
In most consumer applications, this is an annoyance. In finance, it is a liability. A fabricated figure in a quarterly commentary, a misquoted discount rate in an actuarial report, or an invented regulatory citation in a compliance document can trigger audit failures, regulatory sanctions, and material financial loss. Google’s Bard erased roughly £100 billion in market capitalisation in a single afternoon after hallucinating a fact about the James Webb Space Telescope during a live demo 1. Air Canada was held legally responsible in court for a refund policy its chatbot invented 1. The question is not whether your LLM will hallucinate. Research suggests it will do so in anywhere from 3% to 41% of finance-related queries 1. The question is whether you have built a system that catches it before it causes harm.
This article presents a practical, production-oriented architecture for doing exactly that.
Why Finance Cannot Tolerate Probabilistic Outputs
Before examining the solution, it is worth being precise about the problem. The financial services industry is built on a principle that is fundamentally at odds with how LLMs operate: every number must be traceable to a source.
Under frameworks like IFRS 17 for insurance contracts, or the FCA’s Consumer Duty in the UK, firms are not just expected to produce accurate outputs — they are expected to demonstrate how those outputs were produced. An audit trail is not optional; it is a regulatory requirement. An LLM that generates a plausible-sounding explanation of a CSM movement, drawing on its training data rather than the firm’s actual figures, fails this requirement entirely, even if the explanation happens to be correct.
The challenge, then, is not to find a more accurate LLM. It is to design a system in which the LLM’s role is so tightly constrained that hallucination becomes structurally impossible.
The Architecture: Deterministic First, Language Model Last
The core principle of a hallucination-free pipeline is simple: the LLM should never be asked to know anything. It should only be asked to say something, based on facts that have already been computed and verified by deterministic code.
This inverts the typical approach, where an LLM is given a broad question and trusted to retrieve and reason over relevant information. Instead, the pipeline separates the computation layer from the communication layer entirely.
deterministic calculations
↓
structured metrics
↓
template / rule generation
↓
LLM summarization layer
↓
validation checksEach stage has a clearly defined responsibility, and no stage is permitted to introduce information that has not been verified by the stage before it.
| Stage | Responsibility | Hallucination Risk |
| -------------------------- | -------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------- |
| Deterministic calculations | Compute all financial metrics from raw data using auditable code | None — purely mathematical | None — purely mathematical |
| Structured metrics | Package outputs into a typed, validated data structure (e.g., a Python dict or Pydantic model) | None — data is already verified |
| Template / rule generation | Populate a prompt template with the structured metrics, providing explicit context and constraints | None — the prompt is constructed programmatically |
| LLM summarization | Generate human-readable narrative based *only* on the inputs provided in the prompt | Minimal — the LLM cannot introduce external facts if the prompt is well-constrained |
| Validation checks | Parse the LLM's output and verify that all numeric claims match the source metrics | Catches any residual errors before the output is used |The key insight is that the LLM is not a reasoning engine in this pipeline. It is a prose generator. Its job is to turn a structured set of verified facts into a coherent paragraph of English. That is a task it performs extremely well, and it is a task that does not require it to know anything beyond what it has been explicitly told.
Technical Deep Dive: A Minimal Pipeline for IFRS 17 Commentary
To make this concrete, consider the challenge of generating automated commentary on insurance liability movements under IFRS 17. This is a genuinely difficult reporting task. The standard is technically demanding, requiring insurers to track the Contractual Service Margin (CSM) — the unearned profit deferred over the life of a group of contracts — alongside discount rate sensitivities, risk adjustments, and experience variances 2. Explaining these movements in plain language for a board report or investor disclosure is exactly the kind of task where an LLM can add real value, provided it is given the right inputs.
The pipeline begins with a deterministic calculation function:
import pandas as pd
def compute_ifrs17_metrics(df: pd.DataFrame) -> dict:
"""
Compute key IFRS 17 metrics from a portfolio DataFrame.
All calculations are deterministic and auditable.
Returns a structured dictionary of verified figures.
"""
csm_opening = df["csm_opening"].sum()
csm_closing = df["csm_closing"].sum()
csm_delta = csm_closing - csm_opening
discount_opening = df["discount_opening"].sum()
discount_closing = df["discount_closing"].sum()
discount_delta = discount_closing - discount_opening
return {
"csm_opening": csm_opening,
"csm_closing": csm_closing,
"csm_delta": csm_delta,
"discount_opening": discount_opening,
"discount_closing": discount_closing,
"discount_delta": discount_delta,
}
metrics = compute_ifrs17_metrics(df)This function is entirely deterministic. Given the same input DataFrame, it will always produce the same output. It can be unit-tested, version-controlled, and audited. No LLM is involved at this stage.
The structured output is then injected into a tightly constrained prompt template:
prompt = f"""
You are an IFRS 17 reporting assistant. Your task is to explain the change in
insurance liability for the current reporting period. Use only the figures
provided below. Do not introduce any external information, assumptions, or
estimates. Do not perform any calculations.
Reporting period metrics:
- CSM opening balance: {metrics['csm_opening']:,.0f}
- CSM closing balance: {metrics['csm_closing']:,.0f}
- CSM net change: {metrics['csm_delta']:,.0f}
- Discount rate impact (opening): {metrics['discount_opening']:,.0f}
- Discount rate impact (closing): {metrics['discount_closing']:,.0f}
- Discount rate net change: {metrics['discount_delta']:,.0f}
Write a two-paragraph commentary suitable for inclusion in a board report.
"""
response = llm(prompt)Notice the explicit constraints in the prompt: “use only the figures provided below,” “do not introduce any external information,” “do not perform any calculations.” These instructions are not just good practice — they are the primary mechanism by which hallucination is prevented at the LLM layer. The model is given no latitude to improvise.
The LLM’s output might then read something like:
“During the reporting period, the Contractual Service Margin decreased by $1,500,000, moving from an opening balance of $42,300,000 to a closing balance of $40,800,000. This reduction reflects the release of deferred profit as insurance services were provided to policyholders over the period, consistent with the Group’s coverage pattern.The discount rate contributed a net positive movement of $500,000 to the liability measurement. The increase in the discount rate applied to the liability for remaining coverage reduced the present value of future fulfilment cash flows, partially offsetting the CSM release and resulting in a net decrease in total insurance contract liabilities for the period.”
This commentary is accurate, traceable, and audit-ready — because every number in it came from a deterministic calculation, not from the LLM’s training data.
Guardrails: The Final Line of Defence
Even with a well-constrained prompt, a production system should implement automated validation checks before the LLM’s output is used. Think of these as the equivalent of a compiler’s type checker: the code may look correct, but you still run the tests.
Numeric verification is the most straightforward guardrail. A simple regular expression or number-extraction routine can parse the LLM’s output and compare every figure against the source metrics dictionary. Any discrepancy — even a rounding difference — should trigger a flag for human review. This check is fast, cheap, and catches the most common class of residual error.
Structured outputs take this a step further. Rather than generating free prose and then parsing it, the LLM can be instructed to return a JSON object with a predefined schema. A Pydantic model can then validate the output programmatically, ensuring that all required fields are present, all numeric values fall within expected ranges, and the output is machine-readable for downstream systems. This approach is particularly valuable in automated pipelines where the commentary feeds directly into a reporting database or document generation system.
Retrieval-Augmented Generation (RAG) addresses a different class of risk: the need to reference external documents such as regulatory guidance, internal policies, or prior-period disclosures. Rather than allowing the LLM to draw on its training data — which may be outdated, incomplete, or simply wrong — a RAG system retrieves the relevant passages from a pre-approved, version-controlled knowledge base and injects them directly into the prompt 3. The LLM is then constrained to cite only what it has been given. This is particularly important for regulatory commentary, where a fabricated reference to a non-existent guidance note could have serious consequences.
Agent validators represent the most sophisticated layer of the stack. A second LLM agent, operating independently of the primary commentary agent, can be tasked with reviewing the output against the source data and a set of validation rules. This agent is not asked to generate prose; it is asked to answer a binary question: “Does this commentary accurately reflect the provided metrics?” If the answer is no, the output is rejected and the primary agent is asked to regenerate. This pattern mirrors the four-eyes principle that is already standard practice in financial controls.
| Guardrail | What It Catches | Complexity |
| ------------------------------ | -------------------------------------------------------------- | ---------- |
| Numeric verification | Figures in the output that do not match source metrics | Low |
| Structured outputs | Missing fields, out-of-range values, malformed responses | Low–Medium |
| Retrieval-Augmented Generation | Fabricated regulatory references, outdated policy citations | Medium |
| Agent validators | Semantic inaccuracies, logical inconsistencies, missed context | High |A production system does not need to implement all four simultaneously. For most automated commentary use cases, numeric verification and structured outputs provide sufficient coverage. RAG becomes essential when the commentary must reference external documents. Agent validators are most valuable in high-stakes workflows — board-level reporting, regulatory submissions — where the cost of an error is highest.
Governance and Auditability: The Non-Negotiable Layer
Technical guardrails are necessary but not sufficient. A truly production-ready system must also address the governance requirements that financial regulators increasingly impose on AI-assisted workflows.
Every output generated by the pipeline should carry a full audit trail: the version of the calculation code used, the input data hash, the prompt template version, the LLM model identifier and temperature setting, and the results of each validation check. This metadata should be stored alongside the output and be retrievable on demand. When an auditor asks “how was this commentary generated?”, the answer should be a complete, reproducible record — not “the AI wrote it.”
Firms operating under the FCA’s Consumer Duty or equivalent frameworks should also consider how they will handle cases where the validation checks fail. A clear escalation path — from automated rejection, to human review, to senior sign-off — should be defined before the system goes live. The governance framework is as important as the technical architecture.
The Bottom Line: Benefits Without the Risks
The promise of LLMs in finance is real. The ability to generate clear, accurate, audit-ready commentary at scale — across thousands of contracts, portfolios, or client accounts — represents a genuine step change in operational efficiency. But that promise can only be realised if the architecture is designed from the ground up to prevent hallucination, not merely to detect it after the fact.
The pipeline described in this article — deterministic calculations feeding structured metrics into a constrained prompt, with validation guardrails at the output layer — provides a practical, implementable path to that goal. It is not a theoretical framework; it is a pattern that can be built, tested, and deployed with standard Python tooling and a well-chosen LLM API.
The key discipline is one of role clarity. Deterministic code computes. Structured data carries. Prompt templates constrain. The LLM narrates. Validators verify. When each component does only its job, the system as a whole becomes trustworthy — and that is the only acceptable standard for AI in finance.


