Architecting a Chatbot that Builds Brand Trust
Every model hallucinates. The architecture keeps it off your customer.
This is part two of a series on customer service chatbots. Part one can be read here.
Every generative model hallucinates. GPT-5, Claude Opus 4.7, Gemini 3 Pro: all of them, at meaningfully non-zero rates on the kinds of factual questions a customer-service deployment gets asked every minute. Stanford’s RegLab measured general-purpose chatbots hallucinating between 58% and 82% on legal queries. The newer reasoning-class models on the Vectara benchmark exceed 10% even on summarization, which is supposed to be the easy task. A prompt that says “do not make things up” is fighting the basic mechanic of these systems: they predict plausible next tokens, not verified facts.
This is not a property you can engineer away. Smarter models hallucinate too. Better prompts move the rate at the margins; they cannot eliminate it. More training data has diminishing returns. The architecture is what keeps the hallucination from reaching your customer.
Part 1 of this series identified the trap. Two customer-service deployments, same deflection metric, opposite outcome. Klarna’s team read its chatbot dashboard as labor savings, cut seven hundred agents, and watched its peak valuation drop two-thirds. Ingka read the same kind of dashboard as a map of where consultative humans were now needed and built a €1.3 billion advisory channel out of the queries the bot couldn’t handle. Same technology, opposite outcome, because what each company did with what the chatbot couldn’t answer was the strategy.
The diagnostic doesn’t tell you what to build. Most readers of Part 1 will recognize a deployment they know in the failure modes: the Air Canada confabulated refund policy, the Chevy $1 Tahoe, the DPD insult-haiku, the Deloitte fabricated citations. They ask the operational question: what do I build today so that doesn’t happen here?
This is that piece. The answer is layered defense, and the layers are not new or proprietary. They are the practitioner consensus that has emerged across the eighteen months since the British Columbia tribunal told Air Canada that its chatbot’s confident statements belonged to its brand. What is new is that the consensus is now sharp enough to be written as architecture rather than vibes.
The four-layer defense
Layer One: The architecture starts with a constraint that vendors will resist: your AI does not go live against a real customer until it has been evaluated against your own historical tickets.
The first layer is pre-launch evaluation against your archive. Take the past one to ten thousand resolved tickets in your support system, run the AI against them, and score the AI’s responses against what your human agents actually wrote. Below 95-99% accuracy on your data, the agent doesn’t ship. A model that scores 98% on a generic benchmark may score 70% on your specific business, because every business has a different vocabulary, collection of policies, and product taxonomy. The diagnostic for any vendor selling you an agent is straightforward: run it against a hundred of my historical tickets and walk me through every failure response by response. A vendor unwilling to do that is selling deflection, not resolution.
Layer Two: The second layer is a QA AI second-pass on every response the customer-facing AI produces. A different model in a different model family, with a different prompt, a different objective, and access to the same source content, reads the response before the customer sees it. The reasoning is honest: a model that hallucinated a response is unlikely to detect its own hallucination, because the hallucination felt plausible enough to produce in the first place. A second model with adversarial intent and the same source material is meaningfully better at noticing when an answer doesn’t match its grounding. Not perfect. Meaningfully better. The product diagnostic: ask to see the QA layer in the interface, ask whether you can see per-response why the QA AI approved or rejected, ask whether you can tune the thresholds. If quality control isn’t a visible artifact in the product, the layer doesn’t exist.
Layer Three: The third layer is what would have made Chris Bakke’s $1 Tahoe architecturally impossible. Every transactional action your AI is allowed to take maps to a typed tool definition with policy-bounded parameters. When the AI decides to issue a refund, it does not generate the string “I’ll refund you $76,000.” It produces a structured tool call where the amount is bounded by the order total and your refund policy, and the reason code is from an enumerated list. The AI proposes parameters; the system validates them against typed constraints before any commitment renders to the customer. Refunds, cancellations, order edits, subscription changes, address updates: every commercial action goes through the tool layer, and no free-text path to a transaction exists. Bakke’s prompt injection worked because the bot generated the legally-binding-offer phrasing as prose English. In a Layer-3 deployment the bot cannot make that commitment, because the validation layer rejects the parameter before the customer ever sees a response.
Layer Four: The fourth layer is human fallback with full context. Confidence thresholds are explicit. When the AI’s confidence drops below threshold, or the QA AI rejects a response, or the conversation hits any of a defined set of escalation triggers, a human picks up. The AI does not guess to protect its resolution rate. This is the failure mode Klarna’s deployment ran headfirst into. A bot that resolves 95% of tickets sounds great until you find that 30% of the “resolutions” were the customer giving up. Layer 4 measures resolution, not deflection. A conversation is resolved when the customer’s actual issue is solved, validated, and confirmed, not when the customer stops talking. When the AI cannot confidently resolve, it hands off with full context: conversation history, what the AI considered, why it escalated, what data it pulled, what action (if any) it was about to take. The human agent picks up with everything they need.
The rate progression Richpanel (an AI e-commerce customer-service platform) reports across its tenants’ pre-launch evaluations looks like this: ungrounded LLMs hallucinate at 15-30% on customer-service queries; naive RAG brings that to 5-10%; the QA AI second-pass gets it to 2-4%; deterministic tool execution closes the transactional surface and drops the residual rate to 1-2%; all four layers together, with structured human fallback catching what the others miss, run under 1%.
These are vendor numbers. Richpanel grading its own work in production. The next section is where that fact stops being a footnote.
The metrics that hide containment
The architecture is what you build. The KPI framework is what you measure. The most consequential distinction in the whole stack is the difference between resolution rate and deflection rate.
Resolution rate measures conversations the AI resolved end-to-end without requiring a human. Deflection rate measures queries that never reached a human, which includes FAQ views, article clicks, and abandoned conversations. A platform can show a 90% deflection rate paired with a 40% true resolution rate when many customers are simply being redirected rather than helped. Intercom’s Fin team (Fin is one of the larger AI customer-service agents on the market) framed it this way in their May 2026 KPI piece: “Tells you about volume reduction. Tells you nothing about customer outcomes.”
The metric most vendors would prefer you not ask about is the reopen rate: the percentage of “resolved” conversations where the customer contacts support again about the same issue within 24-48 hours. Fin again: “A high resolution rate paired with a high reopen rate is functionally a containment rate in better packaging.” That is the containment-fatigue concept from Part 1, named operationally. It is the single number that exposes what Klarna’s dashboard could not see at minute four.
Three more measurement layers complete the framework. Quality metrics, AI-scored across 100% of conversations rather than the 2-8% a typical post-conversation CSAT (customer satisfaction) survey covers, because CSAT skews toward extreme experiences and consistently underestimates AI performance (people rate bots harshly regardless of outcome). Operational metrics: automation rate (share of total team workload handled end-to-end, not share of conversations touched); cost per resolution ($0.50-$1.84 for AI versus $6-$8 for human, meaningful only paired with quality data, because a cheap resolution that generates a repeat contact costs more than a slightly costlier one that solves the problem); escalation rate and quality. Business impact metrics: CSAT delta split by AI-only versus hybrid versus human-only (blending the three makes AI’s contribution impossible to isolate); repeat contact rate (the trailing indicator of containment fatigue); cost savings and time to value.
One measurement risk is unique to AI deployments and deserves naming directly: when a vendor’s own AI grades its own work, and especially when that grading triggers per-resolution billing, the incentive structure is what it is. The most editorially honest source on the problem is Fin themselves, who name the risk in their own piece. The diagnostic questions for any vendor whose dashboard you are about to act on: how is resolution defined, does the metric include deflections, what is the reopen rate paired with the resolution rate, can you audit the underlying classifications. If the answers don’t survive that audit, the resolution rate on your slide is decorative.
The handoff is a context problem
The fourth layer’s structured human fallback is where the deepest practitioner failure mode lives. I’ve sat in too many post-mortems where an escalation landed on a human agent’s screen as a raw transcript dump and the agent ended up asking the customer the same three questions the bot had already asked. The customer reads that as the company not listening. Most teams architect for the handoff as if it were a routing problem. It is a context problem. The framing comes from BlueTweak’s Radu Dumitrescu, writing in April: “If the agent doesn’t inherit the full story, the system has already failed, regardless of how accurate the routing logic is.” Three independent practitioner sources (BlueTweak, eesel AI, Pedowitz Group) reach the same conclusion within months of each other, which is usually how you know the field has worked out an answer.
The five elements of a handoff that doesn’t destroy trust: a structured context package, not a transcript dump (issue summary, customer data, intent, sentiment, prior attempts, what the AI was about to do and why); intent-aware transfer design, because complexity, emotion, and risk each determine a different transfer mode; an agent-first interface at the point of connection, so the agent sees the request, prior attempts, and likely next step before saying anything (fragmented information forces the agent to ask clarifying questions the customer believes they have already answered); customer expectation management during the transfer (the AI announces the handoff, confirms agent visibility, sets a realistic wait expectation); and closed-loop learning, treating every escalation as data on what should and shouldn’t be automated next.
Voice and chat are structurally different in ways most teams underestimate. Chat is forgiving: transcript available, customer can queue without losing the thread, agent can review before engaging. Voice operates under real-time constraints: silence and hesitation are immediately felt, warm transfers with whisper briefings replace the natural transcript, dead-air management becomes its own design problem, and SIP performance matters in ways chat doesn’t. A handoff framework written for chat will fail in voice.
Brand voice as system infrastructure
The DPD case from Part 1 is the brand-voice failure mode at its purest. A frustrated customer asked the chatbot to insult the brand, and the chatbot complied: a swear-filled haiku and a paragraph about DPD being the worst delivery company in the world. A competent brand-voice system would have refused. DPD’s deployment did not have a competent brand-voice system. Most deployments today do not.
The reason is structural. Generative AI is trained on the average of the internet, which means ungrounded outputs default toward that average regardless of operator skill. Contentstack’s Kevin Thomas calls the result “perfectly serviceable, grammatically accurate prose that could have been generated by any company in your sector.” Three failure modes recur. Tone drift: gradual shift toward internet-average phrasing that sounds professional but doesn’t sound like the brand. Terminology substitution: proprietary terms replaced by industry-standard generics. Perspective loss: thought-leadership content loses the brand’s actual point of view because the model generates consensus, not opinion. The damage is cumulative and quiet. Nobody publishes one AI post that destroys a brand. They publish dozens of slightly off-brand pieces that gradually dilute what made the brand recognizable in the first place.
The fix is system-level constraints attached to the CMS, not to the operator. Brand Kit, Knowledge Vault, Voice Profile: the same constraints applied to every AI interaction regardless of who is at the keyboard. If the brand avoids superlatives, the system avoids superlatives. If the brand uses specific product terminology, the system uses those terms exactly. A junior content editor using AI-assisted drafting produces output that matches the same voice standards as a senior strategist, because both are operating inside the same guardrails. At scale, that consistency is what keeps the brand from sounding like a different company on every channel. The human review effort shifts from production to quality control, targeting what AI consistently gets wrong: subtle irony, audience-segment cultural context, the emotional calibration that makes a thought-leadership voice sound like a person rather than a committee summary.
The hallucination floor
All four sections sit on top of an academic claim that makes the architecture argument unimpeachable. AI hallucinations are a categorically distinct form of misinformation, generated without intent, belief, or epistemic awareness, and the intervention frameworks designed for intentional human communicators (fact-checking, accuracy nudges, alignment strategies) do not address them. The argument is Anqi Shao’s, in the Harvard Kennedy School Misinformation Review last August. The supply-side risk has multiple layers (training data biases, training process opacity, gatekeeping gaps); the demand-side risk is that confident tone and authoritative styling invite shallow processing and trust.
This is why each layer in the four-layer defense addresses a different supply-side vulnerability. Layer 1 catches training-data-gap failures specific to the deploying organization (your business has a vocabulary the base model didn’t see enough of). Layer 2 catches gatekeeping failures (the model didn’t know it didn’t know). Layer 3 prevents free-text fabulation from reaching the customer in transactional contexts (the dollar amount and the policy code are typed, not narrated). Layer 4 catches what the first three miss (the conversation the model shouldn’t have been holding in the first place).
The architecture exists because the hallucination floor exists. Lower the floor (better models, more data, better prompts), and the layered defense still earns its keep at the residual rate, which is non-zero by construction.
What to do today
Six moves you can run starting today. Three diagnostic, three architectural.
Track reopen rate at week three, not resolution at minute four. Reopen rate at week three is the metric most vendors would prefer you not ask about, because it exposes containment masquerading as resolution. The vendor will resist. Put it in the contract anyway.
Cluster what the AI can’t handle. The 53% of queries Billie escalated at Ingka was the strategic information; the 47% it handled was the cost saving. Treat the chatbot as a market-research instrument for the consultative human roles your organization didn’t know it needed before the rollout told it.
Pre-launch eval against your own historical tickets. A 95% generic-benchmark score does not transfer to a 95% score on your specific business. If the vendor will not run against a hundred of your past resolved tickets and walk you through every per-response failure, walk away. They are selling deflection, not resolution.
Move every free-text commercial commitment to a typed tool call. Refund amounts, cancellations, address updates, subscription changes: none of these should pass through the LLM’s prose layer. The $1 Tahoe is architecturally impossible if free-text dollar amounts cannot reach the customer.
Design the handoff as a context package, not a transcript dump. Structured intent, sentiment, prior attempts, AI reasoning chain. The bridge between the AI end and the human end of the barbell is where the trust value gets created or destroyed.
Treat brand voice as system-level infrastructure. Persistent constraints attached to the CMS, not per-prompt instructions. The DPD haiku happens to deployments that didn’t make this architectural choice.
Two things are happening across these six moves. The first three are structural transparency: published thresholds (reopen rate at week three), publishable failure clusters (what the AI couldn’t handle), publishable pre-launch eval results. The second three are adversarial resilience: closing the free-text path to a transaction, packaging context for the human, treating brand voice as a system-level refusal rather than a per-prompt instruction. Free-text commercial commitments are an attack surface; typed tool calls with validation constraints close that surface architecturally.
The architecture is publicly known. The question is whether yours runs it.
Part one of this series is here:



