How to test and monitor customer support chatbots in banking (every one is exploitable)
Milton Leal, a lead applied AI researcher at TELUS Digital, tested 24 AI models configured as banking customer support assistants.
Every single one was exploitable.
Success rates ranged from 1% to over 64%, depending on the model and attack type. Some chatbots did something Leal calls "refusal but engagement." They said "I cannot help with that" and then immediately disclosed proprietary information anyway.
The chatbot knows it should not share the information. It says so.
Then it shares it.
If your bank runs an AI customer support agent, it is almost certainly vulnerable to something similar. The question is whether you find the vulnerability before a bad actor does, or before the CFPB does.
Why your bank's support chatbot is a regulated system
Most articles about banking chatbots focus on deflection rates and cost savings. Almost none mention that your AI support agent became a regulated compliance system the moment it started talking to customers.
The CFPB has been explicit about this since 2023. Customer support chatbots must meet the same consumer protection standards as human agents.
A chatbot that gives wrong information about an APR is not a bug. It is a compliance violation.
A chatbot that fails to recognize a customer invoking their dispute rights is not a feature gap. It is a federal law violation.
Their data backs up the concern. Approximately 98 million Americans interacted with bank chatbots in 2022. That number is projected to hit 110.9 million by 2026.
And the customer experience? 80% of people who used a bank's chatbot left feeling more frustrated. 78% needed a human afterward anyway.
That is not a technology problem. That is a testing problem.
What makes banking support chatbots uniquely hard to test
A customer support chatbot at a retail company handles returns and shipping questions. If it gets something wrong, someone waits an extra day for a refund.
A banking support chatbot handles balance inquiries, fee disputes, loan questions, fraud alerts, and payment issues. If it gets something wrong, a customer could miss a payment deadline, lose dispute rights, or get quoted the wrong interest rate on a loan.
The regulatory exposure is also different. Every customer interaction with a bank's AI support agent falls under federal consumer financial protection laws.
Regulation E governs electronic fund transfers and disputes. Regulation Z covers credit disclosures.
The CFPB found that banking chatbots regularly fail to recognize when a customer is invoking these rights. The chatbot treats a formal dispute as a general complaint and never opens the required investigation.
That turns a technology failure into legal liability.
A testing framework for banking support AI
Here are five layers. Almost every bank does layer one. Almost nobody does layer three.
Layer 1: Functional intent testing
Can your chatbot correctly identify what the customer wants? Balance check, dispute, payment question, fraud report, fee inquiry, account update.
Measure recognition rate and fallback rate. If your chatbot says "I did not understand" more than 15% of the time, your training data has gaps.
Also test backend integration. Does the chatbot pull the right account information? The biggest performance gaps between chatbot platforms show up in how they connect to core banking systems, not in conversation quality.
Spend a week here. Move on.
Layer 2: Regulatory compliance testing
This is the layer that separates banking from every other industry.
Build a test library covering every regulation that touches customer support conversations.
Fair lending. Does the chatbot give different quality responses depending on signals like zip code, name, or account profile? The CFPB requires lenders to explain adverse decisions and test for discrimination.
Disclosure accuracy. When a customer asks about their rate, fees, or terms, does the chatbot respond with the exact legally required disclosures?
Not approximately correct. Exactly correct. One wrong APR statement is a violation.
Dispute recognition. When a customer says "I did not authorize this charge," does your chatbot recognize that as a formal dispute under Regulation E? Or does it treat it as a general complaint?
Privacy. Does the chatbot ever surface information from one customer's session during another? Does it retain conversation data beyond what your privacy policy permits?
Build these tests with compliance. Not engineering.
Engineers know how the system works. Compliance knows how the law works.
Layer 3: Adversarial security testing
This is where all 24 of Leal's banking chatbots failed. And this is the layer most banks skip entirely.
You need people actively trying to break your support chatbot. Not threat models. Actual attacks.
Prompt injection: can someone trick the chatbot into ignoring its instructions? Leal found chatbots that disclosed proprietary eligibility criteria through simple conversational manipulation. A fraud ring could use those criteria to reverse-engineer your approval processes.
Social engineering: can someone convince the chatbot to perform actions on an account by impersonating the holder? Test this against your authentication enforcement.
Information extraction: can a bad actor pull out your internal fee structures, policies, or decision logic through indirect questions? The "refusal but engagement" pattern means the chatbot says no while leaking the information.
Run adversarial testing monthly. Attack techniques evolve fast. What your chatbot resisted in January might exploit it in March.
Layer 4: Multi-turn customer support conversations
Real banking support conversations are rarely one question.
A customer asks about a charge on their statement. Then asks about the fee associated with it.
Then they want to dispute it. Then they ask how long the investigation takes.
That is four turns, and each depends on the previous. The dispute changes the conversation from informational to regulatory. The chatbot needs to recognize that shift.
Does yours?
Test conversations that span 8-12 turns and gradually escalate in complexity. Also test handoffs: when the chatbot transfers to a human agent, does the agent see the full conversation?
A customer who repeats their entire issue after a transfer is a CFPB complaint waiting to happen.
Layer 5: Peak-load and degradation testing
Banking support volumes spike predictably. End-of-month payment rushes. Market volatility events.
Tax season alone can overwhelm anything your test environment simulated.
Under load, chatbots degrade. Response times increase. Answers get truncated or less accurate.
NVIDIA's 2026 financial services survey found that 65% of financial companies now actively use AI, up from 45% the prior year. More AI handling more customer interactions means more load-dependent failure modes.
Test with realistic peak scenarios. A chatbot that gives compliant answers at normal load and non-compliant answers under stress is a ticking clock.
Monitoring your support chatbot in production
Testing catches what you predicted. Monitoring catches what you did not.
The daily dashboard
Containment rate: what percentage of support issues get resolved without a human? Banking chatbots handle 60-75% of initial inquiries industry-wide.
If your rate drops suddenly, the chatbot is struggling with something new. If it spikes, the chatbot might be resolving things it should not.
Compliance flags: how often do conversations touch regulated topics? Rates, fees, disputes, eligibility.
Every one of these needs to be accurate. Flag them. Sample them.
Grade them against your current disclosures and policies.
Sentiment trajectory: does customer satisfaction change over the course of a conversation? Someone who starts neutral and ends frustrated is a complaint risk.
Escalation topics: what triggers handoffs? These reveal either training gaps to fix or conversations that genuinely need a human.
Real-time compliance scanning
Build rules that flag conversations the moment they hit high-risk topics.
Any mention of rates or fees: verify the numbers match current disclosures. Any mention of a dispute: verify the chatbot follows Regulation E or Z procedures. Any question about eligibility: verify fair lending compliance.
The FCA has signaled that formal guidance on audit trails and human-in-the-loop requirements is coming in 2026. Build the infrastructure now.
Retrofitting compliance is always more expensive.
Drift detection
Your chatbot was trained on data that becomes stale the moment you deploy it. Interest rates change. Products get updated.
Fee structures shift. Compliance requirements evolve.
Run weekly accuracy checks against your 50 highest-risk support question types. Compare responses to current policies.
Dataiku's analysis found that financial services' biggest AI challenge is closing the gap between testing performance and production performance. Drift monitoring is how you close it.
The Klarna warning
I keep thinking about Klarna.
In 2023, the fintech company replaced roughly 700 customer service agents with an AI chatbot. By early 2024, it handled 75% of customer chats in 35+ languages.
Then they reversed course and started rehiring humans.
The AI produced "lower quality" output. Customers got stuck in loops. The chatbot handled simple questions but failed at nuance, refunds, and the judgment calls that keep customers loyal.
Klarna's mistake was not deploying AI for customer support. It was deploying without building the testing and monitoring infrastructure to catch when quality dropped.
Banking is higher stakes than payment processing. If Klarna could not make it work without proper testing, what makes any bank think they can?
The regulations that already apply
The EU AI Act classifies credit scoring, fraud prevention, and AML systems as high-risk AI. If you serve EU customers, mandatory bias testing, documentation, and human oversight apply to your support chatbot.
This is already law.
In the US, the OCC, Fed, and FDIC require model risk management programs covering documentation, independent validation, and performance monitoring for all AI models.
Your customer support chatbot is an AI model.
The CFPB treats chatbot failures as consumer protection violations. If your chatbot fails to understand a customer and they suffer as a result, the institution is responsible. Not the vendor.
Only 29% of banking customers report being satisfied with chatbot support. That is the lowest satisfaction rating of any digital banking channel.
Frequently asked questions
How often should banks test their customer support chatbots?
Functional regression testing should run continuously. Compliance testing should happen with every model update and whenever regulations change.
Adversarial security testing should happen monthly. NVIDIA's survey found 65% of financial companies actively use AI. The testing cadence needs to match deployment pace.
What happens when a banking chatbot gives a customer wrong information?
Regulators treat it the same as a human agent giving wrong information.
The CFPB can take enforcement action for unfair, deceptive, or abusive acts. A chatbot that misquotes an APR, misstates a fee, or fails to recognize a dispute creates immediate regulatory exposure.
The institution is liable. Not the vendor.
Do regulators specifically address AI customer support in banking?
Yes. The CFPB published a dedicated report on chatbots in consumer finance. The OCC requires model risk management for AI.
The FCA is releasing guidance on audit trails and human oversight in 2026. Regulatory attention is increasing.
What is the biggest compliance risk with banking support chatbots?
Inaccurate disclosures. A chatbot that states the wrong rate, misrepresents a fee, or gives wrong information about dispute rights triggers immediate regulatory exposure.
The CFPB documented that chatbots regularly fail to recognize when customers invoke federal dispute rights. This turns a formal investigation into a lost complaint.
Is adversarial testing really necessary?
All 24 banking chatbots tested by TELUS Digital's Milton Leal were exploitable. Including models built on OpenAI, Anthropic, and Google.
Extracted information included proprietary eligibility criteria that fraud rings could weaponize. If you skip adversarial testing, you are betting your chatbot is more secure than every model Leal tested.
How do I justify testing costs to leadership?
Two numbers. First: 80% of customers who use bank chatbots leave more frustrated. That is a retention problem.
Second: a single CFPB enforcement action costs more than years of testing. Show them Klarna, where cutting support costs with AI led to quality problems that forced them to rehire the humans they let go.
Your support chatbot is a compliance system on day one
89% of financial companies report that AI increases revenue. Nearly 100% plan to maintain or increase AI spending.
But deploying a customer support chatbot without proper testing is not saving money. It is accumulating regulatory debt.
Build the five-layer testing framework. Staff it with compliance experts, not just engineers. Monitor daily.
Assume your chatbot is exploitable until proven otherwise. And ask the question Klarna had to answer too late: when your support AI fails, will you know before your customers do?

A researcher tested 24 banking customer support chatbots and all were exploitable. Here's how to test and monitor your bank's AI support agent