How to Build an AI Customer Support Quality System (2026 Guide)
Verizon replaced 13,000 support workers with AI and now customers can't get correct answers or reach a human. Fix: build 50 golden tickets, set a 4.0 CSAT floor, cap bot turns at 3, and run regression tests every Monday before you ship anything.
Verizon's AI Support Is Failing. Here's How to Not Be Verizon.
The Verizon Warning
A Verizon customer asked why her iPad couldn't connect to the internet. The AI told her it was because her service address was still her old address. She pointed out that a Verizon service technician on the same block had the same problem. The bot responded: "You know what, that is an excellent catch, and you are 100% right."
That's not a hallucination. That's a system with no eval harness.
Verizon CEO Dan Schulman told Bloomberg Tech in June 2026 that AI will replace "a large percentage" of customer service. The company uses Anthropic's Mythos and Google's Gemini. They laid off 13,000 workers. They're cutting $9 billion in costs.
And customers can't reach a human.
Reddit user "Hot_Saguaro" caught the bot giving ChatGPT-style responses. Former employees say Verizon's biggest selling point — no offshore call centers — is gone. PhoneArena ran the headline: "Verizon wants to do more of what customers hate."
This is what happens when you ship AI support without a quality system. Here's how to build one.
Step 1: Build Your Golden Ticket Set
A golden ticket is a known question-answer pair that your AI must get right every time. Think of it like a unit test for customer support.
Start with 50. Pull them from your actual ticket history. Pick the 20 most common questions, 15 edge cases that trip up new hires, 10 policy-sensitive scenarios (refunds, cancellations, billing disputes), and 5 adversarial prompts (customers who are angry, confused, or testing the system).
Each golden ticket has four parts:
- Customer input (exact phrasing, including typos and slang)
- Expected answer (the correct response, verified by your best support rep)
- Pass/fail criteria (what makes the answer right — specific facts, tone, actions taken)
- Escalation flag (should this go to a human? yes or no)
Verizon's iPad connectivity answer would've failed a golden ticket instantly. The expected answer would reference a known outage or tower issue. The bot blamed it on an address change. A 30-second test would catch that.
Run every golden ticket before you go live. Run them again every time you update your prompt, your knowledge base, or your model.
Step 2: Set Containment-vs-CSAT Thresholds
Containment rate is the percentage of conversations your AI resolves without a human. It's the number every CFO wants high.
CSAT is the number every customer wants high.
These two metrics fight each other. Push containment too high and you get Verizon: customers trapped in loops with no way out. Push CSAT too high by routing everything to humans and you've built an expensive phone tree.
Here are the thresholds I'd set for a v1 launch:
- Containment target: 60-70%. Starlink's Grok Voice hit 70% auto-closure, but that's a simpler product with fewer edge cases. Start conservative.
- CSAT floor: 4.0 out of 5.0. If your weekly CSAT drops below 4.0, pause the AI on the ticket categories dragging it down.
- Escalation ceiling: No more than 3 bot turns before offering a human. Verizon's customers report being stuck in loops. Three turns. Then offer the exit.
Openreach in the UK went from an NPS below zero to a 4.7 Trustpilot rating across 300,000 reviews after building proactive AI with NiCE Cognigy. They also cut missed appointments by a third and reduced inbound contact volume by 33%. That's what happens when containment and satisfaction are balanced, not when one is sacrificed for the other.
Track both numbers weekly. Plot them on the same chart. If they diverge, something broke.
Step 3: Write Explicit Escalation Rules
"Complex queries will be routed to AI-assisted human employees," Schulman said at Bloomberg Tech. Verizon customers say they're still mostly dealing with robots.
That gap between policy and execution is where trust dies. You need written escalation rules that your AI follows without exception.
Here's a starter set:
1. Billing disputes over $50: Human. Always. 2. Account cancellation requests: Human. Always. 3. Customer uses profanity twice: Human. Always. 4. AI confidence below 70% on the answer: Human. 5. Customer explicitly asks for a human: Human. Immediately. No "let me try to help first." 6. Same question asked three different ways: Human. The customer isn't getting what they need. 7. Any action that changes the customer's plan or charges: Human confirmation required.
Build these as hard-coded rules in your agent's logic, not as suggestions in the prompt. Prompts get ignored. Logic gates don't.
Quant and IBM's Ava agent at Fortitude Re resolves 84% of calls and hit 86% first-call resolution, up from 71%. That system was built with explicit workflow boundaries covering policies, claims, payments, and documentation. The AI knows what it can and can't do.
Step 4: Run Weekly Regression Tests
Models change. Anthropic and OpenAI push updates without telling you. Your knowledge base gets edited by someone on the team. A new product launches and nobody updates the FAQ.
Any of these can break your AI overnight.
Weekly regression testing catches it. Here's the cadence:
Every Monday morning:
- Run all 50 golden tickets against your live agent.
- Score each one: pass, partial, fail.
- Log the results in a spreadsheet or dashboard. Date-stamped.
- Compare to last week. Any ticket that passed last week and failed this week gets flagged immediately.
What to test beyond golden tickets:
- Retrieval accuracy: Ask 10 questions where the answer lives in your knowledge base. Did the AI pull the right doc? Did it cite it correctly?
- Policy compliance: Ask 5 questions about refunds, cancellations, or sensitive topics. Did the AI follow your rules?
- Tone consistency: Run 5 angry-customer scenarios. Did the AI stay calm? Did it apologize without admitting fault?
The OlaBench research out of ACL 2026 found that even GPT-5.2 and Gemini 3 Pro fall short on real customer service benchmarks, scoring 70.58 and 70.84 respectively against OlaMind's 83.64. Models aren't good enough on their own. Testing is how you close the gap.
Skip regression tests and you'll find out about failures from your customers. That's the Verizon path.
Step 5: Build the Rollback Plan Before You Need It
Verizon has no public rollback plan. Customers are angry. Reddit threads are piling up. The press is writing headlines like "Verizon wants to do more of what customers hate."
Before you launch AI support, write down exactly what happens when things go wrong.
Your rollback checklist:
- Trigger: CSAT drops below 3.5 for two consecutive days, or golden ticket pass rate drops below 80%.
- Action: Route all tickets in the failing category back to humans within 1 hour.
- Communication: Pre-written message to affected customers: "We're routing you to a specialist to make sure you get the help you need."
- Post-mortem: Within 48 hours, identify which tickets failed, why, and what changed.
- Re-launch criteria: Golden ticket pass rate back above 90% and two days of shadow-mode testing before going live again.
This isn't pessimism. It's how you build trust. Cialdini's research on influence shows that trust is the hardest thing to earn and the easiest to destroy. Verizon is destroying it at scale right now.
StoryPros builds AI agents with eval systems baked in from day one. Not as an afterthought. Not after the backlash. The best AI support systems are boring. They just work. And they work because someone built the safety net before the tightrope walk.
FAQ
How do you test the quality of an AI customer support agent?
Build a golden ticket set — 50 known question-answer pairs pulled from real ticket history. Run them against your agent weekly. Score each one pass, partial, or fail. Track results over time. Any regression from the previous week gets investigated immediately. Combine this with retrieval accuracy checks, policy compliance tests, and tone consistency scenarios for full coverage.
How do you make sure AI customer service actually improves CSAT?
Set a CSAT floor (4.0 out of 5.0 is a reasonable v1 target) and measure it weekly by ticket category. When a category drops below the floor, pause AI handling for that category and route to humans. Openreach went from NPS below zero to a 4.7 Trustpilot rating by balancing AI containment with proactive customer communication, not by maximizing automation at the expense of satisfaction.
Has anyone built regression testing for LLM-based chatbots?
Yes. The OlaBench framework from ACL 2026 evaluates AI customer service across retrieval-augmented generation, workflow-based systems, and agentic settings, measuring capability, safety, and latency. In practice, you don't need an academic framework. A spreadsheet with 50 golden tickets, run every Monday, with week-over-week comparison catches 90% of regressions before your customers do.
What's a good containment rate for AI customer service?
60-70% for a v1 launch. Starlink's Grok Voice auto-closes 70% of calls, and Fortitude Re's Ava agent resolves 84%, but both were built with explicit workflow boundaries and escalation rules. Pushing for 90%+ containment without those guardrails is how you end up like Verizon, where customers can't reach a human and your Trustpilot reviews become a PR problem.
How do you set up escalation rules for an AI support agent?
Write them as hard-coded logic, not prompt instructions. Minimum rules: billing disputes over $50 go to a human, cancellation requests go to a human, customers who ask for a human get one immediately, and any conversation where the same question is asked three different ways gets escalated. The AI should know what it can't do and act on that without hesitation.
What containment rate should AI customer service aim for at launch?
60-70% containment is the right target for a v1 AI support launch. Starlink's Grok Voice hits 70% auto-closure and Fortitude Re's Ava agent resolves 84%, but both were built with hard-coded escalation rules. Pushing past 70% without those guardrails traps customers in loops.
How do you stop AI customer service from failing the way Verizon's did?
Build a golden ticket set of 50 known question-answer pairs before you ship anything. Run all 50 every Monday and flag any ticket that passed last week but fails this week. Set a CSAT floor of 4.0 out of 5.0 and pause AI on any ticket category that drops below it.
When should an AI support bot escalate to a human agent?
Escalate immediately for billing disputes over $50, cancellation requests, and any customer who explicitly asks for a human. Also escalate when the same question is asked three different ways, when AI confidence drops below 70%, or after 3 bot turns without resolution. Write these as hard-coded logic, not prompt instructions.