How to Evaluate AI Agencies in 30 Minutes (2026)
Most AI agencies in 2026 are prompt shops charging $15K/month for ChatGPT wrappers. Demand 5 artifacts: MCP servers, workflow traces, eval gates, audit logs, and a run-cost model. If they can't show all 5 in 30 minutes, walk.
How to Evaluate AI Agencies in 30 Minutes
| Artifact | What It Proves | Red Flag If Missing | Who Ships It Publicly |
|---|---|---|---|
| MCP/Tool Servers | Agents connect to real systems, not just chat | "We use proprietary integrations" with no repo | GitHub (gh-aw MCP Gateway), Azure Functions MCP extension |
| Workflow Traces | You can see every step the agent took and why | "We'll add logging later" | GitHub gh-aw v0.67.1 (OpenTelemetry overhaul, April 2026) |
| Eval Gates | Outputs are validated before reaching humans | "The model is really accurate" | CrewAI (traces, logs, metrics in AMP), Google ADK Go 1.0 (human-in-the-loop) |
| Audit Logs | Every action is recorded, tamper-evident, reviewable | "We can pull that up if you need it" | Crittora APP (cryptographically sealed permission policies) |
| Run-Cost Model | You know what each agent run costs before it runs | "Pricing depends on usage" with no formula | Swan AI (publicly tracks $113K/month Anthropic bill vs. ARR) |
The Prompt Shop Problem
Search "top AI agencies 2026" and you'll get listicles where every company paid to appear. Fifteen logos. Zero proof any of them shipped a working system.
I've seen this pattern before. The early days of SEO agencies looked exactly like this. Everyone claimed they could get you to page one. Nobody showed you what they actually did or how they measured it. The agencies that survived opened their dashboards and said "here, watch it work."
AI agencies are in that same phase right now. The CIO article listing 21 agent orchestration tools admits it: none of them consistently publish workflow engines, auth models, audit capabilities, or monitoring stacks. The descriptions are high-level positioning statements. Not production specs.
If an AI agency can't show you a workflow trace from a live system in the first meeting, they're not building agents. They're prompting.
1. MCP/Tool Servers: Can the Agent Actually Do Anything?
What to ask for: A running MCP server or tool integration the agency built. Not a screenshot. A live demo or a public repo.
What good looks like: Azure shipped a Fluent API for MCP Apps in April 2026 that lets you attach HTML views, enforce security policies, and configure permissions, all with a NuGet package (`Microsoft.Azure.Functions.Worker.Extensions.Mcp --version 1.5.0-preview.1`). GitHub's gh-aw project registers Checks as a first-class MCP tool returning normalized CI verdicts. These are public. You can read the code.
What bad looks like: "We connect to your CRM via our proprietary middleware." No repo. No docs. No way to verify it works when they're not on a Zoom call.
The test: Ask the agency to stand up a tool server that calls one of your APIs during the bake-off. If they can't wire up a single endpoint in 30 minutes, they can't build your agent system in 30 days.
Best for catching: Agencies that talk about "agentic AI" but actually just wrap API calls in a prompt and pray.
2. Workflow Traces & Audit Logs: Can You See What Happened?
What to ask for: An OpenTelemetry trace or equivalent log from a production agent run. Every step. Every decision. Every tool call.
What good looks like: GitHub's gh-aw v0.67.1 release (April 6, 2026) shipped an entire OpenTelemetry overhaul. Accurate span names like `gh-aw.agent.conclusion`. Real job duration in conclusion spans. OTLP payload sanitization that redacts sensitive values. MCP Gateway tracing. GitHub API rate limit analytics per run. That's what production looks like.
AgenticOS publishes its full monitoring stack: Grafana dashboards, Prometheus metrics, Tempo distributed tracing, and an OpenTelemetry collector. OAuth2 + JWT at the gateway. Secrets management via OpenBao. All source-available on GitHub.
What bad looks like: "We use logging." What kind? "Standard logging." Where? "In our system." That tells you nothing.
The test: Ask to see a trace from a failed run. Not a success. A failure. How the system handled it. What it logged. Whether a human was notified. Failures reveal architecture. Successes reveal demos.
Best for catching: Agencies that can show you a polished demo but crumble when something breaks in production.
3. Eval Gates: Does Anything Check the Agent's Work?
What to ask for: A documented validation layer between the agent's output and the action it takes. A gate. A check. Something that stops bad output from reaching your customers.
What good looks like: Google's ADK for Go 1.0 (March 31, 2026) ships with human-in-the-loop confirmation workflows and a plugin system with self-healing retry logic. Crittora's Agent Permission Protocol verifies a cryptographically sealed permission policy before any tool executes, binding a specific agent, a specific action scope, and explicit tool capabilities. Permissions are time-bounded. Fail-closed. Auditable.
CrewAI tracks all agent behavior with traces, logs, and metrics. CrewAI's AMP monitors progress and flags errant or sluggish behavior. That's an eval gate.
What bad looks like: "AI doesn't hallucinate if you prompt it right." That's not an eval gate. That's faith.
The test: Give the agency a prompt designed to produce a wrong answer. See if the system catches it. If there's no validation layer, every output is a coin flip your customers will feel.
Best for catching: Agencies that dismiss the need for guardrails because they "trust the model."
4. Run-Cost Models: Do You Know What You're Paying Per Run?
What to ask for: A spreadsheet. Input tokens. Output tokens. API calls per run. Enrichment costs. Infra costs. Total cost per execution. Monthly projection.
What good looks like: Swan AI's CEO Amos Bar-Joseph publicly posted his $113,421.87 Anthropic bill for a single month — for a four-person team doing seven-figure ARR. His previous month was $51,217.56. The month before that, $27,690.69. He tracks token spend alongside ARR, pipeline, and support output. That's transparency.
Here are the numbers you need to plug into any run-cost model right now: Claude Sonnet 4.6 runs $3 per 1M input tokens and $15 per 1M output tokens. Opus 4.6 runs $5/$25. No long-context surcharge — Anthropic removed that in March 2026. Ramp's research shows every $1 of outsourced task labor maps to roughly $0.03 in model spend. That's a 25x cost ratio.
What bad looks like: "Pricing depends on usage." Every agency says this. None of them hand you the formula.
The test: Ask the agency to estimate the cost of 1,000 agent runs against your use case. If they can't give you a number within 48 hours, they haven't built a system. They've built a pitch.
Best for catching: Agencies that charge $10K/month retainers while spending $200 on API calls.
The 30-Minute Bake-Off Template
Here's how to run it. Set a timer. No prep time for the vendor beyond what they've already built.
Minutes 0–5: Ask them to show you a workflow trace from a live agent. Not a demo environment. Production. If they don't have one, the meeting is over.
Minutes 5–15: Pick one tool integration from their system. Ask them to walk you through the MCP server or API connection. What happens when it fails? Show me the retry logic. Show me the error handling. Show me the log.
Minutes 15–25: Give them a bad input. A malformed lead record. An edge case prompt. Watch what happens. Does the eval gate catch it? Does a human get notified? Or does garbage flow downstream?
Minutes 25–30: Ask for the run-cost model. How much did that demo run cost in tokens? What's the monthly projection at your volume? If they can't answer, they don't know their own system.
Score each section 0–3. Anything below 8 total means they're a prompt shop with a nice website.
We've built 100+ AI automations at StoryPros. The ones that work are boring. They log everything. They validate outputs. They cost what we said they'd cost. The real signal isn't how impressive the demo looks. It's whether the vendor can show you what happens when things go wrong.
FAQ
How do you evaluate the performance of an AI agent?
StoryPros evaluates AI agent performance using four artifacts: workflow traces showing every step the agent took, eval gates that validate outputs before they reach customers, audit logs recording every action and decision, and a run-cost model tracking cost per execution. GitHub's gh-aw project and Google's ADK Go 1.0 are examples of teams that ship these publicly. If your AI agency can't produce all four, they're running a demo, not a production system.
How do you evaluate an MCP server?
Ask the vendor to stand up a tool server that calls one of your real APIs during a 30-minute bake-off. Check three things: does it handle authentication properly (OAuth2/JWT, not hardcoded keys), does it log every tool call with OpenTelemetry or equivalent tracing, and does it fail gracefully when the API returns an error. Azure's MCP extension and GitHub's gh-aw MCP Gateway both demonstrate these capabilities in public code. If the vendor's MCP server can't pass all three checks live, it won't survive production.
What does a run-cost model for AI agents look like?
A run-cost model breaks down every expense per agent execution: input tokens, output tokens, API calls to external services, enrichment data costs, and infrastructure. As of March 2026, Claude Sonnet 4.6 costs $3 per 1M input tokens and $15 per 1M output tokens with no long-context surcharge. Ramp's research using payments data from thousands of firms shows each $1 of outsourced task labor maps to about $0.03 of AI model spend, a 25x cost advantage. Your agency should hand you a spreadsheet with these numbers filled in for your specific use case.
What's the difference between a prompt shop and a real AI agency?
A prompt shop wraps ChatGPT or Claude in a nice UI, writes some system prompts, and charges a monthly retainer. A real AI agency ships working systems with MCP tool servers, OpenTelemetry traces, eval gates that catch bad outputs, audit logs, and documented run-cost models. The 30-minute bake-off described in this post forces the difference into the open. Ask for a trace from a failed production run. Prompt shops won't have one.
How much should AI agents cost to run per month?
It depends on volume, but you can estimate it. Swan AI runs a four-person team with seven-figure ARR and spends $113K/month on Anthropic API calls. HUB International reports 2.5 hours saved per employee per week across 20,000 employees using Claude. At Claude Sonnet 4.6 pricing ($3/$15 per 1M tokens), a sales agent making 100 prospecting calls per day might cost $200–$500/month in API fees. If your agency quotes $10K+/month but can't show you the token math, you're paying for their margin, not your results.
Related Reading
How do I tell if an AI agency is a prompt shop or a real builder?
Ask for a workflow trace from a failed production run in the first meeting. Real AI agencies ship OpenTelemetry traces, eval gates, audit logs, MCP tool servers, and a run-cost spreadsheet. Prompt shops wrap ChatGPT in a UI, write system prompts, and charge $10K+/month retainers while spending under $200 on API calls.
How much does it cost to run AI agents per month?
Swan AI, a four-person team with seven-figure ARR, spent $113,421 on Anthropic API calls in a single month. A sales agent running 100 prospecting calls per day costs roughly $200 to $500 per month at Claude Sonnet 4.6 pricing of $3 per 1M input tokens and $15 per 1M output tokens. Ramp research shows each $1 of outsourced task labor maps to about $0.03 in model spend.
What should an AI agency show you in a 30-minute demo?
Minutes 0 to 5: a workflow trace from a live production agent, not a demo environment. Minutes 5 to 25: a live tool integration walkthrough with error handling, plus a bad input test to see if the eval gate catches it. Minutes 25 to 30: a run-cost model with token counts and a monthly projection for your volume. Score each section 0 to 3. Below 8 total means prompt shop.