How to Vet Any AI Agency Selling Computer-Use Automation (2026 Checklist)
Most AI agencies will pitch computer-use clickbots with admin-level access to your systems. IBM's 2025 data shows 97% of breached AI setups lacked access controls, costing $670K extra per incident. Demand a permission matrix, replayable traces, and a live kill-switch demo before signing.
The 10-Point Checklist to Vet Any AI Agency Selling Computer-Use Automation
We Already Made This Mistake Once
In 2024, a large bank launched 200 UiPath RPA bots. One ERP vendor pushed a routine UI refresh. 180 bots broke overnight.
Two weeks of engineering triage later, they were back online — until the next portal change.
That's the dirty secret of RPA. The industry still claims 30% annual growth. But anyone who's run a Selenium-style automation in production for 18+ months knows the failure mode. Maintenance per quarter often exceeds the original build cost.
Now the same cycle is starting again with a shinier label. On May 19, Anthropic launched MCP Tunnels and Self-Hosted Sandboxes for Claude Managed Agents. The same week, Microsoft moved computer-use in Copilot Studio to GA across every commercial Power Platform geography.
Every AI agency on LinkedIn is about to pitch you "computer-use automation." Most of them will hand you a bot that screenshots your CRM, clicks through the UI, and calls it AI. That's not an integration. That's a very expensive macro.
Here's how to spot it before you sign.
Steps 1–4: Lock Down Permissions First
1. Least privilege access. Ask the vendor: "What specific permissions does your agent need, and why?" If the answer is "admin access to your CRM," walk away. The ICML 2026 TRAP benchmark tested six frontier models and found agents are susceptible to prompt injection in 25% of tasks on average — 43% for DeepSeek-R1. An over-privileged agent that gets injected doesn't just fail. It fails with the keys to everything.
Evidence to demand: A written permission matrix showing every system the agent touches, what access level it needs, and what it can't do.
2. Credential isolation. The agent's credentials should never be the same ones your team uses. Anthropic's MCP Tunnels architecture gets this right: a lightweight gateway makes a single outbound encrypted connection, and credentials never cross the public internet. Ask your vendor if their agent uses shared credentials or dedicated service accounts.
Red flag: "We'll just use your Salesforce login."
3. Scoped API tokens over UI sessions. If the vendor's agent logs into your tools through a browser like a human, it's a clickbot. Real integrations use scoped API tokens with defined read/write permissions. Anthropic's Claude Compliance API, announced alongside 28 security integrations (CrowdStrike, Okta, Palo Alto Networks, Datadog, and others), gives programmatic access to conversation content and activity logs. That's what real auditability looks like.
Ask: "Does your agent authenticate via API or browser session?"
4. Network boundary controls. Where does the agent's compute actually run? Anthropic's Self-Hosted Sandboxes (now in public beta) split the architecture: orchestration stays on Anthropic's infra, tool execution runs on yours. Files, credentials, and environment variables never leave your boundary.
Evidence to demand: An architecture diagram showing data flow boundaries. If they can't produce one, they haven't thought about it.
Steps 5–7: Make Every Action Traceable
5. Replayable traces. Every action the agent takes should be logged in a format you can replay and audit. Not "we have logs." Actual step-by-step traces showing what the agent saw, what it decided, and what it did. Anthropic's Managed Agents now surface session events in the Claude Console so developers can trace what an agent learned and where it came from.
Ask: "Can you show me a trace replay of a completed task from a current client?" If they can't demo this in the first meeting, find a different vendor. That's a StoryPros rule: if your AI vendor can't show you a working demo in week 1, move on.
6. Immutable audit logs. Logs that can be edited aren't logs. They're fiction. Anthropic's memory system tracks all changes with a detailed audit log — which agent, which session, what changed. You can roll back to an earlier version or redact content from history. Your vendor's system should work the same way.
Contract clause to add: "Vendor shall maintain immutable, timestamped audit logs of all agent actions, accessible to Client within 24 hours of request."
7. Integration with your existing monitoring. The agent's activity should flow into whatever SIEM or monitoring stack you already run. Anthropic just connected Claude to 28 security platforms: Datadog, CrowdStrike, Zscaler, Netskope, Wiz, and more. Your vendor should pipe agent activity into your existing dashboards without a custom project.
Red flag: "We have our own monitoring dashboard." Translation: your security team will never look at it.
Steps 8–10: Build the Kill Switch Before You Need It
8. Idempotency. If the agent runs a task twice, does it create duplicate records? Send duplicate emails? Charge a customer twice? The Frontiers in Computer Science survey on agentic AI privacy failures found that chained reasoning across tools causes unintended billing actions — duplicate charges, incorrect invoice data — through tool-logic failures.
Test to run: Trigger the same workflow three times in a row. Check for duplicates in your CRM or billing system. If you find any, the agent isn't production-ready.
9. Human-in-the-loop approvals for high-impact actions. The MDPI simulation study on tool-using agent security found that human approvals "sharply reduce high-impact actions and exports." But approvals degrade under habituation — your team will start rubber-stamping after a few weeks. The fix: tier your approvals. Low-risk actions (reading data, generating reports) run automatically. High-risk actions (sending emails, modifying records, processing payments) require a human click.
Ask: "Which actions require human approval, and can I configure the thresholds?"
10. Emergency kill switch. Can you shut the agent down in 30 seconds? Not "submit a ticket." Not "contact support." A button, a command, an API call that stops everything immediately. If the agent is clicking through your billing system at 3 AM and something breaks, you need to pull the plug before it processes 500 bad invoices.
Evidence to demand: A live demo of the kill switch. Time it.
The 60-Second Test That Exposes a Clickbot
Here's the fastest way to know what you're actually buying.
Ask the agency to show you how the agent authenticates. A real API integration makes direct calls: scoped tokens, defined permissions, sub-second response times. A clickbot renders a browser, screenshots the page, processes the image, decides what to click, then clicks. That's 3-5 seconds per action versus 200-400ms for an API call.
Microsoft's own Copilot Studio data shows CUA-style agents cutting legacy ERP data entry from 4 minutes to 35 seconds per item. Impressive — but a direct API integration does the same thing in under a second.
The right tool depends on the job. Government portal with no API? Computer-use is the only option. Salesforce? HubSpot? Any modern tool with an API? If someone pitches you a clickbot for that, they're either lazy or they don't know what they're doing.
Most AI agency failures come down to the same problem: they start with technology instead of strategy. They connect APIs — or worse, screenshot UIs — without ever asking what the buyer needs or how the workflow actually works. The AI is the delivery mechanism. The strategy is the product.
At StoryPros, we build AI agents that use APIs when they exist and computer-use only when they don't. We don't sell you the shiny thing. We sell you the thing that works.
FAQ
How do I secure AI agents that use computer-use automation?
Start with least-privilege permissions: give the agent only the access it needs for specific tasks, never admin access. Use dedicated service accounts instead of shared credentials. IBM's 2025 Cost of a Data Breach Report found that 97% of breached AI setups lacked proper access controls, costing an extra $670,000 per incident. StoryPros recommends a tiered approval system where high-risk actions (payments, record deletion, outbound emails) require human sign-off.
How do I tell the difference between a clickbot and a real AI integration?
Ask the vendor how their agent authenticates. If it uses scoped API tokens, it's a real integration. If it logs in through a browser session like a human user, it's a clickbot. Check latency too: API calls return in 200-400ms, while computer-use agents take 3-5 seconds per action because they screenshot and process each screen. Both have valid use cases, but a vendor selling clickbots for API-enabled tools like Salesforce or HubSpot is selling you the wrong thing.
What is idempotency and why does it matter for AI automation?
Idempotency means running the same operation multiple times produces the same result: no duplicate records, no double charges, no repeated emails. It matters because AI agents retry failed tasks. If a network timeout causes an agent to re-run a billing workflow and process the same invoice twice, that's real money lost. Test for this by triggering the same workflow three times in a row and checking for duplicates.
What are replayable traces in AI agent auditing?
Replayable traces are step-by-step logs showing exactly what an AI agent saw, decided, and did during a task. You can replay them in sequence to reconstruct the agent's full decision path. Anthropic's Claude Managed Agents surface session events in the Claude Console for exactly this purpose. When vetting an AI agency, ask for a trace replay demo from a real project — if they can't show one, their system likely doesn't support proper auditing.
Which is better for automation — API integrations or computer-use agents?
Neither is universally better. API integrations are faster (sub-second vs. 3-5 seconds per action), more reliable, and easier to audit. Computer-use agents are the right choice when no API exists: legacy ERPs, government portals, proprietary systems with no way in. Microsoft's Copilot Studio data shows computer-use agents cutting government portal filings from 90 minutes to 8 minutes. That's a valid use case. The mistake is using computer-use when a clean API is sitting right there.
Related Reading
What happens if an AI agent has too many permissions?
97% of breached AI setups lacked proper access controls, costing an extra $670,000 per incident on average, per IBM's 2025 breach report. Frontier AI models are susceptible to prompt injection in 25% of tasks on average. An over-privileged agent that gets injected fails with access to everything.
How can I tell if an AI agency is selling me a clickbot instead of a real integration?
Ask how the agent authenticates. Scoped API tokens mean a real integration. Browser login means a clickbot. API calls return in 200-400ms. Computer-use agents take 3-5 seconds per action because they screenshot and process each screen.
What is idempotency and why does it matter for AI agents?
Idempotency means running the same operation multiple times produces the same result: no duplicate records, no double charges, no repeated emails. AI agents retry failed tasks, so a network timeout can trigger the same billing workflow twice. Test by triggering the same workflow three times and checking for duplicates.