95% accuracy sounds impressive until you realise it means 1 in 20 actions fail.
LLMs are not deterministic systems. Given the same input, they can produce different outputs. This is a fundamental property of how they work, not a bug to be fixed. It has profound implications for agent reliability.
Traditional software is predictable. If a function works once, it works every time (assuming the same inputs and environment). You write unit tests, integration tests, and end-to-end tests. If tests pass, you deploy with confidence.
AI agents do not work this way. An agent might handle a refund request correctly 95 times out of 100, then fail on the 96th for no obvious reason. The prompt was the same, the context was similar, but the output was wrong. This is not a failure of engineering -- it is the nature of probabilistic systems.
What this means for your architecture:
You need evaluation frameworks, not just unit tests
Unit tests check that code works. Evaluation frameworks measure agent performance across many examples. You need test sets with hundreds or thousands of real-world scenarios, and you need to measure success rate, failure modes, and edge case handling. This is closer to machine learning evaluation than traditional software testing.
Failures will happen -- design for graceful degradation
When an agent fails, what happens? Does the system crash? Does it retry with a different approach? Does it escalate to a human? Your architecture must handle failures explicitly. Fallback mechanisms, retry logic, and human escalation paths are not optional -- they are core infrastructure requirements.
Monitoring is different for agents than for traditional software
Traditional monitoring tracks uptime, error rates, and performance. Agent monitoring tracks success rate (percentage of tasks completed correctly), failure mode distribution (what breaks and why), drift over time (is accuracy degrading?), and edge case discovery (what scenarios have we not seen before?). These metrics require different instrumentation and analysis.
Prompt engineering is infrastructure, not configuration
Changing a prompt can change agent behaviour as much as changing code. Prompts should be version-controlled, tested, and deployed with the same rigour as code changes. Ad-hoc prompt tweaks in production are a recipe for unpredictable behaviour and debugging nightmares.
The implication: if your technical team is not building evaluation frameworks, monitoring failure modes, and treating prompts as versioned infrastructure, they are not ready to deploy agents in production. Ask them how they measure reliability. If the answer is vague or focused on uptime rather than success rate, you have work to do.