Back to Blog

AI • Engineering • Strategy

AI Agent Architecture: What Non-Technical Founders Need to Know Before Building

AI agents are everywhere, but most startups get the architecture wrong. A practical guide to agent orchestration patterns, reliability challenges, build vs buy decisions, and what your technical team should be telling you.

Mike Tempest 11 min read

Every startup founder has heard about AI agents by now. They automate customer support, handle data entry, book meetings, process refunds. The promise is compelling: software that acts autonomously, making decisions and taking actions without constant human oversight.

The problem is that most founders -- and many technical teams -- do not understand how agent architecture actually works. They treat agents like chatbots with extra steps, or they overcomplicate simple problems with multi-agent orchestration frameworks they do not need.

Both approaches waste time and money. One leaves you with brittle systems that break in production. The other burns through runway building infrastructure that already exists.

This guide explains what you actually need to know about AI agent architecture as a non-technical founder. Not theory. Not hype. Just the practical decisions that determine whether your agent systems work reliably or fail expensively.

What AI Agents Actually Are (vs Chatbots)

The distinction matters more than most founders realise.

A chatbot generates text. An AI agent takes actions. That is the fundamental difference, and it changes everything about how you build, test, and deploy these systems.

When ChatGPT answers a question, it produces text. If the answer is wrong, the consequence is limited -- a user gets bad information, they ask again or move on. When an AI agent processes a refund, books a meeting, or updates a customer record, a wrong action has real consequences. Money moves. Calendars change. Data gets overwritten.

This is why agent architecture is fundamentally different from chatbot architecture. Agents need:

Tool access and execution

Agents do not just generate text. They call APIs, query databases, send emails, update CRMs. They need controlled access to tools, with guardrails to prevent destructive actions. A chatbot can hallucinate harmlessly. An agent that hallucinates while connected to your payment system is a disaster.

Multi-step reasoning and planning

Agents break complex tasks into steps: understand the request, gather information, decide on an action, execute it, verify success. Each step can fail. Your architecture needs to handle partial completion, retries, and rollback when things go wrong.

State management across conversations

A chatbot can be stateless -- each message is independent. Agents maintain context across multiple interactions, remember what they have already done, and pick up where they left off. This requires persistent state storage and context management that chatbots do not need.

Audit trails and explainability

When an agent takes an action, you need to know what it did, why it did it, and how to reverse it if necessary. This is critical for debugging, compliance, and user trust. Your architecture must log every decision and action in a queryable format.

The implication for founders: if your technical team is treating agent development like chatbot development, you have a problem. The infrastructure requirements are different, the testing approach is different, and the failure modes are different. Make sure your team understands this distinction before they start building. If you are building an AI product without a technical co-founder, this distinction is even more critical -- see how to build an AI product without a technical co-founder.

Three Agent Architecture Patterns

Each pattern solves different problems. Most startups pick the wrong one.

There are three common patterns for agent architecture. Your choice depends on task complexity, reliability requirements, and how much control you need over agent behaviour.

Pattern 1: Single-Agent (Simple but Fragile)

One LLM handles everything. You give it tools, a prompt, and let it decide what to do. This is the simplest architecture and works well for straightforward tasks with clear success criteria.

When it works: Customer support queries, data lookup tasks, simple automations where the scope is narrow and consequences of failure are low.

When it breaks: Complex multi-step workflows, tasks requiring domain expertise, situations where one mistake early in the process cascades into bigger failures. Single agents struggle with ambiguity and tend to hallucinate when outside their training distribution.

Cost: Low upfront, but failure costs can be high if the agent makes wrong decisions. Token costs scale linearly with usage.

Pattern 2: Multi-Agent Orchestration (Specialised Agents with Router)

Multiple specialised agents, each handling a specific domain, coordinated by a router agent that directs tasks to the right specialist. Think of it as a team of experts rather than one generalist.

When it works: Complex domains where different tasks need different expertise. For example, a customer service system might have separate agents for billing, technical support, and account management, with a router deciding which agent handles each query.

When it breaks: Small-scale problems where the coordination overhead exceeds the benefit. If you only have three types of tasks and they rarely overlap, multi-agent orchestration adds complexity without value. Also fails when tasks do not decompose cleanly into separate domains.

Cost: Higher token usage (router plus specialist agents) and increased infrastructure complexity. Only worth it when specialisation genuinely improves accuracy or reliability.

Pattern 3: Human-in-the-Loop Hybrid (Best for Regulated Sectors)

Agents draft actions but humans approve before execution. The agent does the cognitive work -- gathering information, analysing options, proposing solutions -- but a human makes the final decision.

When it works: Regulated industries (fintech, healthtech, legaltech) where autonomous actions carry compliance risk. High-stakes decisions where the cost of error is unacceptable. Early-stage systems where you are still learning what can go wrong.

When it breaks: High-volume, low-stakes tasks where human approval creates a bottleneck. If you need to approve 1,000 actions per day, human-in-the-loop does not scale. Also inappropriate when the value proposition is full automation -- users expect instant action, not "we will review and get back to you."

Cost: Lower risk but higher operational overhead. You still need humans in the loop, so you are not eliminating labour costs, just augmenting human productivity. Best for compliance-heavy environments where the cost of failure exceeds the cost of human oversight.

Most founders pick multi-agent orchestration because it sounds sophisticated

Start with the simplest architecture that solves your problem. A single-agent system with good prompt engineering often outperforms a poorly designed multi-agent system. Add complexity only when the pain of not having it is clear and measurable. You can always evolve from simple to complex. Going the other direction is much harder. This mirrors the broader principle of choosing the right tech stack -- start simple, add complexity when proven necessary.

The Reliability Problem: LLMs Are Probabilistic

95% accuracy sounds impressive until you realise it means 1 in 20 actions fail.

LLMs are not deterministic systems. Given the same input, they can produce different outputs. This is a fundamental property of how they work, not a bug to be fixed. It has profound implications for agent reliability.

Traditional software is predictable. If a function works once, it works every time (assuming the same inputs and environment). You write unit tests, integration tests, and end-to-end tests. If tests pass, you deploy with confidence.

AI agents do not work this way. An agent might handle a refund request correctly 95 times out of 100, then fail on the 96th for no obvious reason. The prompt was the same, the context was similar, but the output was wrong. This is not a failure of engineering -- it is the nature of probabilistic systems.

What this means for your architecture:

You need evaluation frameworks, not just unit tests

Unit tests check that code works. Evaluation frameworks measure agent performance across many examples. You need test sets with hundreds or thousands of real-world scenarios, and you need to measure success rate, failure modes, and edge case handling. This is closer to machine learning evaluation than traditional software testing.

Failures will happen -- design for graceful degradation

When an agent fails, what happens? Does the system crash? Does it retry with a different approach? Does it escalate to a human? Your architecture must handle failures explicitly. Fallback mechanisms, retry logic, and human escalation paths are not optional -- they are core infrastructure requirements.

Monitoring is different for agents than for traditional software

Traditional monitoring tracks uptime, error rates, and performance. Agent monitoring tracks success rate (percentage of tasks completed correctly), failure mode distribution (what breaks and why), drift over time (is accuracy degrading?), and edge case discovery (what scenarios have we not seen before?). These metrics require different instrumentation and analysis.

Prompt engineering is infrastructure, not configuration

Changing a prompt can change agent behaviour as much as changing code. Prompts should be version-controlled, tested, and deployed with the same rigour as code changes. Ad-hoc prompt tweaks in production are a recipe for unpredictable behaviour and debugging nightmares.

The implication: if your technical team is not building evaluation frameworks, monitoring failure modes, and treating prompts as versioned infrastructure, they are not ready to deploy agents in production. Ask them how they measure reliability. If the answer is vague or focused on uptime rather than success rate, you have work to do.

Build vs Buy for Agent Infrastructure

Do not waste months building orchestration that already exists.

One of the most common mistakes I see: startups building agent orchestration frameworks from scratch. They spend 3-6 months building routing logic, state management, tool execution, and error handling. Then they discover that LangChain, CrewAI, AutoGen, or other frameworks already do this, often better.

The build vs buy decision for agents is straightforward: buy infrastructure, build domain logic.

What you should buy or use as open-source:

Agent orchestration frameworks

LangChain for single agents, CrewAI or AutoGen for multi-agent systems. These handle routing, state management, and tool execution. Do not build this yourself.

Vector databases for retrieval

Pinecone, Weaviate, or Qdrant for semantic search and context retrieval. If your agents need to search documentation or past interactions, use existing vector database infrastructure.

Observability and logging platforms

LangSmith, Weights & Biases, or Helicone for tracking agent performance, token usage, and failure modes. These tools are purpose-built for LLM observability and save months of custom development.

Evaluation frameworks

LangChain evaluation, Braintrust, or Promptfoo for systematic testing of agent performance across test sets. Building your own evaluation infrastructure is expensive and rarely better than what exists.

What you should build:

Domain-specific tools and integrations

Your agents need to interact with your APIs, your database, your CRM. These integrations are specific to your business. Build wrappers that give agents safe, constrained access to your systems.

Business logic and rules

Agents need to understand your business rules, compliance requirements, and edge cases. This knowledge is your competitive advantage. Build prompts, guardrails, and validation logic that encode your domain expertise.

Workflows and escalation paths

How do tasks flow through your system? When does an agent escalate to a human? What happens when something fails? These workflows are specific to your operations. Build them on top of existing frameworks rather than from scratch.

Custom evaluation datasets

Generic evaluation frameworks exist, but your test scenarios are unique. Build a library of edge cases, failure examples, and success criteria drawn from your actual customer interactions.

The principle is simple: invest engineering effort where you have unique requirements or competitive advantage. Everything else, buy or use open-source. Your technical team should be spending 80% of their time on domain logic and 20% on infrastructure integration, not the other way around.

Ask your technical team: "What are we building that we could buy?"

If the answer includes agent orchestration, state management, or logging infrastructure, you are probably wasting runway. These are solved problems. Focus your engineering budget on what makes your product unique, not on reinventing infrastructure that already exists. See build vs buy framework for non-technical founders.

Metrics Founders Should Demand from Their Technical Team

You cannot manage what you do not measure.

Most non-technical founders do not know what metrics to ask for when their team builds AI agents. They hear "the agent is working well" and assume success. Then they discover in production that "working well" meant "works most of the time" and the failure cases are expensive.

Here are the five metrics you should demand, regardless of technical background:

1. Success rate (percentage of tasks completed correctly)

The most important metric. What percentage of agent actions achieve the intended outcome? This should be measured across a representative test set, not cherry-picked examples. Aim for 95%+ for production systems, higher for high-stakes applications. Track this over time -- if it degrades, you have a problem.

2. Failure mode categories (what breaks and why)

Success rate tells you how often things fail. Failure mode analysis tells you why. Common categories: hallucination (agent made up information), tool failure (API error or timeout), context limitation (task required information the agent did not have), ambiguity (task was unclear or underspecified). Understanding failure modes lets you prioritise fixes.

3. Cost per agent action (token usage adds up)

Every agent action consumes tokens -- input tokens for context and output tokens for responses. This cost is variable and scales with usage. Track the average cost per action and multiply by expected volume to understand your cost structure. A customer support agent that costs £0.50 per interaction might be viable at 100 queries per day, uneconomical at 10,000.

4. Latency percentiles (p50, p95, p99 response times)

Median latency (p50) tells you typical performance. p95 and p99 tell you worst-case performance, which matters for user experience. An agent with 2-second median latency but 30-second p99 latency will frustrate users. Track all three percentiles. If p99 is unacceptably high, you need architectural changes or timeout handling.

5. Human escalation rate (how often agents need human intervention)

What percentage of tasks require human escalation because the agent could not complete them? This metric reveals the practical automation rate. If you aimed for 80% automation but your escalation rate is 40%, you are only achieving 60% automation. Track escalation reasons -- they point to where your agents need improvement.

Your technical team should be tracking these metrics from day one of agent development, not waiting until production. If they cannot provide this data on demand, they do not have the instrumentation needed to deploy agents reliably.

Scaling and Token Economics

Agent costs scale with usage unlike traditional SaaS. Model unit economics differently.

Traditional SaaS has mostly fixed costs. Once you build the software, serving an additional customer costs nearly nothing. Your gross margin improves as you scale.

AI agents work differently. Every action consumes tokens, and tokens cost money. Your marginal cost per customer or per action is non-zero. This changes your unit economics fundamentally.

What this means in practice:

Gross margin is lower than traditional SaaS

If each customer interaction costs £0.20 in token fees and you charge £50 per month for unlimited usage, your margin depends on usage patterns. High-usage customers might cost you more than they pay. You need usage-based pricing, consumption limits, or very high pricing to maintain healthy margins.

Costs scale with usage, not just customer count

A customer who uses your agent 10 times per day costs 10x more than a customer who uses it once per day. This makes revenue forecasting harder and creates unit economics challenges. Track not just customer acquisition cost (CAC) and lifetime value (LTV), but cost per action and actions per customer. These metrics determine profitability.

Model choice has massive impact on economics

GPT-4 is more capable but costs significantly more than GPT-3.5. Claude Opus is more accurate but more expensive than Claude Haiku. Your choice of model determines both your quality and your cost structure. For high-volume, low-complexity tasks, a cheaper model might deliver acceptable quality at much lower cost. For high-stakes tasks, a more expensive model might be worth it. Run cost-quality trade-off analysis before committing to a model.

Prompt optimisation is a margin lever

Shorter prompts consume fewer tokens. More efficient prompts that achieve the same result in fewer interactions reduce costs. Caching frequently used context (if your provider supports it) can reduce input token costs substantially. Treat prompt optimisation as engineering work that directly impacts gross margin.

The implication for founders: do not assume AI agent products will have SaaS-like margins. Model your unit economics carefully, including variable token costs, and price accordingly. Usage-based pricing often makes more sense than flat subscriptions for agent-powered products. For a broader view of where AI fits into your product strategy, see AI strategy without a CTO.

Track cost per customer per month, not just customer count

Your CFO should be monitoring token costs as closely as cloud infrastructure costs. If you do not have visibility into per-customer token consumption, you cannot model profitability accurately. Build this instrumentation early.

The Regulatory Angle: Fintech, HealthTech, LegalTech

Regulators require audit trails, explainability, and human override. Design for this from the start.

If you are building AI agents in a regulated industry -- fintech (FCA), healthtech (MHRA), legaltech (SRA) -- you have additional requirements beyond reliability and cost. Regulators demand transparency, auditability, and human accountability.

Three core regulatory requirements for agent systems:

1. Audit trails (who made what decision when)

Every agent action must be logged with sufficient detail to reconstruct what happened, why, and who (or what) was responsible. This includes input context, the decision-making process, actions taken, and outcomes. You need persistent storage, queryable logs, and retention policies that match regulatory requirements (often 5-7 years).

Architecture requirement: Structured logging of all agent decisions and actions, stored in a tamper-evident format. Consider write-once storage or blockchain-based audit logs for high-compliance environments.

2. Explainability (why did the agent do this)

Regulators do not accept "the AI decided" as an explanation. You need to explain why the agent took a specific action in terms a non-technical regulator can understand. This is harder than it sounds -- LLMs are black boxes, and extracting interpretable reasoning requires architectural design.

Architecture requirement: Chain-of-thought prompting (ask the agent to explain its reasoning), decision trees or rule-based fallbacks for high-stakes actions, and human-readable summaries of agent logic. Do not rely on post-hoc explanations -- build explainability into the agent's workflow.

3. Human override (ability to stop and reverse agent actions)

Regulators require that humans can intervene, override, or reverse agent decisions. Fully autonomous systems with no human control are often non-compliant. You need mechanisms for human review, approval workflows for high-risk actions, and rollback capabilities when agents make mistakes.

Architecture requirement: Approval workflows for high-stakes actions, manual override mechanisms in the agent interface, and rollback or reversal processes for completed actions. This often means implementing the human-in-the-loop hybrid pattern discussed earlier.

These requirements are not optional extras you add later. They are foundational architectural decisions. Trying to bolt on audit trails, explainability, and human override after building a fully autonomous agent is expensive and often impossible.

If you are in a regulated industry, start with the assumption that every agent action will be audited and challenged. Design accordingly. The human-in-the-loop hybrid pattern is often the safest approach until you have proven reliability and regulatory acceptance.

Work with your legal and compliance team from the start

Do not wait until technical architecture is locked in to involve legal and compliance. They need to review your audit trail design, explainability mechanisms, and human override processes before you deploy. Discovering compliance issues after launch is vastly more expensive than designing for compliance upfront. See technical leadership in regulated startups.

The Bottom Line

AI agents are not magic. They are probabilistic systems that require different architecture, testing, and operational practices than traditional software. Most startups underestimate this difference and either build fragile systems or waste months on infrastructure that already exists.

The key principles:

  • Understand that agents take actions, not just generate text -- this changes everything
  • Start with the simplest architecture that solves your problem
  • LLMs are probabilistic -- design for failure, not perfection
  • Buy infrastructure (orchestration, evaluation, observability), build domain logic
  • Track success rate, failure modes, cost per action, latency, and escalation rate
  • Model unit economics differently -- token costs scale with usage
  • If you are in a regulated industry, design for audit trails, explainability, and human override from day one

The goal is not to build the most sophisticated agent architecture. The goal is to build a system that reliably delivers value to customers at acceptable cost and risk. Everything else is secondary.

Ask your technical team the right questions. Demand the right metrics. Make sure they are not reinventing infrastructure that already exists. That is how you build agent systems that actually work.

Building AI agents and need technical guidance?

I work with non-technical founders as a Fractional CPTO, helping you make the right architectural decisions for AI agent systems. Start with a free strategy day to review your approach, evaluate build vs buy decisions, and plan your technical roadmap.

Frequently Asked Questions

What is the difference between an AI chatbot and an AI agent?

A chatbot generates text responses to user input. An AI agent takes autonomous actions -- booking meetings, updating databases, sending emails, processing refunds. Chatbots are conversational interfaces. Agents are decision-making systems that act on your behalf. The distinction matters because agents need reliability guarantees, audit trails, and failure handling that chatbots do not.

Should we build our agent infrastructure or buy existing tools?

Buy infrastructure, build domain logic. Tools like LangChain, CrewAI, and AutoGen handle orchestration, routing, and multi-agent coordination. You should not spend months building what already exists. Your competitive advantage is in the domain-specific logic -- understanding your industry, your customer workflows, and your business rules. Build that. Buy everything else.

How reliable do AI agents need to be for production use?

It depends on the consequences of failure. A customer support agent that occasionally gets something wrong might be acceptable with human oversight. A compliance agent that files regulatory submissions cannot fail. You need to define your acceptable error rate based on business impact, then design evaluation frameworks and fallback mechanisms to hit that target. 95% accuracy sounds good until you realise it means 1 in 20 actions fail.

What metrics should we track for AI agent performance?

Five core metrics: success rate (percentage of tasks completed correctly), failure mode categories (what breaks and why), cost per agent action (token usage adds up), latency percentiles (p50, p95, p99 response times), and human escalation rate (how often agents need human intervention). These tell you whether your agents are reliable, cost-effective, and improving over time.

Do AI agents work in regulated industries like fintech or healthtech?

Yes, but with specific requirements. Regulators like the FCA, MHRA, and SRA demand audit trails (who made what decision when), explainability (why did the agent do this), and human override (ability to stop and reverse agent actions). You need to architect these capabilities from the start. Trying to bolt them on later is expensive and often impossible. If you are in a regulated sector, assume every agent action will be audited and design accordingly.

Mike Tempest

Mike Tempest

Fractional CPTO

Mike works with funded startups as a Fractional CPTO, helping non-technical founders make the right technical decisions for AI products. He has built and advised on agent systems across fintech, healthtech, and SaaS, with particular focus on reliability, compliance, and unit economics.

Learn more about Mike