AI Agent Harness Architecture: Why State Belongs Outside It

Key Takeaways

The harness, not the model, decides if agent state is portable, durable, and recoverable.

Agent memory is runtime state. Treat it as architecture from the start, not a plugin added later.

The real risk is state lock-in. Switching frameworks should not mean losing the agent’s operating history.

TiDB Cloud Zero gives you an independent, MySQL-compatible state substrate any harness can connect to. mem9 sits on top as a managed memory API. drive9 covers files and artifacts.

For the first wave of AI applications, the model was the only architectural question that mattered. Which model reasons better. Whose code is cleaner. Whose tool calls are more reliable. Where is the context window largest. What costs less per task.

That conversation still matters. It is just no longer enough.

The systems attracting serious infrastructure attention now are not just models. They are harnessed systems: Models surrounded by tools, memory, context management, sandboxes, files, permissions, recovery logic, evaluators, and feedback loops. A harness turns a model into a tool-using, long-running agent. Once that happens, the hardest question changes: Where does the agent’s runtime state live, and who controls it?

Two situations from production make the cost of getting this wrong concrete.

You ship a working agent on LangGraph. Three months later, a stronger model lands and your team wants to switch frameworks to take advantage of it. The agent’s memory (conversation history, user preferences, task checkpoints) lives inside LangGraph’s internal state format. Switching means one of two things: Rewrite the memory layer to match the new harness, or start every user back at turn zero.

Or: your coding agent runs a four-hour task and gets interrupted at hour two. On restart, it has no record of what it already tried, which files it modified, or what it decided to skip. It starts over and repeats the same failed paths.

In both cases, the problem is not the model. It is where the agent’s state lived.

What is an AI Agent Harness?

An AI agent harness is the application layer around a model. It decides what the model can see, which tools it can call, what prior work enters context, how tool outputs are represented, where intermediate artifacts go, when memory is retrieved, and how the agent resumes after interruption.

OpenAI describes the Codex harness as the core agent loop and execution logic: The part that coordinates user input, model inference, tool calls, tool outputs, conversation history, and context window management. Anthropic’s work on agent evaluations makes a similar distinction: when you evaluate an agent, you are evaluating the model and the harness together.

That distinction matters:

A model produces tokens.
A harness produces behavior.

The model is where reasoning happens. The harness is where reasoning becomes work. Reading files, invoking tools, calling MCP servers, testing outputs, writing summaries, deciding what carries forward. Stronger models do not eliminate the harness. They raise the ceiling on what the harness can coordinate.

Long-Running Agents Create State Whether You Plan for It or Not

Every useful harness produces state. Some of it is obvious:

Conversation history.
Retrieved context.
Tool outputs and execution logs.
Task plans and generated files.
User preferences and memory records.

Some of it is less obvious, and this is where teams usually get caught:

What the agent already tried, and what it abandoned.
What assumptions were true at the time of a decision.
What was compressed out of context.
Which artifacts belong to which decision.
What should be visible to the next session.

In demos, that state lives wherever it is easiest to put it. Markdown files, JSON blobs, SQLite, a vector store, a hidden harness directory, a provider-managed thread, a progress file in the repo. For early prototypes, that is often fine. The problem starts when temporary state becomes runtime state.

Anthropic’s writing on long-running agents shows the failure mode clearly. Their coding agents needed progress files, feature lists, git history, initialization scripts, and structured handoff artifacts so that each new session could understand what happened before. Without that, each context window began with too little usable memory of the previous one. The issue was not model intelligence. The runtime did not preserve enough usable state.

That is the shift production teams are now facing.

Why Memory Cannot Be Added to an AI Agent as a Plugin

Memory is not a plugin you bolt on after the rest of the harness is working. It sits in the path between the system and the model on every turn. Before the model acts, the harness decides what to retrieve, summarize, filter, compress, and include. After the model acts, the harness decides what to store, update, discard, or expose to future runs.

LangChain’s analysis of harness architecture makes a related point: Memory is tightly coupled to how the harness manages context and state. If memory lives inside a closed harness or proprietary API, developers lose control of the state that makes their agent useful.

But once you see memory this way, the word starts to feel too narrow. The harness is not just managing memory. It is managing runtime state: Session state, task history, user profiles, permissions, retrieval metadata, tool outputs, workspace files, generated artifacts, recovery checkpoints.

That is why “just add a vector database” is an incomplete answer. Vector search is useful for semantic recall. Agent state is not purely semantic. A production agent also needs:

Exact filters and chronology. What happened, in what order, at what time.
Ownership and transactions. Which session wrote which record, and what depends on what.
Permissions and auditability. Who can read which records, and who did read them.
Structured querying. Inspection and reporting across history, users, and tools.

The tier most often missing is structured, queryable, control-plane state: Task history, permissions, user profiles, audit trails, and recovery checkpoints. These are the records that make an agent’s operating history inspectable and portable across harnesses.

The Real Risk Is Agent State Lock-In, Not Open vs. Closed

It is tempting to frame the harness debate as open source versus closed source. That framing is too simple. The deeper issue is state lock-in.

For CTOs, architectural tech leads, and platform engineers, the practical questions are:

Can we inspect what the agent stored?
Can we query memory across users, sessions, tools, and workspaces?
Can we migrate state from one harness to another?
Can we separate model choice from memory ownership?
Can we keep using the state if we change model providers?

If the answer is no, the harness is not only orchestrating the agent. It is becoming the system of record for the agent’s operating history. That may be acceptable for early experiments. It is much harder to accept once agents participate in engineering, support, research, data operations, or customer-facing workflows.

Tool-Calling AI Agents Make State Governance Visible

Once agents call external tools at scale, the platform needs clear answers for identity, authorization, audit trails, policy enforcement, and backend persistence. Agent workloads are shaped by runtime inputs, not just static code, which means infrastructure-level controls matter.

A better design treats state as a first-class substrate beneath the harness. Which agent called which tool, under which identity, with which input, visible to which future run. Without that record, debugging and compliance both become guesswork.

This is also where the MCP server pattern shows its limits. An MCP server that proxies a tool call without writing a durable record of who called what and why will work in a demo and break under audit. Backend persistence is not optional once tool calls become part of how real work gets done.

What Memory Benchmarks Like LoCoMo Confirm

Memory benchmarks like LoCoMo confirm empirically what production teams already report: Memory quality depends heavily on how the agent manages context and uses tools, not only on the retrieval mechanism. LoCoMo evaluates agents on multi-session conversations averaging 300 turns and 9K tokens over up to 35 sessions, and the systems that win on it are not the ones with the biggest context window. They are the ones with the most disciplined state management.

The implication is direct. Improving agent memory means improving the harness, and the harness needs a durable, queryable, portable place to keep the state it manages.

What an Independent AI Agent State Substrate Looks Like

The answer is not to make every agent prototype heavy. It is to make the state explicit earlier.

A useful state substrate has four properties:

Fast to create. Provisioning measured in seconds, not weeks.
Queryable with standard SQL. So memory can be inspected, joined, and filtered like any other data.
Independent from any single harness. Switching frameworks does not require rebuilding state.
Accessible from standard MySQL-compatible drivers. So any language, any harness, any tool can connect.

The point is portability. The harness can evolve, the model can change, and the state the agent accumulated stays in a place that is inspectable and queryable.

This is what TiDB Cloud Zero is designed to be in the agent state stack: An instant, MySQL-compatible SQL substrate that any harness can connect to and migrate away from without data loss if the architecture changes. For agents that need persistent memory across sessions, mem9 sits on top of that substrate as a managed memory API for coding agents and custom tools. Where agents also produce files, artifacts, and documents, drive9 extends the stack to cover those too.

No single layer solves the whole problem. Treating state as a separable substrate is the architectural decision worth making before the harness becomes the system of record.

The two failure modes from the opening illustrate this directly. In the framework migration case: If memory lives in a standard, queryable substrate rather than inside the harness’s internal format, it does not need to be rebuilt when the harness changes. It stays legible and portable. In the interrupted-task case: A durable state layer means the agent’s progress is a record it can read on restart, not session history it has to reconstruct from scratch.

When a Dedicated Agent State Substrate Becomes Necessary

Not every agent needs this on day one. A dedicated state substrate becomes important when the agent crosses one or more of these lines:

It must resume across sessions or machines.
Memory must follow users across devices.
Multiple agents need shared context.
MCP tools need backend persistence.
Tool outputs need to be queried later.
The agent must survive a model or harness change.
Provider-managed memory has become a lock-in concern.

If none of those apply, local files and a simple store may be enough. Once one applies, the architecture debt starts compounding.

The Question to Ask Before Choosing Your Next AI Agent Harness

As the harness becomes the application layer, it needs an independent state substrate to match. Before choosing the next agent framework or MCP architecture, ask “What state will this agent create, and will we still control it three months from now?”

If the state lives only inside a closed API, you may move quickly but lose portability.
If it lives only in local files, you may preserve visibility but lose durability.
If it lives in an independent, queryable substrate, the harness can evolve, the model can change, and the agent’s operating history remains under your control.

That is what an AI agent harness actually needs beyond a model. Not more glue, but a state layer the builder can own.

If you are ready to test what an independent state layer looks like in practice, TiDB Cloud Zero spins up an instant, MySQL-compatible SQL substrate you can use today for agent memory, tool outputs, and retrieval prototypes. No signup required to start.

FAQ

What is an AI agent harness?

An AI agent harness is the application-layer code around a model that decides what the model sees, which tools it can call, how tool outputs are handled, when memory is retrieved, and how the agent resumes after interruption. The model produces tokens. The harness turns those tokens into work by coordinating tools, context, and state across many turns.

What is the difference between a model and a harness?

A model is a single inference engine. It takes a prompt and produces a response. A harness is the surrounding system that runs many model calls in sequence, manages conversation history and tool outputs, decides what to remember, and recovers from interruption. Stronger models do not replace the harness. They raise the ceiling on what the harness can coordinate.

What is agent state lock-in?

Agent state lock-in is what happens when an agent’s runtime state (memory, task history, user preferences, tool outputs) lives inside a specific harness or provider’s internal format. Switching to a different framework or model provider then requires either rebuilding the memory layer or starting users back at turn zero. The cost compounds as the agent accumulates more operating history.

Why isn’t a vector database enough for agent memory?

Vector search handles semantic recall, but agents also need exact filtering on metadata, ownership and transactional guarantees, permissions and audit trails, and structured queries across users, sessions, and tools. A vector-only store cannot answer questions like “what did this user request yesterday” or “which agent called which tool under whose identity.” Production agent state needs both vector and SQL semantics in the same backend.

What does an independent agent state substrate need to provide?

Four properties: Fast provisioning so prototypes are not blocked, standard SQL so memory is inspectable and joinable, independence from any single harness so frameworks can change without losing state, and standard driver support (such as MySQL compatibility) so any language or tool can connect. The point is portability across harness, model, and provider changes.

Try TiDB Cloud Zero

Experience modern data infrastructure firsthand.

Start for Free

Thought Leadership

Have questions? Let us know how we can help.

TiDB Cloud Dedicated

A fully-managed cloud DBaaS for predictable workloads

TiDB Cloud Starter

A fully-managed cloud DBaaS for auto-scaling workloads

Start for Free Learn More

What an AI Agent Harness Actually Needs Beyond a Model

Key Takeaways