A lot of agent demos are misleading in one specific way: they hide time.
A model receives a prompt, calls a tool or two, produces an answer, and the system looks capable. But real agent work rarely happens in one clean burst. It unfolds over time. The agent has to survive partial failures, retries, interruptions, changing context, and multi-step execution. Once that happens, the problem stops looking like prompt engineering and starts looking like workflow engineering.
That is the real shift: as agents move from one-shot responses to ongoing work, durability becomes part of the core architecture.

Why long-running work changes the engineering problem
The moment work becomes long-running, a different class of failure appears.
A tool call times out. A shell command only partly succeeds. An API call fails after earlier steps already changed state. A user interrupts the task and comes back later. The system needs to resume from a meaningful checkpoint instead of starting over blindly.
This is the gap between a system that can produce a good answer once and a system that can make progress reliably.
The paper Building AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned is useful here because it does not describe execution as a single model response. It describes shell integration, task state, tool behavior, and explicit completion signals as part of the system itself.
Source URL: https://arxiv.org/html/2603.05344v1
That framing matters because long-running systems fail in ways that one-shot demos do not show. The problem is not just whether the model can reason. The problem is whether the surrounding system can preserve progress when the world is messy.
One-shot generation is not durable execution
A one-shot workflow can look stable simply because it has not been stressed.
If a model reads input, produces output, and exits, many important system questions remain hidden. What happens if the third step fails after the first two succeeded? What happens if the same task is retried? What happens if a human pauses the process and returns later? What happens if downstream state has changed while the task was waiting?
These are not edge cases. They are normal production conditions.
This is why durable execution matters. Durability means the system can preserve state, resume from checkpoints, retry safely, and recover without losing the integrity of the workflow.
A system that works only when nothing goes wrong is not durable. It is lucky.
What durability actually means
In practice, durability usually includes some combination of:
- persistent workflow state
- explicit checkpoints
- retries and backoff
- resumability after interruption
- safe replay or idempotent recovery paths
- traces and logs for inspection
- human checkpoints for correction or approval
These are not implementation details that sit outside the AI system. They shape whether an agent can do real work over time.
Durability is what allows a system to move from “the model produced something plausible” to “the workflow completed safely and can be inspected, resumed, or retried when needed.”
Public systems are already treating this as infrastructure
The open-source landscape is useful here because it shows what builders are actually investing in.
Restate’s AI examples emphasize durable execution, resilience, retries, persistence, and long-running workflow behavior.
Source URL: https://github.com/restatedev/ai-examples
That matters because it shows durability being treated as a first-class systems concern rather than as cleanup after the fact.
Dapr Agents reflects a similar mindset. The project brings together workflow orchestration, messaging, state, and telemetry around agent execution.
Source URL: https://github.com/dapr/dapr-agents
LangGraph is another clear signal. Its model is explicitly stateful and graph-oriented, which makes long-running workflow structure visible rather than implicit.
Source URL: https://github.com/langchain-ai/langgraph
Even outside AI-branded tooling, durable workflow engines such as Conductor point in the same direction: once workflows become meaningful, resilience and resumability stop being optional.
Source URL: https://github.com/conductor-oss/conductor
Seen together, these systems suggest a broader pattern. Public implementations are converging on the idea that serious agent execution needs workflow memory, retries, replay-aware behavior, and explicit state transitions.
Why this is a harness problem
It is easy to describe failures in long-running tasks as model failures. Sometimes they are. But often the model is only one part of the story.
A stronger model does not automatically decide:
- when to checkpoint
- how to persist state
- how to retry safely
- how to resume after interruption
- how to surface partial progress
- how to let humans inspect or redirect the run
Those are harness choices.
This is why the center of gravity keeps moving outward from the model itself. As models improve, the surrounding execution structure becomes easier to notice. The more you expect from an agent, the more visible durability becomes.
A good harness does not just help an agent start. It helps the agent continue.
Bottom line
The difference between a convincing demo and a dependable agent often comes down to whether the workflow can survive time.
That is why durability, state, and recovery are not secondary engineering polish. They are part of the core architecture of serious AI systems.
Once an agent is expected to work across long tasks, interruptions, retries, and shifting context, workflow durability becomes a competitive layer.
In the next part of this series, I will zoom out from individual failure modes to a broader pattern: the recurring architecture choices that keep appearing across public harness-oriented systems.
Leave a Reply