Tool Resilience & Recovery¶
Cogent's resilience layer sits between the agent loop and every tool call.
It provides two independently controllable tiers of recovery, configured
through a single flat ResilienceConfig dataclass.
Two-Tier Model¶
| Tier | Who decides? | When it runs |
|---|---|---|
| Systematic retry | Developer | Retries happen automatically before the LLM ever sees the error |
| Intelligent retry | LLM | Error context is returned to the agent loop; the model chooses what to do next |
Both tiers are active inside agent.run() and any programmatic tool call.
Quick Start¶
from cogent import Agent
from cogent.agent.resilience import ResilienceConfig
# Default: 3 retries, exponential-jitter backoff, hand off to LLM on exhaustion
agent = Agent(name="Bot", model=model, tools=[...])
# Explicit config
agent = Agent(
name="Bot",
model=model,
tools=[...],
resilience=ResilienceConfig(
max_retries=3,
strategy="exponential_jitter",
on_exhaustion="ask_agent",
),
)
Systematic Retry¶
The framework retries the failing tool call mechanically using the configured backoff schedule. The LLM is not consulted during this tier.
agent = Agent(
name="ReliableBot",
model=model,
tools=[flaky_api],
observer="detailed", # see retry events in output
resilience=ResilienceConfig(
max_retries=3,
strategy="exponential",
base_delay=1.0,
max_delay=30.0,
),
)
result = await agent.run("Fetch the report")
# Inspect retry events after the run
errors = result.events_of("tool.error")
print(f"Tool was retried {len(errors)} time(s)")
Backoff Strategies¶
| Value | Behaviour |
|---|---|
"exponential_jitter" |
base_delay × 2^(attempt-1) plus random jitter (default) |
"exponential" |
base_delay × 2^(attempt-1), no jitter |
"linear" |
base_delay × attempt |
"fixed" |
Constant base_delay between retries |
"none" |
No delay |
Strategy values are case-insensitive strings or RetryStrategy enum members.
Retryable vs. Non-Retryable Errors¶
By default the policy retries on ConnectionError, TimeoutError, OSError,
and any exception whose message contains common transient-error patterns
("rate limit", "503", "too many requests", etc.).
It does not retry on ValueError, TypeError, PermissionError,
KeyError, or messages matching auth patterns ("401", "api key",
"unauthorized", etc.).
Intelligent Retry (on_exhaustion="ask_agent")¶
When systematic retries are exhausted, on_exhaustion="ask_agent" feeds the
error context (tool name, args, error message, attempt count) back into the
agent loop. The LLM then decides what to do: try a different tool, reformulate
the arguments, or explain why it cannot proceed.
agent = Agent(
name="SearchBot",
model=model,
tools=[web_search, cached_search, local_index],
observer="progress",
resilience=ResilienceConfig(
max_retries=0, # fail on first error — let the LLM cascade
on_exhaustion="ask_agent",
),
instructions=(
"Try web_search first, then cached_search, then local_index. "
"When a tool fails, immediately try the next one."
),
)
result = await agent.run("Find information about the framework")
on_exhaustion="raise" propagates the last exception instead of handing off
to the LLM. Use this when you want hard failures rather than LLM-driven
recovery.
Per-Tool Overrides¶
Override any ResilienceConfig field for specific tools:
resilience = ResilienceConfig(
max_retries=3,
strategy="exponential_jitter",
tool_overrides={
"payment_api": {"max_retries": 0}, # never retry payment calls
"flaky_search": {"max_retries": 5, "base_delay": 0.5},
"slow_report": {"timeout_seconds": 300.0}, # 5-minute timeout
},
)
Each override is a flat dict with any subset of ResilienceConfig fields.
Timeout¶
Timeout applies to each individual attempt. A timed-out call raises
TimeoutError, which is retryable by default.
Observing Retries¶
Pass a string level to observer for inline output:
agent = Agent(
name="Bot",
model=model,
tools=[...],
observer="progress", # shows retry events as they happen
resilience=ResilienceConfig(max_retries=3),
)
Retry events appear grouped inside the decision-tree block alongside the call and final outcome:
[Bot] [tool-decision]
flaky_api {query='data'}
[tool-error] (retry 1/3) ConnectionError: upstream timeout
[tool-error] (retry 2/3) ConnectionError: upstream timeout
[tool-result] {status: 'ok', data: ...}
Or inspect after the run:
result = await agent.run("...")
retries = result.events_of("tool.retry") # one event per failed attempt
for evt in retries:
print(evt.data["tool_name"], evt.data["attempt"], evt.data["error"])
ResilienceConfig Reference¶
| Field | Type | Default | Description |
|---|---|---|---|
max_retries |
int |
3 |
Retry attempts after first failure. 0 = no retry. |
strategy |
str \| RetryStrategy |
"exponential_jitter" |
Backoff strategy. |
base_delay |
float |
1.0 |
Base delay in seconds. |
max_delay |
float |
60.0 |
Delay cap in seconds. |
jitter_factor |
float |
0.25 |
Jitter multiplier (exponential_jitter only). |
on_exhaustion |
"raise" \| "ask_agent" |
"ask_agent" |
Behaviour when all retries are spent. |
timeout_seconds |
float \| None |
60.0 |
Per-call timeout. None disables. |
tool_overrides |
dict[str, dict] |
{} |
Per-tool field overrides. |
Migration from < 1.18.0¶
The following APIs were removed in 1.18.0:
| Removed | Replacement |
|---|---|
ResilienceConfig(retry_policy=RetryPolicy(...)) |
ResilienceConfig(max_retries=3, strategy="exponential") |
ResilienceConfig.aggressive() |
ResilienceConfig(max_retries=5, base_delay=0.5) |
ResilienceConfig.fast_fail() |
ResilienceConfig(max_retries=0, on_exhaustion="raise") |
ResilienceConfig.balanced() |
ResilienceConfig() (default) |
CircuitBreaker |
Remove — use on_exhaustion="ask_agent" instead |
FallbackRegistry |
Remove — register fallback tools directly and let the LLM cascade |
RecoveryAction |
Remove — on_exhaustion covers the supported modes |
See Also¶
- examples/resilience/tool_resilience.py — Live demos of all three recovery tiers
- docs/observability.md — Observer levels and event inspection
- docs/tool-building.md — Creating tools