Tool Resilience & Recovery¶

Cogent's resilience layer sits between the agent loop and every tool call. It provides two independently controllable tiers of recovery, configured through a single flat ResilienceConfig dataclass.

Two-Tier Model¶

Tier	Who decides?	When it runs
Systematic retry	Developer	Retries happen automatically before the LLM ever sees the error
Intelligent retry	LLM	Error context is returned to the agent loop; the model chooses what to do next

Both tiers are active inside agent.run() and any programmatic tool call.

Quick Start¶

from cogent import Agent
from cogent.agent.resilience import ResilienceConfig

# Default: 3 retries, exponential-jitter backoff, hand off to LLM on exhaustion
agent = Agent(name="Bot", model=model, tools=[...])

# Explicit config
agent = Agent(
    name="Bot",
    model=model,
    tools=[...],
    resilience=ResilienceConfig(
        max_retries=3,
        strategy="exponential_jitter",
        on_exhaustion="ask_agent",
    ),
)

Systematic Retry¶

The framework retries the failing tool call mechanically using the configured backoff schedule. The LLM is not consulted during this tier.

agent = Agent(
    name="ReliableBot",
    model=model,
    tools=[flaky_api],
    observer="detailed",       # see retry events in output
    resilience=ResilienceConfig(
        max_retries=3,
        strategy="exponential",
        base_delay=1.0,
        max_delay=30.0,
    ),
)
result = await agent.run("Fetch the report")

# Inspect retry events after the run
errors = result.events_of("tool.error")
print(f"Tool was retried {len(errors)} time(s)")

Backoff Strategies¶

Value	Behaviour
`"exponential_jitter"`	`base_delay × 2^(attempt-1)` plus random jitter (default)
`"exponential"`	`base_delay × 2^(attempt-1)`, no jitter
`"linear"`	`base_delay × attempt`
`"fixed"`	Constant `base_delay` between retries
`"none"`	No delay

Strategy values are case-insensitive strings or RetryStrategy enum members.

Retryable vs. Non-Retryable Errors¶

By default the policy retries on ConnectionError, TimeoutError, OSError, and any exception whose message contains common transient-error patterns ("rate limit", "503", "too many requests", etc.).

It does not retry on ValueError, TypeError, PermissionError, KeyError, or messages matching auth patterns ("401", "api key", "unauthorized", etc.).

Intelligent Retry (`on_exhaustion="ask_agent"`)¶

When systematic retries are exhausted, on_exhaustion="ask_agent" feeds the error context (tool name, args, error message, attempt count) back into the agent loop. The LLM then decides what to do: try a different tool, reformulate the arguments, or explain why it cannot proceed.

agent = Agent(
    name="SearchBot",
    model=model,
    tools=[web_search, cached_search, local_index],
    observer="progress",
    resilience=ResilienceConfig(
        max_retries=0,             # fail on first error — let the LLM cascade
        on_exhaustion="ask_agent",
    ),
    instructions=(
        "Try web_search first, then cached_search, then local_index. "
        "When a tool fails, immediately try the next one."
    ),
)
result = await agent.run("Find information about the framework")

on_exhaustion="raise" propagates the last exception instead of handing off to the LLM. Use this when you want hard failures rather than LLM-driven recovery.

Per-Tool Overrides¶

Override any ResilienceConfig field for specific tools:

resilience = ResilienceConfig(
    max_retries=3,
    strategy="exponential_jitter",
    tool_overrides={
        "payment_api": {"max_retries": 0},            # never retry payment calls
        "flaky_search": {"max_retries": 5, "base_delay": 0.5},
        "slow_report":  {"timeout_seconds": 300.0},   # 5-minute timeout
    },
)

Each override is a flat dict with any subset of ResilienceConfig fields.

Timeout¶

ResilienceConfig(
    max_retries=3,
    timeout_seconds=30.0,   # per-call timeout; None disables it
)

Timeout applies to each individual attempt. A timed-out call raises TimeoutError, which is retryable by default.

Observing Retries¶

Pass a string level to observer for inline output:

agent = Agent(
    name="Bot",
    model=model,
    tools=[...],
    observer="progress",   # shows retry events as they happen
    resilience=ResilienceConfig(max_retries=3),
)

Retry events appear grouped inside the decision-tree block alongside the call and final outcome:

[Bot] [tool-decision]
  flaky_api {query='data'}
    [tool-error] (retry 1/3) ConnectionError: upstream timeout
    [tool-error] (retry 2/3) ConnectionError: upstream timeout
    [tool-result] {status: 'ok', data: ...}

Or inspect after the run:

result = await agent.run("...")
retries = result.events_of("tool.retry")   # one event per failed attempt
for evt in retries:
    print(evt.data["tool_name"], evt.data["attempt"], evt.data["error"])

`ResilienceConfig` Reference¶

Field	Type	Default	Description
`max_retries`	`int`	`3`	Retry attempts after first failure. `0` = no retry.
`strategy`	`str \\| RetryStrategy`	`"exponential_jitter"`	Backoff strategy.
`base_delay`	`float`	`1.0`	Base delay in seconds.
`max_delay`	`float`	`60.0`	Delay cap in seconds.
`jitter_factor`	`float`	`0.25`	Jitter multiplier (exponential_jitter only).
`on_exhaustion`	`"raise" \\| "ask_agent"`	`"ask_agent"`	Behaviour when all retries are spent.
`timeout_seconds`	`float \\| None`	`60.0`	Per-call timeout. `None` disables.
`tool_overrides`	`dict[str, dict]`	`{}`	Per-tool field overrides.

Migration from < 1.18.0¶

The following APIs were removed in 1.18.0:

Removed	Replacement
`ResilienceConfig(retry_policy=RetryPolicy(...))`	`ResilienceConfig(max_retries=3, strategy="exponential")`
`ResilienceConfig.aggressive()`	`ResilienceConfig(max_retries=5, base_delay=0.5)`
`ResilienceConfig.fast_fail()`	`ResilienceConfig(max_retries=0, on_exhaustion="raise")`
`ResilienceConfig.balanced()`	`ResilienceConfig()` (default)
`CircuitBreaker`	Remove — use `on_exhaustion="ask_agent"` instead
`FallbackRegistry`	Remove — register fallback tools directly and let the LLM cascade
`RecoveryAction`	Remove — `on_exhaustion` covers the supported modes