Skip to content

Tool Resilience & Recovery

Cogent's resilience layer sits between the agent loop and every tool call. It provides two independently controllable tiers of recovery, configured through a single flat ResilienceConfig dataclass.

Two-Tier Model

Tier Who decides? When it runs
Systematic retry Developer Retries happen automatically before the LLM ever sees the error
Intelligent retry LLM Error context is returned to the agent loop; the model chooses what to do next

Both tiers are active inside agent.run() and any programmatic tool call.


Quick Start

from cogent import Agent
from cogent.agent.resilience import ResilienceConfig

# Default: 3 retries, exponential-jitter backoff, hand off to LLM on exhaustion
agent = Agent(name="Bot", model=model, tools=[...])

# Explicit config
agent = Agent(
    name="Bot",
    model=model,
    tools=[...],
    resilience=ResilienceConfig(
        max_retries=3,
        strategy="exponential_jitter",
        on_exhaustion="ask_agent",
    ),
)

Systematic Retry

The framework retries the failing tool call mechanically using the configured backoff schedule. The LLM is not consulted during this tier.

agent = Agent(
    name="ReliableBot",
    model=model,
    tools=[flaky_api],
    observer="detailed",       # see retry events in output
    resilience=ResilienceConfig(
        max_retries=3,
        strategy="exponential",
        base_delay=1.0,
        max_delay=30.0,
    ),
)
result = await agent.run("Fetch the report")

# Inspect retry events after the run
errors = result.events_of("tool.error")
print(f"Tool was retried {len(errors)} time(s)")

Backoff Strategies

Value Behaviour
"exponential_jitter" base_delay × 2^(attempt-1) plus random jitter (default)
"exponential" base_delay × 2^(attempt-1), no jitter
"linear" base_delay × attempt
"fixed" Constant base_delay between retries
"none" No delay

Strategy values are case-insensitive strings or RetryStrategy enum members.

Retryable vs. Non-Retryable Errors

By default the policy retries on ConnectionError, TimeoutError, OSError, and any exception whose message contains common transient-error patterns ("rate limit", "503", "too many requests", etc.).

It does not retry on ValueError, TypeError, PermissionError, KeyError, or messages matching auth patterns ("401", "api key", "unauthorized", etc.).


Intelligent Retry (on_exhaustion="ask_agent")

When systematic retries are exhausted, on_exhaustion="ask_agent" feeds the error context (tool name, args, error message, attempt count) back into the agent loop. The LLM then decides what to do: try a different tool, reformulate the arguments, or explain why it cannot proceed.

agent = Agent(
    name="SearchBot",
    model=model,
    tools=[web_search, cached_search, local_index],
    observer="progress",
    resilience=ResilienceConfig(
        max_retries=0,             # fail on first error — let the LLM cascade
        on_exhaustion="ask_agent",
    ),
    instructions=(
        "Try web_search first, then cached_search, then local_index. "
        "When a tool fails, immediately try the next one."
    ),
)
result = await agent.run("Find information about the framework")

on_exhaustion="raise" propagates the last exception instead of handing off to the LLM. Use this when you want hard failures rather than LLM-driven recovery.


Per-Tool Overrides

Override any ResilienceConfig field for specific tools:

resilience = ResilienceConfig(
    max_retries=3,
    strategy="exponential_jitter",
    tool_overrides={
        "payment_api": {"max_retries": 0},            # never retry payment calls
        "flaky_search": {"max_retries": 5, "base_delay": 0.5},
        "slow_report":  {"timeout_seconds": 300.0},   # 5-minute timeout
    },
)

Each override is a flat dict with any subset of ResilienceConfig fields.


Timeout

ResilienceConfig(
    max_retries=3,
    timeout_seconds=30.0,   # per-call timeout; None disables it
)

Timeout applies to each individual attempt. A timed-out call raises TimeoutError, which is retryable by default.


Observing Retries

Pass a string level to observer for inline output:

agent = Agent(
    name="Bot",
    model=model,
    tools=[...],
    observer="progress",   # shows retry events as they happen
    resilience=ResilienceConfig(max_retries=3),
)

Retry events appear grouped inside the decision-tree block alongside the call and final outcome:

[Bot] [tool-decision]
  flaky_api {query='data'}
    [tool-error] (retry 1/3) ConnectionError: upstream timeout
    [tool-error] (retry 2/3) ConnectionError: upstream timeout
    [tool-result] {status: 'ok', data: ...}

Or inspect after the run:

result = await agent.run("...")
retries = result.events_of("tool.retry")   # one event per failed attempt
for evt in retries:
    print(evt.data["tool_name"], evt.data["attempt"], evt.data["error"])

ResilienceConfig Reference

Field Type Default Description
max_retries int 3 Retry attempts after first failure. 0 = no retry.
strategy str \| RetryStrategy "exponential_jitter" Backoff strategy.
base_delay float 1.0 Base delay in seconds.
max_delay float 60.0 Delay cap in seconds.
jitter_factor float 0.25 Jitter multiplier (exponential_jitter only).
on_exhaustion "raise" \| "ask_agent" "ask_agent" Behaviour when all retries are spent.
timeout_seconds float \| None 60.0 Per-call timeout. None disables.
tool_overrides dict[str, dict] {} Per-tool field overrides.

Migration from < 1.18.0

The following APIs were removed in 1.18.0:

Removed Replacement
ResilienceConfig(retry_policy=RetryPolicy(...)) ResilienceConfig(max_retries=3, strategy="exponential")
ResilienceConfig.aggressive() ResilienceConfig(max_retries=5, base_delay=0.5)
ResilienceConfig.fast_fail() ResilienceConfig(max_retries=0, on_exhaustion="raise")
ResilienceConfig.balanced() ResilienceConfig() (default)
CircuitBreaker Remove — use on_exhaustion="ask_agent" instead
FallbackRegistry Remove — register fallback tools directly and let the LLM cascade
RecoveryAction Remove — on_exhaustion covers the supported modes

See Also