Errors & retries

What happens when a model errors, a tool raises, a network drops, or an agent stalls — what retries run silently and what you see in the transcript.

Things fail. A model 429s, a tool raises, a websocket drops. The agent runtime has a narrow set of retries it runs silently and a narrow set of failure modes that surface directly in the transcript. This page maps both.

Silent LLM retries

Transient LLM errors trigger a backoff loop with jitter. The agent retries the same call; the step budget is not consumed.

Retried: RateLimitError, Timeout, APIConnectionError, APIConnectionTimeoutError, ServiceUnavailableError, InternalServerError, BadGatewayError, generic APIError.

Not retried: BadRequestError, AuthenticationError, ContextWindowExceededError. The last one triggers overflow recovery instead.

Defaults, configurable on the Agent config:

Setting	Default
`backoff_max_tries`	8
`backoff_max_time`	300 s
`backoff_base_factor`	1.0
`backoff_jitter`	`True`

Wait time is base_factor * 2**attempt plus uniform jitter in [0, base_factor]. Each attempt emits a GenerationRetry event and surfaces in the transcript as a system line:

RateLimitError — retrying in 4s (attempt 3/8): provider is rate-limiting your key

If retries exhaust, the turn ends with stop_reason="error" and a final GenerationError row.

Tool-call failures

There are no tool-level retries. When a tool raises, one of two things happens.

Caught exceptions (the default, unless the tool overrides with catch=) become a structured error result the agent sees on its next step. The agent typically adapts — corrects its arguments, picks a different tool, gives up gracefully.

Uncaught exceptions (tools that opt out with catch=False or a narrower exception list) abort the whole turn. The transcript shows a ToolError row labeled with the tool’s display name and the exception message; stop_reason becomes "error". Send a follow-up prompt to continue, or let the agent try again in a new turn.

Stop reasons

Every turn ends with one of four stop_reason values. The transcript describes the first three; the fourth is rarer but worth knowing.

Reason	When it happens
`finished`	Clean completion — the agent stopped because it was done
`max_steps_reached`	Step budget exhausted in autonomous mode. Surfaces as “reached the maximum number of steps. Send a follow-up message to continue”
`error`	An exception propagated past the agent loop — bad tool, bad model call, unrecovered retry
`stalled`	The model returned assistant text with no tool calls while stop conditions were configured and none fired — the agent “ran out of ideas” without hitting its completion criterion

stalled only fires when the agent is running with explicit stop conditions (typically a configured goal or finish tool). A default interactive session won’t produce it.

Network drops

LLM streaming is not incremental — Dreadnode uses non-streamed acompletion calls. A mid-call network drop surfaces as an APIConnectionError and falls into the retry loop; the whole call reissues from scratch.

The TUI-to-runtime connection is streamed. If it drops, the client reconnects and resubscribes using the last sequence number it saw. If the server’s ring buffer has rolled past that sequence, the session is marked stale in the session browser and the context bar shows replay gap after N. The badge is informational — there is no automatic backfill. The next event you see may skip over some history, but the transcript as stored on the platform is intact.

Cancelling a turn

Esc walks the escape ladder. When the agent is busy, the final step cancels the in-flight turn: the local asyncio task is cancelled and a cancel request is sent to the runtime, which cancels the task wrapping the model call. Ctrl+Q does the same thing — press once to cancel, twice within three seconds to quit. (Ctrl+C is reserved for copying selected text; see Selecting and copying text.)

An in-flight tool call is force-marked errored in the session state when the turn is cancelled. The agent sees the error on resume if you send another message.

Distinct error types in the transcript

Different error sources render with different titles so you can tell them apart:

Title	Source
`generation`	Model call errored. Body includes provider-classified error type; auth-key bodies are split out so you don’t paste a key into a chat
`tool-name`	A tool raised. Title is the tool’s display label, body is the exception message
`agent`	The agent loop itself threw — rare
`runtime`	The runtime process errored. A 401 triggers re-authentication

All render the same way — a ✗ marker, the title, and the body.