Agent reconnection and job recovery
When the orchestrator restarts (planned upgrade, crash, or resource pressure), agents lose their WebSocket connection but continue executing in-flight jobs. This page documents the recovery protocol that lets jobs survive orchestrator restarts: grace periods, per-job recovery timers, inFlightJobs reporting, log continuity with gap markers, startup recovery, and failure modes.
Overview
Section titled “Overview”When an agent disconnects, each in-flight job enters a recovering state with a per-job grace timer. If the agent reconnects before the timer expires, the job is restored and continues streaming logs (with a gap marker covering the disconnected window). If the timer fires, the job is permanently failed with a specific error message. This lets jobs survive orchestrator restarts cleanly — without recovery, every disconnect would mark every dispatched job as failed even though the agent is still executing.
Key design decisions:
- Grace period is auto-derived: 2x the max reconnection delay (default 60s max delay = 120s grace)
- Per-job timers, not a global sweep — each job gets its own deadline from when its agent disconnected
- No separate configuration knob for the grace period (less for operators to think about)
recoveringis a real DB column value indispatch_queue— operators can query it directly
Recovery protocol
Section titled “Recovery protocol”Sequence diagram
Section titled “Sequence diagram”Step-by-step flow
Section titled “Step-by-step flow”-
Agent disconnect detection: The orchestrator detects the WebSocket
closeevent for a registered agent. -
Disconnect triage by agent kind:
Dispatcher.onAgentDisconnect()first checks whether the agent was spawned by a scaler backend.- Static agents (a static agent can reconnect after a network blip) get per-job recovery timers — for each in-flight job the dispatcher transitions
dispatch_queue.statusfromdispatchedtorecovering, creates asetTimeouttimer with the grace period (2xmaxReconnectDelayMs), and stores recovery metadata (agentId,runId,timer,disconnectedAt). The remaining steps describe this path. - Scaler-managed agents are single-use — the scaler destroys them on disconnect, so reconnection is impossible and a recovery window cannot succeed. Their in-flight jobs are triaged instead: a job that never reached
runningis immediately requeued for another agent (bumpingdispatch_queue.dispatch_attempts), while a job that had already started executing is failed fast (steps may have external side effects, so silently re-running it is not safe). See Scaler-managed agent disconnect under failure modes.
- Static agents (a static agent can reconnect after a network blip) get per-job recovery timers — for each in-flight job the dispatcher transitions
-
Agent reconnection: The agent detects the WS close and starts exponential backoff reconnection (1s base, 1.5x multiplier, jitter, 60s max). When the new orchestrator is ready, the agent connects.
-
inFlightJobs reporting: The agent sends
agent.registerwith an optionalinFlightJobsarray containing{ jobId, runId }entries for each currently executing job. -
Orchestrator reconciliation: The
reconcileInFlightJobshelper processes each reported job:- Calls
dispatcher.reconcileRecovery(jobId, agentId)which:- Claims the recovery timer (cancels the timeout)
- Transitions
dispatch_queue.statusfromrecoveringback todispatched - Restores in-memory
agentJobs/jobToAgenttracking
- Increments the agent’s
activeJobscount in the registry - Emits a structured log with
recovery_duration,agent_id,job_id - Sends
job.status: runningto updateexecution_jobsstate
- Calls
-
Log replay: The agent flushes its buffered events and log lines with original timestamps, preceded by a gap marker.
-
Job completion: The job continues normally. When it finishes, the agent sends
job.status(success/failed) and the orchestrator updates the execution state and check run as usual.
See
packages/orchestrator/src/agent/dispatcher.tsfor recovery timer management,packages/orchestrator/src/ws/agent-handler.tsforreconcileInFlightJobs, andpackages/agent/src/ws/orchestrator-client.tsfor inFlightJobs reporting.
Execution state machine
Section titled “Execution state machine”The execution state machine includes 11 states, with recovering being the state relevant to agent reconnection.
State diagram
Section titled “State diagram”The full 11-state execution state machine (showing the recovering state relevant to agent reconnection, plus cancelling, held, and waiting states added by other features):
Transition table
Section titled “Transition table”| From State | Event | To State | Trigger |
|---|---|---|---|
pending | ENQUEUE | queued | Job added to dispatch queue |
pending | CANCEL | cancelled | Cancelled before dispatch |
pending | SKIP | skipped | Rule evaluation skips the job |
pending | HOLD | held | Protection rule: reviewer required |
pending | WAIT | waiting | Protection rule: wait timer |
queued | START | running | Agent begins execution |
queued | FAIL | failed | Queue error or agent failure |
queued | CANCEL | cancelled | Cancelled while queued |
running | SUCCEED | success | Job completed successfully |
running | FAIL | failed | Job execution failed |
running | CANCEL | cancelled | Job cancelled during execution |
running | CANCEL_GRACEFUL | cancelling | Graceful cancellation with cleanup hooks |
running | RECOVER | recovering | Agent disconnected (orchestrator restart) |
recovering | START | running | Agent reconnected, job reclaimed |
recovering | FAIL | failed | Grace period expired (recovery timeout) |
recovering | CANCEL | cancelled | Job cancelled during recovery |
cancelling | CANCEL_FORCE | cancelled | Force cancel after grace period |
cancelling | COMPLETE | cancelled | Cleanup hooks completed |
cancelling | FAIL | failed | Cleanup hook failure |
held | APPROVE | queued | Reviewer approved |
held | REJECT | cancelled | Reviewer rejected |
held | EXPIRE | cancelled | Hold expiry exceeded |
held | CANCEL | cancelled | Cancelled while held |
waiting | TIMER_DONE | queued | Wait timer completed |
waiting | CANCEL | cancelled | Cancelled while waiting |
Key properties
Section titled “Key properties”recoveringis non-terminal:TERMINAL_STATESremains['success', 'failed', 'cancelled', 'skipped']. Jobs inrecoveringcan still transition torunning,failed, orcancelled.- RECOVER only from
running: Prevents double-recover. A job already inrecoveringcannot receive RECOVER again. - START reused for reconnection: The
recovering -> runningtransition reuses the existing START event rather than introducing a new RECONNECT event. cancellingsupports graceful shutdown: Running jobs can entercancellingvia CANCEL_GRACEFUL, allowing cleanup hooks to run before the finalcancelledstate.heldandwaitingare non-terminal: These states support environment protection rules (reviewer gates and wait timers).
See
packages/engine/src/state-machine/types.tsandpackages/engine/src/state-machine/machine.tsfor the implementation.
Log continuity
Section titled “Log continuity”Gap markers
Section titled “Gap markers”When the agent reconnects and flushes buffered messages, it inserts a gap marker as a log line before the replayed content. The gap marker gives operators full context about the outage.
Format:
--- Orchestrator offline for 45s. Replaying 12 buffered events and 238 buffered log lines. ---If buffer overflow occurred during the outage:
--- Orchestrator offline for 120s. Replaying 0 buffered events and 5000 buffered log lines. 2847 log lines dropped due to buffer overflow. ---Buffer limits
Section titled “Buffer limits”| Buffer | Max Size | Purpose |
|---|---|---|
| Event buffer | 5,000 | Protocol messages (heartbeats, etc.) via send() |
| Log buffer | 10,000 | Log lines from step execution via streamLog() |
Overflow handling
Section titled “Overflow handling”When a buffer reaches capacity:
- The oldest item is dropped (
RingBuffer.shift()) - A
droppedCountcounter is incremented - On flush, the gap marker includes the exact count of dropped items
resetDroppedCount()is called after the gap marker is sent
All buffered messages carry their original timestamps from when the agent produced them, not the replay time.
See
packages/shared/src/ring-buffer.tsfordroppedCounttracking, andpackages/agent/src/ws/orchestrator-client.tsfor gap marker insertion and buffer flush.
Startup recovery
Section titled “Startup recovery”When the orchestrator starts (or restarts), it must handle jobs from the previous instance that are still in the database.
Recovery flow on startup
Section titled “Recovery flow on startup”-
Orphaned
recoveringjobs:StaleRunDetector.cleanupOrphanedRecoveryJobs()finds any jobs left inrecoveringstate from a previous orchestrator instance. These are permanently failed with the message: “Job failed: orchestrator restarted during recovery (recovery state lost)”. This runs before the first stale detection scan. -
Dispatched jobs from previous instance: After cleanup, the startup scan queries
dispatch_queue WHERE status = 'dispatched'. For each:- The job is transitioned to
recovering - A new recovery timer is created with
agentId = 'unknown'(the previous orchestrator’s in-memory agent mapping is lost) - If the agent reconnects and claims the job, it is restored normally
- If the timer expires, the job is permanently failed
- The job is transitioned to
-
Interaction with stale detection: The stale run detector queries for
status = 'running'jobs. Since recovering jobs are inrecoveringstate (notrunning), they are naturally excluded from stale detection scans.
See
packages/orchestrator/src/stale-detector/stale-run-detector.tsforcleanupOrphanedRecoveryJobs(), andpackages/orchestrator/src/orchestrator-core.tsfor the startup recovery scan.
Failure modes
Section titled “Failure modes”Grace period expiry
Section titled “Grace period expiry”| Aspect | Behavior |
|---|---|
| Trigger | Recovery timer fires (agent did not reconnect within grace period) |
| Job outcome | Permanently failed |
| Error message | ”Job failed: agent disconnected and did not reconnect within the recovery window” |
| DB transition | recovering -> failed (via markFailedIfRecovering — optimistic concurrency) |
| Partial logs | Any logs received before the outage are preserved in the execution report |
| Scaler | onJobFailedPermanently callback notifies the scaler to avoid spinning up replacements |
Scaler-managed agent disconnect
Section titled “Scaler-managed agent disconnect”Scaler-managed agents are single-use and destroyed on disconnect, so they never enter the recovery window. Their in-flight jobs are triaged immediately:
| Aspect | Behavior |
|---|---|
| Never-started job | Requeued to pending (bumps dispatch_queue.dispatch_attempts) and immediately re-dispatched to another agent, or handed to the scaler so a fresh agent is spawned bound to it |
| Started job | Failed fast — “Job failed: scaler-managed agent disconnected mid-execution”; not re-run, since steps may have external side effects |
| Attempt cap | A job re-dispatched more than MAX_DISPATCH_ATTEMPTS (5) times is failed permanently instead of requeued again; expires_at is the time-based backstop |
| No dead wait | There is no 2-minute recovery timeout for these agents — the outcome (requeue or fail) is decided at disconnect |
Orchestrator crash during recovery
Section titled “Orchestrator crash during recovery”| Aspect | Behavior |
|---|---|
| Trigger | Orchestrator crashes while recovery timers are active |
| Job outcome | recovering state persists in DB until next startup |
| Startup cleanup | cleanupOrphanedRecoveryJobs fails all orphaned recovering jobs on next startup |
| Agent behavior | Agent continues executing, buffers messages, and reconnects to the new instance |
Agent crash during recovery
Section titled “Agent crash during recovery”| Aspect | Behavior |
|---|---|
| Trigger | Agent crashes while the orchestrator is waiting for reconnection |
| Job outcome | Recovery timer expires, job permanently failed |
| Alternative | If agent restarts with no in-flight jobs, it registers fresh (no recovery needed) |
| Stale detection | If the agent never reconnects, the stale run detector handles cleanup after timers |
Race between timer and reconnection
Section titled “Race between timer and reconnection”| Aspect | Behavior |
|---|---|
| Trigger | Agent reconnects at nearly the same moment the recovery timer fires |
| Resolution | Optimistic concurrency via claimRecovery |
| Timer path | markFailedIfRecovering: UPDATE WHERE status = 'recovering' — 0 rows if already claimed |
| Reconnect path | claimRecovery deletes from recoveringJobs Map + clearTimeout atomically |
| Guarantee | Exactly one path succeeds — the Map serves as the coordination point |
Observability
Section titled “Observability”Structured log fields
Section titled “Structured log fields”Recovery events are logged with the following structured fields for ELK dashboards and alerting:
| Field | Type | Description |
|---|---|---|
recovery_duration | number | Milliseconds between disconnect and reconnect |
agent_id | string | Agent that reconnected |
job_id | string | Job that was recovered |
run_id | string | Execution run ID |
buffered_messages_count | number | Messages buffered during outage |
Database queries
Section titled “Database queries”Operators can query the dispatch_queue table to monitor recovery state:
-- Count jobs currently in recoverySELECT count(*) FROM dispatch_queue WHERE status = 'recovering';
-- Find recovering jobs with their ageSELECT id, run_id, created_at, now() - created_at AS ageFROM dispatch_queueWHERE status = 'recovering'ORDER BY created_at;
-- Check recent recovery timeoutsSELECT id, run_id, error_message, updated_atFROM dispatch_queueWHERE status = 'failed' AND error_message LIKE '%recovery timeout%'ORDER BY updated_at DESCLIMIT 10;ELK dashboard suggestions
Section titled “ELK dashboard suggestions”- Recovery rate: Count log events with message “Job recovered from agent reconnection” per time window
- Average recovery duration: Aggregate
recovery_durationfield from recovery log events - Recovery timeouts: Count events with message containing “recovery timeout exceeded”
- Buffer overflow frequency: Count gap markers that include “dropped due to buffer overflow”
See also
Section titled “See also”- Reconnection and Event Buffering — WebSocket reconnection behavior and message buffering
- State Machine — Full execution state machine reference
- Job Execution Lifecycle — Agent job lifecycle from dispatch to cleanup
- Stale Detection — Stale run detector behavior
- Orchestrator Configuration — Orchestrator settings