Skip to content

Reconnection and event buffering

Both the orchestrator (connecting upstream to KiCI) and the agent (connecting to the orchestrator) implement automatic reconnection with exponential backoff and event buffering. This page documents the agent↔orchestrator connection in full.

The agent connects outbound to the orchestrator’s WebSocket endpoint. When agent authentication is enabled (agentAuth: 'token'), the agent must first authenticate with a PSK token before registering. The agent sends agent.register and waits for a register.ack response from the orchestrator before transitioning to the registered state.

See packages/agent/src/ws/orchestrator-client.ts for the OrchestratorClient implementation.

Connection states: disconnected -> connecting -> authenticating (if token auth) -> registering -> registered

Key behaviors:

  • When token auth is enabled: first message must be auth.request within 5 seconds (code 4002 AUTH_TIMEOUT), then agent.register within 10 seconds
  • When auth is disabled: first message must be agent.register within 10 seconds (code 4002 AUTH_TIMEOUT)
  • Orchestrator sends register.ack with confirmed configuration after registration
  • Agent transitions to registered state upon receiving register.ack
  • Heartbeats start after register.ack is received
  • Event and log buffers flush after register.ack is received
  • Status messages (job.status, step.status) use sendDirect() (bypasses both buffers)
  • Log lines go through streamLog() into a dedicated LogBuffer (separate from the protocol EventBuffer)

Both layers use the same exponential backoff algorithm with jitter, implemented as a shared getReconnectDelay() function in @kici-dev/shared.

Parameters:

ParameterValue
Initial delay1,000 ms
Multiplier1.5x per attempt
Jitter0-50% additional randomness
Maximum delay60,000 ms (60 seconds)
Maximum attemptsUnlimited (reconnects indefinitely)

Formula:

delay = min(baseDelay * multiplier^attempts * (1.0 + random * 0.5), maxDelay)

Where random is uniformly distributed between 0 and 1.

Example progression:

AttemptBase delayWith jitter (range)
01,000 ms1,000 - 1,500 ms
11,500 ms1,500 - 2,250 ms
22,250 ms2,250 - 3,375 ms
33,375 ms3,375 - 5,062 ms
57,593 ms7,593 - 11,390 ms
1057,665 ms57,665 - 60,000 ms
11+60,000 ms60,000 ms (capped)

The reconnect counter resets to 0 on successful upstream authentication or agent registration.

See packages/shared/src/reconnect-delay.ts for the shared implementation.

Reconnection triggers:

  • WebSocket close event (unless intentional disconnect)
  • WebSocket error event (closes the connection, then close triggers reconnect)
  • Auth failure (server closes connection, reconnect scheduled)

Reconnection does not trigger on:

  • Intentional disconnect (graceful shutdown via disconnect())
  • Normal closure code 1000 initiated by the client

During disconnection, messages are buffered in memory and flushed in order on reconnect. Both layers use the same EventBuffer pattern: a wrapper around RingBuffer from @kici-dev/shared with bounded capacity and oldest-first overflow.

The orchestrator’s upstream connection buffers execution events and log chunks during disconnection (10,000-message ring buffer, oldest-first overflow), flushes them after the upstream auth handshake completes, and bypasses the buffer for protocol ACKs that must be sent only when the link is up.

OrchestratorClient buffers (Agent to Orchestrator)

Section titled “OrchestratorClient buffers (Agent to Orchestrator)”

The agent uses two separate buffers:

EventBuffer (protocol messages):

PropertyValue
Maximum size5,000 messages
Overflow behaviorOldest message dropped (via RingBuffer)
Flush triggerAfter register.ack received
Bypass mechanismsendDirect() for status messages (not buffered)
Message types bufferedjob.heartbeat and other protocol messages

LogBuffer (log lines):

PropertyValue
Maximum size10,000 lines
Overflow behaviorOldest line dropped (via RingBuffer)
Flush triggerAfter register.ack received
Message types bufferedLog lines from step execution

Status messages (job.status, step.status) use sendDirect() and bypass both buffers. Log lines go through streamLog() into the dedicated LogBuffer, not the EventBuffer.

See packages/agent/src/ws/event-buffer.ts for the implementation.

When the buffer reaches its maximum size:

  1. The oldest message is removed by the RingBuffer (FIFO eviction)
  2. The new message is appended
  3. The add() method returns false to indicate overflow

This is a deliberate trade-off: recent messages are more valuable than old ones (a log line from 30 seconds ago is more useful than one from 5 minutes ago).

Both layers use periodic heartbeats to detect stale connections. The client sends heartbeats; the server monitors them.

The orchestrator sends a 30-second heartbeat upstream to KiCI; if the upstream side stops seeing heartbeats it marks the connection unhealthy at 90 seconds and closes it with code 4004 (HEARTBEAT_TIMEOUT) at 180 seconds.

ParameterValueSource
Send interval30 secondsOrchestratorClient.heartbeatIntervalMs
MonitorOrchestrator agent handlerupdateHeartbeat() in registry

The orchestrator updates the agent’s lastHeartbeatAt timestamp in the agent registry when a heartbeat is received. The agent handler uses this for connection health tracking.

See packages/orchestrator/src/ws/agent-handler.ts for heartbeat handling.

Both layers enforce timeouts on initial messages after connection:

LayerPhaseTimeoutRequired messageClose code
OrchestratorAuth (if token)5 secondsauth.request4002 (AUTH_TIMEOUT)
OrchestratorRegistration10 secondsagent.register4002 (AUTH_TIMEOUT)

If the required message is not received within the timeout, the server closes the connection. The client detects the close and schedules a reconnect via the exponential backoff algorithm.

When a connection is re-established after a disconnection:

Agent reconnects to Orchestrator:

  1. New WebSocket connection opened
  2. If token auth enabled: agent sends auth.request with PSK token, orchestrator validates and responds with auth.success
  3. Agent sends agent.register with the same agentId, labels, and maxConcurrency
  4. Orchestrator updates the registry entry (re-registration), sends register.ack
  5. Agent resets reconnect counter, starts heartbeat
  6. Event buffer flushed (all buffered log.chunk messages sent)
  7. In-progress jobs on the agent continue executing — status updates resume on reconnect
AspectBehavior
DetectionOrchestrator receives WebSocket close event
BufferingExecution events and log chunks buffered (up to 10,000)
ReconnectAutomatic via exponential backoff
Webhook impactWebhook deliveries during downtime return HTTP 500 to the provider; provider retry policy applies
Data loss riskIf buffer overflows during extended outage, oldest events dropped
AspectBehavior
DetectionAgent receives WebSocket close event
BufferingLog chunks and events buffered (up to 5,000 events, 10,000 log lines) with gap markers on flush
ReconnectAutomatic via exponential backoff
In-progress jobsEnter recovering state with per-job recovery timers (grace period = 2x max reconnect delay). Agent reconnects with in-flight job list, orchestrator reconciles and restores tracking
Status updatesBuffered with gap markers. On reconnect, gap marker inserted showing outage duration and buffered message counts, then buffer flushed
Recovery timeoutIf agent does not reconnect within grace period (default 120s), jobs permanently failed with error: “Job failed: agent lost during orchestrator restart (recovery timeout exceeded)“
Upstream impactThe orchestrator drops its upstream connection during the restart and re-authenticates on startup.
AspectBehavior
DetectionHeartbeat timeout triggers close (90s unhealthy, 180s close)
BufferingBoth sides buffer messages during partition
RecoveryAuto-reconnect after partition resolves
Duration limitNo hard limit — reconnects indefinitely with backoff capped at 60s
AspectBehavior
DetectionOrchestrator receives WebSocket close event for the agent
Job fateJobs enter recovering state with per-job recovery timers. If agent reconnects within grace period (2x max reconnect delay), jobs resume. If not, jobs fail with timeout error
RecoveryAgent reports in-flight jobs via inFlightJobs on agent.register. Orchestrator reconciles against DB state, cancels timers, restores tracking
Timeout behaviorGrace period expiry permanently fails jobs with: “Job failed: agent lost during orchestrator restart (recovery timeout exceeded)”. Scaler notified to avoid spinning up replacements
Log continuityBuffered logs replayed with gap marker showing outage duration and buffer stats. Original timestamps preserved

See packages/orchestrator/src/agent/dispatcher.ts (onAgentDisconnect, reconcileRecovery) for the recovery timer behavior and Agent Reconnection and Job Recovery for the full protocol.

AspectBehavior
CauseAPI key rotated, revoked, or expired during disconnection
Detectionauth.failure message received from the upstream tier
BehaviorConnection closed, reconnect scheduled
ResolutionReconnect will continue retrying indefinitely (operator must fix credentials)
Buffer impactMessages remain buffered during auth failure retry loop