Skip to content

Dashboard, metrics & wire format

This page documents the WebSocket messages that flow on the agent↔orchestrator and orchestrator↔orchestrator (peer-to-peer) channels, plus the wire format invariants every tier obeys.

These messages enable job-level concurrency control. When an agent discovers a job belongs to a concurrency group (from the workflow definition), it reports to the orchestrator, which decides the action.

Authoritative source: packages/engine/src/protocol/messages/orchestrator-agent.ts

Direction: Agent -> Orchestrator

Agent reports that a job belongs to a concurrency group, requesting permission to proceed.

FieldTypeRequiredDescription
type"job.concurrency.report"YesMessage discriminator
messageIdstringYesUnique message ID
runIdstringYesExecution run ID
jobIdstringYesJob ID within the run
groupstringYesConcurrency group name

Direction: Orchestrator -> Agent

Orchestrator acknowledges concurrency report with action to take.

FieldTypeRequiredDescription
type"job.concurrency.ack"YesMessage discriminator
requestIdstringYesCorrelation ID from the report
actionenumYesOne of: proceed, wait, cancel
reasonstringNoHuman-readable reason for the action
runIdstringNoRun identifier echoed back on unsolicited slot-release wake-ups (see note)
jobIdstringNoJob identifier echoed back on unsolicited slot-release wake-ups (see note)

The orchestrator sends job.concurrency.ack in two situations: (1) as a direct response to a job.concurrency.reportrequestId correlates to the report and action is proceed, wait, or cancel; and (2) as an unsolicited follow-up after the agent received wait. When a slot in the concurrency group is released (by completion, cancellation, or superseding), the orchestrator picks the FIFO-next queued waiter and sends { action: 'proceed' } to the agent still parked on its second waitForConcurrencyAck call. runId and jobId are present in this case so the agent can sanity-check the wake-up matches its job.

These messages support periodic metrics push from agents to the orchestrator. The orchestrator aggregates agent metrics and exposes them via its Prometheus scrape endpoint.

Authoritative source: packages/engine/src/protocol/messages/orchestrator-agent.ts

Direction: Agent -> Orchestrator

Periodic metrics push from agent to orchestrator. Each metric includes its type, value, labels, and optional histogram buckets.

FieldTypeRequiredDescription
type"agent.metrics"YesMessage discriminator
messageIdstringYesUnique message ID
agentIdstringYesAgent identifier
metricsMetricEntry[]YesArray of metric data points
timestampnumberYesUnix timestamp (milliseconds)

Each MetricEntry:

FieldTypeRequiredDescription
namestringYesMetric name
typeenumYesOne of: counter, histogram, gauge, upDownCounter
valuenumberNoMetric value (for counters and gauges)
labelsRecord<string, string>NoMetric labels
buckets[{ le: number, count: number }]NoHistogram bucket boundaries and counts
countnumberNoHistogram observation count
sumnumberNoHistogram observation sum

Authoritative source: packages/engine/src/protocol/messages/orchestrator-agent.tsagentMetricsSchema

Authentication messages for the agent <-> orchestrator WebSocket connection.

Authoritative source: packages/engine/src/protocol/messages/orchestrator-agent.ts

Auth request sent by agent to orchestrator when connecting.

FieldTypeRequiredDescription
type"auth.request"YesMessage discriminator
tokenstringYesAuthentication token
protocolVersionnumberYesProtocol version (positive integer)

Auth success response sent by orchestrator to agent.

FieldTypeRequiredDescription
type"auth.success"YesMessage discriminator
connectionIdstringYesAssigned connection ID

Auth failure response sent by orchestrator to agent.

FieldTypeRequiredDescription
type"auth.failure"YesMessage discriminator
reasonstringYesHuman-readable failure reason

Authoritative source: packages/engine/src/protocol/messages/orchestrator-agent.tsagentAuthFailureSchema

Generic request-response envelope for typed API calls over the agent WebSocket. New methods are registered in AgentApiRegistry on the orchestrator side — no protocol schema changes are needed per method.

Authoritative source: packages/engine/src/protocol/messages/orchestrator-agent.tsagentApiRequestSchema, agentApiResponseSchema

API request sent by agent to orchestrator (e.g., infrastructure.list).

FieldTypeRequiredDescription
type"agent.api.request"YesMessage discriminator
requestIdstringYesUUID for correlating the response
methodstringYesDot-namespaced method name (e.g., infrastructure.list)
paramsRecord<string, unknown>NoMethod-specific parameters (defaults to {})

API response sent by orchestrator to agent.

FieldTypeRequiredDescription
type"agent.api.response"YesMessage discriminator
requestIdstringYesMatches the original request’s requestId
resultunknownNoMethod result (present on success)
errorstringNoError description (present on failure)

Join messages enable zero-knowledge cluster bootstrap. A new orchestrator sends a join.request with a join token, and an existing orchestrator responds with an AES-256-GCM encrypted config bundle. The Platform relay sees only the token’s cleartext routing part and opaque ciphertext — zero knowledge of customer configuration.

Authoritative source: packages/engine/src/protocol/messages/join.ts

New orchestrator -> existing orchestrator (via Platform relay or direct peer)

Section titled “New orchestrator -> existing orchestrator (via Platform relay or direct peer)”

Sent by a new orchestrator to request cluster config from an existing orchestrator.

FieldTypeRequiredDescription
type"join.request"YesMessage discriminator
messageIdstringNoCorrelation ID for Platform relay routing (injected by Platform to match response to correct joiner)
tokenstringYesFull join token: kici_join_v1.<base64url_routing>.<secret_hex>

Existing orchestrator -> new orchestrator (via Platform relay or direct peer)

Section titled “Existing orchestrator -> new orchestrator (via Platform relay or direct peer)”

Response from an existing orchestrator with an encrypted config bundle or error.

FieldTypeRequiredDescription
type"join.response"YesMessage discriminator
messageIdstringNoCorrelation ID echoed from join.request for relay routing
successbooleanYesWhether the join was successful
encryptedBundlestringNoBase64-encoded AES-256-GCM encrypted config bundle (on success)
errorstringNoError message (on failure)

The Browser ↔ Platform protocol (browser-side WebSocket auth, log subscriptions, status fan-out) is documented in the internal docs.

Orchestrator <-> Orchestrator messages (peer-to-peer)

Section titled “Orchestrator <-> Orchestrator messages (peer-to-peer)”

This layer carries cluster coordination messages between orchestrator instances via direct WebSocket connections on /ws/peer. These messages never transit the Platform tier — see Communication Topology.

Authoritative source: packages/engine/src/protocol/messages/peer.ts

Peer authentication uses a 4-message ECDH handshake. The initiator (connecting orchestrator) sends peer.hello, the responder (receiving orchestrator) replies with peer.hello.response, then the initiator sends an encrypted peer.auth.request, and the responder replies with an encrypted peer.auth.response. All auth material is transmitted over the ECDH-encrypted channel.

Sent by the initiator (connecting orchestrator) to start the ECDH handshake. Provides the initiator’s ephemeral ECDH public key and a nonce.

FieldTypeRequiredDescription
type"peer.hello"YesMessage discriminator
ephemeralPublicKeystringYesInitiator’s ephemeral X25519 public key (base64, DER SPKI)
noncestringYesBase64-encoded 32-byte random nonce (HKDF salt)

Sent by the responder (receiving orchestrator) in response to peer.hello. Provides the responder’s ephemeral ECDH public key. After this exchange, both sides derive a shared session key.

FieldTypeRequiredDescription
type"peer.hello.response"YesMessage discriminator
ephemeralPublicKeystringYesResponder’s ephemeral X25519 public key (base64, DER SPKI)

Sent by the connecting orchestrator after the ECDH handshake. Encrypted with the shared session key. Contains either a join token (first connection) or an HMAC credential proof (subsequent connections). The receiving orchestrator must respond within 15 seconds or the connection is closed.

FieldTypeRequiredDescription
type"peer.auth.request"YesMessage discriminator
instanceIdstringYesSender’s cluster instance ID
protocolVersionnumberYesProtocol version
tokenstringConditionalJoin token (first connection only)
proofstringConditionalHMAC proof of credential ownership (subsequent connections)
softwareVersionstringNoSoftware version of the connecting peer (for version compat check)
roleenumNoRole of the connecting peer: coordinator or worker

One of token or proof must be present.

Response to a peer authentication request. Encrypted with the shared session key.

FieldTypeRequiredDescription
type"peer.auth.response"YesMessage discriminator
acceptedbooleanYesWhether authentication succeeded
instanceIdstringNoResponder’s cluster instance ID
sessionCredentialstringNoIssued credential for future connections (first join)
rolestringNoConfirmed role of the peer
reasonstringNoRejection reason (if not accepted)
softwareVersionstringNoSoftware version of the coordinator (for version compat check)
agentsPeerAgentSummary[]NoResponder’s connected agent inventory (present when accepted=true)
scalerCapacityScalerCapacitySummary[]NoResponder’s scaler capacity (present when accepted=true)
capabilitiesPeerCapabilitiesNoResponder’s feature capabilities (present when accepted=true)

Sent every 30 seconds by each peer. Carries agent inventory, scaler capacity, and Raft consensus state. This is the primary mechanism for routing decisions — the coordinator uses peer heartbeats to determine which peers can handle a job’s label requirements.

FieldTypeRequiredDescription
type"peer.heartbeat"YesMessage discriminator
instanceIdstringYesSender’s cluster instance ID
termnumberYesCurrent Raft term
leaderIdstring or nullYesKnown Raft leader’s instance ID
drainingbooleanYesWhether the sender is gracefully shutting down
agentsPeerAgentSummary[]YesConnected agent inventory
capabilitiesPeerCapabilitiesYesFeature flags (S3 log access, log routing override)
scalerCapacityScalerCapacitySummary[]NoOn-demand backend capacity for routing decisions
configVersionnumberNoShared config version for sync detection
registryVersionnumberNoRegistry version for cross-orchestrator registration sync
timestampnumberYesUnix timestamp (milliseconds)
hostnamestringNoMachine hostname (os.hostname())
osReleasestringNoOS kernel release (os.release())
totalMemoryMbnumberNoTotal system memory in MiB
memoryUsedMbnumberNoUsed memory in MiB
memoryAvailableMbnumberNoAvailable memory in MiB
cpuCountnumberNoNumber of logical CPUs
uptimeSecondsnumberNoSystem uptime in seconds
nodeVersionstringNoNode.js version
runningAsUserstring or nullNoUsername of the OS user running the orchestrator
runningAsUidnumber or nullNoUID of the OS user running the orchestrator
versionstringNoOrchestrator version (e.g., "0.0.1")

PeerAgentSummary:

FieldTypeRequiredDescription
agentIdstringYesAgent identifier
labelsstring[]YesCapability labels
activeJobsnumberYesCurrently running jobs
maxConcurrencynumberYesMaximum concurrent jobs
platformstringYesOS platform (e.g., linux)
archstringYesCPU architecture (e.g., x64, arm64)

PeerCapabilities:

FieldTypeRequiredDescription
s3LogAccessbooleanYesWhether this peer has direct S3 log access
logRoutingOverrideenumNoOne of: direct, coordinator (overrides default routing)

ScalerCapacitySummary:

FieldTypeRequiredDescription
namestringNoScaler backend name (e.g., stg-worker-bare-metal)
typestringNoScaler backend type (e.g., bare-metal, container)
labelSetsstring[][]YesLabel sets this backend provisions
maxAgentsnumberYesMaximum agents for this backend
activeCountnumberYesCurrent active agent count

Raft leader election vote request. Sent by candidates during elections.

FieldTypeRequiredDescription
type"raft.vote.request"YesMessage discriminator
termnumberYesCandidate’s current term
candidateIdstringYesCandidate’s instance ID
lastLogIndexnumberYesCandidate’s last log index
lastLogTermnumberYesTerm of candidate’s last log entry

Response to a Raft vote request.

FieldTypeRequiredDescription
type"raft.vote.response"YesMessage discriminator
termnumberYesVoter’s current term
voteGrantedbooleanYesWhether the vote was granted
voterIdstringYesVoter’s instance ID

Raft leader heartbeat (no log entries — KiCI uses Raft for leader election only, not log replication).

FieldTypeRequiredDescription
type"raft.append.entries"YesMessage discriminator
termnumberYesLeader’s current term
leaderIdstringYesLeader’s instance ID

Sent by the coordinator to a peer when no local agent can handle a job. Contains the full resolved job configuration so the peer can dispatch without re-resolving.

FieldTypeRequiredDescription
type"job.reroute"YesMessage discriminator
messageIdstringYesUnique message ID for ACK correlation
jobIdstringYesPre-allocated job ID. The sending coordinator allocates it before reroute so its execution_runs / execution_jobs rows reference the same id the receiving peer will dispatch under.
runIdstringYesExecution run identifier
deliveryIdstringYesOriginal webhook delivery ID
routingKeystringYesProvider routing key
event, actionstring, string or nullYesWebhook event type and action
payloadRecord<string, unknown>YesFull webhook payload
jobNamestringYesJob to execute
workflowNamestringYesWorkflow containing the job
runsOnLabelsstring[][]YesLabel sets the job requires
excludeLabelsstring[]NoLabels that the dispatched agent must NOT have
triedConnectionsstring[]YesInstance IDs already tried (loop prevention)
maxHopsnumberYesMaximum allowed hops (default: 3)
coordinatorIdstringYesInstance ID of the run coordinator
jobConfigRecord<string, unknown>NoResolved job config (steps, rules, matrix, etc.)
repoUrlstringNoRepository clone URL
refstringNoGit ref (branch name)
shastringNoCommit SHA
providerstringNoProvider type (e.g., github)
providerContextRecord<string, unknown>NoProvider-specific context (e.g., installationId)
sourceTarUrlstringNoPre-signed .kici/ source tarball download URL (cache hit)
sourceTarHashstringNoWorkflow contentHash (used for drift verification, not tarball)
depsUrlstringNoPre-signed dependency tarball URL (cache hit)
depsHashstringNoSHA-256 of the dependency tarball bytes
cloneTokenstringNoPre-resolved clone token for workers without provider credentials
encryptedSecretsstringNoEncrypted secrets envelope (AES-256-GCM with session key)
encryptedNamespacedSecretsstringNoEncrypted namespaced secrets envelope
requestIdstringNoTrace ID for distributed tracing
traceIdstringNoAdditional trace context

Acknowledgment of a reroute request. The coordinator tries the next peer if rejected.

FieldTypeRequiredDescription
type"job.reroute.ack"YesMessage discriminator
messageIdstringYesID of the job.reroute being acknowledged
acceptedbooleanYesWhether the peer accepted the job
reasonstringNoRejection reason (if not accepted)

Sent by the worker orchestrator back to the coordinator as the agent reports job-level and step-level state changes. The coordinator uses this to drive its run-level state machine, update GitHub check runs, and track per-step progress.

The kind discriminator decides which ExecutionTracker call the receiver makes:

  • kind: "job"onJobStatus(runId, jobId, state, ...) — drives run transitions (running → success/failed/…). stepIndex/stepName are unused for this kind.
  • kind: "step"onStepStatus(runId, jobId, stepIndex, stepName, state, ...) — persists execution_steps rows.
FieldTypeRequiredDescription
type"job.progress"YesMessage discriminator
kind"job" | "step"YesWhether this update is a job-level or step-level transition
runIdstringYesExecution run identifier
jobIdstringYesJob identifier (matches the pre-allocated jobId from job.reroute)
jobNamestringYesHuman-readable job name
stepIndexnumberYesZero-based step index (used only when kind="step")
stepNamestringYesHuman-readable step name (used only when kind="step")
stateenumYesFull ExecutionJobStatus for kind="job"; the step-state subset (running/success/failed/skipped) for kind="step"
timestampnumberYesUnix timestamp (milliseconds)
dataRecord<string, unknown>NoOptional state-specific data

Sent by the coordinator to a worker to cancel a rerouted job. Used for fail-fast propagation and user-initiated cancellation.

FieldTypeRequiredDescription
type"peer.job.cancel"YesMessage discriminator
runIdstringYesExecution run identifier
jobIdstringNoSpecific job to cancel (omit for all jobs in run)
reasonstringYesHuman-readable cancellation reason
forcebooleanNoWhen true, force-cancel immediately without waiting for hooks

Graceful shutdown announcement. Peers remove the sender from their registry immediately upon receiving this message.

FieldTypeRequiredDescription
type"peer.leaving"YesMessage discriminator
instanceIdstringYesInstance ID of the leaving peer
termnumberYesCurrent Raft term for leader identification

Log and cache relay (coordinator-worker topology)

Section titled “Log and cache relay (coordinator-worker topology)”

Log chunk relay from worker to coordinator. Batched log lines from agent execution, forwarded when the worker does not have direct S3 log access.

FieldTypeRequiredDescription
type"peer.log.chunk"YesMessage discriminator
runIdstringYesExecution run ID
jobIdstringYesJob ID within the run
stepIndexnumberYesZero-based step index
linesLogLineEntry[]YesArray of log line entries

Each LogLineEntry:

FieldTypeRequiredDescription
textstringYesLog line text
timestampnumberYesUnix timestamp (milliseconds)
streamenumNoOne of: stdout, stderr (defaults to stdout)

Authoritative source: packages/engine/src/protocol/messages/peer.tspeerLogChunkSchema

Cache upload request from worker to coordinator. Worker agent needs an upload URL for cache storage.

FieldTypeRequiredDescription
type"peer.cache.upload.request"YesMessage discriminator
messageIdstringYesUnique message ID
runIdstringYesExecution run ID
jobIdstringYesJob ID within the run
cacheTypeenumYesOne of: bundle, deps
hashstringYesContent hash for cache keying
sizeBytesnumberYesSize of the artifact in bytes

Authoritative source: packages/engine/src/protocol/messages/peer.tspeerCacheUploadRequestSchema

Cache upload response from coordinator to worker. Pre-signed upload URL for the worker to upload directly to object storage.

FieldTypeRequiredDescription
type"peer.cache.upload.response"YesMessage discriminator
messageIdstringYesUnique message ID
runIdstringYesExecution run ID
jobIdstringYesJob ID within the run
uploadUrlstringYesPre-signed URL for direct S3 upload

Authoritative source: packages/engine/src/protocol/messages/peer.tspeerCacheUploadResponseSchema

Config reload request forwarded from one orchestrator to a specific peer when an operator calls POST /admin/config/reload with a target parameter. The receiving peer executes a local reload and replies with peer.config.reload.response.

FieldTypeRequiredDescription
type"peer.config.reload"YesMessage discriminator
messageIdstringYesUnique message ID
drainbooleanNoWhether to drain in-flight work before reloading

Authoritative source: packages/engine/src/protocol/messages/peer.tspeerConfigReloadSchema

Response from the target peer carrying the reload result fields.

FieldTypeRequiredDescription
type"peer.config.reload.response"YesMessage discriminator
messageIdstringYesUnique message ID
successbooleanYesWhether the reload succeeded
versionnumberNoNew config version after reload
errorsstring[]NoError messages if reload failed
restartRequiredstring[]NoConfig fields that require a restart to take effect
fieldsChangedstring[]NoConfig fields that were changed

Authoritative source: packages/engine/src/protocol/messages/peer.tspeerConfigReloadResponseSchema

Broadcast to every peer when an agent token is revoked so each peer can close its own in-flight agent WS connections authenticated by that token. The originating peer kicks its local connections first (via the admin revoke route), then fans this message out over the encrypted peer mesh.

FieldTypeRequiredDescription
type"peer.agent-token.revoke"YesMessage discriminator
tokenIdstringYesThe agent_tokens.id whose in-flight agent WS must be kicked
senderInstanceIdstringYesOriginating peer’s instance ID (for cross-cluster log correlation)

Authoritative source: packages/engine/src/protocol/messages/peer.tspeerAgentTokenRevokeSchema

Forwarded by a worker to the coordinator that owns the run when the worker’s scaler emits a provisioning event correlated to a queued job (e.g. a failed agent spawn). Workers have no database, so they cannot persist provisioning failures themselves — the coordinator’s ExecutionTracker writes the event to the provisioning log and the dispatch queue’s last-error column.

FieldTypeRequiredDescription
type"scaler.event"YesMessage discriminator
runIdstringYesExecution run ID
jobIdstringYesJob ID within the run
agentIdstringYesThe scaler-managed agent ID the event is about
eventTypeenumYesScaler event type (a ScalerEventType enum member)
detailstringYesHuman-readable detail, including any captured spawn stderr tail
timestampMsnumberYesEvent timestamp in epoch milliseconds

Authoritative source: packages/engine/src/protocol/messages/peer.tspeerScalerEventSchema

See Multi-Orchestrator Architecture for rerouting protocol, loop prevention, and failure modes.

These messages support CLI-initiated test runs (kici run remote). The CLI connects to the orchestrator via WebSocket and triggers a test execution using a fixture payload.

Authoritative source: packages/engine/src/protocol/messages/test-run.ts

Initiate a test run using a fixture payload.

FieldTypeRequiredDescription
type"test.trigger"YesMessage discriminator
messageIdstringYesUnique message ID
fixtureIdstringYesFixture identifier
eventTestEventYesSimulated event (type, action, payload)
uploadIdstringNoUpload ID for bundle transfer
routingKeystringYesProvider routing key
secretsRecord<string, string>NoSecrets to pass to the test run
workflowNamestringNoSpecific workflow to test (all if omitted)

TestEvent: { type: string, action?: string, targetBranch: string, sourceBranch?: string, payload: Record<string, unknown>, changedFiles?: string[] }

Cancel a running test execution.

FieldTypeRequiredDescription
type"test.cancel"YesMessage discriminator
messageIdstringYesUnique message ID
runIdstringYesRun ID to cancel

Response indicating whether the test trigger was accepted.

FieldTypeRequiredDescription
type"test.trigger.response"YesMessage discriminator
messageIdstringYesID of the test.trigger being responded
runIdstringYesExecution run ID
observeUrlstringYesURL to observe the run via WebSocket
statusenumYesOne of: accepted, rejected
reasonstringNoRejection reason (if rejected)

Response to a test cancellation request.

FieldTypeRequiredDescription
type"test.cancel.response"YesMessage discriminator
messageIdstringYesID of the test.cancel being responded
statusenumYesOne of: cancelled, not_found, already_complete

These messages support real-time observation of test runs via WebSocket. The CLI subscribes to a running execution and receives status, log, step, and completion updates.

Authoritative source: packages/engine/src/protocol/messages/observe.ts

Subscribe to observe a running test execution.

FieldTypeRequiredDescription
type"observe.subscribe"YesMessage discriminator
runIdstringYesExecution run ID to observe
tokenstringYesAuth token for the observe session
lastSeenSequencenumberNoResume from sequence (for reconnections)

Run-level status update pushed to the observer.

FieldTypeRequiredDescription
type"observe.status"YesMessage discriminator
runIdstringYesExecution run ID
statusenumYesOne of: pending, running, success, failed, cancelled, cancelling
jobNamestringNoJob name (when status relates to a specific job)
timestampnumberYesUnix timestamp (milliseconds)

Log chunk pushed to the observer.

FieldTypeRequiredDescription
type"observe.log"YesMessage discriminator
runIdstringYesExecution run ID
jobIdstringYesJob ID within the run
jobNamestringYesHuman-readable job name
stepIndexnumberYesZero-based step index
stepNamestringYesHuman-readable step name
linesstring[]YesArray of log output lines
sequencenumberYesSequence number for ordering
timestampnumberYesUnix timestamp (milliseconds)

Step lifecycle event pushed to the observer.

FieldTypeRequiredDescription
type"observe.step"YesMessage discriminator
runIdstringYesExecution run ID
jobIdstringYesJob ID within the run
jobNamestringYesHuman-readable job name
stepNamestringYesHuman-readable step name
stateenumYesOne of: pending, running, success, failed, skipped
durationMsnumberNoStep duration in milliseconds (on completion)
timestampnumberYesUnix timestamp (milliseconds)

Run completion summary pushed to the observer.

FieldTypeRequiredDescription
type"observe.complete"YesMessage discriminator
runIdstringYesExecution run ID
statusenumYesOne of: success, failed, cancelled
summaryCompletionSummaryYesRun summary with duration and per-job status

CompletionSummary: { totalDurationMs: number, jobs: Array<{ name: string, status: string, durationMs?: number }> }

KiCI defines custom close codes in the 4000-4999 range (reserved for application use by RFC 6455) alongside standard codes.

CodeConstantMeaningSent By
1001WS_CLOSE_GOING_AWAYServer shutdown or browser navigating awayServer
4001WS_CLOSE_UNAUTHORIZEDClient failed authenticationServer
4002WS_CLOSE_AUTH_TIMEOUTAuth timeout expiredServer
4003WS_CLOSE_INVALID_MESSAGEInvalid or unparseable message receivedServer
4004WS_CLOSE_HEARTBEAT_TIMEOUTHeartbeat timed out (180 seconds)Server
4005WS_CLOSE_PROTOCOL_ERRORProtocol-level errorServer
4006WS_CLOSE_INTERNAL_ERRORUnexpected internal server errorServer
4010WS_CLOSE_AGENT_AUTH_FAILEDAgent token authentication failedServer
4011WS_CLOSE_CLUSTER_NAME_CONFLICTReserved constant; no Platform code path emits this today. Platform accepts N connected orchestrators per (org_id, cluster_name) and the dashboard listing dedupes by cluster name
4020WS_CLOSE_PLAN_LIMITOrganization has reached its plan limitServer
4030WS_CLOSE_RUN_NOT_FOUNDRequested run was not foundServer
4031WS_CLOSE_DISPATCH_ACK_TIMEOUTA dispatched job went unacknowledged past its deadline; the orchestrator requeues the job and disconnects the unresponsive agentServer

Authoritative source: packages/engine/src/ws/close-codes.ts

All protocol messages are validated at runtime using Zod discriminated unions. Each WebSocket layer has direction-specific schemas that prevent parsing messages from the wrong direction.

Orchestrator-Agent layer:

  • Incoming (Orchestrator receives): agentToOrchestratorMessageSchema — parses agent.register, agent.status, job.status, job.reject, job.ack, log.chunk, step.status, job.heartbeat, agent.log, job.concurrency.report, config.ack, cache.upload.request, cache.upload.complete, cache.user.restore.request, cache.user.save.request, cache.user.save.complete, provenance.upload.request, provenance.upload.complete, event.emit, agent.api.request, agent.metrics, auth.request, fleet.bundle.chunk, fleet.bundle.error, step.approval-request
  • Outgoing (Orchestrator sends): orchestratorToAgentMessageSchema — validates job.dispatch, job.cancel, register.ack, job.concurrency.ack, cache.upload.response, cache.user.restore.response, cache.user.save.response, provenance.upload.response, event.emit.response, agent.api.response, auth.success, auth.failure, fleet.logs.request, step.approval-resolved

Peer-to-peer layer:

  • Bidirectional: peerToPeerMessageSchema / peerFromPeerMessageSchema — parses peer.hello, peer.hello.response, peer.auth.request, peer.auth.response, peer.heartbeat, job.reroute, job.reroute.ack, job.progress, peer.job.cancel, raft.vote.request, raft.vote.response, raft.append.entries, peer.log.chunk, peer.cache.upload.request, peer.cache.upload.response, peer.config.reload, peer.config.reload.response, peer.leaving, peer.agent-token.revoke, scaler.event

The upstream layers (orchestrator↔KiCI and browser↔KiCI) have their own discriminated unions.

Usage example:

import { agentToOrchestratorMessageSchema } from '@kici-dev/engine';
// Parse and validate an incoming message
const message = agentToOrchestratorMessageSchema.parse(JSON.parse(rawData));
// TypeScript narrows the type based on the 'type' discriminator
switch (message.type) {
case 'agent.register':
handleRegister(message.agentId, message.labels);
break;
case 'job.status':
handleStatus(message.runId, message.jobId, message.status);
break;
}

KiCI propagates a requestId (UUIDv4) through the entire webhook processing pipeline for end-to-end observability. The trace ID enables correlating all log lines, database operations, and protocol messages belonging to a single webhook event.

  1. Origin: Generated at webhook ingestion — either by KiCI (for WS-relayed webhooks) or by the orchestrator’s direct HTTP endpoint (for independent/hybrid mode)
  2. Pipeline propagation: Carried through AsyncLocalStorage (ALS) during synchronous webhook processing: webhook.relay -> processWebhook -> trigger matching -> execution start -> job dispatch
  3. Queue persistence: When no agent is immediately available, the requestId is persisted in the dispatch_queue PostgreSQL table alongside the job payload
  4. Queue restoration: When a queued job is drained (agent becomes available), the requestId is read from the database and restored into a new ALS scope via requestContext.run()
  5. Agent dispatch: Included in the job.dispatch message so the agent can log it during execution
  6. Completion callbacks: Stored in the ExecutionTracker’s in-memory RunState at execution start, then passed back to onExecutionComplete and onStepStatusForward callbacks. The callback handlers in server.ts and standalone.ts restore ALS context via requestContext.run() so that check run updates and upstream forwarding are logged with the correct trace ID.

The app.service field identifies which KiCI tier produced a log line (values: platform, orchestrator, agent). Set process-wide at startup via setServiceName() from @kici-dev/shared. This field is independent of container naming and works in all deployment models (containerized, bare-metal, Firecracker). For forwarded agent logs flowing through orchestrator stdout, the service field is preserved from the agent’s original JSON output.