Dashboard, metrics & wire format
This page documents the WebSocket messages that flow on the agent↔orchestrator and orchestrator↔orchestrator (peer-to-peer) channels, plus the wire format invariants every tier obeys.
Concurrency protocol messages
Section titled “Concurrency protocol messages”These messages enable job-level concurrency control. When an agent discovers a job belongs to a concurrency group (from the workflow definition), it reports to the orchestrator, which decides the action.
Authoritative source:
packages/engine/src/protocol/messages/orchestrator-agent.ts
job.concurrency.report
Section titled “job.concurrency.report”Direction: Agent -> Orchestrator
Agent reports that a job belongs to a concurrency group, requesting permission to proceed.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "job.concurrency.report" | Yes | Message discriminator |
| messageId | string | Yes | Unique message ID |
| runId | string | Yes | Execution run ID |
| jobId | string | Yes | Job ID within the run |
| group | string | Yes | Concurrency group name |
job.concurrency.ack
Section titled “job.concurrency.ack”Direction: Orchestrator -> Agent
Orchestrator acknowledges concurrency report with action to take.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "job.concurrency.ack" | Yes | Message discriminator |
| requestId | string | Yes | Correlation ID from the report |
| action | enum | Yes | One of: proceed, wait, cancel |
| reason | string | No | Human-readable reason for the action |
| runId | string | No | Run identifier echoed back on unsolicited slot-release wake-ups (see note) |
| jobId | string | No | Job identifier echoed back on unsolicited slot-release wake-ups (see note) |
The orchestrator sends job.concurrency.ack in two situations: (1) as a direct response to a job.concurrency.report — requestId correlates to the report and action is proceed, wait, or cancel; and (2) as an unsolicited follow-up after the agent received wait. When a slot in the concurrency group is released (by completion, cancellation, or superseding), the orchestrator picks the FIFO-next queued waiter and sends { action: 'proceed' } to the agent still parked on its second waitForConcurrencyAck call. runId and jobId are present in this case so the agent can sanity-check the wake-up matches its job.
Agent metrics push messages
Section titled “Agent metrics push messages”These messages support periodic metrics push from agents to the orchestrator. The orchestrator aggregates agent metrics and exposes them via its Prometheus scrape endpoint.
Authoritative source:
packages/engine/src/protocol/messages/orchestrator-agent.ts
agent.metrics
Section titled “agent.metrics”Direction: Agent -> Orchestrator
Periodic metrics push from agent to orchestrator. Each metric includes its type, value, labels, and optional histogram buckets.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "agent.metrics" | Yes | Message discriminator |
| messageId | string | Yes | Unique message ID |
| agentId | string | Yes | Agent identifier |
| metrics | MetricEntry[] | Yes | Array of metric data points |
| timestamp | number | Yes | Unix timestamp (milliseconds) |
Each MetricEntry:
| Field | Type | Required | Description |
|---|---|---|---|
| name | string | Yes | Metric name |
| type | enum | Yes | One of: counter, histogram, gauge, upDownCounter |
| value | number | No | Metric value (for counters and gauges) |
| labels | Record<string, string> | No | Metric labels |
| buckets | [{ le: number, count: number }] | No | Histogram bucket boundaries and counts |
| count | number | No | Histogram observation count |
| sum | number | No | Histogram observation sum |
Authoritative source:
packages/engine/src/protocol/messages/orchestrator-agent.ts—agentMetricsSchema
Agent authentication messages
Section titled “Agent authentication messages”Authentication messages for the agent <-> orchestrator WebSocket connection.
Authoritative source:
packages/engine/src/protocol/messages/orchestrator-agent.ts
auth.request (agent)
Section titled “auth.request (agent)”Auth request sent by agent to orchestrator when connecting.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "auth.request" | Yes | Message discriminator |
| token | string | Yes | Authentication token |
| protocolVersion | number | Yes | Protocol version (positive integer) |
auth.success (agent)
Section titled “auth.success (agent)”Auth success response sent by orchestrator to agent.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "auth.success" | Yes | Message discriminator |
| connectionId | string | Yes | Assigned connection ID |
auth.failure (agent)
Section titled “auth.failure (agent)”Auth failure response sent by orchestrator to agent.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "auth.failure" | Yes | Message discriminator |
| reason | string | Yes | Human-readable failure reason |
Authoritative source:
packages/engine/src/protocol/messages/orchestrator-agent.ts—agentAuthFailureSchema
Agent private API messages
Section titled “Agent private API messages”Generic request-response envelope for typed API calls over the agent WebSocket. New methods are registered in AgentApiRegistry on the orchestrator side — no protocol schema changes are needed per method.
Authoritative source:
packages/engine/src/protocol/messages/orchestrator-agent.ts—agentApiRequestSchema,agentApiResponseSchema
agent.api.request
Section titled “agent.api.request”API request sent by agent to orchestrator (e.g., infrastructure.list).
| Field | Type | Required | Description |
|---|---|---|---|
| type | "agent.api.request" | Yes | Message discriminator |
| requestId | string | Yes | UUID for correlating the response |
| method | string | Yes | Dot-namespaced method name (e.g., infrastructure.list) |
| params | Record<string, unknown> | No | Method-specific parameters (defaults to {}) |
agent.api.response
Section titled “agent.api.response”API response sent by orchestrator to agent.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "agent.api.response" | Yes | Message discriminator |
| requestId | string | Yes | Matches the original request’s requestId |
| result | unknown | No | Method result (present on success) |
| error | string | No | Error description (present on failure) |
Join messages
Section titled “Join messages”Join messages enable zero-knowledge cluster bootstrap. A new orchestrator sends a join.request with a join token, and an existing orchestrator responds with an AES-256-GCM encrypted config bundle. The Platform relay sees only the token’s cleartext routing part and opaque ciphertext — zero knowledge of customer configuration.
Authoritative source:
packages/engine/src/protocol/messages/join.ts
New orchestrator -> existing orchestrator (via Platform relay or direct peer)
Section titled “New orchestrator -> existing orchestrator (via Platform relay or direct peer)”join.request
Section titled “join.request”Sent by a new orchestrator to request cluster config from an existing orchestrator.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "join.request" | Yes | Message discriminator |
| messageId | string | No | Correlation ID for Platform relay routing (injected by Platform to match response to correct joiner) |
| token | string | Yes | Full join token: kici_join_v1.<base64url_routing>.<secret_hex> |
Existing orchestrator -> new orchestrator (via Platform relay or direct peer)
Section titled “Existing orchestrator -> new orchestrator (via Platform relay or direct peer)”join.response
Section titled “join.response”Response from an existing orchestrator with an encrypted config bundle or error.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "join.response" | Yes | Message discriminator |
| messageId | string | No | Correlation ID echoed from join.request for relay routing |
| success | boolean | Yes | Whether the join was successful |
| encryptedBundle | string | No | Base64-encoded AES-256-GCM encrypted config bundle (on success) |
| error | string | No | Error message (on failure) |
The Browser ↔ Platform protocol (browser-side WebSocket auth, log subscriptions, status fan-out) is documented in the internal docs.
Orchestrator <-> Orchestrator messages (peer-to-peer)
Section titled “Orchestrator <-> Orchestrator messages (peer-to-peer)”This layer carries cluster coordination messages between orchestrator instances via direct WebSocket connections on /ws/peer. These messages never transit the Platform tier — see Communication Topology.
Authoritative source:
packages/engine/src/protocol/messages/peer.ts
Authentication
Section titled “Authentication”Peer authentication uses a 4-message ECDH handshake. The initiator (connecting orchestrator) sends peer.hello, the responder (receiving orchestrator) replies with peer.hello.response, then the initiator sends an encrypted peer.auth.request, and the responder replies with an encrypted peer.auth.response. All auth material is transmitted over the ECDH-encrypted channel.
peer.hello
Section titled “peer.hello”Sent by the initiator (connecting orchestrator) to start the ECDH handshake. Provides the initiator’s ephemeral ECDH public key and a nonce.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "peer.hello" | Yes | Message discriminator |
| ephemeralPublicKey | string | Yes | Initiator’s ephemeral X25519 public key (base64, DER SPKI) |
| nonce | string | Yes | Base64-encoded 32-byte random nonce (HKDF salt) |
peer.hello.response
Section titled “peer.hello.response”Sent by the responder (receiving orchestrator) in response to peer.hello. Provides the responder’s ephemeral ECDH public key. After this exchange, both sides derive a shared session key.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "peer.hello.response" | Yes | Message discriminator |
| ephemeralPublicKey | string | Yes | Responder’s ephemeral X25519 public key (base64, DER SPKI) |
peer.auth.request
Section titled “peer.auth.request”Sent by the connecting orchestrator after the ECDH handshake. Encrypted with the shared session key. Contains either a join token (first connection) or an HMAC credential proof (subsequent connections). The receiving orchestrator must respond within 15 seconds or the connection is closed.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "peer.auth.request" | Yes | Message discriminator |
| instanceId | string | Yes | Sender’s cluster instance ID |
| protocolVersion | number | Yes | Protocol version |
| token | string | Conditional | Join token (first connection only) |
| proof | string | Conditional | HMAC proof of credential ownership (subsequent connections) |
| softwareVersion | string | No | Software version of the connecting peer (for version compat check) |
| role | enum | No | Role of the connecting peer: coordinator or worker |
One of token or proof must be present.
peer.auth.response
Section titled “peer.auth.response”Response to a peer authentication request. Encrypted with the shared session key.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "peer.auth.response" | Yes | Message discriminator |
| accepted | boolean | Yes | Whether authentication succeeded |
| instanceId | string | No | Responder’s cluster instance ID |
| sessionCredential | string | No | Issued credential for future connections (first join) |
| role | string | No | Confirmed role of the peer |
| reason | string | No | Rejection reason (if not accepted) |
| softwareVersion | string | No | Software version of the coordinator (for version compat check) |
| agents | PeerAgentSummary[] | No | Responder’s connected agent inventory (present when accepted=true) |
| scalerCapacity | ScalerCapacitySummary[] | No | Responder’s scaler capacity (present when accepted=true) |
| capabilities | PeerCapabilities | No | Responder’s feature capabilities (present when accepted=true) |
Inventory & consensus
Section titled “Inventory & consensus”peer.heartbeat
Section titled “peer.heartbeat”Sent every 30 seconds by each peer. Carries agent inventory, scaler capacity, and Raft consensus state. This is the primary mechanism for routing decisions — the coordinator uses peer heartbeats to determine which peers can handle a job’s label requirements.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "peer.heartbeat" | Yes | Message discriminator |
| instanceId | string | Yes | Sender’s cluster instance ID |
| term | number | Yes | Current Raft term |
| leaderId | string or null | Yes | Known Raft leader’s instance ID |
| draining | boolean | Yes | Whether the sender is gracefully shutting down |
| agents | PeerAgentSummary[] | Yes | Connected agent inventory |
| capabilities | PeerCapabilities | Yes | Feature flags (S3 log access, log routing override) |
| scalerCapacity | ScalerCapacitySummary[] | No | On-demand backend capacity for routing decisions |
| configVersion | number | No | Shared config version for sync detection |
| registryVersion | number | No | Registry version for cross-orchestrator registration sync |
| timestamp | number | Yes | Unix timestamp (milliseconds) |
| hostname | string | No | Machine hostname (os.hostname()) |
| osRelease | string | No | OS kernel release (os.release()) |
| totalMemoryMb | number | No | Total system memory in MiB |
| memoryUsedMb | number | No | Used memory in MiB |
| memoryAvailableMb | number | No | Available memory in MiB |
| cpuCount | number | No | Number of logical CPUs |
| uptimeSeconds | number | No | System uptime in seconds |
| nodeVersion | string | No | Node.js version |
| runningAsUser | string or null | No | Username of the OS user running the orchestrator |
| runningAsUid | number or null | No | UID of the OS user running the orchestrator |
| version | string | No | Orchestrator version (e.g., "0.0.1") |
PeerAgentSummary:
| Field | Type | Required | Description |
|---|---|---|---|
| agentId | string | Yes | Agent identifier |
| labels | string[] | Yes | Capability labels |
| activeJobs | number | Yes | Currently running jobs |
| maxConcurrency | number | Yes | Maximum concurrent jobs |
| platform | string | Yes | OS platform (e.g., linux) |
| arch | string | Yes | CPU architecture (e.g., x64, arm64) |
PeerCapabilities:
| Field | Type | Required | Description |
|---|---|---|---|
| s3LogAccess | boolean | Yes | Whether this peer has direct S3 log access |
| logRoutingOverride | enum | No | One of: direct, coordinator (overrides default routing) |
ScalerCapacitySummary:
| Field | Type | Required | Description |
|---|---|---|---|
| name | string | No | Scaler backend name (e.g., stg-worker-bare-metal) |
| type | string | No | Scaler backend type (e.g., bare-metal, container) |
| labelSets | string[][] | Yes | Label sets this backend provisions |
| maxAgents | number | Yes | Maximum agents for this backend |
| activeCount | number | Yes | Current active agent count |
raft.vote.request
Section titled “raft.vote.request”Raft leader election vote request. Sent by candidates during elections.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "raft.vote.request" | Yes | Message discriminator |
| term | number | Yes | Candidate’s current term |
| candidateId | string | Yes | Candidate’s instance ID |
| lastLogIndex | number | Yes | Candidate’s last log index |
| lastLogTerm | number | Yes | Term of candidate’s last log entry |
raft.vote.response
Section titled “raft.vote.response”Response to a Raft vote request.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "raft.vote.response" | Yes | Message discriminator |
| term | number | Yes | Voter’s current term |
| voteGranted | boolean | Yes | Whether the vote was granted |
| voterId | string | Yes | Voter’s instance ID |
raft.append.entries
Section titled “raft.append.entries”Raft leader heartbeat (no log entries — KiCI uses Raft for leader election only, not log replication).
| Field | Type | Required | Description |
|---|---|---|---|
| type | "raft.append.entries" | Yes | Message discriminator |
| term | number | Yes | Leader’s current term |
| leaderId | string | Yes | Leader’s instance ID |
Job rerouting
Section titled “Job rerouting”job.reroute
Section titled “job.reroute”Sent by the coordinator to a peer when no local agent can handle a job. Contains the full resolved job configuration so the peer can dispatch without re-resolving.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "job.reroute" | Yes | Message discriminator |
| messageId | string | Yes | Unique message ID for ACK correlation |
| jobId | string | Yes | Pre-allocated job ID. The sending coordinator allocates it before reroute so its execution_runs / execution_jobs rows reference the same id the receiving peer will dispatch under. |
| runId | string | Yes | Execution run identifier |
| deliveryId | string | Yes | Original webhook delivery ID |
| routingKey | string | Yes | Provider routing key |
| event, action | string, string or null | Yes | Webhook event type and action |
| payload | Record<string, unknown> | Yes | Full webhook payload |
| jobName | string | Yes | Job to execute |
| workflowName | string | Yes | Workflow containing the job |
| runsOnLabels | string[][] | Yes | Label sets the job requires |
| excludeLabels | string[] | No | Labels that the dispatched agent must NOT have |
| triedConnections | string[] | Yes | Instance IDs already tried (loop prevention) |
| maxHops | number | Yes | Maximum allowed hops (default: 3) |
| coordinatorId | string | Yes | Instance ID of the run coordinator |
| jobConfig | Record<string, unknown> | No | Resolved job config (steps, rules, matrix, etc.) |
| repoUrl | string | No | Repository clone URL |
| ref | string | No | Git ref (branch name) |
| sha | string | No | Commit SHA |
| provider | string | No | Provider type (e.g., github) |
| providerContext | Record<string, unknown> | No | Provider-specific context (e.g., installationId) |
| sourceTarUrl | string | No | Pre-signed .kici/ source tarball download URL (cache hit) |
| sourceTarHash | string | No | Workflow contentHash (used for drift verification, not tarball) |
| depsUrl | string | No | Pre-signed dependency tarball URL (cache hit) |
| depsHash | string | No | SHA-256 of the dependency tarball bytes |
| cloneToken | string | No | Pre-resolved clone token for workers without provider credentials |
| encryptedSecrets | string | No | Encrypted secrets envelope (AES-256-GCM with session key) |
| encryptedNamespacedSecrets | string | No | Encrypted namespaced secrets envelope |
| requestId | string | No | Trace ID for distributed tracing |
| traceId | string | No | Additional trace context |
job.reroute.ack
Section titled “job.reroute.ack”Acknowledgment of a reroute request. The coordinator tries the next peer if rejected.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "job.reroute.ack" | Yes | Message discriminator |
| messageId | string | Yes | ID of the job.reroute being acknowledged |
| accepted | boolean | Yes | Whether the peer accepted the job |
| reason | string | No | Rejection reason (if not accepted) |
Progress & cancel
Section titled “Progress & cancel”job.progress
Section titled “job.progress”Sent by the worker orchestrator back to the coordinator as the agent reports job-level and step-level state changes. The coordinator uses this to drive its run-level state machine, update GitHub check runs, and track per-step progress.
The kind discriminator decides which ExecutionTracker call the receiver makes:
kind: "job"→onJobStatus(runId, jobId, state, ...)— drives run transitions (running → success/failed/…).stepIndex/stepNameare unused for this kind.kind: "step"→onStepStatus(runId, jobId, stepIndex, stepName, state, ...)— persistsexecution_stepsrows.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "job.progress" | Yes | Message discriminator |
| kind | "job" | "step" | Yes | Whether this update is a job-level or step-level transition |
| runId | string | Yes | Execution run identifier |
| jobId | string | Yes | Job identifier (matches the pre-allocated jobId from job.reroute) |
| jobName | string | Yes | Human-readable job name |
| stepIndex | number | Yes | Zero-based step index (used only when kind="step") |
| stepName | string | Yes | Human-readable step name (used only when kind="step") |
| state | enum | Yes | Full ExecutionJobStatus for kind="job"; the step-state subset (running/success/failed/skipped) for kind="step" |
| timestamp | number | Yes | Unix timestamp (milliseconds) |
| data | Record<string, unknown> | No | Optional state-specific data |
peer.job.cancel
Section titled “peer.job.cancel”Sent by the coordinator to a worker to cancel a rerouted job. Used for fail-fast propagation and user-initiated cancellation.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "peer.job.cancel" | Yes | Message discriminator |
| runId | string | Yes | Execution run identifier |
| jobId | string | No | Specific job to cancel (omit for all jobs in run) |
| reason | string | Yes | Human-readable cancellation reason |
| force | boolean | No | When true, force-cancel immediately without waiting for hooks |
Graceful shutdown
Section titled “Graceful shutdown”peer.leaving
Section titled “peer.leaving”Graceful shutdown announcement. Peers remove the sender from their registry immediately upon receiving this message.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "peer.leaving" | Yes | Message discriminator |
| instanceId | string | Yes | Instance ID of the leaving peer |
| term | number | Yes | Current Raft term for leader identification |
Log and cache relay (coordinator-worker topology)
Section titled “Log and cache relay (coordinator-worker topology)”peer.log.chunk
Section titled “peer.log.chunk”Log chunk relay from worker to coordinator. Batched log lines from agent execution, forwarded when the worker does not have direct S3 log access.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "peer.log.chunk" | Yes | Message discriminator |
| runId | string | Yes | Execution run ID |
| jobId | string | Yes | Job ID within the run |
| stepIndex | number | Yes | Zero-based step index |
| lines | LogLineEntry[] | Yes | Array of log line entries |
Each LogLineEntry:
| Field | Type | Required | Description |
|---|---|---|---|
| text | string | Yes | Log line text |
| timestamp | number | Yes | Unix timestamp (milliseconds) |
| stream | enum | No | One of: stdout, stderr (defaults to stdout) |
Authoritative source:
packages/engine/src/protocol/messages/peer.ts—peerLogChunkSchema
peer.cache.upload.request
Section titled “peer.cache.upload.request”Cache upload request from worker to coordinator. Worker agent needs an upload URL for cache storage.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "peer.cache.upload.request" | Yes | Message discriminator |
| messageId | string | Yes | Unique message ID |
| runId | string | Yes | Execution run ID |
| jobId | string | Yes | Job ID within the run |
| cacheType | enum | Yes | One of: bundle, deps |
| hash | string | Yes | Content hash for cache keying |
| sizeBytes | number | Yes | Size of the artifact in bytes |
Authoritative source:
packages/engine/src/protocol/messages/peer.ts—peerCacheUploadRequestSchema
peer.cache.upload.response
Section titled “peer.cache.upload.response”Cache upload response from coordinator to worker. Pre-signed upload URL for the worker to upload directly to object storage.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "peer.cache.upload.response" | Yes | Message discriminator |
| messageId | string | Yes | Unique message ID |
| runId | string | Yes | Execution run ID |
| jobId | string | Yes | Job ID within the run |
| uploadUrl | string | Yes | Pre-signed URL for direct S3 upload |
Authoritative source:
packages/engine/src/protocol/messages/peer.ts—peerCacheUploadResponseSchema
Config reload (per-instance targeting)
Section titled “Config reload (per-instance targeting)”peer.config.reload
Section titled “peer.config.reload”Config reload request forwarded from one orchestrator to a specific peer when an operator calls POST /admin/config/reload with a target parameter. The receiving peer executes a local reload and replies with peer.config.reload.response.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "peer.config.reload" | Yes | Message discriminator |
| messageId | string | Yes | Unique message ID |
| drain | boolean | No | Whether to drain in-flight work before reloading |
Authoritative source:
packages/engine/src/protocol/messages/peer.ts—peerConfigReloadSchema
peer.config.reload.response
Section titled “peer.config.reload.response”Response from the target peer carrying the reload result fields.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "peer.config.reload.response" | Yes | Message discriminator |
| messageId | string | Yes | Unique message ID |
| success | boolean | Yes | Whether the reload succeeded |
| version | number | No | New config version after reload |
| errors | string[] | No | Error messages if reload failed |
| restartRequired | string[] | No | Config fields that require a restart to take effect |
| fieldsChanged | string[] | No | Config fields that were changed |
Authoritative source:
packages/engine/src/protocol/messages/peer.ts—peerConfigReloadResponseSchema
Agent-token revoke fan-out
Section titled “Agent-token revoke fan-out”peer.agent-token.revoke
Section titled “peer.agent-token.revoke”Broadcast to every peer when an agent token is revoked so each peer can close its own in-flight agent WS connections authenticated by that token. The originating peer kicks its local connections first (via the admin revoke route), then fans this message out over the encrypted peer mesh.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "peer.agent-token.revoke" | Yes | Message discriminator |
| tokenId | string | Yes | The agent_tokens.id whose in-flight agent WS must be kicked |
| senderInstanceId | string | Yes | Originating peer’s instance ID (for cross-cluster log correlation) |
Authoritative source:
packages/engine/src/protocol/messages/peer.ts—peerAgentTokenRevokeSchema
Scaler provisioning events
Section titled “Scaler provisioning events”scaler.event
Section titled “scaler.event”Forwarded by a worker to the coordinator that owns the run when the worker’s scaler emits a provisioning event correlated to a queued job (e.g. a failed agent spawn). Workers have no database, so they cannot persist provisioning failures themselves — the coordinator’s ExecutionTracker writes the event to the provisioning log and the dispatch queue’s last-error column.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "scaler.event" | Yes | Message discriminator |
| runId | string | Yes | Execution run ID |
| jobId | string | Yes | Job ID within the run |
| agentId | string | Yes | The scaler-managed agent ID the event is about |
| eventType | enum | Yes | Scaler event type (a ScalerEventType enum member) |
| detail | string | Yes | Human-readable detail, including any captured spawn stderr tail |
| timestampMs | number | Yes | Event timestamp in epoch milliseconds |
Authoritative source:
packages/engine/src/protocol/messages/peer.ts—peerScalerEventSchema
See Multi-Orchestrator Architecture for rerouting protocol, loop prevention, and failure modes.
Test run messages
Section titled “Test run messages”These messages support CLI-initiated test runs (kici run remote). The CLI connects to the orchestrator via WebSocket and triggers a test execution using a fixture payload.
Authoritative source:
packages/engine/src/protocol/messages/test-run.ts
CLI -> Orchestrator
Section titled “CLI -> Orchestrator”test.trigger
Section titled “test.trigger”Initiate a test run using a fixture payload.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "test.trigger" | Yes | Message discriminator |
| messageId | string | Yes | Unique message ID |
| fixtureId | string | Yes | Fixture identifier |
| event | TestEvent | Yes | Simulated event (type, action, payload) |
| uploadId | string | No | Upload ID for bundle transfer |
| routingKey | string | Yes | Provider routing key |
| secrets | Record<string, string> | No | Secrets to pass to the test run |
| workflowName | string | No | Specific workflow to test (all if omitted) |
TestEvent: { type: string, action?: string, targetBranch: string, sourceBranch?: string, payload: Record<string, unknown>, changedFiles?: string[] }
test.cancel
Section titled “test.cancel”Cancel a running test execution.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "test.cancel" | Yes | Message discriminator |
| messageId | string | Yes | Unique message ID |
| runId | string | Yes | Run ID to cancel |
Orchestrator -> CLI
Section titled “Orchestrator -> CLI”test.trigger.response
Section titled “test.trigger.response”Response indicating whether the test trigger was accepted.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "test.trigger.response" | Yes | Message discriminator |
| messageId | string | Yes | ID of the test.trigger being responded |
| runId | string | Yes | Execution run ID |
| observeUrl | string | Yes | URL to observe the run via WebSocket |
| status | enum | Yes | One of: accepted, rejected |
| reason | string | No | Rejection reason (if rejected) |
test.cancel.response
Section titled “test.cancel.response”Response to a test cancellation request.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "test.cancel.response" | Yes | Message discriminator |
| messageId | string | Yes | ID of the test.cancel being responded |
| status | enum | Yes | One of: cancelled, not_found, already_complete |
Observer messages
Section titled “Observer messages”These messages support real-time observation of test runs via WebSocket. The CLI subscribes to a running execution and receives status, log, step, and completion updates.
Authoritative source:
packages/engine/src/protocol/messages/observe.ts
CLI -> Orchestrator
Section titled “CLI -> Orchestrator”observe.subscribe
Section titled “observe.subscribe”Subscribe to observe a running test execution.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "observe.subscribe" | Yes | Message discriminator |
| runId | string | Yes | Execution run ID to observe |
| token | string | Yes | Auth token for the observe session |
| lastSeenSequence | number | No | Resume from sequence (for reconnections) |
Orchestrator -> CLI
Section titled “Orchestrator -> CLI”observe.status
Section titled “observe.status”Run-level status update pushed to the observer.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "observe.status" | Yes | Message discriminator |
| runId | string | Yes | Execution run ID |
| status | enum | Yes | One of: pending, running, success, failed, cancelled, cancelling |
| jobName | string | No | Job name (when status relates to a specific job) |
| timestamp | number | Yes | Unix timestamp (milliseconds) |
observe.log
Section titled “observe.log”Log chunk pushed to the observer.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "observe.log" | Yes | Message discriminator |
| runId | string | Yes | Execution run ID |
| jobId | string | Yes | Job ID within the run |
| jobName | string | Yes | Human-readable job name |
| stepIndex | number | Yes | Zero-based step index |
| stepName | string | Yes | Human-readable step name |
| lines | string[] | Yes | Array of log output lines |
| sequence | number | Yes | Sequence number for ordering |
| timestamp | number | Yes | Unix timestamp (milliseconds) |
observe.step
Section titled “observe.step”Step lifecycle event pushed to the observer.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "observe.step" | Yes | Message discriminator |
| runId | string | Yes | Execution run ID |
| jobId | string | Yes | Job ID within the run |
| jobName | string | Yes | Human-readable job name |
| stepName | string | Yes | Human-readable step name |
| state | enum | Yes | One of: pending, running, success, failed, skipped |
| durationMs | number | No | Step duration in milliseconds (on completion) |
| timestamp | number | Yes | Unix timestamp (milliseconds) |
observe.complete
Section titled “observe.complete”Run completion summary pushed to the observer.
| Field | Type | Required | Description |
|---|---|---|---|
| type | "observe.complete" | Yes | Message discriminator |
| runId | string | Yes | Execution run ID |
| status | enum | Yes | One of: success, failed, cancelled |
| summary | CompletionSummary | Yes | Run summary with duration and per-job status |
CompletionSummary: { totalDurationMs: number, jobs: Array<{ name: string, status: string, durationMs?: number }> }
WebSocket close codes
Section titled “WebSocket close codes”KiCI defines custom close codes in the 4000-4999 range (reserved for application use by RFC 6455) alongside standard codes.
| Code | Constant | Meaning | Sent By |
|---|---|---|---|
| 1001 | WS_CLOSE_GOING_AWAY | Server shutdown or browser navigating away | Server |
| 4001 | WS_CLOSE_UNAUTHORIZED | Client failed authentication | Server |
| 4002 | WS_CLOSE_AUTH_TIMEOUT | Auth timeout expired | Server |
| 4003 | WS_CLOSE_INVALID_MESSAGE | Invalid or unparseable message received | Server |
| 4004 | WS_CLOSE_HEARTBEAT_TIMEOUT | Heartbeat timed out (180 seconds) | Server |
| 4005 | WS_CLOSE_PROTOCOL_ERROR | Protocol-level error | Server |
| 4006 | WS_CLOSE_INTERNAL_ERROR | Unexpected internal server error | Server |
| 4010 | WS_CLOSE_AGENT_AUTH_FAILED | Agent token authentication failed | Server |
| 4011 | WS_CLOSE_CLUSTER_NAME_CONFLICT | Reserved constant; no Platform code path emits this today. Platform accepts N connected orchestrators per (org_id, cluster_name) and the dashboard listing dedupes by cluster name | — |
| 4020 | WS_CLOSE_PLAN_LIMIT | Organization has reached its plan limit | Server |
| 4030 | WS_CLOSE_RUN_NOT_FOUND | Requested run was not found | Server |
| 4031 | WS_CLOSE_DISPATCH_ACK_TIMEOUT | A dispatched job went unacknowledged past its deadline; the orchestrator requeues the job and disconnects the unresponsive agent | Server |
Authoritative source:
packages/engine/src/ws/close-codes.ts
Validation
Section titled “Validation”All protocol messages are validated at runtime using Zod discriminated unions. Each WebSocket layer has direction-specific schemas that prevent parsing messages from the wrong direction.
Orchestrator-Agent layer:
- Incoming (Orchestrator receives):
agentToOrchestratorMessageSchema— parsesagent.register,agent.status,job.status,job.reject,job.ack,log.chunk,step.status,job.heartbeat,agent.log,job.concurrency.report,config.ack,cache.upload.request,cache.upload.complete,cache.user.restore.request,cache.user.save.request,cache.user.save.complete,provenance.upload.request,provenance.upload.complete,event.emit,agent.api.request,agent.metrics,auth.request,fleet.bundle.chunk,fleet.bundle.error,step.approval-request - Outgoing (Orchestrator sends):
orchestratorToAgentMessageSchema— validatesjob.dispatch,job.cancel,register.ack,job.concurrency.ack,cache.upload.response,cache.user.restore.response,cache.user.save.response,provenance.upload.response,event.emit.response,agent.api.response,auth.success,auth.failure,fleet.logs.request,step.approval-resolved
Peer-to-peer layer:
- Bidirectional:
peerToPeerMessageSchema/peerFromPeerMessageSchema— parsespeer.hello,peer.hello.response,peer.auth.request,peer.auth.response,peer.heartbeat,job.reroute,job.reroute.ack,job.progress,peer.job.cancel,raft.vote.request,raft.vote.response,raft.append.entries,peer.log.chunk,peer.cache.upload.request,peer.cache.upload.response,peer.config.reload,peer.config.reload.response,peer.leaving,peer.agent-token.revoke,scaler.event
The upstream layers (orchestrator↔KiCI and browser↔KiCI) have their own discriminated unions.
Usage example:
import { agentToOrchestratorMessageSchema } from '@kici-dev/engine';
// Parse and validate an incoming messageconst message = agentToOrchestratorMessageSchema.parse(JSON.parse(rawData));
// TypeScript narrows the type based on the 'type' discriminatorswitch (message.type) { case 'agent.register': handleRegister(message.agentId, message.labels); break; case 'job.status': handleStatus(message.runId, message.jobId, message.status); break;}Request tracing model
Section titled “Request tracing model”KiCI propagates a requestId (UUIDv4) through the entire webhook processing pipeline for end-to-end observability. The trace ID enables correlating all log lines, database operations, and protocol messages belonging to a single webhook event.
requestId lifecycle
Section titled “requestId lifecycle”- Origin: Generated at webhook ingestion — either by KiCI (for WS-relayed webhooks) or by the orchestrator’s direct HTTP endpoint (for independent/hybrid mode)
- Pipeline propagation: Carried through AsyncLocalStorage (ALS) during synchronous webhook processing:
webhook.relay->processWebhook-> trigger matching -> execution start -> job dispatch - Queue persistence: When no agent is immediately available, the
requestIdis persisted in thedispatch_queuePostgreSQL table alongside the job payload - Queue restoration: When a queued job is drained (agent becomes available), the
requestIdis read from the database and restored into a new ALS scope viarequestContext.run() - Agent dispatch: Included in the
job.dispatchmessage so the agent can log it during execution - Completion callbacks: Stored in the
ExecutionTracker’s in-memoryRunStateat execution start, then passed back toonExecutionCompleteandonStepStatusForwardcallbacks. The callback handlers inserver.tsandstandalone.tsrestore ALS context viarequestContext.run()so that check run updates and upstream forwarding are logged with the correct trace ID.
app.service field
Section titled “app.service field”The app.service field identifies which KiCI tier produced a log line (values: platform, orchestrator, agent). Set process-wide at startup via setServiceName() from @kici-dev/shared. This field is independent of container naming and works in all deployment models (containerized, bare-metal, Firecracker). For forwarded agent logs flowing through orchestrator stdout, the service field is preserved from the agent’s original JSON output.