Multi-orchestrator clustering
KiCI supports running multiple orchestrators in a cluster. Clustering enables high availability (HA) through redundancy, cross-architecture job routing (e.g., x64 + ARM64 agents), and dedicated coordinator topologies for large deployments.
Cluster components (Raft consensus, peer registry, health endpoints) are always initialized. A single-orchestrator deployment works unchanged — Raft dormant mode self-elects as leader immediately with zero overhead. No extra configuration is needed for single-orch deployments.
Overview
Section titled “Overview”In a cluster, each orchestrator is a full-featured instance that can both receive webhooks and execute jobs. When a webhook arrives, the receiving orchestrator becomes the run coordinator for that webhook event. The coordinator matches triggers against the lock file, claims jobs it can dispatch to its own local agents, and reroutes remaining jobs to peers with matching capacity.
Key capabilities:
- HA pair — two identical orchestrators behind the Platform round-robin for redundancy
- Cross-architecture pools — x64 and ARM64 orchestrators with different scalers, same routing key
- Dedicated coordinator — one orchestrator with no scalers (coordinator-only) handles webhook processing while others handle execution
- Automatic peer discovery — in Platform/hybrid modes, the Platform matchmaker tells orchestrators about peers
- Raft leader election — one orchestrator is elected leader for cluster-wide operations like orphan run recovery
How many clusters?
Section titled “How many clusters?”Most orgs need one cluster. A single cluster spans architectures and
hardware shapes through multiple scalers — x64 containers, ARM64 containers,
bare-metal, Firecracker, GPU pools — and routes each job to a matching scaler
by its runs-on labels. To add capacity or a new agent type (an ARM64 pool, a
GPU pool), add a scaler, not a cluster: the x64 + ARM64 pool recipe below
runs one cluster across two architectures, and the
auto-scaler overview covers the available backends.
High availability also lives inside one cluster: redundancy comes from running multiple coordinators that share one PostgreSQL (the HA pair recipe below), not from standing up more clusters.
Reserve a separate cluster for a genuine isolation boundary — a distinct team that needs its own dashboard write policy, environments, secrets, and sources. A separate cluster is not the default for every environment, region, or architecture.
Every additional cluster carries real setup cost: its own sources and webhook registration, its own join-token bootstrap, its own (possibly divergent) configuration, and a separate dashboard surface to operate. Prefer fewer clusters with more scalers until an isolation boundary forces a split.
Prerequisites
Section titled “Prerequisites”Shared API key (coordinators only)
Section titled “Shared API key (coordinators only)”All coordinator orchestrators in a cluster authenticate to the Platform using the same API key (KICI_PLATFORM_TOKEN). This is by design — a cluster is a single operator’s deployment, and the API key is scoped to that operator’s organization. Each coordinator connects independently to the Platform, but they all belong to the same org and share webhook sources, runs, and dashboard data.
Workers do not need a Platform token. Workers connect only to their coordinator via P2P WebSocket (KICI_CLUSTER_COORDINATOR_URL) and never talk to the Platform relay. They have no database, no S3 credentials, and no API key — see coordinator/worker deployment for details.
Different operators (customers) must use separate API keys and organizations. Each API key is bound to a single organization on the Platform, so cross-tenant traffic is rejected at the relay before it reaches any orchestrator.
Shared PostgreSQL (mandatory)
Section titled “Shared PostgreSQL (mandatory)”All orchestrators in a cluster must share the same PostgreSQL database. This is a hard requirement — join tokens, peer credentials, Raft state, execution tracking, webhook secrets, cluster metadata, and the dispatch queue all live in this shared database.
The shared database ensures:
- Join token consumption is atomic across coordinators (DB transactions prevent races)
- Peer credentials are globally visible, so any coordinator can validate a reconnecting peer
- Cluster identity (
cluster_idincluster_meta) is consistent — at startup, each coordinator validates it reads the samecluster_id, preventing accidental misconfiguration where separate clusters point at the same database. When S3 storage is configured, the validation also checks a<prefix>/.kici-cluster-idsentinel object so two clusters can safely share a physical bucket if they use distinctKICI_STORAGE_PREFIXvalues. See cluster identity in multi-orchestrator design for the full sentinel contract - Credential tracking records which coordinator last validated each peer (
last_validated_by), providing operational visibility into connection routing
Do not run separate PostgreSQL instances per orchestrator in a cluster. This will cause split-brain: each orchestrator would generate a different cluster_id, tokens consumed on one would not be visible to others, and peer credentials would be siloed.
Peer authentication (join tokens)
Section titled “Peer authentication (join tokens)”Peer-to-peer WebSocket connections use an ECDH key exchange followed by join token or credential authentication. The first orchestrator (coordinator) starts normally. Additional peers join using a one-time join token that authenticates them and issues a persistent credential for subsequent connections.
Network connectivity
Section titled “Network connectivity”Orchestrators need to reach each other via WebSocket for direct peer connections. Set KICI_CLUSTER_ADDRESS to a reachable host:port for each orchestrator. Without a reachable address, peers cannot establish direct connections.
In Platform/hybrid modes, orchestrators that cannot reach each other directly fall back to relay through the Platform tier.
Configuration reference
Section titled “Configuration reference”Cluster configuration uses the KICI_CLUSTER_* environment variable prefix.
| Environment Variable | Default | Required | Description |
|---|---|---|---|
KICI_CLUSTER_JOIN_TOKEN | — | First join only | One-time join token for authenticating with the cluster. Used only on the first connection; a persistent credential is issued after successful authentication. |
KICI_CLUSTER_CREDENTIAL_FILE | ~/.kici/peer-credential | — | Path to store/load the persistent peer credential. After first join, the orchestrator uses this credential for all subsequent connections. |
KICI_CLUSTER_INSTANCE_ID | random UUID | — | Unique identifier for this orchestrator instance. Auto-generated if not set. |
KICI_CLUSTER_ADDRESS | — | When peers set | This orchestrator’s reachable address (e.g., ws://10.0.0.1:4000). Required when KICI_CLUSTER_PEERS is set. |
KICI_CLUSTER_PEERS | — | Multi-orch independent | Comma-separated list of peer addresses (e.g., ws://10.0.0.2:4000,ws://10.0.0.3:4000). Only needed for multi-orchestrator independent mode (Platform/hybrid uses automatic peer discovery). |
KICI_CLUSTER_RAFT_ELECTION_TIMEOUT_MIN_MS | 5000 | — | Minimum Raft election timeout (ms). |
KICI_CLUSTER_RAFT_ELECTION_TIMEOUT_MAX_MS | 10000 | — | Maximum Raft election timeout (ms). |
KICI_CLUSTER_RAFT_HEARTBEAT_MS | 2000 | — | Raft leader heartbeat interval (ms). |
KICI_CLUSTER_PEER_HEARTBEAT_INTERVAL_MS | 30000 | — | Peer inventory heartbeat interval (ms). |
KICI_CLUSTER_PEER_MAX_RECONNECT_DELAY_MS | 60000 | — | Maximum delay between peer reconnect attempts (ms). |
KICI_CLUSTER_ROLE | coordinator | For workers | Cluster role: coordinator (default, full orchestrator) or worker (delegated execution only). See Coordinator/worker deployment. |
KICI_CLUSTER_COORDINATOR_URL | — | For workers | WebSocket URL of the coordinator’s peer endpoint (e.g., ws://coordinator:4000/ws/peer). Required when KICI_CLUSTER_ROLE=worker. |
KICI_CLUSTER_PEER_STALE_TIMEOUT_MS | 60000 | — | Timeout in ms after which a peer with no heartbeat is considered stale. |
KICI_CLUSTER_AUTO_ROTATE_CREDENTIALS | false | — | When true, enables automatic rotation of peer credentials. Boolean. |
KICI_CLUSTER_TRUSTED_PROXIES | — | Behind reverse proxy | Comma-separated list of trusted proxy IPs or CIDR ranges (e.g., 10.0.0.0/8,172.16.0.0/12). When set, the peer handler extracts the real client IP from X-Forwarded-For instead of using the socket IP. Required for correct rate limiting when orchestrators are behind a load balancer or reverse proxy. |
Mode-specific requirements
Section titled “Mode-specific requirements”- Single-orchestrator (any mode): No cluster env vars needed. Cluster components initialize in dormant mode automatically.
- Multi-orchestrator Platform/hybrid mode: The first orchestrator starts normally. Additional peers need a join token created by
kici-admin peer create-token. The Platform matchmaker handles peer discovery automatically. - Multi-orchestrator independent mode:
KICI_CLUSTER_JOIN_TOKEN(for peers),KICI_CLUSTER_ADDRESS, andKICI_CLUSTER_PEERSare required because there is no Platform matchmaker for peer discovery.
Peer authentication flow
Section titled “Peer authentication flow”Peer authentication uses ECDH (X25519) key exchange to establish an encrypted channel before any credentials are transmitted.
First join (with token)
Section titled “First join (with token)”- Start the coordinator — the first orchestrator in the cluster starts normally with
KICI_SECRET_KEYset - Create a join token on the coordinator:
Terminal window kici-admin peer create-token --role coordinator# or for a worker peer:kici-admin peer create-token --role worker - Start the peer with the join token:
Terminal window KICI_CLUSTER_JOIN_TOKEN=kici_join_v1.xxx.yyy - ECDH handshake — the peer and coordinator exchange ephemeral X25519 public keys via
peer.hello/peer.hello.responsemessages - Encrypted auth — the peer sends the join token in an encrypted
peer.auth.request. The coordinator validates the token and responds with an encryptedpeer.auth.responsecontaining a session credential - Credential persistence — the peer saves the issued credential to
KICI_CLUSTER_CREDENTIAL_FILE(default:~/.kici/peer-credential) - Token consumed — the join token is marked as used and cannot be reused
Subsequent connections (with credential)
Section titled “Subsequent connections (with credential)”After the first join, the orchestrator uses its persisted credential file for all subsequent connections. No join token is needed:
- The orchestrator loads the credential from
KICI_CLUSTER_CREDENTIAL_FILE - ECDH handshake establishes an encrypted channel
- The orchestrator sends an HMAC proof of the credential (never the credential itself)
- The coordinator verifies the HMAC proof using
timingSafeEqual - Connection authenticated — the peer resumes normal operation
Security properties
Section titled “Security properties”- No cleartext auth material — all authentication happens over the ECDH-encrypted channel
- One-time tokens — join tokens are consumed after use
- HMAC credential proof — credentials are never sent over the wire; only an HMAC proof is transmitted
- Rate limiting — failed authentication attempts are rate-limited (5 attempts per IP within 60 seconds)
- Post-auth encryption — all subsequent messages (heartbeats, reroutes, Raft) use the ECDH-derived session key
Deployment recipes
Section titled “Deployment recipes”HA pair
Section titled “HA pair”Two identical orchestrators sharing the same PostgreSQL, scalers, and routing key. Provides redundancy: if one goes down, the other continues processing webhooks.
+---> Orchestrator A (coordinator) Platform ----+ | (round | | robin) +-----+ | +---> Orchestrator B (peer) joins via tokenStep 1: Start Orchestrator A (coordinator):
KICI_MODE=platformKICI_SECRET_KEY=<64-char-hex-key>KICI_CLUSTER_ADDRESS=ws://orchestrator-a:4000KICI_DATABASE_URL=postgres://user:pass@shared-db:5432/kiciKICI_SCALER_CONFIG_PATH=/etc/kici/scalers.yamlStep 2: Create a join token on Orchestrator A:
kici-admin peer create-token --role coordinatorStep 3: Start Orchestrator B with the token:
KICI_MODE=platformKICI_SECRET_KEY=<64-char-hex-key>KICI_CLUSTER_JOIN_TOKEN=kici_join_v1.xxx.yyyKICI_CLUSTER_ADDRESS=ws://orchestrator-b:4000KICI_DATABASE_URL=postgres://user:pass@shared-db:5432/kiciKICI_SCALER_CONFIG_PATH=/etc/kici/scalers.yamlAfter the first successful connection, Orchestrator B saves its credential and no longer needs the join token. Subsequent restarts use the persisted credential automatically.
x64 + ARM64 pool
Section titled “x64 + ARM64 pool”Two orchestrators with different scalers targeting different architectures. Same routing key so both receive webhooks. Jobs are automatically routed to the orchestrator with agents matching the job’s runs-on labels.
+---> Orchestrator A Platform ----+ | scalers: x64 containers (round | | agents: [self-hosted, linux, x64] robin) +-----+ | +---> Orchestrator B scalers: ARM64 containers agents: [self-hosted, linux, arm64]Orchestrator A (x64, coordinator):
KICI_MODE=platformKICI_SECRET_KEY=<64-char-hex-key>KICI_CLUSTER_ADDRESS=ws://orch-x64:4000KICI_SCALER_CONFIG_PATH=/etc/kici/scalers-x64.yamlOrchestrator B (ARM64, peer — first start):
KICI_MODE=platformKICI_SECRET_KEY=<64-char-hex-key>KICI_CLUSTER_JOIN_TOKEN=<token-from-orch-a>KICI_CLUSTER_ADDRESS=ws://orch-arm64:4000KICI_SCALER_CONFIG_PATH=/etc/kici/scalers-arm64.yamlWhen a webhook arrives at Orchestrator A and the job requires arm64 labels, the coordinator reroutes it to Orchestrator B (which has ARM64 agents). The coordinator still handles all check run reporting.
Dedicated coordinator + workers
Section titled “Dedicated coordinator + workers”One orchestrator with no scalers acts as the coordinator: it processes webhooks, matches triggers, creates check runs, and reroutes all jobs to worker orchestrators. Workers have scalers and agents but delegate check run reporting to the coordinator.
+---> Coordinator (no scalers) Platform ----+ | handles: webhook processing, check runs (round | | robin) +-----+---> Worker A (x64 scalers) | handles: job execution only | +---> Worker B (ARM64 scalers) handles: job execution onlyCoordinator:
KICI_MODE=platformKICI_SECRET_KEY=<64-char-hex-key>KICI_CLUSTER_ADDRESS=ws://coordinator:4000# No KICI_SCALER_CONFIG_PATH -- this orchestrator has no agentsWorkers (first start):
KICI_MODE=platformKICI_SECRET_KEY=<64-char-hex-key>KICI_CLUSTER_JOIN_TOKEN=<token-from-coordinator>KICI_CLUSTER_ADDRESS=ws://worker-a:4000 # or worker-bKICI_SCALER_CONFIG_PATH=/etc/kici/scalers.yamlCreate separate join tokens for each worker using kici-admin peer create-token --role worker.
In this topology, the coordinator always reroutes jobs since it has no local agents. Workers accept rerouted jobs and dispatch to their agents. The coordinator tracks job progress from workers and updates GitHub check runs.
Storage configuration (SharedConfig)
Section titled “Storage configuration (SharedConfig)”S3 recommendation for multi-orchestrator pools
Section titled “S3 recommendation for multi-orchestrator pools”Multi-orchestrator pools strongly recommend shared S3 log storage so any pool member can serve historical logs (the Platform’s leastLoaded() router can pick any orch in the pool to handle a dashboard request). When a second orchestrator joins a pool, the Platform compares its s3LogAccess flag against the existing members and logs a warning if they disagree, but does not reject the connection — this supports the coordinator/worker topology where the coordinator has S3 access and the workers do not (workers reroute jobs but do not serve logs directly).
The s3LogAccess flag is true whenever storage.type=s3 is configured (the same S3 backend underpins both the cache storage and the log storage). When it is false, the orchestrator falls back to a filesystem LogStorage rooted at ${webhookPayloadDir ?? '/var/lib/kici/cache'}/logs — logs and webhook payloads are still stored, just on the local disk of the orch that ingested the run. Source-tarball / dep-tarball caching is disabled in this mode (every run recompiles).
Single-orchestrator deployments can use filesystem storage with no penalty — the same orch ingests, executes, and serves logs.
The s3LogAccess field is sent in the source.register message (and in peer heartbeats), so the Platform records it per connection and surfaces it on the diagnostics page.
SharedConfig storage migration
Section titled “SharedConfig storage migration”Storage settings can be supplied either via env vars on each orchestrator or via the SharedConfig system in the database. Both paths land in the same storage.* config field.
Env vars and the SharedConfig fields they populate:
| Env var | SharedConfig field |
|---|---|
KICI_STORAGE_TYPE | storage.type |
KICI_STORAGE_BUCKET | storage.bucket |
KICI_STORAGE_PREFIX | storage.prefix |
KICI_STORAGE_REGION | storage.region |
KICI_STORAGE_ENDPOINT | storage.endpoint |
KICI_STORAGE_EXTERNAL_ENDPOINT | storage.externalEndpoint |
KICI_STORAGE_UPLOAD_ENDPOINT | storage.uploadEndpoint |
KICI_STORAGE_FORCE_PATH_STYLE | storage.forcePathStyle |
KICI_STORAGE_LOG_BUCKET | storage.logBucket |
Migration path:
- Move storage settings from env vars to the SharedConfig in the database
- Remove the env vars from your deployment
- Restart the orchestrators
The orchestrator bridges env vars into config.storage at startup. Once migrated to SharedConfig, all orchestrators in the cluster share the same storage configuration automatically.
Note: storage is a restart-required field. Changes to storage config detected during hot-reload emit a warning but are not applied until the orchestrator is restarted.
Cluster join tokens
Section titled “Cluster join tokens”Join tokens enable zero-knowledge cluster bootstrap. A new orchestrator can join an existing cluster with a single command — no manual config copying required.
Creating a join token
Section titled “Creating a join token”Via CLI (recommended):
kici-admin peer create-token --role coordinator# or for a worker:kici-admin peer create-token --role workerVia API:
curl -X POST https://orchestrator:4000/api/v1/admin/join-tokens \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "orgId": "my-org", "routingKey": "github:12345", "role": "coordinator", "expiryMs": 3600000 }'Response:
{ "token": "kici_join_v1.eyJvcmdJZCI6Im15LW9yZyIsInJvdXRpbmdLZXkiOiJnaXRodWI6MTIzNDUiLCJleHBpcnkiOjE3MTY5MjAwMDB9.a1b2c3d4...", "expiresAt": "2026-03-20T08:00:00.000Z"}Endpoint: POST /api/v1/admin/join-tokens
| Field | Type | Required | Description |
|---|---|---|---|
| orgId | string | Yes | Organization ID for the target pool |
| routingKey | string | Yes | Routing key (e.g., github:12345) |
| role | string | No | Peer role: coordinator (default) or worker |
| expiryMs | number | No | Token validity duration in ms (default: 1h) |
Requires token.manage RBAC permission.
Using a join token
Section titled “Using a join token”Set the token as an environment variable before starting the orchestrator:
KICI_CLUSTER_JOIN_TOKEN=kici_join_v1.xxx.yyyThe orchestrator authenticates with the token on its first connection. After successful authentication, a persistent credential is issued and saved to the credential file. The join token is consumed (one-time use) and no longer needed.
Token security
Section titled “Token security”- Tokens are one-time use — consumed after successful join
- Token hashes (not plaintext) are stored in the database
- Auth material is transmitted over an ECDH-encrypted channel (never in cleartext)
- Tokens carry a role (
coordinatororworker) enforced at join time
Credential management
Section titled “Credential management”After the initial join, peers authenticate using persistent credentials stored in a local file.
Credential file
Section titled “Credential file”The credential file is stored at ~/.kici/peer-credential by default (configurable via KICI_CLUSTER_CREDENTIAL_FILE). It contains:
- Instance ID
- Credential hash
- Role (coordinator or worker)
- Coordinator URL
- Issued timestamp
The file is created with 0600 permissions (owner read/write only).
Managing peers
Section titled “Managing peers”List active peers:
kici-admin peer listRevoke a peer:
kici-admin peer revoke --instance-id <id>The revoked peer’s credential is invalidated. The peer must re-join with a new token.
Revoke all peers:
kici-admin peer revoke-all --confirmAll peer credentials are invalidated. All peers must re-join with new tokens. Use this for emergency security responses.
Re-joining after revocation
Section titled “Re-joining after revocation”If a peer’s credential is revoked:
- Create a new join token:
kici-admin peer create-token --role coordinator - Set the token on the revoked peer:
KICI_CLUSTER_JOIN_TOKEN=<new-token> - Restart the peer
- The peer authenticates with the new token and receives a new credential
Monitoring
Section titled “Monitoring”Health endpoints
Section titled “Health endpoints”Three cluster HTTP endpoints are always mounted:
| Endpoint | Description |
|---|---|
GET /cluster/health | Overall cluster health: status (healthy/degraded/unhealthy), role, term, leader, peer count, agent count, active runs |
GET /cluster/peers | Per-peer details: instance ID, connection state, agent count, draining status, capabilities |
GET /cluster/runs | Active execution runs with job routing summary |
Health status logic:
- healthy — Raft leader exists AND all peers connected (or single-node with no peers)
- degraded — Raft leader exists and majority of peers connected, but some peers disconnected
- unhealthy — no Raft leader OR fewer than majority of nodes connected
Example health check:
curl -s http://orchestrator:4000/cluster/health | jq .{ "status": "healthy", "instanceId": "orch-a-abc123", "role": "leader", "term": 3, "leaderId": "orch-a-abc123", "peerCount": 1, "connectedPeers": 1, "agentCount": 4, "activeRuns": 2}Prometheus metrics
Section titled “Prometheus metrics”Cluster operations emit standard Prometheus metrics alongside existing orchestrator metrics. Relevant metrics include peer connection counts, job reroute totals, Raft election events, and orphan recovery counts.
Webhook secret management
Section titled “Webhook secret management”Webhook secrets live in the orchestrator’s shared PostgreSQL database under scoped_secrets (the encrypted PgSecretStore), keyed by source ID. The sources table holds the source’s metadata (routing key, provider config) and joins to the secret on __source__/<sourceId>. Use kici-admin source to manage them — never write to scoped_secrets directly.
How secrets work
Section titled “How secrets work”- Operators register sources via the CLI:
kici-admin source add github --app-id <id> --webhook-secret <secret> --private-key-file <path>. The CLI inserts a row intosourcesand writes the webhook secret intoscoped_secretsviaPgSecretStore.setSecret(). - The orchestrator sends
source.registerto the Platform on connect, advertising the routing key (e.g.,github:12345). The webhook secret is never sent over the wire — the message carries only the routing key and provider type. - When a webhook arrives at the Platform, the Platform sends a
webhook.relay.start/webhook.relay.chunksequence to the orchestrator that owns the routing key. The orchestrator reads the secret fromPgSecretStoreand verifies the HMAC signature locally, then ACKs back to the Platform with the verification outcome. - The Platform tier never holds webhook secret material — verification is delegated to the orchestrator on every inbound delivery.
Rotating a webhook secret
Section titled “Rotating a webhook secret”Each source carries exactly one active webhook secret. To rotate without dropping deliveries:
- Update the provider first (e.g., GitHub App settings) to use the new secret. Provider-side rotation typically tolerates a brief window where in-flight deliveries still carry the old signature, but the new secret takes effect on the next delivery.
- Update the orchestrator-side secret:
This rewrites
Terminal window kici-admin source update <routingKey> --webhook-secret <new-secret>scoped_secretsin a single transaction. Inflightwebhook.relayverifications may briefly fail during the swap; the provider will retry. - After the rotation settles, operate as normal. There is no second-row “two-secret” mode — the source carries one secret at a time.
For a coordinated zero-downtime rotation, drain webhook traffic at the load balancer for the few seconds between provider and orchestrator updates.
Rate limiting behind reverse proxies
Section titled “Rate limiting behind reverse proxies”Peer authentication rate limiting tracks failed attempts by IP address. When orchestrators sit behind a reverse proxy or load balancer, all connections appear to come from the proxy’s IP, which can cause legitimate peers to be rate-limited when a single bad actor triggers the limit.
Set KICI_CLUSTER_TRUSTED_PROXIES to the proxy’s IP or CIDR range so the peer handler extracts the real client IP from the X-Forwarded-For header:
KICI_CLUSTER_TRUSTED_PROXIES=10.0.0.0/8,172.16.0.0/12Without this setting, rate limiting uses the socket IP (the proxy), which may incorrectly block all peers behind the same proxy.
Troubleshooting
Section titled “Troubleshooting”Peer authentication failed
Section titled “Peer authentication failed”Symptom: Peer connections fail with “Invalid join token” or “Invalid credential” in logs.
Checks:
- Join token expired — tokens expire after 1 hour by default. Create a new one with
kici-admin peer create-token - Token already consumed — join tokens are one-time use. Create a new one for each peer
- Credential revoked — if the credential was revoked via
kici-admin peer revoke, the peer needs a new join token - Rate limited — after 5 failed auth attempts within 60 seconds, the IP is temporarily blocked. Wait and retry
Peers not connecting
Section titled “Peers not connecting”Symptom: /cluster/peers shows 0 connected peers despite multiple orchestrators running.
Checks:
- Verify
KICI_CLUSTER_ADDRESSis set and reachable from peer orchestrators - In independent mode, verify
KICI_CLUSTER_PEERSlists all peer addresses - Check firewall rules allow WebSocket connections on the orchestrator port
- In Platform mode, verify both orchestrators connect to the same Platform relay and register the same routing key
Stale peers
Section titled “Stale peers”Symptom: /cluster/peers shows peers as connected but their last heartbeat is old.
The peer heartbeat interval is 30 seconds by default. If heartbeats stop, the peer may have crashed or lost network connectivity. The peer will be marked as disconnected after the connection closes.
Orphan runs
Section titled “Orphan runs”Symptom: Execution runs stuck in “running” state after an orchestrator crash.
The Raft leader runs periodic orphan recovery (every 60 seconds). It detects runs whose coordinator orchestrator is no longer connected and finalizes them. If no leader is elected, orphan recovery cannot run — check /cluster/health to verify a leader exists.
Job routing limits
Section titled “Job routing limits”Rerouted jobs carry a hop counter to prevent infinite routing loops. If a job exceeds the maximum hop count, it fails instead of being rerouted again.
| Limit | Value | Description |
|---|---|---|
| Maximum hops | 3 | Jobs rerouted more than 3 times are failed to prevent routing loops |
| ACK timeout | 15s | Time for a peer to acknowledge receipt of a rerouted job |
Jobs not rerouting
Section titled “Jobs not rerouting”Symptom: Jobs fail with “No orchestrator in cluster has matching agents” even though a peer has agents.
Checks:
- Verify the peer is connected (
/cluster/peers) - Verify the peer’s agents have matching labels (labels are shared via heartbeat)
- Verify the peer’s agents have available capacity (not at max concurrency)
- Verify the peer is not draining
See also
Section titled “See also”- Multi-orchestrator architecture — coordinator/worker model, Raft consensus, peer communication, and the rerouting protocol behind this deployment guide.
- Coordinator/worker architecture — how the dedicated-coordinator topology splits webhook processing from job execution.