Monitoring & tracing
Execution metrics show a 24-hour snapshot of your CI pipeline health:
- Total runs in the last 24h.
- Success rate across those runs.
- Average duration of completed runs.
- Active jobs — queued and running combined.
Use these to spot trends in failure rates or unusual queue buildup.
The infrastructure section shows a hierarchical tree of orchestrators, their scaler pools, and connected agents.
Each orchestrator row displays three identifiers plus host metadata:
orchestrator:(bold) — the instance ID set viaKICI_CLUSTER_INSTANCE_ID, or the first 8 chars of the connection ID if none is set.conn:(dimmed) — first 8 chars of the Platform WebSocket connection ID, shown only when an explicit instance ID is present.host:(right side) — system hostname.- OS metadata: CPU, memory, uptime, plus connection status.
Click or hover the info icon for registration details: webhook sources (parsed from routing keys), deployment mode (platform / hybrid / independent), configured scaler backends, S3 log access status, connection timestamp, Node.js version, and OS information.
Below each orchestrator:
- Scaler rows: pool type, active/max agent counts, labels, and safe configuration details via a popover. Scaler rows whose backend spawns on the orchestrator’s own host — bare-metal, Firecracker, and container pools using a local runtime socket — also show that spawning host as a
kici:host:<hostname>badge directly on the row, even while idle. Pools that provision elsewhere (a container pool pointed at a remote runtime) omit it. - Agent rows: ID, platform, host, memory usage (color-coded badges), uptime, and labels — including auto-generated
kici:labels such as thekici:host:<hostname>routing label — shown as badges below each agent. - Stateful agents section: standalone agents not managed by any scaler. Expand the section to see each standalone agent with its own
kici:host:<hostname>label.
Every label badge is clickable — click one to copy its value to the clipboard.
Why this count can differ from billing. The badge in the section header counts every orchestrator node in your topology — coordinator + each peer in a Raft cluster + each standalone — because the goal here is operational visibility. The billing page counts orchestrator connections: only coordinators and standalones open a direct Platform WebSocket; peers gossip through their coordinator. A 4-connection org running two 3-node clusters and two standalones shows 8 nodes here and 4 connections on billing — both correct, answering different questions.
Each orchestrator node and each persistent agent shows its running version next to the latest published version.
When a node runs a version older than the latest published one, an update available badge appears. Opening it reveals a two-line copy-paste command to run on that host:
npm install -g @kici-dev/<package>@<latest>kici-admin <component> upgrade --version <latest> --yes
The dashboard never mutates anything — it only reveals the command for you to run. Ephemeral (scaler-spawned) agents are excluded, since the scaler recreates them with the current version.
Shows the OS user the orchestrator process is running as (username and UID).
Root is expected for Firecracker scalers (which require root for VM management). For bare-metal scalers, a non-privileged user is safer — running as root there usually indicates a misconfiguration.
Badge color codes:
- Red: root + bare-metal scaler.
- Green: root + Firecracker scaler.
- Neutral: any non-root user.
The secret backends section shows the health status of all registered secret backends (PostgreSQL and Vault).
Each card displays:
- Backend name and type.
- Health status and last error.
- Sync interval and scope count.
Action buttons:
- Sync — trigger an immediate scope refresh.
- Test — verify connectivity to the backend.
Orphaned connections are WebSocket connection records in the database that no longer have a corresponding live socket on any Platform instance. They typically occur after ungraceful Platform restarts or network partitions.
The orphan sweeper automatically cleans up connections older than 6 minutes (2× the maximum expected heartbeat interval).
A high orphan count may indicate Platform instance instability — investigate before the sweeper masks the problem.
The Raft role badge on each orchestrator row is explained inline (hover the badge). Full design — election parameters, dormant single-orchestrator mode, state persistence — lives in Multi-orchestrator architecture § Raft consensus.
KiCI provides distributed tracing across all three tiers (Platform, orchestrator, agent) using structured log fields. Every webhook event is assigned a unique trace ID at ingestion, which propagates through the entire pipeline and appears in all related log lines.
Distributed tracing
Section titled “Distributed tracing”Trace fields
Section titled “Trace fields”All KiCI services emit structured JSON logs. Grafana Alloy (running on every host that produces KiCI logs) parses each line and pushes it to Loki with low-cardinality labels (env, host, service, instance) plus per-line structured metadata carrying the trace fields below. Inside LogQL queries the fields are addressed via | json | <field>="...".
| Loki field | Raw JSON field | Generated at | Description |
|---|---|---|---|
requestId | requestId | Platform | UUID assigned when a webhook is received. Traces the event through all three tiers. |
runId | runId | Orchestrator | UUID assigned when a workflow run is dispatched. Groups all jobs in a single run. |
routingKey | routingKey | Platform | Webhook routing key (e.g., github:42). Present on webhook-related log lines. |
jobId | jobId | Orchestrator | Job identifier. Present during job execution on orchestrator and agent. |
traceId | traceId | Multiple | OTel trace ID linking spans across tiers. Always present alongside spanId. |
spanId | spanId | Multiple | OTel span ID for the current operation. |
service | service | Orchestrator | Originating service for forwarded logs (e.g., agent). Only present on logs forwarded through orchestrator stdout. |
These fields are in addition to the standard log fields: level, message, error, stack.
Tier identification: The
serviceLoki label is the canonical “which tier produced this line?” answer (platform,orchestrator,agent,postgres-orch, etc.). For forwarded agent logs that appear inside the orchestrator’s stdout, the parsed JSON also carries an innerservice: 'agent'field — query both with{service="orchestrator"} | json | service="agent"if you need to disambiguate.
Trace lifecycle
Section titled “Trace lifecycle”-
Webhook received (Platform): A
requestIdUUID is generated and associated with all log lines for this webhook. The webhook response body includes{ "requestId": "..." }for client-side correlation. -
Webhook relayed (Platform -> Orchestrator): The
requestIdis included in thewebhook.relayWebSocket message. The orchestrator picks it up and continues the trace. -
Triggers matched (Orchestrator): The orchestrator evaluates triggers against the lock file. All log lines during matching carry the
requestId. -
Job dispatched (Orchestrator): A
runIdUUID is generated for the workflow run. BothrequestIdandrunIdare included in thejob.dispatchWebSocket message to agents. -
Job executed (Agent): The agent prints a trace header once at job start:
Run: <runId> | Trace: <requestId>. All subsequent log lines carry both IDs. -
Check run updated (Orchestrator): GitHub Check Run summaries include both
TraceandRunIDs, allowing operators to copy-paste into the Logs trace Grafana dashboard’srequestIdvariable.
Log levels
Section titled “Log levels”- Info: Milestone events — webhook received, webhook relayed, triggers matched, job dispatched, job started, job completed
- Debug: Internal operations — trigger matching details, lock file fetch, rule evaluation, matrix expansion, step start/end
Querying your logs
Section titled “Querying your logs”KiCI services emit structured JSON logs to stdout. The trace fields in those logs are your join keys for correlating a single webhook or run across the orchestrator and agent — ship the logs to whatever store you already run (Loki, Elasticsearch, CloudWatch, …) and query by these fields. The examples below use LogQL; adapt the label selectors to your own log shipper.
Tracing a webhook end-to-end
Section titled “Tracing a webhook end-to-end”The webhook response includes the requestId. Query your log store for every
line carrying it across all tiers:
{job="kici"} | json | requestId="<requestId>"Sorted ascending, this returns Platform → orchestrator → agent log lines for
the full lifecycle of a webhook event. (The {job="kici"} selector is
illustrative — the label set depends on how you ship logs.)
Filtering by tier
Section titled “Filtering by tier”Add the service label to scope to one tier:
{service="orchestrator"} | json | requestId="<requestId>"Labels are filtered before the | json parser runs, so this is materially faster than parsing JSON for every line and then dropping the wrong tier.
Tracing a workflow run
Section titled “Tracing a workflow run”Use the parsed runId field to see all jobs in a single workflow run:
{service="orchestrator"} | json | runId="<runId>"Finding errors by tier
Section titled “Finding errors by tier”{service="agent"} |~ "\"level\":\"error\""The regex match (|~) on the raw JSON line is the cheapest way to filter on level, because level is auto-promoted by Loki 3.x but the line filter runs against the index — no JSON parsing required. Add | json afterwards if you need to filter on additional structured-metadata fields.
Agent log paths
Section titled “Agent log paths”Agent logs reach your log store through two paths:
- Scaler-managed agents: The orchestrator captures container stdout/stderr via the scaler’s log capture, parses each line, and re-emits it to its own stdout. Your log shipper reads it as part of the orchestrator’s journald / file stream and labels it
service=orchestrator. The parsed JSON then carries an innerservice: 'agent'field identifying the original source. - WS-based agents: Stateful or external agents send
agent.logmessages over WebSocket. The orchestrator forwards these to stdout withservice: 'agent'in the parsed JSON, following the same pattern. Native systemd agents (e.g.,kici-stateful-agent.service) ship via journald directly with theservice=agentlabel.
To find every forwarded agent log line regardless of which path it took:
{service="agent"} | jsonFinding jobs by routing key
Section titled “Finding jobs by routing key”{service="orchestrator"} | json | routingKey="github:<app-id>" | jobId!=""The trailing jobId!="" filter keeps only the lines emitted during job execution (where the orchestrator and agent both populate jobId), dropping the upstream Platform webhook-receipt lines that share the routing key.
Prometheus metrics
Section titled “Prometheus metrics”KiCI exposes Prometheus metrics from three services:
| Service | Mode | Endpoint | Metric prefix |
|---|---|---|---|
| Platform | Scraped by Prometheus | {base-path}/metrics (port 10142) | kici_ |
| Orchestrator | Scraped by Prometheus | /metrics (port 10143) | kici_orch_ |
| Agent | Scraped directly or pushed via WebSocket | /metrics (port 8080) + orchestrator /metrics | kici_agent_ |
Agent metrics push
Section titled “Agent metrics push”Agents expose a local /metrics endpoint (default port 8080) for direct Prometheus scraping. In addition, they push metrics every ~30 seconds via the agent.metrics WebSocket message to the orchestrator. The orchestrator’s agent metrics aggregator collects these and exposes them on its own /metrics endpoint with an agent_id label distinguishing each agent’s contributions.
Metrics are retained for one scrape interval after an agent disconnects, then cleaned up automatically.
Common issues
Section titled “Common issues”| Problem | Cause | Solution |
|---|---|---|
| Prometheus can’t reach orchestrator | Container networking — localhost inside the Prometheus container doesn’t reach the host | Use host.containers.internal:{port} in your scrape target |
Agent metrics missing from orchestrator /metrics | Agent not connected or hasn’t pushed yet | Wait ~30s for the next push interval; check WS connection status |
| Prometheus target shows “down” | Service not running or wrong port/path | Verify the scrape target matches the actual service port/path |
Health endpoints
Section titled “Health endpoints”All three tiers expose health endpoints for monitoring:
Orchestrator
Section titled “Orchestrator”| Endpoint | Description |
|---|---|
/health | Basic liveness check |
/ready | Readiness check (database connected) |
/metrics | Prometheus metrics (prefix: kici_orch_) |
/cluster/health | Cluster health: status, role, term, leader, peers, agents |
/cluster/peers | Per-peer details: instance ID, connection state, agents |
/cluster/runs | Active execution runs with job routing summary |
| Endpoint | Description |
|---|---|
/health | Basic liveness check |
/ready | Readiness check (connected to orchestrator) |
/metrics | Prometheus metrics (prefix: kici_agent_) |
Platform
Section titled “Platform”| Endpoint | Description |
|---|---|
/health | Basic liveness check |
/ready | Readiness check (database connected) |
/metrics | Prometheus metrics (prefix: kici_) |
Event delivery DLQ runbook
Section titled “Event delivery DLQ runbook”The orchestrator’s event router is at-least-once: events that fail to dispatch
are retried with exponential backoff (5 attempts by default, exponential with
full jitter, capped at 5 min). When all attempts are exhausted the event lands
in the DLQ — kici_events.dlq_at IS NOT NULL — and is surfaced via:
- Prometheus:
kici_orch_event_dlq_depth(gauge),kici_orch_event_dlq_total(counter),kici_orch_event_lease_expirations_total(counter — node crash signal). - Logs:
{service="orchestrator"} | json | message="Event moved to DLQ"— every DLQ admission is logged with the event id, name, and last error. - CLI:
kici-admin event-dlq list / count / retry / discard.
Triage steps
Section titled “Triage steps”- Confirm the alert. Check the
kici_orch_event_dlq_depthgauge (e.g. on a Grafana dashboard if you’ve imported KiCI’s, or via your own Prometheus). If the ingress rate is 0 and only depth > 0, the events are old — no urgent pager-class issue, but still triage them so the depth doesn’t accumulate forever (DLQ rows are NOT cleaned up by TTL, by design). - Inspect the events.
kici-admin event-dlq list --limit 20prints the recent DLQ rows witheventName,dlqReason,attempts,lastError, and the source routing key. ThelastErroris the truncated message from the final failing dispatch — usually enough to identify the offending workflow. - Cross-check your logs. For more context on an offending dispatch:
{service="orchestrator"} | json | eventId="<id>"returns every line tagged with the event id, including the retry sequence and the original handler exception. - Fix the root cause. A single event in the DLQ usually means a workflow handler is consistently failing — not a transient backend blip (transients are absorbed by the retry budget). Find the workflow in the run list, fix the handler, redeploy.
- Decide retry vs discard.
- Retry once a fix is deployed:
kici-admin event-dlq retry <eventId>— clears the DLQ flag, resets attempts, and re-publishes pg_notify so a healthy orchestrator picks it up immediately. - Discard if the event is no longer relevant (e.g. the workflow that should
have processed it was deleted):
kici-admin event-dlq discard <eventId>.
- Retry once a fix is deployed:
- Watch the depth recover. The
kici_orch_event_dlq_depthgauge should drop to 0 within a minute of the last retry. If new events keep landing in the DLQ after the fix, the fix is incomplete — go back to step 4.
When kici_orch_event_lease_expirations_total is climbing
Section titled “When kici_orch_event_lease_expirations_total is climbing”That counter increments every time an orchestrator’s dispatch lease ages out without the holder finalising it — the canonical signal that an orchestrator process crashed mid-dispatch. Healthy clusters keep this counter flat.
- A small handful per day across a large fleet: probably a node restart for routine maintenance (rolling deploy, OOM killer). Note and move on.
- Steady non-zero rate: investigate the orchestrator instance whose
claimed_byvalue appears in the expired-lease log lines — its process is dying or stuck. Check Loki for crash traces, OOM kills, or container restart events.
The leader-only retry scanner releases expired leases automatically; events
held by a crashed node are re-dispatched within leaseDurationMs + retryScanIntervalMs (default 60 s + 10 s = 70 s worst case).
See also
Section titled “See also”- Architecture: data flows — trace propagation across tiers
- Agent configuration — agent deployment settings
- Orchestrator configuration — orchestrator deployment settings