Skip to content

Monitoring & tracing

Execution metrics show a 24-hour snapshot of your CI pipeline health:

  • Total runs in the last 24h.
  • Success rate across those runs.
  • Average duration of completed runs.
  • Active jobs — queued and running combined.

Use these to spot trends in failure rates or unusual queue buildup.

The infrastructure section shows a hierarchical tree of orchestrators, their scaler pools, and connected agents.

Each orchestrator row displays three identifiers plus host metadata:

  • orchestrator: (bold) — the instance ID set via KICI_CLUSTER_INSTANCE_ID, or the first 8 chars of the connection ID if none is set.
  • conn: (dimmed) — first 8 chars of the Platform WebSocket connection ID, shown only when an explicit instance ID is present.
  • host: (right side) — system hostname.
  • OS metadata: CPU, memory, uptime, plus connection status.

Click or hover the info icon for registration details: webhook sources (parsed from routing keys), deployment mode (platform / hybrid / independent), configured scaler backends, S3 log access status, connection timestamp, Node.js version, and OS information.

Below each orchestrator:

  • Scaler rows: pool type, active/max agent counts, labels, and safe configuration details via a popover. Scaler rows whose backend spawns on the orchestrator’s own host — bare-metal, Firecracker, and container pools using a local runtime socket — also show that spawning host as a kici:host:<hostname> badge directly on the row, even while idle. Pools that provision elsewhere (a container pool pointed at a remote runtime) omit it.
  • Agent rows: ID, platform, host, memory usage (color-coded badges), uptime, and labels — including auto-generated kici: labels such as the kici:host:<hostname> routing label — shown as badges below each agent.
  • Stateful agents section: standalone agents not managed by any scaler. Expand the section to see each standalone agent with its own kici:host:<hostname> label.

Every label badge is clickable — click one to copy its value to the clipboard.

Why this count can differ from billing. The badge in the section header counts every orchestrator node in your topology — coordinator + each peer in a Raft cluster + each standalone — because the goal here is operational visibility. The billing page counts orchestrator connections: only coordinators and standalones open a direct Platform WebSocket; peers gossip through their coordinator. A 4-connection org running two 3-node clusters and two standalones shows 8 nodes here and 4 connections on billing — both correct, answering different questions.

Each orchestrator node and each persistent agent shows its running version next to the latest published version.

When a node runs a version older than the latest published one, an update available badge appears. Opening it reveals a two-line copy-paste command to run on that host:

  • npm install -g @kici-dev/<package>@<latest>
  • kici-admin <component> upgrade --version <latest> --yes

The dashboard never mutates anything — it only reveals the command for you to run. Ephemeral (scaler-spawned) agents are excluded, since the scaler recreates them with the current version.

Shows the OS user the orchestrator process is running as (username and UID).

Root is expected for Firecracker scalers (which require root for VM management). For bare-metal scalers, a non-privileged user is safer — running as root there usually indicates a misconfiguration.

Badge color codes:

  • Red: root + bare-metal scaler.
  • Green: root + Firecracker scaler.
  • Neutral: any non-root user.

The secret backends section shows the health status of all registered secret backends (PostgreSQL and Vault).

Each card displays:

  • Backend name and type.
  • Health status and last error.
  • Sync interval and scope count.

Action buttons:

  • Sync — trigger an immediate scope refresh.
  • Test — verify connectivity to the backend.

Orphaned connections are WebSocket connection records in the database that no longer have a corresponding live socket on any Platform instance. They typically occur after ungraceful Platform restarts or network partitions.

The orphan sweeper automatically cleans up connections older than 6 minutes (2× the maximum expected heartbeat interval).

A high orphan count may indicate Platform instance instability — investigate before the sweeper masks the problem.

The Raft role badge on each orchestrator row is explained inline (hover the badge). Full design — election parameters, dormant single-orchestrator mode, state persistence — lives in Multi-orchestrator architecture § Raft consensus.

KiCI provides distributed tracing across all three tiers (Platform, orchestrator, agent) using structured log fields. Every webhook event is assigned a unique trace ID at ingestion, which propagates through the entire pipeline and appears in all related log lines.

All KiCI services emit structured JSON logs. Grafana Alloy (running on every host that produces KiCI logs) parses each line and pushes it to Loki with low-cardinality labels (env, host, service, instance) plus per-line structured metadata carrying the trace fields below. Inside LogQL queries the fields are addressed via | json | <field>="...".

Loki fieldRaw JSON fieldGenerated atDescription
requestIdrequestIdPlatformUUID assigned when a webhook is received. Traces the event through all three tiers.
runIdrunIdOrchestratorUUID assigned when a workflow run is dispatched. Groups all jobs in a single run.
routingKeyroutingKeyPlatformWebhook routing key (e.g., github:42). Present on webhook-related log lines.
jobIdjobIdOrchestratorJob identifier. Present during job execution on orchestrator and agent.
traceIdtraceIdMultipleOTel trace ID linking spans across tiers. Always present alongside spanId.
spanIdspanIdMultipleOTel span ID for the current operation.
serviceserviceOrchestratorOriginating service for forwarded logs (e.g., agent). Only present on logs forwarded through orchestrator stdout.

These fields are in addition to the standard log fields: level, message, error, stack.

Tier identification: The service Loki label is the canonical “which tier produced this line?” answer (platform, orchestrator, agent, postgres-orch, etc.). For forwarded agent logs that appear inside the orchestrator’s stdout, the parsed JSON also carries an inner service: 'agent' field — query both with {service="orchestrator"} | json | service="agent" if you need to disambiguate.

  1. Webhook received (Platform): A requestId UUID is generated and associated with all log lines for this webhook. The webhook response body includes { "requestId": "..." } for client-side correlation.

  2. Webhook relayed (Platform -> Orchestrator): The requestId is included in the webhook.relay WebSocket message. The orchestrator picks it up and continues the trace.

  3. Triggers matched (Orchestrator): The orchestrator evaluates triggers against the lock file. All log lines during matching carry the requestId.

  4. Job dispatched (Orchestrator): A runId UUID is generated for the workflow run. Both requestId and runId are included in the job.dispatch WebSocket message to agents.

  5. Job executed (Agent): The agent prints a trace header once at job start: Run: <runId> | Trace: <requestId>. All subsequent log lines carry both IDs.

  6. Check run updated (Orchestrator): GitHub Check Run summaries include both Trace and Run IDs, allowing operators to copy-paste into the Logs trace Grafana dashboard’s requestId variable.

  • Info: Milestone events — webhook received, webhook relayed, triggers matched, job dispatched, job started, job completed
  • Debug: Internal operations — trigger matching details, lock file fetch, rule evaluation, matrix expansion, step start/end

KiCI services emit structured JSON logs to stdout. The trace fields in those logs are your join keys for correlating a single webhook or run across the orchestrator and agent — ship the logs to whatever store you already run (Loki, Elasticsearch, CloudWatch, …) and query by these fields. The examples below use LogQL; adapt the label selectors to your own log shipper.

The webhook response includes the requestId. Query your log store for every line carrying it across all tiers:

{job="kici"} | json | requestId="<requestId>"

Sorted ascending, this returns Platform → orchestrator → agent log lines for the full lifecycle of a webhook event. (The {job="kici"} selector is illustrative — the label set depends on how you ship logs.)

Add the service label to scope to one tier:

{service="orchestrator"} | json | requestId="<requestId>"

Labels are filtered before the | json parser runs, so this is materially faster than parsing JSON for every line and then dropping the wrong tier.

Use the parsed runId field to see all jobs in a single workflow run:

{service="orchestrator"} | json | runId="<runId>"
{service="agent"} |~ "\"level\":\"error\""

The regex match (|~) on the raw JSON line is the cheapest way to filter on level, because level is auto-promoted by Loki 3.x but the line filter runs against the index — no JSON parsing required. Add | json afterwards if you need to filter on additional structured-metadata fields.

Agent logs reach your log store through two paths:

  • Scaler-managed agents: The orchestrator captures container stdout/stderr via the scaler’s log capture, parses each line, and re-emits it to its own stdout. Your log shipper reads it as part of the orchestrator’s journald / file stream and labels it service=orchestrator. The parsed JSON then carries an inner service: 'agent' field identifying the original source.
  • WS-based agents: Stateful or external agents send agent.log messages over WebSocket. The orchestrator forwards these to stdout with service: 'agent' in the parsed JSON, following the same pattern. Native systemd agents (e.g., kici-stateful-agent.service) ship via journald directly with the service=agent label.

To find every forwarded agent log line regardless of which path it took:

{service="agent"} | json
{service="orchestrator"} | json | routingKey="github:<app-id>" | jobId!=""

The trailing jobId!="" filter keeps only the lines emitted during job execution (where the orchestrator and agent both populate jobId), dropping the upstream Platform webhook-receipt lines that share the routing key.

KiCI exposes Prometheus metrics from three services:

ServiceModeEndpointMetric prefix
PlatformScraped by Prometheus{base-path}/metrics (port 10142)kici_
OrchestratorScraped by Prometheus/metrics (port 10143)kici_orch_
AgentScraped directly or pushed via WebSocket/metrics (port 8080) + orchestrator /metricskici_agent_

Agents expose a local /metrics endpoint (default port 8080) for direct Prometheus scraping. In addition, they push metrics every ~30 seconds via the agent.metrics WebSocket message to the orchestrator. The orchestrator’s agent metrics aggregator collects these and exposes them on its own /metrics endpoint with an agent_id label distinguishing each agent’s contributions.

Metrics are retained for one scrape interval after an agent disconnects, then cleaned up automatically.

ProblemCauseSolution
Prometheus can’t reach orchestratorContainer networking — localhost inside the Prometheus container doesn’t reach the hostUse host.containers.internal:{port} in your scrape target
Agent metrics missing from orchestrator /metricsAgent not connected or hasn’t pushed yetWait ~30s for the next push interval; check WS connection status
Prometheus target shows “down”Service not running or wrong port/pathVerify the scrape target matches the actual service port/path

All three tiers expose health endpoints for monitoring:

EndpointDescription
/healthBasic liveness check
/readyReadiness check (database connected)
/metricsPrometheus metrics (prefix: kici_orch_)
/cluster/healthCluster health: status, role, term, leader, peers, agents
/cluster/peersPer-peer details: instance ID, connection state, agents
/cluster/runsActive execution runs with job routing summary
EndpointDescription
/healthBasic liveness check
/readyReadiness check (connected to orchestrator)
/metricsPrometheus metrics (prefix: kici_agent_)
EndpointDescription
/healthBasic liveness check
/readyReadiness check (database connected)
/metricsPrometheus metrics (prefix: kici_)

The orchestrator’s event router is at-least-once: events that fail to dispatch are retried with exponential backoff (5 attempts by default, exponential with full jitter, capped at 5 min). When all attempts are exhausted the event lands in the DLQkici_events.dlq_at IS NOT NULL — and is surfaced via:

  • Prometheus: kici_orch_event_dlq_depth (gauge), kici_orch_event_dlq_total (counter), kici_orch_event_lease_expirations_total (counter — node crash signal).
  • Logs: {service="orchestrator"} | json | message="Event moved to DLQ" — every DLQ admission is logged with the event id, name, and last error.
  • CLI: kici-admin event-dlq list / count / retry / discard.
  1. Confirm the alert. Check the kici_orch_event_dlq_depth gauge (e.g. on a Grafana dashboard if you’ve imported KiCI’s, or via your own Prometheus). If the ingress rate is 0 and only depth > 0, the events are old — no urgent pager-class issue, but still triage them so the depth doesn’t accumulate forever (DLQ rows are NOT cleaned up by TTL, by design).
  2. Inspect the events. kici-admin event-dlq list --limit 20 prints the recent DLQ rows with eventName, dlqReason, attempts, lastError, and the source routing key. The lastError is the truncated message from the final failing dispatch — usually enough to identify the offending workflow.
  3. Cross-check your logs. For more context on an offending dispatch: {service="orchestrator"} | json | eventId="<id>" returns every line tagged with the event id, including the retry sequence and the original handler exception.
  4. Fix the root cause. A single event in the DLQ usually means a workflow handler is consistently failing — not a transient backend blip (transients are absorbed by the retry budget). Find the workflow in the run list, fix the handler, redeploy.
  5. Decide retry vs discard.
    • Retry once a fix is deployed: kici-admin event-dlq retry <eventId> — clears the DLQ flag, resets attempts, and re-publishes pg_notify so a healthy orchestrator picks it up immediately.
    • Discard if the event is no longer relevant (e.g. the workflow that should have processed it was deleted): kici-admin event-dlq discard <eventId>.
  6. Watch the depth recover. The kici_orch_event_dlq_depth gauge should drop to 0 within a minute of the last retry. If new events keep landing in the DLQ after the fix, the fix is incomplete — go back to step 4.

When kici_orch_event_lease_expirations_total is climbing

Section titled “When kici_orch_event_lease_expirations_total is climbing”

That counter increments every time an orchestrator’s dispatch lease ages out without the holder finalising it — the canonical signal that an orchestrator process crashed mid-dispatch. Healthy clusters keep this counter flat.

  • A small handful per day across a large fleet: probably a node restart for routine maintenance (rolling deploy, OOM killer). Note and move on.
  • Steady non-zero rate: investigate the orchestrator instance whose claimed_by value appears in the expired-lease log lines — its process is dying or stuck. Check Loki for crash traces, OOM kills, or container restart events.

The leader-only retry scanner releases expired leases automatically; events held by a crashed node are re-dispatched within leaseDurationMs + retryScanIntervalMs (default 60 s + 10 s = 70 s worst case).