Stale run detection and failure marking
KiCI automatically detects and marks stale CI runs — jobs that stop responding because an agent died, a network partition occurred, or a container was killed. This prevents runs from remaining stuck in an “in progress” state indefinitely.
Stale detection is always on by default with sensible defaults. No opt-in or configuration is required for basic operation.
How it works
Section titled “How it works”KiCI uses a two-tier detection model:
-
Orchestrator tier — Detects stale agents. Agents send periodic heartbeats for each running job. If no heartbeat arrives within the stale threshold, the job is marked as
timed_out_stale. The detector also catches dispatched jobs that were never acknowledged by an agent (dispatch sent but no heartbeat ever received). -
Platform tier — Detects stale orchestrators. If an orchestrator disconnects from the Platform relay and does not reconnect within 5 minutes (default), all active runs for that orchestrator’s routing keys are marked as
timed_out_stale. On Platform startup, an immediate orphan scan marks any running runs with no active orchestrator connection as stale without waiting for the threshold.
Detection flow
Section titled “Detection flow”Agent --heartbeat (60s)--> Orchestrator --scan (60s)--> DBPlatform --WS monitor--> disconnect tracking --scan (60s)--> DBAgents send a job.heartbeat message every 60 seconds (configurable) for each running job. The orchestrator persists the heartbeat timestamp in execution_jobs.last_heartbeat_at. The stale detector scans the DB periodically and marks jobs whose last heartbeat is older than the stale threshold.
Configuration
Section titled “Configuration”All stale detection parameters are configured via environment variables on the orchestrator.
| Variable | Default | Description |
|---|---|---|
KICI_STALE_DETECTOR_SCAN_INTERVAL_MS | 60000 | How often the detector scans for stale runs (milliseconds) |
KICI_STALE_DETECTOR_THRESHOLD_MULTIPLIER | 2 | Multiplier applied to heartbeat interval for the stale threshold |
KICI_JOB_HEARTBEAT_INTERVAL_MS | 60000 | How often agents send per-job heartbeats (milliseconds) |
Derived stale threshold
Section titled “Derived stale threshold”The stale threshold is computed as:
stale threshold = KICI_JOB_HEARTBEAT_INTERVAL_MS x KICI_STALE_DETECTOR_THRESHOLD_MULTIPLIERWith defaults: 60,000ms x 2 = 120,000ms (2 minutes).
This means a job is considered stale if no heartbeat has been received for 2 minutes.
Platform tier behaviour (informational)
Section titled “Platform tier behaviour (informational)”The Platform tier also performs stale-orchestrator detection: when an orchestrator disconnects from the Platform relay and does not reconnect within ~5 minutes, runs whose routing keys belong to that orchestrator are marked timed_out_stale. Operators of self-deployed orchestrators do not configure the Platform tier — its thresholds are managed by the SaaS Platform. If you operate the Platform itself (internal docs only), see internal Platform docs.
Tuning guidance
Section titled “Tuning guidance”Faster detection
Section titled “Faster detection”Lower the heartbeat interval for faster stale detection at the cost of more DB writes:
KICI_JOB_HEARTBEAT_INTERVAL_MS=30000 # 30s heartbeatKICI_STALE_DETECTOR_THRESHOLD_MULTIPLIER=2 # 60s detectionKICI_STALE_DETECTOR_SCAN_INTERVAL_MS=30000 # 30s scanThis configuration detects stale jobs within approximately 60 seconds.
More tolerant detection
Section titled “More tolerant detection”Increase the multiplier if you experience false positives due to network blips or high-latency connections:
KICI_STALE_DETECTOR_THRESHOLD_MULTIPLIER=3 # 180s detection with default 60s heartbeatDefault recommendation
Section titled “Default recommendation”The default configuration (2 minutes detection time) is suitable for most CI workloads. It provides a good balance between detection speed and tolerance for temporary network issues.
Metrics
Section titled “Metrics”KiCI exposes the following Prometheus metrics for stale run monitoring:
| Metric | Type | Description |
|---|---|---|
kici_orch_stale_runs_detected_total | Counter | Total number of stale runs detected and marked as failed |
kici_orch_stale_detection_duration_seconds | Histogram | Time between job becoming stale and detection (seconds) |
kici_orch_stale_runs_current | Gauge | Current number of stale runs detected in last scan |
kici_platform_stale_orchestrators_total | Counter | Total orchestrators detected as permanently disconnected |
kici_platform_stale_runs_failed_total | Counter | Total execution runs failed due to orchestrator staleness |
Example alerts
Section titled “Example alerts”# Alert when stale jobs are detected- alert: KiCIStaleJobsDetected expr: rate(kici_orch_stale_runs_detected_total[5m]) > 0 for: 5m labels: severity: warning annotations: summary: 'KiCI is detecting stale jobs' description: 'Stale job detection rate: {{ $value }}/s'Failure behavior
Section titled “Failure behavior”GitHub check runs
Section titled “GitHub check runs”When a job becomes stale, KiCI updates its GitHub check run with the timed_out conclusion. This appears in the GitHub UI as a timeout indicator, distinct from a regular failure.
Database status
Section titled “Database status”Stale jobs are marked with the timed_out_stale status in the execution_jobs table. This is a distinct value from failed, allowing you to query specifically for stale timeouts:
SELECT * FROM execution_jobs WHERE status = 'timed_out_stale';When all jobs in a run reach a terminal state (including timed_out_stale), the execution_runs row is also marked as completed with status = 'failed' (since a stale timeout is treated as a failure for overall run status).
Held run expiry
Section titled “Held run expiry”The stale detector also handles expiring overdue held runs (runs awaiting environment approval that exceed their hold expiry timeout). When a HeldRunStore is configured, each scan calls expireOverdue() to transition expired pending holds to expired status, cancelling the associated jobs. See Environments for details on hold expiry configuration.
Orphaned recovery jobs
Section titled “Orphaned recovery jobs”On startup, the stale detector cleans up jobs stuck in recovering state from a previous orchestrator instance. Recovery timers are in-memory and lost on restart, so any jobs still in recovering state are immediately failed with a descriptive error message.
Crash recovery
Section titled “Crash recovery”If the orchestrator itself crashes and restarts, the stale detector runs an immediate scan on startup to find any jobs that became stale during the downtime. This ensures no orphaned running jobs survive a restart.
The completion logic includes a DB-fallback path that works even when in-memory state is empty (post-restart), so execution_runs.status is correctly updated even after a crash.
Queue timeout expiry
Section titled “Queue timeout expiry”Jobs waiting in the dispatch queue (no matching agent available) are automatically expired after a configurable timeout. When a job expires:
- The orchestrator marks it as
timed_out_stalein both the localexecution_jobstable and the Platform execution projection - The run is checked for completion (if all jobs are terminal, the run is marked
failed) - An
orchestrator.job.queue_expiredinfrastructure event is emitted for dashboard visibility
Configuration
Section titled “Configuration”The queue timeout is configurable via the admin CLI (persisted in the cluster DB) or environment variable:
# Via admin CLI (recommended -- persisted, survives restarts, shared across cluster)kici-admin config set queue.timeoutMs 7200000 # 2 hourskici-admin config set queue.timeoutMs 0 # indefinite (no expiry)kici-admin config reload
# Via environment variable (override, takes precedence over DB config)KICI_QUEUE_TIMEOUT_MS=7200000| Setting | Default | Description |
|---|---|---|
queue.timeoutMs | 3600000 (1 hour) | How long a job can wait in the dispatch queue before expiring. 0 = indefinite. |
Platform safety-net GC
Section titled “Platform safety-net GC”The Platform tier runs a safety-net garbage collector as part of its periodic stale scan (every 60 seconds). This catches jobs that the orchestrator failed to expire — e.g., if the orchestrator crashed during its cleanup cycle.
The safety net uses 2x the orchestrator’s reported queue timeout as its threshold. The orchestrator reports its queue.timeoutMs value to the Platform via the source.register protocol message on every connection. For orchestrators that don’t report a timeout (older versions), the Platform falls back to a 2-hour threshold.
The 2x margin ensures the orchestrator-side expiry always takes priority. The Platform safety net only fires if the orchestrator genuinely failed to clean up.
Late updates from the orchestrator will overwrite the Platform-side expiry status (the Platform uses an upsert on job.status.forward messages), so there is no race condition risk.
Stale check run cleanup
Section titled “Stale check run cleanup”When an orchestrator dies permanently, the Platform tier marks the associated execution_runs as timed_out_stale in its database. It also collects cleanup metadata (repository, SHA, workflow name, job names) for each affected run and queues it in memory, keyed by routing key.
When a replacement orchestrator reconnects and sends source.register for the same routing keys, the Platform forwards the pending cleanup entries via the stale.checkrun.cleanup protocol message. The orchestrator then discovers the stuck GitHub check runs via checks.listForRef and updates them to a timed_out conclusion, so they no longer appear as “in_progress” on GitHub.
If no replacement orchestrator reconnects (e.g., the customer decommissions the instance), GitHub will eventually mark these check runs as stale on its own (after several hours), or the user can re-push to trigger a fresh CI run.