Stale run detection architecture
This document describes the architecture of KiCI’s stale run detection system, which automatically detects and marks CI jobs that have stopped responding.
Two-tier model
Section titled “Two-tier model”KiCI uses a two-tier detection model that covers both agent-level and orchestrator-level failures:
Tier 1 (Orchestrator): Agent --job.heartbeat (60s)--> Orchestrator WS handler | v execution_jobs.last_heartbeat_at | StaleRunDetector --scan (60s)--> DB | v Mark timed_out_stale + GitHub check run + force-terminate
Tier 2 (Upstream): Orchestrator --WS connection--> KiCI | WS disconnect event | v Stale-orch detector --scan after threshold | v Mark execution_runs timed_out_staleWhy two tiers
Section titled “Why two tiers”A single detection tier cannot cover all failure modes:
- Agent dies: The orchestrator detects this via missing heartbeats (Tier 1).
- Orchestrator dies: Only the upstream tier can detect this because it holds the WS connection (Tier 2).
- Network partition: Depending on which side of the partition, either tier may detect the failure.
Heartbeat flow
Section titled “Heartbeat flow”Agent side
Section titled “Agent side”The agent sends a job.heartbeat message every 60 seconds (configurable via KICI_JOB_HEARTBEAT_INTERVAL_MS) for each running job. The heartbeat timer starts when the job enters the running state and is cleaned up in a finally block to handle all exit paths (success, failure, cancellation).
Heartbeats use the buffered send() path (not sendDirect()) so they queue during temporary orchestrator disconnects and flush on reconnect.
The job.heartbeat schema is lightweight (no messageId):
{ type: 'job.heartbeat', runId: string, jobId: string, timestamp: number}Orchestrator side
Section titled “Orchestrator side”The orchestrator handles job.heartbeat messages in the agent WS handler and calls executionTracker.updateJobHeartbeat(), which updates execution_jobs.last_heartbeat_at with an optimistic WHERE status='running' clause (completed jobs are safely skipped).
An initial last_heartbeat_at is set when the job enters the running state via onJobStatus(). This ensures newly-running jobs have a heartbeat timestamp before the first periodic heartbeat arrives (60 seconds later).
Separate from WS heartbeat
Section titled “Separate from WS heartbeat”The per-job heartbeat (60s) is separate from the WebSocket-level heartbeat (30s). The WS heartbeat monitors connection-level health (handled by AgentHeartbeatMonitor), while the job heartbeat monitors job-level liveness (handled by StaleRunDetector).
Detection flow
Section titled “Detection flow”StaleRunDetector runs a periodic scan at a configurable interval (default 60s). Each scan consists of three sub-scans (A, B, C) followed by two post-scan steps (E, D):
Sub-scan A: stale running jobs (heartbeat present)
Section titled “Sub-scan A: stale running jobs (heartbeat present)”SELECT ej.*, er.workflow_name, er.repo_identifier, er.sha, er.provider, er.provider_contextFROM execution_jobs ejINNER JOIN execution_runs er ON er.run_id = ej.run_idWHERE ej.status = 'running' AND ej.last_heartbeat_at IS NOT NULL AND ej.last_heartbeat_at < :thresholdFinds running jobs where the last heartbeat is older than the stale threshold (default: 2 minutes).
Sub-scan B: stale running jobs (NULL heartbeat)
Section titled “Sub-scan B: stale running jobs (NULL heartbeat)”WHERE ej.status = 'running' AND ej.last_heartbeat_at IS NULL AND ej.created_at < :thresholdCatches jobs that started before the heartbeat system existed, or where the initial heartbeat was never set. Falls back to created_at as the staleness indicator.
Sub-scan C: stale dispatched queue entries
Section titled “Sub-scan C: stale dispatched queue entries”SELECT id, run_id, job_name FROM dispatch_queueWHERE status = 'dispatched' AND created_at < :thresholdDetects jobs that were dispatched to an agent but never acknowledged (agent died after dispatch, before starting execution). Both the dispatch_queue entry and the corresponding execution_jobs row are marked as failed/timed_out_stale, ensuring dispatch_queue failures propagate to the execution tables.
Post-scan E: held run expiry
Section titled “Post-scan E: held run expiry”If a HeldRunStore is configured, calls heldRunStore.expireOverdue() to cancel held jobs whose hold expiry deadline has passed. This is a separate concern from heartbeat-based staleness — held runs are waiting for reviewer approval, not executing.
Post-scan D: batch run completion
Section titled “Post-scan D: batch run completion”After all sub-scans and held run expiry, collects the unique runId values of all affected jobs and calls completeRunIfAllJobsTerminal() for each. This checks if all jobs in the run have reached terminal state and, if so, updates the execution_runs row.
Status model
Section titled “Status model”timed_out_stale as distinct DB status
Section titled “timed_out_stale as distinct DB status”timed_out_stale is stored as a distinct value in execution_jobs.status, not just as an error_message on a failed row. This design enables:
- Direct querying:
WHERE status = 'timed_out_stale'without parsing error messages. - Separate metrics: Stale timeouts counted independently from regular failures.
- GitHub distinction: Maps to
timed_outconclusion (notfailure).
Relationship to engine state machine
Section titled “Relationship to engine state machine”The engine’s 11-state ExecutionState type (the pure-function state machine) is not modified. timed_out_stale is included in the engine’s ExecutionJobStatus Zod enum and TERMINAL_JOB_STATES set (defined in packages/engine/src/protocol/messages/execution-status.ts), so it’s recognized as a terminal state for run completion logic across all tiers.
Overall run status
Section titled “Overall run status”timed_out_stale is treated as failed when computing overall run status:
- Any
timed_out_stalejob -> run status isfailed failedtakes precedence overcancelled
Failure flow
Section titled “Failure flow”When a stale job is detected:
- Optimistic UPDATE:
SET status='timed_out_stale' WHERE status='running' - Log warning: Includes runId, jobId, agentId, staleDurationMs
- Metric increment:
staleRunsDetectedTotalandstaleDetectionDurationSeconds - In-memory sync:
executionTracker.updateInMemoryJob()(no redundant DB write) - Forward terminal status to Platform:
executionTracker.forwardJobTerminalStatus()keeps the Platform execution_jobs projection in sync - Emit infrastructure event:
executionTracker.emitInfraEvent()withorchestrator.job.stale_detectedfor the dashboard timeline - Cancel in-progress steps:
executionTracker.cancelStepsForJob()removes stale running indicators from the dashboard - Check run update:
checkRunReporter.updateJobStatus()withtimed_outconclusion - Force-terminate agent:
scalerManager.onAgentDisconnected()+dispatcher.onAgentDisconnect() - Run completion check:
completeRunIfAllJobsTerminal()for each affected run (batched)
Crash recovery
Section titled “Crash recovery”On orchestrator startup, StaleRunDetector.start() runs an immediate scan before starting the periodic interval. This catches any jobs that became stale while the orchestrator was down.
DB-Fallback in completeRunIfAllJobsTerminal
Section titled “DB-Fallback in completeRunIfAllJobsTerminal”completeRunIfAllJobsTerminal() has two code paths:
- Path A (normal): In-memory
RunStateMap available. ChecksisRunComplete(), computes status from in-memory job states, updates DB, fires callback. - Path B (crash recovery): In-memory Map is empty (post-restart). Queries
execution_jobsfrom DB, checks if all are terminal, computes status from DB rows, updatesexecution_runs, fires callback.
Path B is critical for crash recovery: after a restart, in-memory state is empty, but the DB contains jobs that were marked timed_out_stale by the immediate startup scan. Without Path B, execution_runs.status would never be updated from running to failed.
Dispatch queue propagation
Section titled “Dispatch queue propagation”When a stale dispatch queue entry is found:
dispatch_queue.status->failedexecution_jobs.status->timed_out_stale(for the matching run_id + job_name)executionTracker.updateInMemoryJob()keeps in-memory state consistent- The affected run_id is added to the batch completion check
This guarantees a failed dispatch queue entry never leaves the corresponding execution_jobs row in a non-terminal state.
Queue timeout expiry
Section titled “Queue timeout expiry”Jobs waiting in the dispatch queue are subject to a configurable timeout (queue.timeoutMs, default 1 hour). The cleanup scheduler runs every 60 minutes and calls JobQueue.markExpired(), which:
- SELECTs pending jobs with
expires_at < now() - UPDATEs those rows to
expiredstatus - Returns
ExpiredJobInfo[](id, runId, jobName)
The cleanup function then follows the same pattern as the stale run detector:
- Updates
execution_jobs.statustotimed_out_stalewith error message"Queue timeout expired (job was never dispatched to an agent)" - Calls
executionTracker.updateInMemoryJob()andforwardJobTerminalStatus() - Emits
orchestrator.job.queue_expiredinfrastructure event - Checks run completion via
completeRunIfAllJobsTerminal()
The forwardJobTerminalStatus() call sends a job.status.forward message upstream, keeping the dashboard’s view of execution_jobs in sync.
The orchestrator also reports its queue.timeoutMs in the source.register protocol message so the upstream tier can run a safety-net GC of stale queued jobs.
Race conditions
Section titled “Race conditions”Optimistic concurrency
Section titled “Optimistic concurrency”The primary race condition is between the stale detector and a late-arriving agent completion:
UPDATE execution_jobs SET status='timed_out_stale'WHERE run_id=:runId AND job_id=:jobId AND status='running'If the agent completes the job between the SELECT and UPDATE, the WHERE status='running' clause prevents overwriting the legitimate completion. The stale detector checks numUpdatedRows and skips further processing if 0 (the job was already completed).
No redundant writes
Section titled “No redundant writes”StaleRunDetector does not call executionTracker.onJobStatus() because that method writes to the DB (status, timestamps, etc.), which would be redundant with the detector’s own UPDATE. Instead, it calls updateInMemoryJob() (in-memory only) and completeRunIfAllJobsTerminal() (handles run completion separately).
Workflow timeout = orchestrator run deadline
Section titled “Workflow timeout = orchestrator run deadline”A workflow-level timeout (see Timeouts) caps the whole run’s wall-clock across all jobs. Because a run’s jobs can span multiple agents, this cap cannot be enforced agent-side — it is a run-level deadline owned by the orchestrator, enforced by a periodic scanner modeled on the stale-run detector.
- Persisted at run creation. When a run starts, the orchestrator reads the workflow’s
timeoutfrom the lock file and persists it asexecution_runs.workflow_timeout_ms. A run with no workflow timeout has a null column and is never deadline-enforced. - Periodic deadline scan. A
WorkflowDeadlineDetectorruns on the same interval as the stale-run detector. Each scan finds non-terminal runs (pending/running/cancelling) with a non-nullworkflow_timeout_mswhosestarted_at + workflow_timeout_msis in the past, time-bounded to the last 24 hours to avoid re-scanning ancient history. - Canonical cancel with a distinct reason. Each overdue run is driven through the same run-cancellation path that a user-initiated
kici runs canceluses: queued dispatch rows are cancelled,job.cancelis sent to the agents running the run’s jobs, and the run’s failure reason is stamped with the distinctworkflow_timeoutreason so the dashboard labels the run “timed out” rather than a generic cancel. The run reachescancelled. - Crash recovery. Like the stale-run detector, the deadline detector runs an immediate scan on startup, catching runs that blew their deadline while the orchestrator was down.
This complements the job-level timeout, which is agent-enforced: the agent arms a job wall-clock deadline in the forked workflow runner and, on breach, aborts the job and reports it failed with the distinct job_timeout reason. Job and workflow timeouts are independent caps — each enforces only its own scope.