Skip to content

Cancel behavior

This document describes the operator-facing configuration and monitoring aspects of KiCI’s cancel system.

The grace period is the time between sending SIGTERM and sending SIGKILL to a running step when a job is cancelled.

SettingDefaultScopeDescription
SDK gracePeriod30sPer-jobSet by workflow authors in the SDK (optional)
Agent max grace period30sPer-agentHardcoded upper bound (fork-based backends)

The agent enforces the lower of the two values. If a workflow author sets gracePeriod: 60 but the agent max is 30, the effective grace period is 30 seconds.

Workflow authors set the grace period per-job:

job('deploy', {
runsOn: 'linux',
gracePeriod: 60, // seconds
steps: [
/* ... */
],
});

The maximum grace period depends on the scaler backend:

BackendMax grace periodConfigurable
Fork-based (bare-metal, Firecracker)30sNo
Container (Docker/Podman)10sNo

Both values are hardcoded. The effective grace period for a job is Math.min(jobGracePeriod, backendMaxGracePeriod).

After the step exits (via SIGTERM/SIGKILL or normal completion), lifecycle hooks run. Each hook has a timeout:

SettingDefaultScopeDescription
Hook timeout5 minutesPer-hookMaximum time for a single hook to complete
SDK timeout5 minPer-hookSet by workflow authors in the SDK

Workflow authors can customize per-hook:

cleanup: {
run: async (ctx) => { /* ... */ },
timeout: 10 * 60 * 1000, // 10 minutes in ms
},

The maximum time a cancel operation can take is:

total_cancel_time = gracePeriod + hookTimeout

With defaults: 30s + 5min = 5 minutes 30 seconds maximum.

Operators should account for this when setting stale detection timeouts. A job in cancelling state for longer than total_cancel_time is stuck.

Force cancel bypasses the grace period and all hooks:

  1. SIGKILL sent immediately to the running step process
  2. All hooks (onCancel, cleanup, onSuccess, onFailure) are skipped
  3. If hooks are currently running (during a graceful cancel), they are killed via SIGKILL
  4. The job transitions directly to cancelled

Force cancel is available through:

  • Dashboard: click the “Force cancel” button (appears after initial graceful cancel)
  • CLI: kici runs cancel <runId> --force or second Ctrl+C during interactive mode
  • API: POST /api/v1/runs/:runId/cancel { "force": true }
  • Step is stuck in a blocking syscall that ignores SIGTERM
  • Hook is stuck (e.g., network timeout during cleanup)
  • Need immediate termination regardless of cleanup

A job in cancelling state beyond total_cancel_time indicates a problem.

Query for jobs stuck in cancelling state:

SELECT j.job_id, j.run_id, j.status, j.updated_at,
EXTRACT(EPOCH FROM (NOW() - j.updated_at)) AS stuck_seconds
FROM execution_jobs j
WHERE j.status = 'cancelling'
AND j.updated_at < NOW() - INTERVAL '10 minutes'
ORDER BY j.updated_at;
CauseSymptomResolution
Agent disconnected during cancelNo status updateStale detection will time out the job
Hook infinite loopHook running foreverForce cancel via kici runs cancel --force
Process ignoring SIGTERMStep not exitingWait for grace period expiry (SIGKILL)
Network partitionAgent can’t report statusReconnection or stale timeout
  1. First, try force cancel via the CLI: kici runs cancel <runId> --force (or the equivalent HTTP call POST /api/v1/runs/:runId/cancel { "force": true })
  2. If the agent is disconnected, the stale run detection system will mark the job as timed_out_stale
  3. Check agent logs for errors during the cancel sequence

Firecracker VM-based jobs handle cancellation the same way as container and bare-metal jobs. The agent runs inside the VM and handles the SIGTERM/SIGKILL/hooks sequence. There is no special VM-level shutdown handling — the VM is destroyed after the agent reports the final job status.

Any organization member can cancel runs (Member, Admin, or Owner role). Cancelling is considered non-destructive since it doesn’t modify code or configuration.

The cancelled_by field on the run record stores the user ID of who initiated the cancel. The dashboard shows “Cancelled by @username” on the run detail page.

Queued runs (waiting in a concurrency group) can be cancelled before they start executing. Since no step is running, the cancel is immediate:

  • No grace period
  • No hooks execute (no step context to run them in)
  • The run transitions directly to cancelled
  • The queue slot is freed for the next waiting run

Source: packages/agent/src/execution/job-runner.ts, packages/engine/src/state-machine/