Auto-scaler: operations
This page covers how the auto-scaler behaves at runtime and how to operate it. For the YAML fields you set, see Common configuration and the per-backend pages.
Deployment topology
Section titled “Deployment topology”The orchestrator’s own deployment mode (running on bare metal vs running inside a container) is orthogonal to which scaler backend it can use. There is no code-level check that detects whether the orchestrator is containerised, and nothing restricts the backend list based on it. What each backend needs is host-level access to the resources it provisions, regardless of how the orchestrator process is packaged.
So a containerised orchestrator (e.g., the customer-deployable quay.io/kici-dev/kici-orchestrator image under Podman or Docker) can drive any of the three backends — but only if its container is granted the access that backend requires. A default, unprivileged container does not have that access; you must pass it through explicitly.
| Backend | What it provisions | What a containerised orchestrator must be granted |
|---|---|---|
| Container | sibling containers | The container runtime socket bind-mounted in (/var/run/docker.sock or the Podman socket), or a remote daemon via host: tcp://…. This is the common, supported deployment. |
| Bare-metal | host child processes | The agent binary must be reachable on a mounted host path (e.g. --volume /opt/kici:/opt/kici), with binaryPath pointing at the mounted location. Spawned agents then inherit the container’s namespaces — they do not land on the host unless the container shares the host PID/network namespaces. |
| Firecracker | KVM microVMs | --device /dev/kvm, --device /dev/net/tun, a writable mount for chrootBaseDir (default /srv/jailer), and the capability to run ip / chown (either as root inside the container or via requireSudo: true with passwordless sudo). In practice this means --privileged or a carefully curated capability + device set. |
Practical guidance:
- Container backend in a container is the standard customer deployment and needs only the socket (or a remote
host). - Bare-metal backend in a container is unusual. Because child processes inherit the orchestrator container’s filesystem and namespaces, “bare-metal” agents launched from inside a container are really container-local processes, not host processes — which defeats the usual reason to pick bare-metal (host hardware / GPU access). If you need agents on the host, run the orchestrator on the host.
- Firecracker backend in a container is technically possible but operationally fragile: nested KVM access, TAP device management, and jailer chroot all want host-level privileges. The supported and documented path is to run a Firecracker-backed orchestrator on the bare-metal host — see the Firecracker setup guide, whose prerequisites assume the orchestrator process has direct
/dev/kvmand networking access.
The bottom line: the answer is “yes, with the right passthrough,” but only the container backend is a natural fit for a containerised orchestrator. For bare-metal and Firecracker, run the orchestrator on the host.
Multi-scaler setup
Section titled “Multi-scaler setup”For complex deployments, split scaler definitions across multiple files using the scalers.d/ directory pattern (inspired by Linux daemon conventions like conf.d/).
Directory structure
Section titled “Directory structure”/etc/kici/ scalers.yaml # Main config (version, globalMaxAgents, defaults) scalers.d/ container-linux.yaml # Container scaler for Linux agents gpu-machines.yaml # Bare-metal scaler for GPU machinesMain config
Section titled “Main config”The main config provides version, global settings, and optionally its own scalers:
version: 1globalMaxAgents: 50defaults: resources: memory: '2g' cpus: 2Additional scaler files
Section titled “Additional scaler files”Files in scalers.d/ contain only scalers arrays. They are loaded alphabetically and merged into the main config:
scalers: - name: container-linux type: container maxAgents: 20 labelSets: - labels: ['linux', 'container'] image: 'ghcr.io/myorg/kici-agent:latest'scalers: - name: gpu-machines type: bare-metal maxAgents: 3 labelSets: - labels: ['linux', 'gpu', 'cuda'] binaryPath: '/opt/kici/kici-agent'Label-set overlap detection
Section titled “Label-set overlap detection”At startup, the scaler validates that no label set appears in more than one scaler backend across all files. If an overlap is detected, the orchestrator rejects the configuration and exits.
Error: Label set [linux,container] overlaps between scalers "container-team-a" and "container-team-b"Overlapping label sets within the same scaler are allowed (the first match is used). Only cross-scaler overlaps are rejected.
Label matching
Section titled “Label matching”Labels use subset matching semantics — a job’s runsOn labels must all be present in the scaler’s label set, but the scaler can have additional labels. This is consistent with peer routing in cluster mode. Label sets are normalized (sorted alphabetically, deduplicated, lowercased) before comparison. When multiple backends match, the one with the smallest label set wins (most specific match).
Auto-injected labels
Section titled “Auto-injected labels”Every spawned agent ends up with a set of internal labels added automatically. Some are injected by the scaler backend via the KICI_LABELS environment variable when the agent is spawned; others are added by the agent itself at registration time based on its own environment. The final label set (as seen by the orchestrator after registration) includes:
kici:os:<platform>— Host OS (e.g.,kici:os:linux). Added by the agent at registration from its ownos.platform(). For matching purposes, the scaler uses the actual host platform for bare-metal entries and always useslinuxfor container and Firecracker entries.kici:arch:<arch>— CPU architecture (e.g.,kici:arch:x64,kici:arch:arm64). Added by the agent at registration fromos.arch().kici:agent:<backend-type>— Backend type (e.g.,kici:agent:container,kici:agent:bare-metal,kici:agent:firecracker). Injected by the scaler intoKICI_LABELS.kici:scaler:<scaler-name>— Scaler entry name (e.g.,kici:scaler:linux-containers). Injected by the scaler intoKICI_LABELS.kici:host:<hostname>— Hostname of the machine running the agent (e.g.,kici:host:host-1). Added by the agent at registration fromos.hostname(). Useful for routing jobs to already-registered agents, but not usable as a scaling target (the scaler cannot predict the hostname of agents it has not yet spawned).kici:role:<role>— One label per active role (e.g.,kici:role:builder,kici:role:init-runner). Injected by the scaler intoKICI_LABELS. By default (whenrolesis not set), all known role labels are injected. Whenrolesis an empty array[], no role labels are injected. When specific roles are listed, only those role labels are injected.
These labels use the reserved kici: prefix namespace. You do not need to include them in your labelSets configuration — they are added automatically. However, you can reference them in job runsOn arrays to target specific backend types or scaler entries (e.g., runsOn: ['linux', 'kici:agent:firecracker']).
Matching rules
Section titled “Matching rules”- A job with
runsOn: ["linux", "container"]matches a label set["container", "linux"](order does not matter). - A job with
runsOn: ["linux"]matches["linux", "container"](job labels are a subset of the scaler’s labels). - A job with
runsOn: ["linux", "container", "node20"]does NOT match["linux", "container"](job requires labels the scaler does not have). - Labels are case-insensitive:
["Linux"]matches["linux"](normalization lowercases all labels).
Examples
Section titled “Examples”Job runsOn | Scaler Label Set | Match? |
|---|---|---|
["linux", "container"] | ["container", "linux"] | Yes |
["linux"] | ["linux", "container"] | Yes |
["linux", "container", "node20"] | ["linux", "container"] | No |
["linux", "container"] | ["linux", "container"] | Yes |
["gpu", "cuda"] | ["cuda", "gpu"] | Yes |
["macos"] | ["macos", "darwin", "bare-metal"] | Yes |
What happens when no match is found
Section titled “What happens when no match is found”If a job’s labels do not match any scaler’s label sets and no static agent with matching labels is connected, the job is queued locally as a fallback (with queued-no-backend status) while the cluster coordinator attempts peer rerouting. If no peer in the cluster can handle the labels either, the job ultimately fails with a no-backend error. In single-orchestrator deployments (no cluster peers), the queued fallback job will remain pending until it times out.
Config reload (SIGHUP)
Section titled “Config reload (SIGHUP)”Send SIGHUP to the orchestrator process to reload the scaler configuration without restart:
kill -HUP $(pidof node)# orkill -HUP $(cat /var/run/kici-orchestrator.pid)Reload process
Section titled “Reload process”The reload follows a three-stage validation process:
- Load — Re-reads the YAML config file (and
scalers.d/directory if configured). - Validate overlaps — Checks for label-set overlaps across scalers. If any overlap is found, the reload is rejected.
- Validate backends — Each backend validates its new label sets (e.g., container checks that all label sets have
imagefields). If any backend rejects, the reload is rejected.
Only after all three stages pass is the new configuration applied. On failure, the current configuration is kept and an error is logged.
What changes on reload
Section titled “What changes on reload”- Global max agents limit
- Per-scaler label sets and their properties (image, resources, env)
- Per-scaler
orchestratorUrl(URL for spawned agents to connect back to) - Per-scaler
roles(which internal job types the scaler handles) - Warm pool sizes and timeouts
What does NOT change on reload
Section titled “What does NOT change on reload”- Backend types (cannot change a container scaler to bare-metal)
- Active agents (running agents are not affected; changes apply to new spawns)
Monitoring
Section titled “Monitoring”The scaler exposes Prometheus metrics with the kici_orch_scaler_ prefix, available on the orchestrator’s /metrics endpoint.
Metrics reference
Section titled “Metrics reference”| Metric | Type | Labels | Description |
|---|---|---|---|
kici_orch_scaler_config_reloads_total | Counter | result | Config reload attempts (attempted, success, failed) |
kici_orch_scaler_cpus_used | Gauge | scaler, machinePool | Current CPU reservations summed by scaler / pool. scaler="__global__" is the orchestrator-wide total; pool rows reflect the ledger |
kici_orch_scaler_memory_bytes_used | Gauge | scaler, machinePool | Current memory reservations (bytes) summed by scaler / pool. Same label semantics as kici_orch_scaler_cpus_used |
kici_orch_scaler_spawn_refusals_total | Gauge | (none) | Cumulative count of spawn requests refused due to resource caps (maxAgents, resourceCap, globalResourceCap, machinePool) |
kici_orch_scaler_spawn_failures_total | Counter | backend, bound | Spawn failures where the backend accepted the request but the agent never came up (missing binary, unpullable image, boot failure). bound is true for a job-bound spawn, false for a warm-pool spawn |
Suggested alert rules
Section titled “Suggested alert rules”# Config reload failures- alert: KiCIScalerConfigReloadFailed expr: increase(kici_orch_scaler_config_reloads_total{result="failed"}[1h]) > 0 for: 0m labels: severity: warning annotations: summary: 'Scaler config reload failed'Troubleshooting
Section titled “Troubleshooting”Backend-specific troubleshooting lives on each backend page: Container, Bare-metal, Firecracker.
Label mismatch
Section titled “Label mismatch”Symptom: Jobs fail immediately with no-backend error.
Cause: Job runsOn labels are not a subset of any scaler label set.
Solution: Check your workflow’s runsOn labels against the scaler config. Remember: all job labels must be present in the scaler’s label set (subset matching). Labels are normalized (sorted, deduplicated, lowercased) before comparison.
# Check what label sets are configuredgrep -A2 "labels:" /etc/kici/scalers.yamlConfig reload rejected
Section titled “Config reload rejected”Symptom: SIGHUP sent but config does not change. Error in logs: Config reload failed, keeping current config.
Cause: The new config has validation errors (label-set overlap, missing required fields, invalid values).
Solution: Check orchestrator logs for the specific validation errors. Fix the config and send SIGHUP again. The current config remains active during failed reloads.
Example configurations
Section titled “Example configurations”These examples combine multiple scaler backends. For single-backend examples, see the Container, Bare-metal, and Firecracker pages.
Mixed: container + bare-metal GPU
Section titled “Mixed: container + bare-metal GPU”# Container for standard workloads, bare-metal for GPU workloadsversion: 1globalMaxAgents: 25
defaults: resources: memory: '2g' cpus: 2
scalers: - name: container-standard type: container maxAgents: 20 labelSets: - labels: ['linux', 'container'] image: 'ghcr.io/myorg/kici-agent:latest' - labels: ['linux', 'container', 'node20'] image: 'ghcr.io/myorg/kici-agent-node20:latest' resources: memory: '4g' cpus: 4
- name: gpu-machines type: bare-metal maxAgents: 3 labelSets: - labels: ['linux', 'gpu', 'cuda'] binaryPath: '/opt/kici/kici-agent' resources: memory: '16g' cpus: 8Production: multi-scaler with warm pools
Section titled “Production: multi-scaler with warm pools”Using scalers.d/ directory for team-managed configs:
version: 1globalMaxAgents: 100
defaults: resources: memory: '2g' cpus: 2scalers: - name: container-standard type: container maxAgents: 40 warmPool: enabled: true size: 5 idleTimeoutSeconds: 300 labelSets: - labels: ['linux', 'container'] image: 'ghcr.io/myorg/kici-agent:latest' - labels: ['linux', 'container', 'node20'] image: 'ghcr.io/myorg/kici-agent-node20:latest' resources: memory: '4g' cpus: 4scalers: - name: container-heavy type: container maxAgents: 10 warmPool: enabled: true size: 2 idleTimeoutSeconds: 600 labelSets: - labels: ['linux', 'heavy'] image: 'ghcr.io/myorg/kici-agent-heavy:latest' resources: memory: '8g' cpus: 8 containerSocket: true # WARNING: See security sectionscalers: - name: gpu-machines type: bare-metal maxAgents: 5 labelSets: - labels: ['linux', 'gpu', 'cuda'] binaryPath: '/opt/kici/kici-agent' resources: memory: '32g' cpus: 16