Execution isolation architecture
Deep-dive on the agent-code isolation model: how KiCI separates customer workflow code from agent-internal resources.
Problem
Section titled “Problem”In earlier KiCI versions, the agent executed customer workflow steps in-process using eval() or dynamic import within the agent’s own V8 isolate. This created critical security issues:
- Environment access — Customer code could read
process.env, exposingKICI_ORCHESTRATOR_URL,KICI_DATABASE_URL,KICI_PLATFORM_TOKEN,WEBHOOK_SECRET, and other agent credentials - Filesystem access — Customer code could read agent configuration files, TLS certificates, and other sensitive host files
- Network access — Customer code could connect to internal services (orchestrator WebSocket, database, MMDS metadata endpoint)
- Process interference — Customer code could call
process.exit(), modify global state, or interfere with the agent’s event loop
Solution
Section titled “Solution”The agent now delegates all customer code execution to an ExecutionSandbox — a separate process (or container) that receives only explicitly allowed data.
Core components
Section titled “Core components”- ExecutionSandbox interface (
types.ts) — Lifecycle contract:setup -> executeJob -> teardownwithabortavailable at any time - Workflow Runner (
workflow-runner.ts) — Standalone Node.js entry point that runs inside the sandbox. Handles git clone, dependency install, compile, and step execution - IPC Protocol (
ipc-protocol.ts) — Typed message protocol between agent and runner - Environment Sanitizer (
env-sanitizer.ts) — Allowlist-based environment variable filtering
Architecture diagram
Section titled “Architecture diagram”Agent Process Sandbox (child process / container / VM)+----------------------------+ +----------------------------------+| | | || JobRunner | | WorkflowRunner || | | | | || +-- detectExecutionMode | | +-- git clone || | | | | || +-- createSandbox() | | +-- dep restore (tarball or || | | | | | inline npm ci) || | +-- setup() | | +-- register TS loader hook || | | | IPC | | || | +-- executeJob() -------->| +-- import workflow .ts || | | | | | || | | onStepStatus <---------| +-- evaluate rules || | | onLogLine <---------| | || | | | | +-- execute steps || | +-- teardown() | | | | || | | | | +-- step.run(ctx) || +-- forward to | | | | (native zx $) || orchestrator WS | | | +-- stream logs || | | | |+----------------------------+ +----------------------------------+ Sanitized env only No KICI_*, KICI_DATABASE_URL, (allowlist + user env) KICI_PLATFORM_TOKEN, WEBHOOK_SECRET Secrets via IPC, not envIPC protocol
Section titled “IPC protocol”The runner and agent communicate via a typed message protocol with two transport modes:
Message types
Section titled “Message types”Runner to Agent (RunnerToAgentMessage):
| Type | Fields | Description |
|---|---|---|
ready | — | Runner initialized, ready for execute command |
step.start | stepIndex, stepName, step_type? | Step began executing |
step.complete | stepIndex, status, durationMs, error?, outputs?, step_type?, secretsAccessed? | Step finished |
log.line | stepIndex, line | Single log line from step |
job.complete | status, stepResults[], error?, outputs?, secretOutputs? | All steps done, final results |
event.emit | requestId, eventName, payload, target? | Request to emit a custom event from a step |
concurrency.report | group | Report evaluated concurrency group key |
Agent to Runner (AgentToRunnerMessage):
| Type | Fields | Description |
|---|---|---|
execute | request: JobExecutionRequest | Start job execution |
abort | force? | Cancel the running job (force skips hooks) |
event.emit.response | requestId, deliveryId?, error? | Confirm or reject a custom event emit request |
concurrency.ack | action, reason? | Concurrency gate result (proceed/wait/cancel) |
Dual transport mode
Section titled “Dual transport mode”The protocol supports two transport mechanisms, selected at runner startup:
Fork IPC (bare-metal and Firecracker backends):
- Uses Node.js IPC channel via
child_process.fork() - Messages sent with
process.send()and received viaprocess.on('message') - Binary-efficient, no serialization overhead
- The runner detects fork mode via
typeof process.send === 'function'
Stdio JSON-lines (container backend):
- Uses stdin/stdout via
docker exec - Each message is a single JSON object terminated by a newline
- Agent sends execute request on stdin, runner writes messages on stdout
zx $.verbose=falseand$.quiet=falsesuppress command echoing while allowing output to flow to stdout/stderr for step output capture
Lifecycle sequence
Section titled “Lifecycle sequence”Agent Runner | | |-- fork/exec ---------------→| | |-- initialize | |-- detect IPC mode |←--- ready -------------------| |---- execute {request} -----→| | |-- restore .kici/ source (tarball) | |-- restore deps (tarball or npm ci) | |-- register TS loader hook | |-- import workflow .ts | |-- verify contentHash (drift guard) | |-- evaluate rules | | | |-- for each step: |←--- step.start --------------| |←--- log.line ----------------| (many) |←--- log.line ----------------| |←--- step.complete ----------| | | |←--- job.complete ------------| | |-- exit(0)Abort sequence
Section titled “Abort sequence”Agent Runner | | |---- abort ----------------→| (IPC message) | |-- attempt graceful shutdown | ... 10s grace period ... | |---- SIGTERM --------------→| (if still running) | ... 5s more ... | |---- SIGKILL --------------→| (force kill)Environment sanitization
Section titled “Environment sanitization”Environment sanitization operates at two tiers:
- Orchestrator tier — The orchestrator sanitizes the environment when spawning agent processes (bare-metal and container backends). This prevents orchestrator secrets (KICI_DATABASE_URL, GITHUB_PRIVATE_KEY, S3 credentials) from reaching agents.
- Agent tier — The agent sanitizes the environment when spawning sandbox processes (customer code). This prevents agent credentials (KICI_ORCHESTRATOR_URL, KICI_AGENT_ID) from reaching customer code.
Shared constants (single source of truth)
Section titled “Shared constants (single source of truth)”Environment allowlist constants are defined in @kici-dev/engine (packages/engine/src/env/environment-allowlist.ts) and imported by both the orchestrator and agent:
ALLOWED_SYSTEM_VARS— System variables safe to pass downstream (PATH, HOME, USER, etc.)AGENT_REQUIRED_KICI_VARS— KICI variables the agent needs (set explicitly, not copied from process.env)KICI_AGENT_ENV_PREFIX— TheKICI_AGENT_ENV_prefix constant for operator-controlled forwarding
This eliminates duplication between tiers and prevents drift.
Allowlist approach
Section titled “Allowlist approach”The environment sanitizer uses an explicit allowlist — only named system variables pass through. This is the inverse of a blocklist: adding new variables to the host will never accidentally leak them.
The allowlist (ALLOWED_SYSTEM_VARS) contains:
export const ALLOWED_SYSTEM_VARS = [ 'PATH', // Command execution 'HOME', // User home directory 'USER', // Current user 'SHELL', // User shell 'LANG', // Locale 'LC_ALL', // Locale override 'TERM', // Terminal type 'TMPDIR', // Temp directory 'NODE_PATH', // Node module resolution 'TZ', // Timezone] as const;KICI_AGENT_ENV_ prefix forwarding
Section titled “KICI_AGENT_ENV_ prefix forwarding”Operators can set KICI_AGENT_ENV_-prefixed variables on the orchestrator host. The orchestrator strips the prefix and passes the variable to spawned agents:
KICI_AGENT_ENV_HTTP_PROXY=http://proxy:3128 -> HTTP_PROXY=http://proxy:3128KICI_AGENT_ENV_NO_PROXY=localhost -> NO_PROXY=localhostAll three backends honor this mechanism with identical precedence rules; only the transport differs. Bare-metal merges into the spawned process’s env map, container assembles a flat env array Docker/Podman feeds the container, and Firecracker writes the merged map per-key into MMDS under meta-data/kici-env/. The Firecracker backend additionally enforces a per-VM 32 KiB byte budget (defends Firecracker’s ~51 KiB MMDS data store cap) and rejects keys that aren’t POSIX-safe identifiers; both filters fire warning logs and skip the offending var without aborting the spawn.
Orchestrator-tier precedence (bare-metal backend)
Section titled “Orchestrator-tier precedence (bare-metal backend)”The bare-metal backend constructs the agent’s environment in four layers:
- System allowlist (from orchestrator
process.env) — lowest precedence - KICI_AGENT_ENV_ forwarded (prefix stripped from orchestrator
process.env) - Explicit KICI_* agent vars (KICI_ORCHESTRATOR_URL, KICI_AGENT_ID, etc. — set to known values)
- scalers.yaml
env:(label-set configuration) — highest precedence
Container-tier precedence
Section titled “Container-tier precedence”The container backend builds a flat environment array. Docker/Podman uses last-value-wins for duplicate keys:
- Explicit KICI_* agent vars (KICI_ORCHESTRATOR_URL, etc.)
- KICI_AGENT_ENV_ forwarded (prefix stripped)
- scalers.yaml
env:— last in array, highest precedence
Firecracker-tier precedence
Section titled “Firecracker-tier precedence”The Firecracker backend builds the merged env on the orchestrator and ships it to the VM via MMDS (system vars are kernel-provided inside the VM, not orchestrator-supplied):
- KICI_AGENT_ENV_ forwarded (prefix stripped from orchestrator
process.env) - scalers.yaml
env:(label-set configuration) — highest precedence
The merged map is written to MMDS as a nested object under meta-data/kici-env/<KEY> (one MMDS key per env var, value stored verbatim). Inside the VM, the rootfs /init script lists the directory, GETs each value, and emits export KEY='value' lines into a sourceable temp file (POSIX single-quote escaping handles values with quotes). Two safety filters apply on the orchestrator side: keys must match [A-Za-z_][A-Za-z0-9_]* and the cumulative byte cost must stay under 32 KiB; otherwise the var is skipped with a firecracker-backend warning log.
Agent-tier precedence (sandbox)
Section titled “Agent-tier precedence (sandbox)”The agent’s buildSanitizedEnv() constructs the sandbox environment using the 7-layer merge documented in Environments architecture. The simplified view:
- System allowlist (from agent
process.env) — lowest precedence - User env (from workflow config / orchestrator-provided) — overrides system
- Job env (from SDK env property) — overrides user env
Secrets are NOT injected as environment variables. They flow through IPC and are accessed via ctx.secrets.get() and ctx.secrets.has(). Users can explicitly inject a secret into process.env by calling ctx.secrets.expose('KEY'), but this is opt-in.
This ensures:
- User vars can customize system defaults (e.g., custom PATH)
- Agent credentials are never included regardless of variable name
- Secrets never leak into environment variables unless explicitly exposed by user code
What gets excluded
Section titled “What gets excluded”Any variable not in the allowlist is stripped, including:
KICI_ORCHESTRATOR_URL,KICI_AGENT_ID,KICI_LABELS— agent configKICI_DATABASE_URL— database credentialsKICI_PLATFORM_TOKEN— Platform authenticationWEBHOOK_SECRET— webhook signature keysAWS_*,DOCKER_*— infrastructure credentials (unless explicitly passed as user env)- Any other variable present in the agent’s environment
Per-backend details
Section titled “Per-backend details”Container backend (ContainerSandbox)
Section titled “Container backend (ContainerSandbox)”- Creates a disposable Docker/Podman container per job via
sleep infinity - Workflow runner bind-mounted read-only at
/opt/kici/workflow-runner.js - Workspace bind-mounted read-write at
/workspace - IPC via dockerode exec API with demultiplexed stdin/stdout streams
- Container labels (
kici-sandbox,kici-job-id) for orphan cleanup - Optional
keepFailedflag preserves containers for debugging - Uses
buildRequest()fromfork-runner.tsfor consistent dispatch-to-request mapping
Bare-metal backend (BareMetalSandbox)
Section titled “Bare-metal backend (BareMetalSandbox)”- Forks workflow runner via
child_process.fork()with IPC channel - Environment sanitized via
buildSanitizedEnv()— only allowlisted vars - Optional bubblewrap (bwrap) wrapping:
child_process.spawn('bwrap', [...args, node, runner])with stdio IPC fd- Read-only system mounts, writable workspace at
/workspace - PID and IPC namespace isolation (
--unshare-pid,--unshare-ipc) - Network isolation via
--unshare-net(loopback only, no external connectivity) - Die-with-parent and new-session for lifecycle safety
- Without bwrap: credential isolation only (full filesystem/network access)
- stderr captured (last 20 lines) for crash diagnostics
Firecracker backend (FirecrackerSandbox)
Section titled “Firecracker backend (FirecrackerSandbox)”- Thin defense-in-depth wrapper around the same fork mechanism
- VM provides real isolation (separate kernel, rootfs, network)
- Sandbox adds environment sanitization inside the VM
- Prevents customer code from accessing MMDS metadata endpoint
- VM lifecycle (start/stop) managed by scaler backend, not the sandbox
setup()is a no-op (VM already running when agent starts)teardown()kills child process only (VM shutdown is scaler’s job)
Key design decisions
Section titled “Key design decisions”One process per job
Section titled “One process per job”Each job gets exactly one sandbox process (or container). The runner handles the entire lifecycle: clone, install, compile, execute all steps. This avoids:
- Per-step process creation overhead
- State loss between steps (working directory, installed dependencies)
- Complex IPC multiplexing
StepContext reconstruction (not serialization)
Section titled “StepContext reconstruction (not serialization)”The StepContext object (which provides $ from zx, log, env, inputs, workflow, job, matrix) is reconstructed natively inside the runner process, not serialized across the IPC boundary. This is critical because:
- zx’s
$cannot be serialized (it holds process references) - The runner creates a fresh zx instance with
initZx()and$.verbose=false - Logger, env, and inputs are constructed from the execution request data
- Customer code gets a fully functional
StepContextwith native zx shell execution
zx runs natively in the sandbox
Section titled “zx runs natively in the sandbox”zx (the shell execution library) runs inside the sandbox process, not in the agent. This means:
- Shell commands execute with the sanitized environment
- zx’s
$has no access to agent credentials - Command echoing is suppressed:
$.verbose=falseand$.quiet=false - Command output goes through
log.lineIPC messages back to the agent
Sandbox types are self-contained
Section titled “Sandbox types are self-contained”The sandbox type hierarchy (ExecutionSandbox, JobExecutionOptions, JobExecutionResult, SandboxStepResult) is independent from the agent’s older StepResult types. This ensures:
- Clean separation of concerns
- No accidental coupling between sandbox and agent internals
- The sandbox module can evolve independently
Sandbox is the only execution path
Section titled “Sandbox is the only execution path”The sandbox is the only execution path for customer workflow code. There is no in-process execution route, which eliminates the risk of running untrusted step bodies outside the isolation boundary.
Network isolation
Section titled “Network isolation”Network isolation prevents customer workflow code from accessing internal infrastructure (orchestrator, database, object storage, MMDS metadata). Each backend uses a different mechanism suited to its isolation model.
Container backend
Section titled “Container backend”- Containers are attached to a dedicated bridge network (
kici-agent-net, subnet 172.30.0.0/16) - Per-container nftables rules using
ip saddrmatching block traffic to RFC1918 ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) and cloud metadata endpoints (169.254.0.0/16) - Internet access is allowed via NAT masquerade through the bridge gateway
- Rules are applied per-container during spawn and cleaned up during destroy
- Per-label-set
networkPolicy.allowlistadds CIDR exceptions to the RFC1918 block;denyAllblocks all non-allowlisted outbound traffic
Firecracker backend
Section titled “Firecracker backend”- Each VM has a dedicated TAP device attached to a shared bridge (
kici-br0) - Per-VM nftables rules keyed on the VM’s source IP block RFC1918 and metadata traffic (the TAP is enslaved to the bridge, so forwarded traffic carries the bridge as its input interface — source IP is the per-VM match that holds on the routed path)
- Internet access is allowed via NAT masquerade through the bridge gateway
- MMDS metadata endpoint (169.254.169.254) is additionally protected via in-VM iptables and host-side MMDS clearing (see Register/Config ACK Protocol below)
- Per-label-set
networkPolicy.allowlistadds CIDR exceptions;denyAllblocks all non-allowlisted outbound traffic
Bare-metal backend
Section titled “Bare-metal backend”- When bubblewrap (bwrap) is enabled,
--unshare-netcreates a network namespace with only the loopback interface - Customer workflow code has zero external network access (no internet, no local services)
- This is intentionally strict: bare-metal is for trusted environments only, and full isolation is simpler and more secure than selective nftables blocking
- Without bwrap: no network isolation (full host network access)
Blocked traffic summary
Section titled “Blocked traffic summary”| Target | Reason |
|---|---|
| 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16 | Prevent access to local services (Postgres, orchestrator HTTP) |
| 169.254.0.0/16 | Prevent SSRF-style attacks on cloud provider metadata services |
| Gateway exception | Allows internet access via NAT masquerade |
Register/config ACK protocol
Section titled “Register/config ACK protocol”The registration handshake between agent and orchestrator delivers configuration securely and triggers MMDS clearing for Firecracker VMs.
Handshake sequence
Section titled “Handshake sequence”Agent Orchestrator | | |--- agent.register ------->| (agentId, labels, maxConcurrency) | | (register in AgentRegistry) |<--- register.ack ---------| (config: agentId, labels, scalerManaged) | | | (block MMDS if FC mode) | |--- config.ack ----------->| | | (clear MMDS if Firecracker backend) | | | Ready for job dispatch |Message details
Section titled “Message details”agent.register (agent to orchestrator): The agent sends its labels, max concurrent job count, and optional agent ID.
register.ack (orchestrator to agent): The orchestrator confirms the registration and sends back the agent’s confirmed configuration: agentId, labels, and scalerManaged flag.
config.ack (agent to orchestrator): The agent confirms it received and applied the register.ack config. For Firecracker/scaler-managed agents, this is sent after blocking MMDS access via iptables.
MMDS clearing flow
Section titled “MMDS clearing flow”For Firecracker VMs, the config.ack triggers MMDS data clearing:
- Agent receives
register.ackwith confirmed config - Agent detects it is scaler-managed (via
KICI_SCALER_MANAGEDenv var orscalerManagedflag) - Agent blocks MMDS:
iptables -A OUTPUT -d 169.254.169.254 -j DROP - Agent sends
config.ackto orchestrator - Orchestrator’s
ScalerManager.onConfigAck()identifies the Firecracker backend viamanagedAgentIndex - Orchestrator calls
FirecrackerScalerBackend.clearAgentMmds()which uses the Firecracker API to clear MMDS data
This dual-sided protection (agent blocks + orchestrator clears) ensures customer workflow code cannot read orchestrator credentials from MMDS even if one side fails.
Backward compatibility
Section titled “Backward compatibility”The agent waits for a register.ack response from the orchestrator before transitioning to the registered state. If the orchestrator does not send register.ack (e.g., due to a bug or version mismatch), the agent remains in the registering state until the connection is closed and reconnection is attempted.