CSA implements multi-layer resource isolation and memory-aware scheduling to prevent OOM kills and system instability when running AI tools.
AI CLI tools can consume significant memory (1-3 GB per instance). CSA prevents resource exhaustion through:
- Pre-flight checks -- verify sufficient memory before launching
- P95 estimation -- use historical data to predict memory needs
- Resource sandbox -- cgroup/rlimit isolation per tool process
- Peak tracking -- monitor and record actual memory usage
- Global concurrency slots -- limit concurrent tool instances
CSA uses a defense-in-depth approach with three isolation mechanisms:
| Layer | Mechanism | Isolation | Availability |
|---|---|---|---|
| 1 | cgroup v2 (systemd user scope) | Memory + PID limits | Linux with systemd |
| 2 | setrlimit (RLIMIT_AS, NPROC) |
Address space + process count | POSIX systems |
| 3 | RSS monitor (background thread) | Peak memory tracking | All platforms |
CSA probes the host at startup and caches the result:
pub enum ResourceCapability {
CgroupV2, // Best: cgroup v2 + systemd user scope
Setrlimit, // Fallback: POSIX setrlimit
None, // No isolation available
}Detection order:
- Check
/sys/fs/cgroup/cgroup.controllersexists (cgroup v2 unified hierarchy) - Check
systemd-run --user --scopeis functional - Fall back to
setrlimitif available
# ~/.config/cli-sub-agent/config.toml or .csa/config.toml
[resources]
enforcement_mode = "BestEffort" # Required | BestEffort | Off
memory_max_mb = 4096 # Per-tool memory limit
memory_swap_max_mb = 0 # Swap limit (cgroup only)
pids_max = 256 # Process count limit
# Per-tool overrides
[tools.codex.resources]
memory_max_mb = 3072
enforcement_mode = "Required"| Mode | Behavior |
|---|---|
Required |
Fail if preferred sandbox is unavailable |
BestEffort |
Use best available mechanism, warn if degraded |
Off |
Disable sandbox entirely |
When cgroup v2 is available, CSA creates a systemd transient scope for each tool process:
- RAII cleanup via
CgroupScopeGuard(scope removed on drop) - Memory limits enforced by the kernel
- PID limits prevent fork bombs
- Orphan cleanup:
cleanup_orphan_scopes()removes stalecsa-*.scopeunits
When cgroup is unavailable, CSA uses pre_exec to set:
RLIMIT_AS-- virtual address space limitRLIMIT_NPROC-- max processes per user
Combined with setsid() in a single pre_exec closure for atomicity.
CSA maintains a rolling window of the last 20 peak memory measurements
per tool in usage_stats.toml:
[history]
gemini-cli = [1024, 1152, 1088, 1920, 1200, ...]
codex = [2048, 2304, 2176, 2560, ...]The P95 (95th percentile) is used instead of the average because it:
- Accounts for occasional high-usage spikes
- Provides a conservative estimate
- Avoids skew from outliers
required = min_free_memory_mb + P95_estimate(tool)
available = physical_free + swap_free
if available < required:
abort with OOM risk message
Priority chain for estimates:
- P95 from historical data (if >= 1 run exists)
- Initial estimate from config (
resources.initial_estimates) - Hardcoded fallback: 500 MB
Tool: codex
P95 estimate: 2560 MB (from 20 runs)
min_free_memory_mb: 4096
Available: 8192 MB (physical: 6144 + swap: 2048)
Required: 4096 + 2560 = 6656 MB
Available: 8192 MB
6656 < 8192 -> PASS
The csa-resource crate provides MemoryMonitor for runtime tracking:
- Get child process PID after spawn
- Background task samples RSS every 500ms via
sysinfocrate - Peak RSS tracked via
Arc<AtomicU64>(lock-free) - Monitoring stops when process exits
- Peak value recorded to
usage_stats.toml
| Metric | Value |
|---|---|
| CPU overhead | < 0.1% (1 syscall per 500ms) |
| Memory overhead | < 1 MB |
| Peak detection accuracy | +/- 500ms |
CSA limits how many instances of each tool can run simultaneously:
[tools.codex]
max_concurrent = 5
[tools.claude-code]
max_concurrent = 3Implemented via flock-based slot files under
~/.local/state/cli-sub-agent/slots/:
slots/
+-- codex-0.lock
+-- codex-1.lock
+-- codex-2.lock
+-- ...
When all slots are occupied:
- Default: fail with "no slots available"
--wait: block until a slot becomes free
Path: ~/.local/state/csa/{project_path}/usage_stats.toml
Retention: Last 20 records per tool (FIFO).
Statistics are written atomically: write to .tmp file, then rename()
(POSIX atomic). This prevents corruption from concurrent writes.
Until P95 data is available, CSA uses configured initial estimates:
| Tool | Recommended Initial (MB) |
|---|---|
| gemini-cli | 1024 |
| codex | 4096 |
| opencode | 1536 |
| claude-code | 4096 |
use csa_resource::{ResourceGuard, ResourceLimits};
// Create guard
let mut guard = ResourceGuard::new(limits, &stats_path);
// Pre-flight check
guard.check_availability("codex")?;
// After process completes
let peak_mb = monitor.stop().await;
guard.record_usage("codex", peak_mb);pipeline_sandbox.rs in cli-sub-agent calls resolve_sandbox_options()
to determine sandbox configuration for each tool execution. Telemetry
is recorded via SandboxInfo in session state.
AcpConnection::spawn_sandboxed() applies resource isolation to ACP
processes, combining cgroup/rlimit enforcement with ACP session management.
| Problem | Solution |
|---|---|
| Frequent OOM prevention errors | Lower min_free_memory_mb or close other apps |
| Tool crashes despite passing pre-flight | Expected for P5 of runs; estimates will adapt |
| Initial runs always fail | Lower initial_estimates in config |
usage_stats.toml corrupt |
Delete it (will regenerate; loses history) |
| Memory shows 0 MB peak | Process exited before first 500ms sample |
| "No sandbox capability" warning | Install systemd or ensure setrlimit is available |
- Configuration --
[resources]section reference - Architecture -- process model and isolation
- ACP Transport -- ACP sandbox integration