Tiny Scheduler

Tiny Scheduler is a user-space, SLURM-like job scheduler for a single, shared host belonging to small research groups. The scheduler is tailored for groups that may be familiar with SLURM on HPC but for whatever reason cannot install it on group-allocated resources.

The package mirrors the SLURM user experience via commands tsbatch, tsqueue, tscancel, which can be optionally aliased for drop-in replacement.

No root privileges are necessary for installation, and all state is stored in a configurable directory (default: /scratch/$HOSTNAME/tiny-scheduler).

Note that this project is intentionally minimal and trades robustness for simplicity, lacking privilege separation, isolation, log integrity, or resource abuse, among other basic security and architectural features in order to work within the limitations of user-space.

Installation

Queue Administrator

Designate one member of the group as the queue administrator to complete the installation steps.

Create the scheduler root and install the binaries.

./install.sh \
   --bin-dir=/your/shared/bin \
   --lock-dir=/your/shared/locks \
   --cpu-limit=0.5 \
   --sched-root=/scratch/$(hostname)/tiny-scheduler

Start the scheduler daemon.

tiny_sched_daemon.sh start
tiny_sched_daemon.sh status

Research Group Members

After the Queue Administrator has installed the scripts, other research group members add the bin directory their PATH.

# For default installation path
echo 'export PATH="/shared/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc

# For zsh users
echo 'export PATH="/shared/bin:$PATH"' >> ~/.zshrc && source ~/.zshrc

# For custom installation path (replace with actual path used during installation)
echo 'export PATH="/your/custom/bin/path:$PATH"' >> ~/.bashrc && source ~/.bashrc

Add bin directory to your PATH then:

# Legacy immediate run interface
run_job.sh 8 no python3 my_script.py

# Queued SLURM-like interface (namespaced)
tsbatch my_job.sh
tsqueue
tscancel <jobid>

Optional: Harden Permissions with ACLs

If your filesystem supports POSIX ACLs (setfacl, getfacl) you can lock down scheduler metadata so only a designated administrator and a research group have access while removing world permissions. Example:

sudo ./install.sh \
   --bin-dir=/shared/bin \
   --lock-dir=/shared/locks \
   --sched-root=/scratch/$(hostname)/tiny-scheduler \
   --admin-user=alice \
   --research-group=quantlab \
   --secure-acl

Effects (best‑effort, skipped if setfacl missing):

Remove world perms (o-rwx) from bin and scheduler root.
Admin user: rwx everywhere; research group: rx on binaries, rwx on queue/logs/tmp/lock/state.
config file: admin rw, group r (640 + ACL), world none.
Default ACLs applied so new files inherit group/admin rights.

If any ACL operation fails (unsupported FS), install continues with a warning.

Scheduler Scripts

All are pure Bash; no external dependencies beyond standard GNU utilities (awk, sed, flock or bash emulation via file descriptor locking).

tsbatch – Submit a job script with #SBATCH directives (queues job).
tiny_sched_daemon.sh – Background dispatcher (start|stop|status|run).
tsqueue – Show pending/running/completed jobs (simple format).
tsout – Show resolved stdout/stderr file paths and existence for job(s).
tscancel – Cancel pending or running job (supports array elements).
tiny_sched_worker.sh – Internal per‑job runner.
tiny_sched_common.sh – Internal shared helpers.
tiny_sched_env.sh – Optional environment script to alias sbatch→tsbatch etc. on specific host(s).

Supported #SBATCH Directives

Implemented (parsed & acted upon):

--job-name=NAME / -J NAME
--cpus-per-task=INT
--mem=####[M|G] (stored only, not enforcement yet)
--time=HH:MM:SS (best‑effort soft limit; process killed after limit)
--array=1-10 or --array=1,2,5 (basic arrays; each task becomes separate meta/jobid_task)
--output=PATH (supports %j job id, %A array parent id, %a task id; default slurm-%j.out)
--error=PATH (default slurm-%j.err)

Accepted (ignored / stored only for future): --partition, --qos, --mail-type, --mail-user.

Not yet supported, and likely never will be: array stride (1-10:2), max concurrent (1-100%5), dependency graph, accounting, memory enforcement, multi‑node.

Job Environment Variables

Inside the job execution environment we export a subset of SLURM variables: SLURM_JOB_ID, SLURM_JOB_NAME, SLURM_CPUS_PER_TASK, SLURM_MEM_PER_NODE, SLURM_SUBMIT_DIR, SLURM_JOB_NODELIST, SLURM_JOB_NUM_NODES=1, SLURMD_NODENAME.

Arrays additionally receive: SLURM_ARRAY_JOB_ID, SLURM_ARRAY_TASK_ID.

Directory Layout

Default root: /scratch/$HOSTNAME/tiny-scheduler (override with env TJS_ROOT or --sched-root at install install time).

$TJS_ROOT/
   config              # CPU_FRACTION, POLL_INTERVAL, DEFAULT_MEM_MB
   state/last_job_id   # monotonic counter
   queue/
      pending/          # job_<id>.meta + .script
      running/
      completed/
      failed/
   logs/               # daemon/worker logs
   lock/global.lock    # flock coordination
   tmp/                # transient files

Examples

my_job.sh:

#!/usr/bin/env bash
#SBATCH --job-name=testpi
#SBATCH --cpus-per-task=4
#SBATCH --time=00:05:00
#SBATCH --mem=2G
#SBATCH --output=pi_%j.out
#SBATCH --array=1-3

echo "Task $SLURM_ARRAY_TASK_ID running on $(hostname) with $SLURM_CPUS_PER_TASK cores"
python3 compute_pi.py --method montecarlo --seed $SLURM_ARRAY_TASK_ID

Submit & monitor:

tsbatch my_job.sh      # prints Submitted batch job <ID>
tsqueue                # shows <ID>_1, <ID>_2, <ID>_3 tasks
tscancel <ID>_2        # cancel one array element

Outputs: pi_<jobid>.out per element (unless %a token used).

tsqueue Output Format

Columns: JOBID NAME STATE CPUS TIME SUBMIT. (Time = seconds since start for RUNNING.)

States: PENDING, RUNNING, COMPLETED, FAILED, CANCELLED (subset of SLURM).

For completed / failed / cancelled jobs the TIME column is a fixed runtime (end_time - start_time), not a growing counter.

tsout Helper

Inspect output/error resolution (helpful when you cannot locate an expected .out file):

tsout 42        # stdout/stderr for job 42
tsout 42 42_3   # job 42 plus array task 3

Output columns: JOBID[_TASK] TYPE PATH EXISTS SIZE_BYTES. TYPE is OUT or ERR. EXISTS = YES/NO.

Retention / Purging Completed Jobs

Completed / failed jobs remain indefinitely (metadata + a copy of the original script) until you purge them. Output/error files written via --output/--error live wherever your job ran (typically your submit directory) and are never deleted by the scheduler.

Automatic (Background) Purging

You can enable lightweight periodic purging by exporting environment variables before starting the daemon:

export TJS_PURGE_AGE_DAYS=14        # Remove metadata whose end_time older than 14 days
export TJS_PURGE_KEEP_LAST=200      # Always retain most recent 200 regardless of age
export TJS_PURGE_INTERVAL_SEC=1800  # Run purge pass at most every 30 minutes (default 900)
tiny_sched_daemon.sh start

Set either or both of TJS_PURGE_AGE_DAYS and TJS_PURGE_KEEP_LAST. If neither is set, no automatic removal occurs. Purge affects only metadata/script copies, never user output files.

To reclaim space, use the tspurge helper:

# Dry run: show what would be removed (keep newest 200)
tspurge --keep-last=200 --dry-run

# Purge everything older than 14 days, keeping the most recent 100 regardless of age
tspurge --age-days=14 --keep-last=100

# Remove all historical metadata (dangerous):
tspurge --keep-last=0

Rules:

Only metadata (job_*.meta) and the internal script copies are removed.
User output files (e.g. tstest.out) are not touched.
Relative --output / --error paths are resolved against the submit directory (mirrors SLURM), so look there for your files.

If you do not see your expected .out file, verify:

The job used #SBATCH --output=... (or default slurm-<jobid>.out).
You are in the original submit directory (check SLURM_SUBMIT_DIR recorded in the meta file under $TJS_ROOT/queue/completed).
The worker had permission to create the file (parent directory writable).
The job didn't exit before producing output (inspect stderr / meta exit_code).

Resource Model

Cores: total usable cores = CPU_FRACTION * nproc (rounded down, at least 1).
Scheduler packs jobs FIFO provided sum(running req_cpus + new req_cpus) ≤ allowed.
Memory is not enforced yet (advisory only in metadata).

Cancellation

tscancel <id> cancels a single job (or array element <id>_<task>). Running jobs receive a SIGTERM; if they trap/ignore, final kill may rely on worker exit.

Desired Improvements

Array stride & %concurrency limits.
Memory enforcement (via cgroups, ulimit?).

sacct style historical reporting.

Security & Permissions

All jobs execute as submitting user. Ensure $TJS_ROOT directory is group‑writable if multiple users share scheduling; typical mode 775 / group sticky.

Job scripts can write to home directories transparently; only shared scheduler metadata is centralized.

Recovery Procedure

If the scheduler enters an inconsistent state (stale locks, corrupted metadata, or daemon failure), follow these steps:

Stop the daemon (if running):
```
tiny_sched_daemon.sh stop
```

Clear stale locks:

rm -f "$TJS_ROOT"/lock/*
# Or if using legacy lock directory:
rm -f /shared/locks/*

Restart the daemon:
```
tiny_sched_daemon.sh start
```

Redeploy scripts (if binaries are corrupted or missing):

./install.sh --bin-dir=/your/bin --lock-dir=/your/locks --sched-root=/your/root

Note: Job output files (*.out, *.err) written to user directories are unaffected by scheduler recovery. Pending jobs may need to be resubmitted after a full reset.

Running as a persistent process (user-level)

This project does not include a system-wide service and assumes no administrator (no sudo) will install or manage services. You can run the scheduler as a long-running user process — for example via systemd --user — but without administrator assistance a user service will normally start only when you log in and will not be started automatically by the system after a cold boot.

If you want persistence across reboots without sudo, the repository includes a small heartbeat/health-check helper (see contrib/heartbeat.sh) which can be run from a machine that remains reachable (or from a remote host you control) to verify the daemon is alive and optionally re-establish it over an SSH tunnel. The helper assumes you (the user) have passwordless SSH or a persistent key-based connection available to the remote host.

Running with systemd --user (best-effort, no root required)

You may still use systemd --user to supervise the daemon for convenience. Copy contrib/tiny_sched_daemon.user.service to ~/.config/systemd/user/ and enable it; it will start when you log in.

Example:

mkdir -p ~/.config/systemd/user
cp contrib/tiny_sched_daemon.user.service ~/.config/systemd/user/tiny_sched_daemon.service
# optional: create ~/.config/tiny_sched_daemon.env with TJS_ROOT or other env vars
systemctl --user daemon-reload
systemctl --user enable --now tiny_sched_daemon.service
systemctl --user status tiny_sched_daemon.service

Note: without administrative loginctl enable-linger the user service will not automatically start at boot when the user is not logged in. If you need that behavior and can obtain admin help later, enabling linger is an option, but it is not required for basic operation.

Remote heartbeat / restart helper (no sudo required)

If you cannot rely on system-level persistence, contrib/heartbeat.sh provides a simple approach for monitoring and remotely restarting the scheduler using SSH. The basic idea:

A remote machine (or another machine you control) runs the heartbeat script periodically.
The heartbeat attempts to contact the scheduler process on the host (example: check an HTTP health endpoint if you provide one, or test a PID file, or attempt to run a lightweight tiny_sched_daemon.sh status).
If the check fails, the script can attempt to SSH into the host and start the daemon as the same user (using key-based auth).

See contrib/heartbeat.sh for a lightweight implementation and usage notes.

Files added to this repo

The contrib/ directory contains:

contrib/tiny_sched_daemon.user.service — example systemd --user unit for convenience
contrib/heartbeat.sh — a small SSH-based heartbeat/remote-restart helper script (no sudo required)

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
contrib		contrib
examples		examples
scripts		scripts
.actrc		.actrc
.dockerignore		.dockerignore
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
check_resources.sh		check_resources.sh
install.sh		install.sh
run_job.sh		run_job.sh
test_acl_mode.sh		test_acl_mode.sh
test_job_scheduler.sh		test_job_scheduler.sh
test_queue_scheduler.sh		test_queue_scheduler.sh
tiny_sched_common.sh		tiny_sched_common.sh
tiny_sched_daemon.sh		tiny_sched_daemon.sh
tiny_sched_env.sh		tiny_sched_env.sh
tiny_sched_worker.sh		tiny_sched_worker.sh
tsbatch		tsbatch
tscancel		tscancel
tsout		tsout
tspurge		tspurge
tsqueue		tsqueue
uninstall.sh		uninstall.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tiny Scheduler

Installation

Queue Administrator

Research Group Members

Optional: Harden Permissions with ACLs

Scheduler Scripts

Supported #SBATCH Directives

Job Environment Variables

Directory Layout

Examples

tsqueue Output Format

tsout Helper

Retention / Purging Completed Jobs

Automatic (Background) Purging

Resource Model

Cancellation

Desired Improvements

Security & Permissions

Recovery Procedure

Running as a persistent process (user-level)

Running with systemd --user (best-effort, no root required)

Remote heartbeat / restart helper (no sudo required)

Files added to this repo

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tiny Scheduler

Installation

Queue Administrator

Research Group Members

Optional: Harden Permissions with ACLs

Scheduler Scripts

Supported #SBATCH Directives

Job Environment Variables

Directory Layout

Examples

tsqueue Output Format

tsout Helper

Retention / Purging Completed Jobs

Automatic (Background) Purging

Resource Model

Cancellation

Desired Improvements

Security & Permissions

Recovery Procedure

Running as a persistent process (user-level)

Running with systemd --user (best-effort, no root required)

Remote heartbeat / restart helper (no sudo required)

Files added to this repo

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages