Tiny Scheduler is a user-space, SLURM-like job scheduler for a single, shared host belonging to small research groups. The scheduler is tailored for groups that may be familiar with SLURM on HPC but for whatever reason cannot install it on group-allocated resources.
The package mirrors the SLURM user experience via commands tsbatch, tsqueue, tscancel, which can be optionally aliased for drop-in replacement.
No root privileges are necessary for installation, and all state is stored in a configurable directory (default: /scratch/$HOSTNAME/tiny-scheduler).
Note that this project is intentionally minimal and trades robustness for simplicity, lacking privilege separation, isolation, log integrity, or resource abuse, among other basic security and architectural features in order to work within the limitations of user-space.
Designate one member of the group as the queue administrator to complete the installation steps.
-
Create the scheduler root and install the binaries.
./install.sh \ --bin-dir=/your/shared/bin \ --lock-dir=/your/shared/locks \ --cpu-limit=0.5 \ --sched-root=/scratch/$(hostname)/tiny-scheduler -
Start the scheduler daemon.
tiny_sched_daemon.sh start tiny_sched_daemon.sh status
After the Queue Administrator has installed the scripts, other research group members add the bin directory their PATH.
# For default installation path
echo 'export PATH="/shared/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc
# For zsh users
echo 'export PATH="/shared/bin:$PATH"' >> ~/.zshrc && source ~/.zshrc
# For custom installation path (replace with actual path used during installation)
echo 'export PATH="/your/custom/bin/path:$PATH"' >> ~/.bashrc && source ~/.bashrcAdd bin directory to your PATH then:
# Legacy immediate run interface
run_job.sh 8 no python3 my_script.py
# Queued SLURM-like interface (namespaced)
tsbatch my_job.sh
tsqueue
tscancel <jobid>If your filesystem supports POSIX ACLs (setfacl, getfacl) you can lock down scheduler metadata so only a designated administrator and a research group have access while removing world permissions. Example:
sudo ./install.sh \
--bin-dir=/shared/bin \
--lock-dir=/shared/locks \
--sched-root=/scratch/$(hostname)/tiny-scheduler \
--admin-user=alice \
--research-group=quantlab \
--secure-aclEffects (best‑effort, skipped if setfacl missing):
- Remove world perms (o-rwx) from bin and scheduler root.
- Admin user: rwx everywhere; research group: rx on binaries, rwx on queue/logs/tmp/lock/state.
configfile: admin rw, group r (640 + ACL), world none.- Default ACLs applied so new files inherit group/admin rights.
If any ACL operation fails (unsupported FS), install continues with a warning.
All are pure Bash; no external dependencies beyond standard GNU utilities (awk, sed, flock or bash emulation via file descriptor locking).
tsbatch– Submit a job script with#SBATCHdirectives (queues job).tiny_sched_daemon.sh– Background dispatcher (start|stop|status|run).tsqueue– Show pending/running/completed jobs (simple format).tsout– Show resolved stdout/stderr file paths and existence for job(s).tscancel– Cancel pending or running job (supports array elements).tiny_sched_worker.sh– Internal per‑job runner.tiny_sched_common.sh– Internal shared helpers.tiny_sched_env.sh– Optional environment script to aliassbatch→tsbatchetc. on specific host(s).
Implemented (parsed & acted upon):
--job-name=NAME/-J NAME--cpus-per-task=INT--mem=####[M|G](stored only, not enforcement yet)--time=HH:MM:SS(best‑effort soft limit; process killed after limit)--array=1-10or--array=1,2,5(basic arrays; each task becomes separate meta/jobid_task)--output=PATH(supports%jjob id,%Aarray parent id,%atask id; defaultslurm-%j.out)--error=PATH(defaultslurm-%j.err)
Accepted (ignored / stored only for future): --partition, --qos, --mail-type, --mail-user.
Not yet supported, and likely never will be: array stride (1-10:2), max concurrent (1-100%5), dependency graph, accounting, memory enforcement, multi‑node.
Inside the job execution environment we export a subset of SLURM variables:
SLURM_JOB_ID, SLURM_JOB_NAME, SLURM_CPUS_PER_TASK, SLURM_MEM_PER_NODE,
SLURM_SUBMIT_DIR, SLURM_JOB_NODELIST, SLURM_JOB_NUM_NODES=1, SLURMD_NODENAME.
Arrays additionally receive: SLURM_ARRAY_JOB_ID, SLURM_ARRAY_TASK_ID.
Default root: /scratch/$HOSTNAME/tiny-scheduler (override with env TJS_ROOT or --sched-root at install install time).
$TJS_ROOT/
config # CPU_FRACTION, POLL_INTERVAL, DEFAULT_MEM_MB
state/last_job_id # monotonic counter
queue/
pending/ # job_<id>.meta + .script
running/
completed/
failed/
logs/ # daemon/worker logs
lock/global.lock # flock coordination
tmp/ # transient files
my_job.sh:
#!/usr/bin/env bash
#SBATCH --job-name=testpi
#SBATCH --cpus-per-task=4
#SBATCH --time=00:05:00
#SBATCH --mem=2G
#SBATCH --output=pi_%j.out
#SBATCH --array=1-3
echo "Task $SLURM_ARRAY_TASK_ID running on $(hostname) with $SLURM_CPUS_PER_TASK cores"
python3 compute_pi.py --method montecarlo --seed $SLURM_ARRAY_TASK_IDSubmit & monitor:
tsbatch my_job.sh # prints Submitted batch job <ID>
tsqueue # shows <ID>_1, <ID>_2, <ID>_3 tasks
tscancel <ID>_2 # cancel one array elementOutputs: pi_<jobid>.out per element (unless %a token used).
Columns: JOBID NAME STATE CPUS TIME SUBMIT. (Time = seconds since start for RUNNING.)
States: PENDING, RUNNING, COMPLETED, FAILED, CANCELLED (subset of SLURM).
For completed / failed / cancelled jobs the TIME column is a fixed runtime (end_time - start_time), not a growing counter.
Inspect output/error resolution (helpful when you cannot locate an expected .out file):
tsout 42 # stdout/stderr for job 42
tsout 42 42_3 # job 42 plus array task 3Output columns: JOBID[_TASK] TYPE PATH EXISTS SIZE_BYTES.
TYPE is OUT or ERR. EXISTS = YES/NO.
Completed / failed jobs remain indefinitely (metadata + a copy of the original script) until you purge them. Output/error files written via --output/--error live wherever your job ran (typically your submit directory) and are never deleted by the scheduler.
You can enable lightweight periodic purging by exporting environment variables before starting the daemon:
export TJS_PURGE_AGE_DAYS=14 # Remove metadata whose end_time older than 14 days
export TJS_PURGE_KEEP_LAST=200 # Always retain most recent 200 regardless of age
export TJS_PURGE_INTERVAL_SEC=1800 # Run purge pass at most every 30 minutes (default 900)
tiny_sched_daemon.sh startSet either or both of TJS_PURGE_AGE_DAYS and TJS_PURGE_KEEP_LAST. If neither is set, no automatic removal occurs. Purge affects only metadata/script copies, never user output files.
To reclaim space, use the tspurge helper:
# Dry run: show what would be removed (keep newest 200)
tspurge --keep-last=200 --dry-run
# Purge everything older than 14 days, keeping the most recent 100 regardless of age
tspurge --age-days=14 --keep-last=100
# Remove all historical metadata (dangerous):
tspurge --keep-last=0Rules:
- Only metadata (
job_*.meta) and the internal script copies are removed. - User output files (e.g.
tstest.out) are not touched. - Relative
--output/--errorpaths are resolved against the submit directory (mirrors SLURM), so look there for your files.
If you do not see your expected .out file, verify:
- The job used
#SBATCH --output=...(or defaultslurm-<jobid>.out). - You are in the original submit directory (check
SLURM_SUBMIT_DIRrecorded in the meta file under$TJS_ROOT/queue/completed). - The worker had permission to create the file (parent directory writable).
- The job didn't exit before producing output (inspect stderr / meta exit_code).
- Cores: total usable cores =
CPU_FRACTION * nproc(rounded down, at least 1). - Scheduler packs jobs FIFO provided sum(running req_cpus + new req_cpus) ≤ allowed.
- Memory is not enforced yet (advisory only in metadata).
tscancel <id> cancels a single job (or array element <id>_<task>). Running jobs receive a SIGTERM; if they trap/ignore, final kill may rely on worker exit.
- Array stride & %concurrency limits.
- Memory enforcement (via cgroups, ulimit?).
sacctstyle historical reporting.
All jobs execute as submitting user. Ensure $TJS_ROOT directory is group‑writable if multiple users share scheduling; typical mode 775 / group sticky.
Job scripts can write to home directories transparently; only shared scheduler metadata is centralized.
If the scheduler enters an inconsistent state (stale locks, corrupted metadata, or daemon failure), follow these steps:
-
Stop the daemon (if running):
tiny_sched_daemon.sh stop
-
Clear stale locks:
rm -f "$TJS_ROOT"/lock/* # Or if using legacy lock directory: rm -f /shared/locks/*
-
Restart the daemon:
tiny_sched_daemon.sh start
-
Redeploy scripts (if binaries are corrupted or missing):
./install.sh --bin-dir=/your/bin --lock-dir=/your/locks --sched-root=/your/root
Note: Job output files (*.out, *.err) written to user directories are unaffected by scheduler recovery. Pending jobs may need to be resubmitted after a full reset.
This project does not include a system-wide service and assumes no administrator (no sudo) will install or manage services. You can run the scheduler as a long-running user process — for example via systemd --user — but without administrator assistance a user service will normally start only when you log in and will not be started automatically by the system after a cold boot.
If you want persistence across reboots without sudo, the repository includes a small heartbeat/health-check helper (see contrib/heartbeat.sh) which can be run from a machine that remains reachable (or from a remote host you control) to verify the daemon is alive and optionally re-establish it over an SSH tunnel. The helper assumes you (the user) have passwordless SSH or a persistent key-based connection available to the remote host.
You may still use systemd --user to supervise the daemon for convenience. Copy contrib/tiny_sched_daemon.user.service to ~/.config/systemd/user/ and enable it; it will start when you log in.
Example:
mkdir -p ~/.config/systemd/user
cp contrib/tiny_sched_daemon.user.service ~/.config/systemd/user/tiny_sched_daemon.service
# optional: create ~/.config/tiny_sched_daemon.env with TJS_ROOT or other env vars
systemctl --user daemon-reload
systemctl --user enable --now tiny_sched_daemon.service
systemctl --user status tiny_sched_daemon.serviceNote: without administrative loginctl enable-linger the user service will not automatically start at boot when the user is not logged in. If you need that behavior and can obtain admin help later, enabling linger is an option, but it is not required for basic operation.
If you cannot rely on system-level persistence, contrib/heartbeat.sh provides a simple approach for monitoring and remotely restarting the scheduler using SSH. The basic idea:
- A remote machine (or another machine you control) runs the heartbeat script periodically.
- The heartbeat attempts to contact the scheduler process on the host (example: check an HTTP health endpoint if you provide one, or test a PID file, or attempt to run a lightweight
tiny_sched_daemon.sh status). - If the check fails, the script can attempt to SSH into the host and start the daemon as the same user (using key-based auth).
See contrib/heartbeat.sh for a lightweight implementation and usage notes.
The contrib/ directory contains:
contrib/tiny_sched_daemon.user.service— examplesystemd --userunit for conveniencecontrib/heartbeat.sh— a small SSH-based heartbeat/remote-restart helper script (no sudo required)