Slurm (Simple Linux Utility for Resource Management) is an open-source workload manager used in High Performance Computing (HPC) environments.
Slurm is responsible for:
Slurm allows many users to share a cluster efficiently and fairly.
| Term | Description |
|---|---|
| Job | A computation task submitted to Slurm |
| Node | A physical or virtual compute machine |
| Partition | A queue or group of nodes |
| Allocation | Resources assigned to a job |
| Time Limit | Maximum allowed runtime |
| GRES | Generic RESources (e.g., GPUs) |
First, Use SSH to connect to the login node portal.slurm.cpe.kmutt.ac.th
ssh <username>@cpe.kmutt.ac.th@portal.cpe.kmutt.ac.th
After login complete, Use this command.
sinfo
Show cluster status, partitions, and node availability.
scontrol show nodes
Display detailed node information.
sbatch job.sbatch
Submit a batch job script.
srun command
Run a job interactively.
squeue
View running and pending jobs.
squeue -u <username>
View jobs of a specific user.
scontrol show job <jobid>
Show detailed job information.
scancel <jobid>
Cancel a job.
scancel -u <username>
Cancel all jobs of a user.
sacct
View completed job history.
sacct -j <jobid>
View accounting details for a specific job.
Example job.sbatch:
#!/bin/bash
#SBATCH --job-name=test-job
#SBATCH --partition=poc
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --time=00:30:00
#SBATCH --output=job.out
#SBATCH --error=job.err
echo "Hello Slurm"
hostname
| State | Meaning |
|---|---|
| PENDING | Waiting for resources |
| RUNNING | Job is running |
| COMPLETED | Finished successfully |
| FAILED | Job failed |
| CANCELLED | Job cancelled |