SLURM Guide
Quick Disclaimers
1. This is a non-exhaustive guide
This guide will list the commands and arguments you will most likely need to use. Odds are, you won't need anything else. However, you may need information on arguments or commands not listed here. When this is the case, you should consult the official SLURM documentation.
2. Resource caps
You can only request up to a total of 12 GPUs. Please note that, unless otherwise specified, GPU request are per node, meaning requesting 2 nodes and requesting 2 A100s will result in a total of 4 A100s.
3. The Turing Cluster is a shared resource
It is important to remember that the Turing Cluster is shared by its users. Because of this, you should be mindful of the resources you request. With that said, we would prefer you request many resources that make your job quick rather than using less resources that are held up for longer periods of time. Keep in mind, however, the more resources you request and the more powerful the resource, the longer your wait time will be. It is important to experiment with how many resources you need that satisfy all of your requirements.
Terminal Commands
The following commands are custom built for your use on the Turing Cluster. Please note that these will not work on another HPC Cluster:
idle_gpus: lists all GPUs, the total number of each, and how many are freescluster: lists available nodes, features, and memory per node on the cluster; memory is reported as MBslurm_viewer: brings up a terminal GUI with filterable information on your job history
The following commands are provided by SLURM by default. As such, additional information can be found on the official SLURM documentation:
sbatch <submission file>: queues a submission script (.sh) to be executed with your specified resources.squeue --me: view all active jobs you queued, showing job id, partition, runtime, and the node it's running on (or if it is still queued).scancel <job id>: cancel a job using its id.seff <job id>: reports cpu usage for a specified job, its runtime, and memory usage in MB.sacct: reports recent jobs you ran, their status, and some resource information.scontrol show config: reports the SLURM and Cluster config. Information here might be essential for projects that are sensitive to hardware/kernel configurations.
scontrol
scontrol is capable of doing much more than just reporting the cluster's config. However, most of its uses are for administrators. It is mentioned here because some projects that are sensitive to signals, kernel configurations, and specific hardware infomation will want to be able to view what limitations the cluster imposes.
Submission File Arguments
The following arguments are a large subset of the total SLURM provides for sbatch. While they may be the most common you will need, there is a chance your project may require more specific constraints, in which case we highly recommend that if you find yourself asking "If only there was a way to do this...", you should check the official argument list for sbatch. Note that some of these are only executable by administrators, in which case this will be specified in the command description.
Essential Arguments
-N <count>: Number of compute nodes requested (usually each node has a maximum of 8 GPUs)--mem=<n><unit>: specify the maximum RAM (n) your program will be allocated. When unit is omitted, the default is MB. Valid unit specifications are k,g,t (KB, GB, TB)-o <filename>.out: the name of the output file; contains all print statements-e <filename>.err: the name of the error file; contains all debug, warning, and error statements-p <partition>: partition to submit your job to; one of academic, short, or long
Requesting CPU Cores
While technically not essential, please note that your program will most likely run into errors if you do not explicitly request CPUs or a number of processes.
--cpus-per-task=<n>: request n CPU cores per process. This is the most simple and safe method of requesting CPUs. Incompatible withcputs-per-gpu.--cpus-per-gpu=<n>: request n CPU cores per GPU. Best practice for computationally expensive AI/ML tasks. Incompatible withcpus-per-task.--ntasks <n>: Informs SLURM to launch a MAXIMUM of n processes at the start of your job. By default, this also allocates 1 CPU core per task. Without specifyingntasks, the default behavior of your job is to allocate 1 process per compute node.cpus-per-taskchanges this default behavior, but this argument is still required by some other arguments.
Optional Arguments
Requesting GPUs
There are a few ways to request GPUs in SLURM, so make sure to choose what works best for you. Please note that you can request a maximum of 12 GPUs total for your job. Each of the following arguments lets you optionally specify the name of the GPU, i.e. A100. If omitted, you will recieve the specified count of GPUs with the highest priority adhereing to the following:
- GPUs that are immediately free
- of the free GPUs, allocate the least powerful
Please note that if you are on the academic partition, you only have access to A30 GPUs.
--gres=<resource>:<type>:<count>: request a generic resource (currently only GPUs) of a specified name (A100, L40S, etc.). This will be applied to all nodes.--gpus=<type>:<count>: request gpus of a specific type and count. This will be applied to all nodes.--gpus-per-node=<type>:<count>: request that each node has a specified count of GPUs. Multiple GPU types can be requested via a comma separated list. Incompatible withgpus-per-task.
--gpus-per-task=<type>:<count>: request a count of GPUs per process. Incompatible withgpus-per-node. Requires you to explicitly usentasks.
Other Optional Arguments
--constraint=<constraints>: Request only nodes that have specific features (features can be viewed by running scluster). A more detailed list of specifications can be found on the official argument list forsbatch.
-J "<job name>": Assign a specific name to your job. If omitted, it will be the same name as your submission script.-t D-HH:MM:SS: Time limit for your job. We recommend you set one for the sake of avoiding early timeouts. While SLURM accepts multiple formats for this argument, we recommend this one as it is the least ambiguous.--begin=<time>: Notify SLURM to queue the job but not to run it until a specified time. The list of acceptabletimeformats is extensive, and can be found in the official argument list forsbatch.--chdir=<directory>: set the working directory of the submission script to another directory before execution.--dependency=<dependencies>: Prevent the job from running UNTIL the specified jobs have finished. The formatting for this argument is complex, but is explained in the official argument list forsbatch.--requeue: flag the job as eligible for requeueing. Note that this does not automatically requeue your job for all incomplete runs.
Utility Bash Functions
These functions/commands are completely optional, but may help you with various debugging and logging tasks.
Debugging GPU Issues
gpu_debug: tool to identify possible issues you are having with your job (may not be perfectly accurate)
- Makes broad assumptions about your submission, but is a great first-step tool to figure out what is going wrong if you are receiving GPU errors.
- Should be included inside of your SLURM script as an additional process. Note that this is another custom-built terminal command that will only work on the Turing Cluster.
Responding to Signals with trap
trap <function/command> <signal>: When included in your SLURM script, this will allow you to trigger functions or commands when a signal is broadcasted by the kernel.
For example, you can use SIGTERM to cleanup after a job ends (removing temporary files, moving logs to another directory, etc.), requeue your job, or checkpointing. If you are running a job that has a chance to go over the time limit, we recommend you write your own bash script function that performs cleanup and checkpointing when it recieves SIGTERM.
SIGTERM
SIGTERM, or TERM, flags a process for termination. It gets broadcasted 30 seconds before the process dies due to reaching the time limit or being manually terminated with scancel.
Linux Signals
Other Linux signals exist and can be trapped as per the need of your project.
You can broadcast these signals at a specified time before your job reaches the time limit using the --signal argument. It covers some options out of the scope of this guide, but more information is available in the official argument list for sbatch.
Monitoring Live Output
To view real-time output from a file:
Monitoring Job Runtime where interval is the time between updates in seconds. If-n is omitted, the default is every 2 seconds.
Premade Bash Functions
Auto Requeuing
Auto-requeueing requires a few extra SLURM arguments, which are included below.
#SBATCH --requeue # Marks job as eligible for requeueing
#SBATCH --signal=TERM@3 # Send signal (in this case, TERM, shorthand for SIGTERM) 3 seconds before the timelimit is reached. This will send SIGTERM earlier than originally intended as well, extending the time period between SIGTERM and SIGSTOP
trap 'scontrol requeue $SLURM_JOB_ID; exit 0' TERM
wait at the end of your script for this to work.
You can also include #SBATCH --open-mode=append in your submission script to append to your output files instead of overwriting them when requeueing occurs.
Monitoring GPU Utilization
monitor_gpus(){
nvidia-smi --query-gpu=timestamp,uuid,name,utilization.gpu,memory.used,memory.total --format=csv > "gpu_log${SLURM_JOB_ID}.csv"
while kill -0 "$1" 2>/dev/null; do
nvidia-smi --query-gpu=timestamp,uuid,name,utilization.gpu,memory.used,memory.total --format=csv,noheader >> "gpu_log${SLURM_JOB_ID}.csv"
sleep "$2"
done
}
python3 your_python_script.py & # run Python script
PID=$! # captures Process ID of the last run process
monitor_gpus "$PID" 2 # passes PID and an interval (in seconds, set to 2 here) to monitor_gpus
wait # waits until all processes die
Filename Patterns
SLURM allows you to use patterns to include in your output filenames. The following is not an exhaustive list, as some of the patterns refer to concepts not covered in this guide (such as running jobs in an array). If you are interested in seeing the full list, please consult the official documentation for sbatch.
%%: include the '%' character in filename.%j: The job's UUID.{SLURM_JOB_ID}: Same as%j, but for bash functions and lines other than the arguments.
%n: the node ID the job is running on. If using multiple nodes, this is result in multiple output files.%r: restart count of the running job. May be useful for debugging purposes.%S: SLUID of the running job. Unlike the jobID, this will change on requeuing.%t: task ID relative to current job. This will create a separate output file per process. Note that this ID may change per run.%u: Username of the person who queued the job.%x: The job name.