SLURM Guide

The purpose of this Guide is to provide you with a list of features Turing supports as provided by SLURM. The purpose is not to be all-encompassing, but rather provide a few quick solutions to questions you may have, or point you in the right direction.

Quick Disclaimers

1. This is a non-exhaustive guide

This guide will list the commands and arguments you will most likely need to use. Odds are, you won't need anything else. However, you may need information on arguments or commands not listed here. When this is the case, you should consult the official SLURM documentation.

2. Resource caps

You can only request up to a total of 12 GPUs. Please note that, unless otherwise specified, request resources are per node, meaning requesting 2 nodes and requesting 2 A100s will result in a total of 4 A100s.

3. The Turing Cluster is a shared resource

It is important to remember that the Turing Cluster is shared by its users. Because of this, you should be mindful of the resources you request. With that said, we would prefer you request many resources that make your job quick rather than using less resources that are held up for longer periods of time. Keep in mind, however, the more resources you request and the more powerful the resource, the longer your wait time will be. It is important to experiment with how many resources you need that satisfy all of your requirements.

Terminal Commands

The following commands are custom built for your use on the Turing Cluster. Please note that these will not work on another HPC Cluster:

idle_gpus: lists all GPUs, the total number of each, and how many are free
scluster: lists available nodes, features, and memory per node on the cluster; memory is reported as MB
slurm_viewer: brings up a terminal GUI with filterable information on your job history

The following commands are provided by SLURM by default. As such, additional information can be found on the official SLURM documentation:

sbatch <submission file>: queues a submission script (.sh) to be executed with your specified resources.
squeue --me: view all active jobs you queued, showing job id, partition, runtime, and the node it's running on (or if it is still queued).
scancel <job id>: cancel a job using its id.
seff <job id>: reports cpu usage for a specified job, its runtime, and memory usage in MB.
sacct: reports recent jobs you ran, their status, and some resource information.
scontrol show config: reports the SLURM and Cluster config. Information here might be essential for projects that are sensitive to hardware/kernel configurations.

scontrol

scontrol is capable of doing much more than just reporting the cluster's config. However, most of its uses are for administrators. It is mentioned here because some projects that are sensitive to signals, kernel configurations, and specific hardware infomation will want to be able to view what limitations the cluster imposes.

Submission File Arguments

The following arguments are a large subset of the total SLURM provides for sbatch. While they may be the most common you will need, there is a chance your project may require more specific constraints, in which case we highly recommend that if you find yourself asking "If only there was a way to do this...", you should check the official argument list for sbatch. Note that some of these are only executable by administrators, in which case this will be specified in the command description.

Some of these arguments have shorthand abbreviations that you may see in some examples. they are ommitted here for the sake of readability, but can be found in the official argument list for sbatch.

Essential Arguments

--nodes <count>: Number of compute nodes requested (usually each node has a maximum of 8 GPUs)
--mem=<n><unit>: specify the maximum RAM (n) your program will be allocated. When unit is omitted, the default is MB. Valid unit specifications are k,g,t (KB, GB, TB)
--partition <partition>: partition to submit your job to; one of academic, short, or long

Requesting CPU Cores

While technically not essential, please note that your program will most likely run into errors if you do not explicitly request CPUs or a number of processes.

--cpus-per-task=<n>: request n CPU cores per process. This is the most simple and safe method of requesting CPUs. Incompatible with cputs-per-gpu.
--cpus-per-gpu=<n>: request n CPU cores per GPU. Best practice for computationally expensive AI/ML tasks. Incompatible with cpus-per-task.
--ntasks <n>: Informs SLURM to launch a MAXIMUM of n processes at the start of your job. By default, this also allocates 1 CPU core per task. Without specifying ntasks, the default behavior of your job is to allocate 1 process per compute node. cpus-per-task changes this default behavior, but this argument is still required by some other arguments.

Optional Arguments

Requesting GPUs

There are a few ways to request GPUs in SLURM, so make sure to choose what works best for you. Each of the following arguments lets you optionally specify the name of the GPU, i.e. A100. If omitted, you will recieve the specified count of GPUs with the highest priority adhereing to the following:

GPUs that are immediately free
of the free GPUs, allocate the least powerful

Please note that if you are on the academic partition, you only have access to A30 GPUs.

--gres=<resource>:<type>:<count>: request a generic resource (currently only GPUs) of a specified name (A100, L40S, etc.). This will be applied to all nodes.
--gpus=<type>:<count>: request gpus of a specific type and count. This will be applied to all nodes.
--gpus-per-node=<type>:<count>: request that each node has a specified count of GPUs. Multiple GPU types can be requested via a comma separated list. Incompatible with gpus-per-task.

--gpus-per-task=<type>:<count>: request a count of GPUs per process. Incompatible with gpus-per-node. Requires you to explicitly use ntasks.

Syntax Example

--gres=gpu:2: request 2 of any GPU
--gres=gpu:A100:2: request 2 A100 GPUs

GPU LIMIT

Please note that you can request a maximum of 12 GPUs total for your job.

Other Optional Arguments

--output <filename>.out: the name of the output file; contains all print statements
--error <filename>.err: the name of the error file; contains all debug, warning, and error statements
--constraint=<constraints>: Request only nodes that have specific features (features can be viewed by running scluster). A more detailed list of specifications can be found on the official argument list for sbatch.

--job-name "<job name>": Assign a specific name to your job. If omitted, it will be the same name as your submission script.
--time D-HH:MM:SS: Time limit for your job. We recommend you set one for the sake of avoiding early timeouts. While SLURM accepts multiple formats for this argument, we recommend this one as it is the least ambiguous.
--begin=<time>: Notify SLURM to queue the job but not to run it until a specified time. The list of acceptable time formats is extensive, and can be found in the official argument list for sbatch.
--chdir=<directory>: set the working directory of the submission script to another directory before execution.
--dependency=<dependencies>: Prevent the job from running UNTIL the specified jobs have finished. The formatting for this argument is complex, but is explained in the official argument list for sbatch.
--requeue: flag the job as eligible for requeueing. Note that this does not automatically requeue your job for all incomplete runs.
--signal=<signal>@<n>: broadcast a specified Linux signal n seconds before the timelimit is reached. A link to Linux signals is provided later in the guide. You can optionally specify that the signal get recieved only by the script by adding B: just before the signal name.
--array=<indices>: Create an array of jobs with the same sbatch arguments to run in parallel. Great for running experiments with different settings. Job arrays are more complex than what this guide intends to provide, so checkout our job array example and SLURM's official documentation on job arrays for more information.

Constraint Example

--constraint="H100": Specify that all requested nodes should support H100 GPUs
--constraint="H100|H200|A100": Specify that all requeested nodes should support H100, H200, or A100
--constraint="A100&EPYC-7543": Specify that all requested nodes support both A100 GPUs and EPYC-7543 processors

Simple Dependency Example

--dependency=afterok:12345: if included in submission script, the job will not run until job 12345 exits successfully.

Constraint Example

--constraint="H100": Specify that all requested nodes should support H100 GPUs
--constraint="H100|H200|A100": Specify that all requeested nodes should support H100, H200, or A100
--constraint="A100&EPYC-7543": Specify that all requested nodes support both A100 GPUs and EPYC-7543 processors

Simple Dependency Example

--dependency=afterok:12345: if included in submission script, the job will not run until job 12345 exits successfully.

Utility Bash Functions

These functions/commands are completely optional, but may help you with various debugging and logging tasks.

Debugging GPU Issues

gpu_debug: tool to identify possible issues you are having with your job (may not be perfectly accurate) - Makes broad assumptions about your submission, but is a great first-step tool to figure out what is going wrong if you are receiving GPU errors. - Should be included inside of your SLURM script as an additional process. Note that this is another custom-built terminal command that will only work on the Turing Cluster.

Responding to Signals with trap

trap <function/command> <signal>: When included in your SLURM script, this will allow you to trigger functions or commands when a signal is broadcasted by the kernel.

For example, you can use SIGTERM to cleanup after a job ends (removing temporary files, moving logs to another directory, etc.), or checkpointing. If you are running a job that has a chance to go over the time limit, we recommend you write your own bash script function that performs cleanup and checkpointing when it recieves SIGTERM. You can also use the customizable use signals USR1 and USR2 for requeuing a job.

SIGTERM

SIGTERM, or TERM, flags a process for termination. It gets broadcasted 30 seconds before the process dies due to reaching the time limit or being manually terminated with scancel.

Linux Signals

Other Linux signals exist and can be trapped as per the need of your project.

You can broadcast these signals at a specified time before your job reaches the time limit using the --signal argument. It covers some options out of the scope of this guide, but more information is available in the official argument list for sbatch.

Monitoring Live Output

To view real-time output from a file:

tail -f <file name>

Monitoring Job Runtime

watch -n <interval> squeue --me

where interval is the time between updates in seconds. If -n is omitted, the default is every 2 seconds.

Premade Bash Functions

Auto Requeuing

Auto-requeueing requires a few extra SLURM arguments, which are included below.

#SBATCH --requeue            # Marks job as eligible for requeueing
#SBATCH --signal=B:SIGUSR1@30  # Send USR1 signal to this script (B) 30 seconds before the time limit is reached
#SBATCH --open-mode=append   # Optional;l Do not overwrite output files when requeuing the job
trap 'scontrol requeue $SLURM_JOB_ID; exit 0' SIGUSR1

Note that the trap will only trigger IF you effectively treat it as its own process. Any program you are running through the slurm file, such as a python script, make sure to include the '&' symbol afterwards, and include wait at the end of your script for this to work. You can also include #SBATCH --open-mode=append in your submission script to append to your output files instead of overwriting them when requeueing occurs.

Requeuing Tutorial

To get a better idea of how requeuing works, see our example

Filename Patterns

SLURM allows you to use patterns to include in your output filenames. The following is not an exhaustive list, as some of the patterns refer to concepts not covered in this guide. If you are interested in seeing the full list, please consult the official documentation for sbatch.

%%: include the '%' character in filename.
%j: The job's UUID.
- {SLURM_JOB_ID}: Same as %j, but for bash functions and lines other than the arguments.

%n: the node ID the job is running on. If using multiple nodes, this is result in multiple output files.
%r: restart count of the running job. May be useful for debugging purposes.
%S: SLUID of the running job. Unlike the jobID, this will change on requeuing.
%t: task ID relative to current job. This will create a separate output file per process. Note that this ID may change per run.
%u: Username of the person who queued the job.
%x: The job name.
%A: The job array's UUID of the 0th entry. Acts as the master UUID.
%a: The job's array index in a job array.

Examples

A list of various examples creating and using SLURM scripts can be found here.