Skip to content

FAQ

How do I run my jobs?

All jobs must be submitted to run on the compute nodes using Slurm. There are many different ways to interact with Slurm:

  • Use sbatch to submit a job script (e.g. sbatch jobscript.sh)
  • Start an interactive shell session on a compute node with the sinteractive command
  • Obtain a job allocation using salloc and then submit jobs to these resources using srun
  • Construct a Slurm job array

The most important point step in all of these cases is your resource request for compute, memory, and other hardware resources (i.e. GPUs).

Why isn’t my job running?

There are many reasons why your job does not start running immediately. The most common reasons are:

  • The cluster is busy, and the resources you requested are not available
  • You requested a very specific or limited resource (i.e. a specific GPU) that is only available on a few nodes
  • You requested a combination of resources that together are very specific or limited (e.g. many cpus/memory, and also GPUs)

You can use the scontrol show job JOBID command, where JOBID is your pending job ID number, to get additional information, including:

  • What node your job is scheduled to run on
  • When your job is expected to start
  • What resources you requested, in the event that you made a mistake in your job submission

The amount of resources varies across the different compute nodes, including the type of CPU, number of CPU cores, amount of memory, and availability of GPUs. If you want to decrease the amount of wait time for your job submission, make sure you are requesting the minimum amount of resources (CPU/RAM/GPU) required.

User X has 20 jobs running, while I only submitted one job and it is pending. Why does user X get to run 20 jobs while my job waits?

After reviewing the answers to the previous question, the following are some additional reasons why your job is pending while other jobs are running:

  • User X is running 20 small CPU jobs, which are easy to schedule in between the "gaps" of other running jobs
  • Your job requested a large amount of CPUs or RAM, making it difficult to schedule immediately if the cluster is busy (it usually is)

There are many different reasons for why your job does not start immediately, and these factors are constantly changing. Use of the short partition, requesting the minimum required resources for your job, frequent checkpointing, and multiple job submissions is the best strategy to use for moving your jobs through the queue as quickly as possible.

Why does my GPU code say there are no CUDA devices available?

You did not request a GPU in your resource request. Make sure to include --gres=gpu:X in your SLURM script, where X is the number of GPUs you need per node. For example, add the line:

#SBATCH --gres=gpu:2
to indicate that your job will use 2 GPUs per node.

Additionally, to make use of CUDA drivers, your SLURM script must have the line

module load cuda
written after your #SBATCH lines to load the proper drivers.

My job runs more slowly on the cluster than it does on my laptop/workstation/server. Why is the cluster so slow?

From a purely hardware perspective, this is highly unlikely. If your job is running more slowly than you expect, the most common reasons are:

  • Incorrect/inadequate resource request (you only requested 1 CPU core, or the default amount of RAM)
  • Your code is doing something that you don’t know about

A combination of these two factors are almost always the reason for jobs running more slowly on the cluster than on a different resource.

One example is Matlab code running on the cluster versus a modern laptop. Matlab implicitly performs some operations in parallel (e.g. matrix operations), spawning multiple threads to accomplish this parallelization. If your resource request on the cluster only includes 1 or 2 CPU cores, no matter how many threads Matlab spawns, they will be pinned to the cores you were assigned.

This can lead to a situation where a user’s Matlab code is running using all 4 CPU cores on their laptop, but only 1 CPU core on the cluster. This can give the impression that the cluster is slower than a laptop. By requesting 4 or 8 CPU cores in your resource request, Matlab can now run these operations in parallel.

Tip

It is important to understand what the code or application you are using is actually doing when you run your job

I went from one node to multiple nodes for my job, and it runs more slowly.

There are two major reasons why a job can run more slowly when it is run on multiple nodes compared to one.

  • The job tasks are tightly coupled, and the overhead of communicating over the network exceeds the benefit from using additional hardware. In this case, the best option might be to use a larger job on a single node. Alternatively, you can investigate exactly where the performance bottleneck lies, and potentially improve that.
  • Your job is not actually running on multiple nodes, and SLURM allocated the CPUs unevenly. For example, a 64-core job on two nodes (-N 2 -n 64) could be allocated as 1 CPU on one node and 63 on the other. If your job was running solely on the node where 1 CPU was allocated, it would be significantly slower. The solution to this is twofold: first, using --ntasks-per-node rather than --ntasks to ensure that nodes are evenly allocated. Secondly, ensuring that your code is using a mechanism to run across all alocated nodes, rather than leaving them idle.

I accidentally deleted some files or folders and need to recover them! How can I recover my files?

Fortunately, Turing automatically takes daily and hourly snapshots of your directories in order to prevent loss of important data. You can recover your files by following these steps after signing in to Turing:

  1. Navigate to the .snapshots folder by using the following command: cd .snapshot
  2. Access the backup folder that would most likely still contain your accidentally deleted files using one of the following commands: cd hourly_ or cd daily_ followed by the date of the backup (e.g. hourly_2024-03-14_15_00_00_UTC)
  3. Now you are in your home directory snapshot and can copy your files back into your current home using the following command: cp <name_of_files_or_folders> ~/<name_of_destination_folder> for example, to copy the file test.sh from the backups folder I am currently in to my home directory's scratch folder, I can use: cp test.sh ~/scratch
  4. Return to your Turing home directory using cd ~ to access your files again!

How do I check on the dynamic resource use of a running job?

you can use the command srun --nodelist=$(squeue -j <job_id> -h -o %N) --pty /bin/bash to inject a shell in to a node that is currently running a job, which is usefull for resource monitoring.

When I SSH to Turing, I receive an "Operation timed out" error and cannot connect (MacOS)

If you are using a Mac and are having troubles connecting to Turing, it may be due to an update to the MacOS Wi-Fi settings. Please check if your computer has these Wi-Fi settings enabled, and disable them using the following instructions.