Slurm Commands

 

Here is a quick reminders on Slurm and how to use it

One Line Commands

View the Running Jobs

$ squeue
   JOBID  PARTITION             NAME     USER   ST       TIME     NODES    NODELIST(REASON)
   26529      batch       train_BERT      joe    R    1:24:52         1              cnode1
   26530      batch     train_ResNet    maria    R    2:12:56         1              cnode2
   26531      batch      update_plex    maria    R    2:14:20         1              cnode3
   26532      batch    download_data    janet    R    2:16:17         1              cnode7

View Nodes and Partitions

$ sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
batch*          up 8-00:00:00      9    mix cnode[1-9]
batch*          up 8-00:00:00      3   idle cnode[10-12]
gpu             up 8-00:00:00      9    mix cnode[1-9]
gpu             up 8-00:00:00      3   idle cnode[10-12]
rtx             up 8-00:00:00      4   idle rnode[1-4]
reservations    up   infinite      1  down* nvidia-dgx1
reservations    up   infinite     10    mix beauty,cnode[1-9]
reservations    up   infinite      7   idle cnode[10-12],rnode[1-4]
overflow        up   infinite      1  down* nvidia-dgx1
overflow        up   infinite     10    mix beauty,cnode[1-9]
overflow        up   infinite      7   idle cnode[10-12],rnode[1-4]

Submitting a Job

You can submit either a job or open an interactive bash, and then run your job through there

Running a Job with sbatch

First create a shell script that will be used

# run.sh

#!/bin/bash
#SBATCH --job-name  {job_name}         # E.g. train_model
#SBATCH --nodelist  {node_name}        # E.g. cnode1
#SBATCH --partition {partition_name}   # E.g. gpu
#SBATCH --gpus      {number_of_gpus}   # E.g. 1
#SBATCH --time      {max_time}         # E.g. 08-00
#SBATCH --ntasks    {number_of_cpus}   # E.g. 16
#SBATCH --mem       {ram_memory}       # E.g. 32G

# Enable using a python 
scl enable rh-python36 bash
source /home/user/envs/myenv/bin/activate

# Copy data into your /tmp folder (Read below)
cp /home/user/data/my_dataset /tmp/user/my_dataset

# Copy your script
cp my_experiment.py /tmp/user

# Change into that directory
cd /tmp/user

# Run your script
python my_experiment.py

Sometimes, job is deployed at a different node than that of where you originally are. For example, if you are on cnode0, sometimes the job gets deployed at another node (e.g. cnode1). This can be seen in your terminal, which should say user@cnode0. With that, the /tmp directory in different nodes are not the same. In other words, cnode0’s /tmp directory is not the same as cnode1’s /tmp. Therefore, if you are dealing with a dataset, it is not wise to have the server to get data from another computer every time. So, it’s helpful to copy that dataset into the /tmp directory

Finally, you can run the job with

sbatch run.sh

Below is the same script as the one above, but using the shortened flags

# run.sh

#!/bin/bash
#SBATCH -J       {job_name}         # E.g. train_model
#SBATCH -w       {node_name}        # E.g. cnode1
#SBATCH -p       {partition_name}   # E.g. gpu
#SBATCH -G       {number_of_gpus}   # E.g. 1
#SBATCH -t       {max_time}         # E.g. 08-00
#SBATCH -n       {number_of_cpus}   # E.g. 16
#SBATCH --mem    {ram_memory}       # E.g. 32G

# Enable using a python 
scl enable rh-python36 bash
source /home/user/envs/myenv/bin/activate

# Copy data into your /tmp folder (Read below)
cp /home/user/data/my_dataset /tmp/user/my_dataset

# Copy your script
cp my_experiment.py /tmp/user

# Change into that directory
cd /tmp/user

# Run your script
python my_experiment.py

Run a Job with srun

This is the way you can run a job interactively. Basically, a lot of the above is done again, but just interactively.

First you need to get into the interactive bash. The --pty flag in srun allows you to execute a task in a pseudo terminal mode

user@cnode0:~$ srun -J jobname -p gpu -G 1 -w cnode7 -n 10 --mem 16G --pty bash
# or srun -J jobname -p batch -G 1 -w cnode7 -n 10 --mem 16G --pty bash
user@cnode7:~$

Then you can run your jobs as you normally would. E.g.

user@cnode7:~$ cp /home/user/data/my_dataset /tmp/user
user@cnode7:~$ source /home/user/envs/myenv/bin/activate
user@cnode7:~$ cp my_experiment.py /tmp/user
user@cnode7:~$ cd /tmp/user
user@cnode7:~$ python my_experiment.py

Additional Resources

An example written by Jim Kinney and Robert Tweedy is

#!/bin/bash

# This is an example SBATCH script "slurm_example_script.sh"
# For all available options, see the 'sbatch' manpage.
#
# Note that all SBATCH commands must start with a #SBATCH directive;
# to comment out one of these you must add another # at the beginning of the line.
# All #SBATCH directives must be declared before any other commands appear in the script.
#
# Once you understand how to use this file, you can remove these comments to make it
# easier to read/edit/work with/etc. :-)

### (Recommended)
### Name the project in the batch job queue
#SBATCH -J ExampleName

### (Optional)
### If you'd like to give a bit more information about your job, you can
### use the command below.
#SBATCH --comment='A comment/brief descriptive name of your job goes here.'

### (REQUIRED)
### Select the queue (also called "partition") to use. The available partitions for your
### use are visible using the 'sinfo' command.
### You must specify 'gpu' or another partition to have access to the system GPUs.
#SBATCH -p batch

### (REQUIRED for GPU, otherwise do not specify)
### If you select a GPU queue, you must also use the command below to select the number of GPUs
### to use. Note that you're limited to 1 GPU per job as a maximum on the basic GPU queue.
### If you need to use more than 1, contact [email protected] to schedule a multi-gpu test for
### access to the multi-gpu queue.
###
### If you need a specific type of GPU, you can prefix the number with the GPU's type like
### so: "SBATCH -G turing:1". The available types of GPUs as of 04/16/2020 are:
### turing (12 total)
### titan (only 1; requesting this GPU may result in a delay in your job starting)
### pascal (4 total; using this GPU requires that your code handle being pre-empted/stopped at any
###        time, as there are certain users with priority access to these GPUs).
### volta (8 total) - You must use the 'dgx-only' partition to select these GPUs.
### rtx (4 total) - NVidia Quadro RTX 6000. You must use the 'rtx' or 'overflow' partitions to select these GPUs.
##SBATCH -G 1

### (REQUIRED) if you don't want your job to end after 8 hours!
### If you know your job needs to run for a long time or will finish up relatively
### quickly then set the command below to specify how long your job should take to run.
### This may allow it to start running sooner if the cluster is under heavy load.
### Your job will be held to the value you specify, which means that it will be ended
### if it should go over the limit. If you're unsure of how long your job will take to run, it's
### better to err on the longer side as jobs can always finish earlier, but they can't extend their
### requested time limit to run longer.
###
### The format can be "minutes", "hours:minutes:seconds", "days-hours", or "days-hours:minutes:seconds".
### By default, jobs will run for 8 hours if this isn't specified.
#SBATCH -t 8:0:0


### (optional) Output and error file definitions. To have all output in a file named
### "slurm-<jobID>.out" just remove the two SBATCH commands below. Specifying the -e parameter
### will split the stdout and stderr output into different files.
### The %A is replaced with the job's ID.
#SBATCH -o file-%A.out
#SBATCH -e file-%A.err

### You can specify the number of nodes, number of cpus/threads, and amount of memory per node
### you need for your job. We recommend specifying only memory unless you know you need a
### specific number of nodes/threads, as you will be automatically allocated a reasonable
### amount of threads based on the memory amount requested.

### (REQUIRED)
### Request 4 GB of RAM - You should always specify some value for this option, otherwise
###                       your job's available memory will be limited to a default value
###                       which may not be high enough for your code to run successfully.
###                       This value is for the amount of RAM per computational node.
#SBATCH --mem 4G

### (optional)
### Request 4 cpus/threads - Specify a value for this function if you know your code uses
###                          multiple CPU threads when running and need to override the
###                          automatic allocation of threads based on your memory request
###                          above. Note that this value is for the TOTAL number of threads
###                          available to the job, NOT threads per computational node! Also note
###                          that Matlab is limited to using up to 15 threads per node due to
###                          licensing restrictions imposed by the Matlab software.
##SBATCH -n 4

### (optional)
### Request 2 cpus/threads per task - This differs from the "-n" parameter above in that it
###                                   specifies how many threads should be allocated per job
###                                   task. By default, this is 1 thread per task. Set this
###                                   parameter if you need to dedicate multiple threads to a
###                                   single program/application rather than running multiple
###                                   separate applications which require a single thread each.
##SBATCH -c 2

### (very optional; leave as '1' unless you know what you're doing)
### Request 1 node - Only specify a value other than 1 for this option when you know that
###                  your code will run across multiple systems concurrently. Otherwise
###                  you're just wasting resources that could be used by others.
#SBATCH -N 1

### (optional)
### This is to send you emails of job status
### See the manpage for sbatch for all the available options.
#SBATCH [email protected]
#SBATCH --mail-type=ALL

### Your actual commands to start your code go below this area. If you need to run anything in
### the SCL python environments that's more complex than a simple Python script (as in, if you
### have to do some other setup in the shell environment first for your code), then you should
### write a wrapper script that does all the necessary steps and then run it like in the below
### example:
###
### scl enable rh-python36 '/home/mynetid/my_wrapper_script.sh'
###
### Otherwise, you're probably not running everything you think you are in the SCL environment.
hostname
echo 'Hello world!' > test.txt
scl enable rh-python36 'python shit.py' > shit.txt