USER GUIDE - Slurm - A Batch Scheduler Introduction

Slurm is a combined batch scheduler, billing and resource manager that uses slurm accounts to allow users with a login to the High Performance Computing clusters to run their jobs for a fee. For many researchers this fee is paid for by the University of Michigan Research Computing Package account. This document describes the process for submitting and running jobs under the Slurm Workload Manager on the current High Performance Computing  clusters (Great Lakes, Armis2 and Lighthouse).

The batch scheduler and resource manager work together to run jobs on an HPC cluster.

  • WHEN: The batch scheduler, sometimes called a workload manager, is responsible for finding and allocating the resources that fulfill the job’s request at the soonest available time. The time to start the job will be faster if the batch script is written well for what the job will need (IE not asking for too many resources).
  • HOW WELL: When a job is scheduled to run, the scheduler instructs the resource manager to launch the application(s) across the job’s allocated resources.  This is also known as “running the job”. The job will run best if the batch script is configured for what is needed to complete the job (IE making sure that the job processes can be completed on what is requested).
  • HOW MUCH: When a job has completed, you can check the funds remaining in your account on the Research Management Portal (RMP).

Cluster Basics and Terminology

An HPC cluster is made up of a number of compute nodes, each with a complement of processors, memory and GPUs. The user submits jobs that specify the application(s) they want to run along with a description of the computing resources needed to run the application(s).

Cluster Schematic: The architecture of a High Performance Computing Cluster. 

account (relates to Slurm ): A group of users with a chargeable shortcode and optional limits that is configured in slurm to provide a way to run jobs for a fee on a cluster.

batch: Instructions run in a file without user interaction, typically referred to as 'run in batch'

batch script: a text file created to contain a set of commands that can be run in batch by execution of a shell or job scheduler

Core: A processing unit within a computer chip.

CPU: The chip in a node that performs computations.

GPU: A graphics processing unit (GPU) is a specialized CPU designed to efficiently perform calculations for generating computer graphics (Merriam-Webster)

job: A set of text commands in a file, executed in batch by the cluster scheduler.

login: The way you access the cluster typically your uniqname and level 1 password

node: A physical machine in a cluster, including login, compute, and transfer nodes

  •  Login nodes  The login nodes are a place where users can login, edit files, view job results and submit new jobs. Login nodes are a shared resource and should not be used to run application workloads. There are limits on the login nodes. 
  • Head node: The head node is the computer where users land when they log in to the cluster. This is where they edit scripts, compile code, and submit jobs to the scheduler. 
  • Data Transfer node: The data transfer node is available for moving data to be accessible to and from the cluster.
  • Compute node: The compute nodes are the computers where jobs are run. To run jobs on the compute nodes, users need to access a head node and schedule their program to be run on the compute nodes once the requested resources are available. 

Node Geometry: The physical and logical arrangement of nodes in the cluster are being requested across the available resources on a cluster that need to be optimized for your particular job based on what the job is trying to accomplish. The geometry of these nodes can influence the overall performance, efficiency, and scalability of computational tasks.