General ======= A main component of every HPC system is called the resource manager (RM), sometimes also referred to as a batch system. There are quite a number of systems out there, commercial, free and open source, or a mixture of both. They all try to solve a similar problem but they are not compatible to each other. Some notable examples are: + PBS + Sun/Oracle Grid Engine + Torque/PBSpro + Condor + LSF + SLURM We have been using a resource manager called Torque for many years now and it worked quite well. Unfortunately the open source part of the project isn't maintained very well anymore and the lack of proper GPU support led us to switch to SLURM. We will gradually switch from Torque to SLURM (2020) and hence you will find documentation and example commands for both systems. Queues/Partitions ----------------- The RM usually manages a number of queues (they are called partitions for Slurm) and they can hold thousands of jobs that are subject to execution. Once you prepared a Job (:ref:`torque_jobs`, :ref:`slurm/slurm_jobs`) you can place them inside a queue. All jobs of all users are going to the same central queue(s): .. image:: ../img/queue.svg :width: 100% The scheduler part of the RM uses different, configurable priority-based algorithms to decide what job to pick out of that queue and start it on some node. For Torque specifically, the scheduler implements fair share scheduling for every user over a window of 7 days. Another global objective for the scheduler (Torque or SLURM) is to maximize resource utilization while simultaneously assuring that every job will start eventually. Assume that in the image above jobs on the right hand side have been submitted earlier than those on the left side. It is absolutely possible that the next feasible job in that queue is not a green one but a blue or red one. The decision depends on the amount of resources a job requires and the amount of resources the corresponding job owner had used in last 7 days. Each queue has different parameter sets and resource targets: **Torque** + ``default`` + ``longwall`` for jobs that need more than 36 hours + ``testing`` for very short jobs only **Slurm** + ``short`` (default) + ``long`` for jobs that need more than 24 hours + ``test`` short jobs up to 1 hour + ``gpu`` for jobs that need a GPU --------- There are **four important resources** used for accounting, reservations and scheduling: 1. CPU cores 2. Amount of physical memory 3. Time 4. Generic resources (gres, usually a GPU) A job is a piece of code that requires combination of those resources to run correctly. Thus you can request each of those resources separately. For instance a computational problem *might* consist of the following: A 10.000 single threaded Jobs each running only a couple of minutes and a memory foot print of 100MB. B 20 jobs where each can use as many local processors as possible requiring 10GB of memory each with an unknown or varying running time. C A single job that is able to utilize the whole cluster at once using a network layer such as Message Passing Interface (MPI) All of the above requirements need to be represented with a job description so the RM knows how many resources to acquire. This is especially important with large jobs when there are a lot of other, smaller jobs in the queue that need to be actively retained so the larger jobs won't starve. The batch system is constantly partitioning all of the cluster resources to maintain optimal efficiency and fairness. .. important:: The need for GPU scheduling is the reason we are switching from Torque to SLURM. If you want to submit CUDA jobs, you **have** to use SLURM.