general.rst

General
=======

A main component of every HPC system is called the resource manager (RM),
sometimes also referred to as a batch system. There are quite a number of systems
out there, commercial, free and open source, or a mixture of both. They all try
to solve a similar problem but they are not compatible to each other. Some
notable examples are:

+ PBS
+ Sun/Oracle Grid Engine
+ Torque/PBSpro
+ Condor
+ LSF
+ SLURM

We have been using a resource manager called Torque for many years now and it
worked quite well. Unfortunately the open source part of the project isn't
maintained very well anymore and the lack of proper GPU support led us to switch
to SLURM. We will gradually switch from Torque to SLURM (2020) and hence you
will find documentation and example commands for both systems.

Queues/Partitions
-----------------

The RM usually manages a number of queues (they are called partitions for
Slurm) and they can hold thousands of jobs that are subject to execution. Once
you prepared a Job (:ref:`torque_jobs`, :ref:`slurm_jobs`) you can place them
inside a queue. All jobs of all users are going to the same central
queue(s):

.. image:: ../img/queue.svg
   :width: 100%

The scheduler part of the RM uses different, configurable priority-based
algorithms to decide what job to pick out of that queue and start it on some
node. For Torque specifically, the scheduler implements fair share scheduling
for every user over a window of 7 days. Another global objective for the
scheduler (Torque or SLURM) is to maximize resource utilization while
simultaneously assuring that every job will start eventually.

Assume that in the image above jobs on the right hand side have been submitted
earlier than those on the left side. It is absolutely possible that the next
feasible job in that queue is not a green one but a blue or red one. The
decision depends on the amount of resources a job requires and the amount of
resources the corresponding job owner had used in last 7 days.

Each queue has different parameter sets and resource targets:

**Torque**

+ ``default``
+ ``longwall`` for jobs that need more than 36 hours
+ ``testing`` for very short jobs only

**Slurm**

+ ``short`` (default)
+ ``long`` for jobs that need more than 24 hours
+ ``test`` short jobs up to 1 hour
+ ``gpu`` for jobs that need a GPU

---------

There are **four important resources** used for accounting, reservations and scheduling:

1. CPU cores
2. Amount of physical memory
3. Time
4. Generic resources (gres, usually a GPU)

A job is a piece of code that requires combination of those resources to run
correctly. Thus you can request each of those resources separately. For
instance a computational problem *might* consist of the following:

A
   10.000 single threaded Jobs each running only a couple of minutes and a memory foot print of 100MB.

B
   20 jobs where each can use as many local processors as possible requiring 10GB of memory each with an unknown or varying running time.

C
  A single job that is able to utilize the whole cluster at once using a network
  layer such as Message Passing Interface (MPI)


All of the above requirements need to be represented with a job description so
the RM knows how many resources to acquire. This is especially important with
large jobs when there are a lot of other, smaller jobs in the queue that need
to be actively retained so the larger jobs won't starve. The batch system is
constantly partitioning all of the cluster resources to maintain optimal
efficiency and fairness.


.. important::

    The need for GPU scheduling is the reason we are switching from Torque to
    SLURM. If you want to submit CUDA jobs, you **have** to use SLURM.