general.rst 3.62 KB
Newer Older
Michael Krause's avatar
Michael Krause committed
1
2
3
General
=======

4
5
A main component of every HPC system is called the resource manager (RM),
sometimes also referred to as a batch system. There are quite a number of systems
Michael Krause's avatar
Michael Krause committed
6
out there, commercial, free and open source, or a mixture of both. They all try
7
to solve a similar problem but they are not compatible to each other. Some
Michael Krause's avatar
Michael Krause committed
8
9
10
11
notable examples are:

+ PBS
+ Sun/Oracle Grid Engine
12
+ Torque/PBSpro
Michael Krause's avatar
Michael Krause committed
13
14
15
16
17
18
19
+ Condor
+ LSF
+ SLURM

We have been using a resource manager called Torque for many years now and it
worked quite well. Unfortunately the open source part of the project isn't
maintained very well anymore and the lack of proper GPU support led us to switch
20
21
to SLURM. We will gradually switch from Torque to SLURM (2020) and hence you
will find documentation and example commands for both systems.
Michael Krause's avatar
Michael Krause committed
22

23
24
Queues/Partitions
-----------------
Michael Krause's avatar
Michael Krause committed
25

26
27
28
29
30
The RM usually manages a number of queues (they are called partitions for
Slurm) and they can hold thousands of jobs that are subject to execution. Once
you prepared a Job (:ref:`torque_jobs`, :ref:`slurm/slurm_jobs`) you can place them
inside a queue. All jobs of all users are going to the same central
queue(s):
Michael Krause's avatar
Michael Krause committed
31
32
33
34
35
36
37
38
39

.. image:: ../img/queue.svg
   :width: 100%

The scheduler part of the RM uses different, configurable priority-based
algorithms to decide what job to pick out of that queue and start it on some
node. For Torque specifically, the scheduler implements fair share scheduling
for every user over a window of 7 days. Another global objective for the
scheduler (Torque or SLURM) is to maximize resource utilization while
40
simultaneously assuring that every job will start eventually.
Michael Krause's avatar
Michael Krause committed
41
42
43
44
45
46
47

Assume that in the image above jobs on the right hand side have been submitted
earlier than those on the left side. It is absolutely possible that the next
feasible job in that queue is not a green one but a blue or red one. The
decision depends on the amount of resources a job requires and the amount of
resources the corresponding job owner had used in last 7 days.

48
49
50
Each queue has different parameter sets and resource targets:

**Torque**
Michael Krause's avatar
Michael Krause committed
51
52
53
54
55

+ ``default``
+ ``longwall`` for jobs that need more than 36 hours
+ ``testing`` for very short jobs only

56
57
58
59
60
61
**Slurm**

+ ``short`` (default)
+ ``long`` for jobs that need more than 24 hours
+ ``test`` short jobs up to 1 hour
+ ``gpu`` for jobs that need a GPU
62

Michael Krause's avatar
Michael Krause committed
63
64
---------

65
There are **four important resources** used for accounting, reservations and scheduling:
Michael Krause's avatar
Michael Krause committed
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98

1. CPU cores
2. Amount of physical memory
3. Time
4. Generic resources (gres, usually a GPU)

A job is a piece of code that requires combination of those resources to run
correctly. Thus you can request each of those resources separately. For
instance a computational problem *might* consist of the following:

A
   10.000 single threaded Jobs each running only a couple of minutes and a memory foot print of 100MB.

B
   20 jobs where each can use as many local processors as possible requiring 10GB of memory each with an unknown or varying running time.

C
  A single job that is able to utilize the whole cluster at once using a network
  layer such as Message Passing Interface (MPI)


All of the above requirements need to be represented with a job description so
the RM knows how many resources to acquire. This is especially important with
large jobs when there are a lot of other, smaller jobs in the queue that need
to be actively retained so the larger jobs won't starve. The batch system is
constantly partitioning all of the cluster resources to maintain optimal
efficiency and fairness.


.. important::

    The need for GPU scheduling is the reason we are switching from Torque to
    SLURM. If you want to submit CUDA jobs, you **have** to use SLURM.