Commit 5e820c31 authored by Michael Krause's avatar Michael Krause 🎉
Browse files

Merge branch 'slurm'

parents 1787d2cd efad9306
Pipeline #6518 failed with stage
in 7 seconds
......@@ -69,14 +69,12 @@ List of Contents
introduction/data
.. toctree::
:maxdepth: 1
:caption: Batch System
:maxdepth: 3
:caption: Resource Manager
pbs/torque
pbs/jobs
pbs/commands
pbs/resources
pbs/dependencies
rm/general
rm/slurm
rm/torque
.. toctree::
:maxdepth: 1
......
......@@ -20,11 +20,12 @@ Tardis
Some technical facts:
+ **832** Intel® Xeon® CPU E5-2670 CPU cores(no HT) inside 48 Dell m6x0 blade servers
+ 7 dedicated nodes housing **24** Nvidia GPUs
+ **R**:sub:`max` = 9.9 TFlops, **R**:sub:`peak` = 14.8 TFlops
+ **8.32TB** total amount of memory
+ **10.6TB** total amount of memory
+ **32TB** of attached NFS storage for software
+ **747TB** of BeeGFS storage for user homes
+ fully-connected 100GbE
+ **747 TB** of attached BeeGFS storage
+ **10GB/s** fully-connected Ethernet
Workflows
......@@ -52,7 +53,7 @@ might want to use your machine as well. But most importantly it is **slow**.
With the Tardis you can login from your laptop or workstation with SSH (see:
:doc:`login`) to a single head node called ``tardis``. On that node users can
prepare and test their code and analysis and then submit it to a queue (see:
:doc:`../pbs/torque`). Jobs will then **eventually** be submitted to one of the
:ref:`torque`). Jobs will then **eventually** be submitted to one of the
computing nodes to get a guaranteed set of processing resources. Afterwards
users can collect the results and copy them back to the file servers.
......
#!/bin/sh
WATCHSOURCE="**/*.rst *.rst"
shopt -s globstar
WATCHSOURCE="**/*.rst"
MAKEFUNC=makeindex
function makeindex() {
......@@ -11,7 +12,6 @@ function makeindex() {
while true ; do
inotifywait -e modify $WATCHSOURCE -qq
WATCHSOURCE="**/*.rst *.rst"
$MAKEFUNC
done
Torque
======
We are using a cluster environment called `Torque`_. There is large number of
similar systems with different sets of tools, implementation styles and
licenses. Many of them are somewhat similar to the original portable batch
System **PBS**, developed by NASA.
There are **3 main components** to Torque:
1. Main Server accepting queueing and scheduling commands
2. Separate Scheduling System that handles resource allocation policies
3. Machine Oriented Mini-Servers that handle the processing on the nodes itself
As a user you are only going to interact with the main server with a set of
commands, most importantly ``qsub``.
Queues
------
Torque manages a number of queues that can hold thousands of jobs that are
subject to execution. Once you prepared a Job (:ref:`Jobs`) you can place them
inside a queue. All the jobs of all users are going to the same central
queue(s):
.. image:: ../img/queue.svg
:width: 100%
The scheduler uses a fair share sliding window algorithm to decide what job to
pick out of that queue and start it on some node. Assume that in the image
above jobs on the right hand side have been submitted earlier than those on the
left side. It is absolutely possible that the next feasible job in that queue
is not a green one but a blue or red one. The decision depends on the amount of
resources a job requires and the amount of resources the corresponding job
owner had used in last 7 days.
Each queue has different parameter sets and resource targets. On the tardis there are 3 queues:
+ ``default``
+ ``longwall`` for jobs that need more than 36 hours
+ ``testing`` for very short jobs only
+ ``gpu`` for jobs that need a GPU
Resources
---------
There are **3** important resources used for accounting, reservations and scheduling:
1. CPU cores
2. Amount of physical Memory
3. Time
A job is a piece of code that requires combination of those 3 resources to run
correctly. Thus you can request each of those resources separately. For
instance a computational problem *might* consist of the following:
A
10.000 single threaded Jobs each running only a couple of minutes and a memory foot print of 100MB.
B
20 jobs where each can use as many local processors as possible requiring 10GB of memory each with an unknown or varying running time.
C
A single job that is able to utilize the whole cluster at once using MPI.
All of the above requirements need to be represented with a job description so
Torque knows how many resources to acquire. This is especially important with
large jobs when there are a lot of other, smaller jobs in the queue that need
to be actively retained so the larger jobs won't starve. The batch system is
constantly partitioning all of the cluster resources to maintain optimal
efficiency and fairness.
.. _Torque: http://www.adaptivecomputing.com/products/open-source/torque
General
=======
A main component of every HPC system is called the resource manager (RM),
sometimes also referred to as a batch system. There are quite a number of systems
out there, commercial, free and open source, or a mixture of both. They all try
to solve a similar problem but they are not compatible to each other. Some
notable examples are:
+ PBS
+ Sun/Oracle Grid Engine
+ Torque/PBSpro
+ Condor
+ LSF
+ SLURM
We have been using a resource manager called Torque for many years now and it
worked quite well. Unfortunately the open source part of the project isn't
maintained very well anymore and the lack of proper GPU support led us to switch
to SLURM. We will gradually switch from Torque to SLURM (2020) and hence you
will find documentation and example commands for both systems.
Queues/Partitions
-----------------
The RM usually manages a number of queues (they are called partitions for
Slurm) and they can hold thousands of jobs that are subject to execution. Once
you prepared a Job (:ref:`torque_jobs`, :ref:`slurm_jobs`) you can place them
inside a queue. All jobs of all users are going to the same central
queue(s):
.. image:: ../img/queue.svg
:width: 100%
The scheduler part of the RM uses different, configurable priority-based
algorithms to decide what job to pick out of that queue and start it on some
node. For Torque specifically, the scheduler implements fair share scheduling
for every user over a window of 7 days. Another global objective for the
scheduler (Torque or SLURM) is to maximize resource utilization while
simultaneously assuring that every job will start eventually.
Assume that in the image above jobs on the right hand side have been submitted
earlier than those on the left side. It is absolutely possible that the next
feasible job in that queue is not a green one but a blue or red one. The
decision depends on the amount of resources a job requires and the amount of
resources the corresponding job owner had used in last 7 days.
Each queue has different parameter sets and resource targets:
**Torque**
+ ``default``
+ ``longwall`` for jobs that need more than 36 hours
+ ``testing`` for very short jobs only
**Slurm**
+ ``short`` (default)
+ ``long`` for jobs that need more than 24 hours
+ ``test`` short jobs up to 1 hour
+ ``gpu`` for jobs that need a GPU
---------
There are **four important resources** used for accounting, reservations and scheduling:
1. CPU cores
2. Amount of physical memory
3. Time
4. Generic resources (gres, usually a GPU)
A job is a piece of code that requires combination of those resources to run
correctly. Thus you can request each of those resources separately. For
instance a computational problem *might* consist of the following:
A
10.000 single threaded Jobs each running only a couple of minutes and a memory foot print of 100MB.
B
20 jobs where each can use as many local processors as possible requiring 10GB of memory each with an unknown or varying running time.
C
A single job that is able to utilize the whole cluster at once using a network
layer such as Message Passing Interface (MPI)
All of the above requirements need to be represented with a job description so
the RM knows how many resources to acquire. This is especially important with
large jobs when there are a lot of other, smaller jobs in the queue that need
to be actively retained so the larger jobs won't starve. The batch system is
constantly partitioning all of the cluster resources to maintain optimal
efficiency and fairness.
.. important::
The need for GPU scheduling is the reason we are switching from Torque to
SLURM. If you want to submit CUDA jobs, you **have** to use SLURM.
SLURM
=====
`SLURM`_ is the resource manager that replaced Torque on the Tardis in 2020. It
is similar to Torque in its main concepts, but the commands and syntax differs
a little. We switched, because it is much more flexible than Torque, actively
maintained, and supports sophisticated GPU scheduling. We will gradually move
nodes from Torque to SLURM to motivate everyone to familiarize with the new system.
**Users familiar with Torque** can check out the :ref:`slurm_transition` for a quick start:
.. include:: slurm/transition.rst
.. include:: slurm/jobs.rst
.. include:: slurm/commands.rst
.. include:: slurm/resources.rst
.. include:: slurm/gpus.rst
.. _SLURM: https://slurm.schedmd.com/
Commands
--------
Submitting
+++++++++++
**Non-Interactively**
Add a single to job the the default queued:
.. code-block:: bash
sbatch job.slurm
Submit the same job with some resource requests and a name.
.. code-block:: bash
sbatch --cpus 2 --mem 8G --job-name test job.slurm
Submit a job to the gpu partition, requesting 2 gpus on a single node:
.. code-block:: bash
sbatch -p gpu --gres gpu:2 job.slurm
Wrap bash commands into a job on the fly:
.. code-block:: bash
sbatch --wrap "module load R ; Rscript main.R"
**Interactively/Blocking**
Quick interactive, dual-core shell in the test partition:
.. code-block:: bash
srun -p test -c2 --pty bash
Querying
++++++++
You can use ``squeue`` to get all information about queued or running jobs. This
example limits the output to jobs belonging to the user `krause`:
.. code-block:: bash
[krause@master ~] squeue -u krause
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
110996 short test krause R 0:12 1 ood-43
110997 gpu job.slur krause R 0:08 1 gpu-4
110995 short job.slur krause R 0:15 1 ood-43
As you can see there are 3 jobs, two of them are in the default partition
(**short**) and one has been sent to the gpu partition. They are all in the
running (R) state (ST) and have been running for a couple of seconds (TIME).
**Squeue** is very powerful and its output can be arbiatrarily configured using
format strings. Checkout ``squeue -o all`` and have a look at the manpage with
``man squeue``.
To get live metrics from the job you have to use
To look up historical (accounting) data there is ``sacct``. Again, all output columns can be configured. Example:
.. code-block:: bash
[krause@master ~] sacct -o JobID,ReqMEM,MaxRSS,CPU,Exit
JobID ReqMem MaxRSS CPUTime ExitCode
------------ ---------- ---------- ---------- --------
110973 4Gc 936K 00:00:08 0:0
110974 4Gc 936K 00:00:00 0:0
110976 4Gc 944K 00:00:03 0:0
Deleting
++++++++
You can cancel a specific job by running ``scancel JOBID`` or all of
your jobs at once with ``scancel -u $USER``. This is a bit
different to Torque as there is no special **all** placeholder.
Instead you just ask the system to cancel jobs matching to your
username. Of course it's not possible to accidentally cancel other
user's jobs.
using GPUs
----------
With the release of SLURM we introduced a number of specific nodes with two
flavors of Nvidia GPUs attached to them to be used with CUDA-enabled code.
Right now we have these nodes available:
======== =============== ====== ===== =========
Nodename GPU Type Memory Count Partition
======== =============== ====== ===== =========
gpu-1 GTX 1080 TI 12 GB 2 test
-------- --------------- ------ ----- ---------
gpu-2 GTX 1080 8 GB 3 gpu
-------- --------------- ------ ----- ---------
gpu-3 GTX 1080 8 GB 3 gpu
-------- --------------- ------ ----- ---------
gpu-4 Quadro RTX 5000 16 GB 4 gpu
-------- --------------- ------ ----- ---------
gpu-5 Quadro RTX 5000 16 GB 4 gpu
-------- --------------- ------ ----- ---------
gpu-6 Quadro RTX 5000 16 GB 4 gpu
-------- --------------- ------ ----- ---------
gpu-7 Quadro RTX 5000 16 GB 4 gpu
======== =============== ====== ===== =========
Both the 12GB 1080 TI and the 8GB 1080 are grouped under the name **1080**. The
short name for the more powerful Quadro cards is **rtx5k**.
To request any GPU, you can use ``-p gpu --gres gpu:1`` or ``-p test --gres
gpu:1`` if you want to test things. The ``gres`` parameter is very flexible and
allows to request the GPU group (**1080** or **rtx5k**).
For example, to request 2 Geforce 1080, use ``--gres gpu:1080:2``. This will
effectively hide all other GPUs and grants exclusive usage of the devices.
You can use the `nvidia-smi` tool in an interactive job or the node-specific
charts to get an idea of the device's utilization.
Any code that supports CUDA up to version 10.1 should just work out of the box, that includes python's pygpu or Matlab's gpu-enabled libraries.
.. note::
It is also possible to pass a requested GPU into a **singularity
container**. You have to pass the ``--nv`` flag to any
singylarity calls, however.
Example: Request an interactive job (srun --pty) with 4 cores, 8gb of memory and a single card from the rtx5k group. Instead of ``/bin/bash`` we use the shell from a singularity container and tell singularity to prepare an nvidia environment ``singularity shell --nv``:
.. code::
srun --pty -p gpu --gres gpu:rtx5k:1 -c 4 --mem 8gb \
singularity shell --nv /data/container/unofficial/fsl/fsl-6.0.3.sif
Singularity> hostname
gpu-4
Singularity> nvidia-smi
Tue Jul 14 18:38:14 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.74 Driver Version: 418.74 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro RTX 5000 Off | 00000000:3B:00.0 Off | Off |
| 33% 28C P8 10W / 230W | 0MiB / 16095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
+-----------------------------------------------------------------------------+
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Singularity>
.. _slurm_jobs:
Example Jobs
------------
SLURM job concepts exceed those of Torque and this document will try to
match those. For a comprehensible list, check out the official `slurm
documentation`_.
Simple Jobs
+++++++++++
The simplest job file just consist of a list of shell commands to be executed.
In that case it is equivalent to a shell script. Note, that in contrast to
Torque, SLURM jobs have to start with a hash-bang (#!) line.
Example ``simple_job.job``
.. code-block:: bash
#!/bin/bash
./run_simulation.py
To submit this job run:
.. code-block:: bash
sbatch simple_job.job
Now that is rarely sufficient. In most cases you are going to
need some resource requests and a state variable as you are very likely to
submit multiple similar jobs. It is possible to add SLURM parameters (see
:ref:`slurm_resources`) inside the job file.
Example ``job_with_resources.job``
.. code-block:: bash
#!/bin/bash
#SBATCH --job-name myjob
#SBATCH --partition gpu
#SBATCH --time 24:0:0
#SBATCH --cpus-per-task 2
#SBATCH --mem 32GB
#SBATCH --gres gpu:1
#SBATCH --mail-type NONE
#SBATCH --workdir .
./run_simulation.py
This would create a job called "myjob" in the GPU partiton, that needs 24 hours
of running time, 32GB of RAM, a single GPU of any type, and 2 processors. It
will not send any e-mails and start in the current directory.
Interactive Jobs
++++++++++++++++
Sometimes it may be useful to get a quick shell on one of the compute nodes.
Before submitting hundreds or thousands of jobs you might want to run some
simple checks to ensure all the paths are correct and the software is loading
as expected. Although you can usually run these tests on the login node itself
there are cases when this is dangerous, for example when your tests quickly
require lot's of memory. In that case you should move those tests to one of the
compute nodes:
.. code-block:: bash
srun --pty bash
This will submit a job that requests a shell. The submission will block until
the job gets scheduled. Note that we do not use `sbatch`, but the similar
command `srun`. The fundamental distinction here is that `srun` will block
until the command or job gets scheduled, while `sbatch` puts the job into the
queue and returns immediately. The parameter ``--pty`` allocates a psdeudo-terminal to the program so input/output works as expected.
When there are lot's of jobs in the queue the scheduling might take some time.
To speed things up you can submit to the testing queue which only allows jobs
with a very short running time: Example:
.. code-block:: bash
srun -p test --pty /bin/bash
Other useful examples are:
.. code-block:: bash
# get a quick R session with 2 cores in the test partition
srun -p test -c 2 --pty R
# Start the most recent Matlab with 32GB
# two bash commands need to be passed to bash -c
srun --mem 32g --pty bash -c 'module load matlab ; matlab'
# Start some python script with 1 GPU and 2 cores
srun -p gpu --gres gpu -c 2 python3 main.py
.. _slurm_job_wrappers:
Job Wrappers
++++++++++++
Usually users want to collect a number of jobs into batches and submit them with one command. There are a number of approaches to do that. The most straight forward way is to use a minimal ``submit.sh`` shell script that could look a bit like this:
.. code-block:: bash
#!/bin/bash
for sub in $(seq -w 1 15) ; do
echo '#!/bin/bash' > job.slurm
echo "#SBATCH --job-name main_$sub" >> job.slurm
echo "#SBATCH --cpus 2" >> job.slurm
echo "python main.py $sub" >> job.slurm
sbatch job.slurm
rm -f job.slurm
done
This can be condensed down into a single line with the ``--wrap`` option to
sbatch. Here SLURM will create a job file on the fly, add the #!-line and
append the wrapped string to that file. This is syntax is being used a lot in
the examples in this document.
.. code-block:: bash
for sub in $(seq -w 1 15) ; do
sbatch -c 2 -J main_$sub --wrap "python3 main.py $sub"
done
.. _slurm documentation: https://slurm.schedmd.com/quickstart.html
.. _slurm_resources:
Resources
---------
This is a list of common SLURM options. You can either use these options
directly with ``sbatch``/``srun`` or add them as meta-parameters in a job file.
In the later case those options need the prefix ``#SBATCH`` and must be stated
in the first section of the file before the actual commands. The complete list
can be found in ``man sbatch``.
``#SBATCH --job-name job-name``
Sets the name of the job. This is mostly useful when submitting lot's of
similar jobs in a loop.
``#SBATCH --time 24:0:0``
Sets the expected maximum running time for the job. When a job **exceeds**
those limits it will **be terminted**.
``#SBATCH --mem 10GB``
Sets another resource requirement: memory. Exceeding this value in a job is
even more crucial than running time as you might interfere with other jobs
on the node. Therefor it needs to be **terminated as well**.
``#SBATCH --cpus-per-task 2``
Requests 2 CPUs for the job. This only makes sense if your code is
multi-threaded and can actually utilize the cores.
``#SBATCH --workdir project/data``
Sets the working directory of the job. Every time a job gets started it
will spawn a shell on some node. To initially jump to some directory use
this option. *Otherwise* the first command of your job should always be ``cd
project/data``.
``#SBATCH --output /home/mpib/krause/logs/slurm-%j.out``
Specify the location where SLURM will save the jobs' log file. By default
(different to Torque) *stdout* and *stderr* streams will be merged together
into this output file. The ``%j`` variable will be replaced with the SLURM
job id. To save the error stream to a separate file, use ``--error``. If
you do not specify ``--output`` or ``--error``, the (combined) log is
stored in the current location. Another difference to Torque is the fact
that log file will be available right away and contents will be streamed
into it during the lifecycle of the job (you can follow incoming data with
``tail -f slurm.out``.
``#SBATCH --output /dev/null``
To discard all standard output log use the special file ``/dev/null``.
``#SBATCH --mail-user krause[,knope,dwyer]``
Send an e-mail to a single user or a list of users for some configured mail
types (see below).
``#SBATCH --mail-type NONE,[OTHER,EVENT,TYPES]``
+ **NONE** default (no mail)
+ **BEGIN** defines the beginning of a job run
+ **FAIL** send an e-mail when the job has failed
+ **END** send an e-mail when the job has finished
Check out ``man sbatch`` for mor mail types.
``#SBATCH --dependency=afterok:Job-Id[:Job2-Id...]``
This will add a dependency to the current job. It will only be started or
tagged as startable when another job with id *Job-Id* finished
successfully. You can provide more than one id using a colon as
a separator.
``#SBATCH --gres gpu:1 --partition gpu``
Request a single GPU of any kind. It's also necessary to specify
a different partition using the ``--partition/-p`` flag.
.. _slurm_transition:
Torque Transition Table
-----------------------
================================================= =====
Torque Slurm
================================================= =====
**Submit**
---------------------------------------------------------
``qsub job.pbs`` ``sbatch job.sh``
``echo "Rscript foo.R" | qsub`` ``sbatch --wrap "Rscript foo.R"``
``qsub -I -q testing`` ``srun -p test --pty /bin/bash``
------------------------------------------------- -----
**Query**
---------------------------------------------------------
``qstat`` ``squeue``
``qstat -r`` ``squeue -u $USER --states R``
``qstat -q`` ``sinfo -s``
------------------------------------------------- -----