Commit cbc8e484 authored by Michael Krause's avatar Michael Krause 🎉
Browse files

RM: add slurm version of torque doc

parent 03be694f
......@@ -54,8 +54,8 @@ List of Contents
:caption: Resource Manager
rm/general
rm/torque
rm/slurm
rm/torque
.. toctree::
......
General
=======
A main component of every HPC system is called the resource manager (RM).
Sometimes it's also called a batch system. There are quite a number of systems
A main component of every HPC system is called the resource manager (RM),
sometimes also referred to as a batch system. There are quite a number of systems
out there, commercial, free and open source, or a mixture of both. They all try
to solve a similar problem but they are not compatible to each other. Some
notable examples are:
+ PBS
+ Sun/Oracle Grid Engine
+ Torque / PBSpro
+ Torque/PBSpro
+ Condor
+ LSF
+ SLURM
......@@ -17,17 +17,17 @@ notable examples are:
We have been using a resource manager called Torque for many years now and it
worked quite well. Unfortunately the open source part of the project isn't
maintained very well anymore and the lack of proper GPU support led us to switch
to SLURM. We will gradually switch from Torque to SLURM (2018) and hence you
will find documentation and example commands for both systems available on this
page.
to SLURM. We will gradually switch from Torque to SLURM (2020) and hence you
will find documentation and example commands for both systems.
Queues
------
Queues/Partitions
-----------------
The RM usually manages a number of queues that can hold thousands of jobs that
are subject to execution. Once you prepared a Job
(:ref:`torque_jobs`, :ref:`slurm_jobs`) you can place them inside a queue. All
the jobs of all users are going to the same central queue(s):
The RM usually manages a number of queues (they are called partitions for
Slurm) and they can hold thousands of jobs that are subject to execution. Once
you prepared a Job (:ref:`torque_jobs`, :ref:`slurm/slurm_jobs`) you can place them
inside a queue. All jobs of all users are going to the same central
queue(s):
.. image:: ../img/queue.svg
:width: 100%
......@@ -37,7 +37,7 @@ algorithms to decide what job to pick out of that queue and start it on some
node. For Torque specifically, the scheduler implements fair share scheduling
for every user over a window of 7 days. Another global objective for the
scheduler (Torque or SLURM) is to maximize resource utilization while
simultaneously assuring that every jobs will start eventually.
simultaneously assuring that every job will start eventually.
Assume that in the image above jobs on the right hand side have been submitted
earlier than those on the left side. It is absolutely possible that the next
......@@ -45,18 +45,23 @@ feasible job in that queue is not a green one but a blue or red one. The
decision depends on the amount of resources a job requires and the amount of
resources the corresponding job owner had used in last 7 days.
Each queue has different parameter sets and resource targets. On the tardis
there are 4 distinct queues:
Each queue has different parameter sets and resource targets:
**Torque**
+ ``default``
+ ``longwall`` for jobs that need more than 36 hours
+ ``testing`` for very short jobs only
+ ``gpu`` for jobs that need a GPU
Resources
**Slurm**
+ ``short`` (default)
+ ``long`` for jobs that need more than 24 hours
+ ``test`` short jobs up to 1 hour
+ ``gpu`` for jobs that need a GPU
---------
There are **4** important resources used for accounting, reservations and scheduling:
There are **four important resources** used for accounting, reservations and scheduling:
1. CPU cores
2. Amount of physical memory
......
SLURM **(new)**
===============
SLURM
=====
`SLURM`_ is the resource manager we introduced to the tardis system in 2018. It
`SLURM`_ is the resource manager that replaced Torque on the Tardis in 2020. It
is similar to Torque in its main concepts, but the commands and syntax differs
a lot. Although there is a compatibility layer to translated the q-command
family to SLURM we will refer to the native commands on this page. Right now you
need SLURM to schedule GPUs, but during the course of the year we will gradually
move nodes from Torque to SLURM to motivate everyone to switch to the new
system.
The structure of this page is similar to the Torque.
a little. We switched, because it is much more flexible than Torque, actively
maintained, and supports sophisticated GPU scheduling. We will gradually move
nodes from Torque to SLURM to motivate everyone to familiarize with the new system.
.. important::
**Users familiar with Torque** can check out the :ref:`slurm_transition` for a quick start.
.. include:: slurm/jobs.rst
.. include:: slurm/commands.rst
.. include:: slurm/resources.rst
.. include:: slurm/transition.rst
.. _SLURM: https://slurm.schedmd.com/
Commands
--------
TBD
Submitting
+++++++++++
**Non-Interactively**
Add a single to job the the default queued:
.. code-block:: bash
sbatch job.slurm
Submit the same job with some resource requests and a name.
.. code-block:: bash
sbatch --cpus 2 --mem 8G --job-name test job.slurm
Submit a job to the gpu partition, requesting 2 gpus on a single node:
.. code-block:: bash
sbatch -p gpu --gres gpu:2 job.slurm
Wrap bash commands into a job on the fly:
.. code-block:: bash
sbatch --wrap "module load R ; Rscript main.R"
**Interactively/Blocking**
Quick interactive, dual-core shell in the test partition:
.. code-block:: bash
srun -p test -c2 --pty bash
Querying
++++++++
You can use ``squeue`` to get all information about queued or running jobs. This
example limits the output to jobs belonging to the user `krause`:
.. code-block:: bash
[krause@master ~] squeue -u krause
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
110996 short test krause R 0:12 1 ood-43
110997 gpu job.slur krause R 0:08 1 gpu-4
110995 short job.slur krause R 0:15 1 ood-43
As you can see there are 3 jobs, two of them are in the default partition
(**short**) and one has been sent to the gpu partition. They are all in the
running (R) state (ST) and have been running for a couple of seconds (TIME).
**Squeue** is very powerful and its output can be arbiatrarily configured using
format strings. Checkout ``squeue -o all`` and have a look at the manpage with
``man squeue``.
To get live metrics from the job you have to use
To look up historical (accounting) data there is ``sacct``. Again, all output columns can be configured. Example:
.. code-block:: bash
[krause@master ~] sacct -o JobID,ReqMEM,MaxRSS,CPU,Exit
JobID ReqMem MaxRSS CPUTime ExitCode
------------ ---------- ---------- ---------- --------
110973 4Gc 936K 00:00:08 0:0
110974 4Gc 936K 00:00:00 0:0
110976 4Gc 944K 00:00:03 0:0
Deleting
++++++++
.. slurm_jobs:
.. _slurm_jobs:
Example Jobs
------------
We mentioned job files as parameters to sbatch in the general section. They are
a convenient way of collecting job properties without clobbering the command
line. It's also useful to programmatically create a job description and capture
it in a file.
SLURM job concepts far exceed those of Torque and this document will try to match those. For a comprehensible list, check out the official `slurm documentation`_.
SLURM job concepts exceed those of Torque and this document will try to
match those. For a comprehensible list, check out the official `slurm
documentation`_.
Simple Jobs
+++++++++++
......@@ -21,17 +19,20 @@ Example ``simple_job.job``
.. code-block:: bash
#!/bin/bash
cd project/
./run_simulation.py
You can then submit that job with ``sbatch simple_job.job``.
To submit this job run:
.. code-block:: bash
sbatch simple_job.job
Now that is rarely sufficient. In most cases you are going to need some
resource requests and a state variable as you are very likely to submit
multiple similar jobs. It is possible to add slurm parameters (see
:doc:`slurm_resources`) inside the job file.
Now that is rarely sufficient. In most cases you are going to
need some resource requests and a state variable as you are very likely to
submit multiple similar jobs. It is possible to add SLURM parameters (see
:ref:`slurm_resources`) inside the job file.
Example ``job_with_resources.job``
......@@ -46,15 +47,14 @@ Example ``job_with_resources.job``
#SBATCH --mem 32GB
#SBATCH --gres gpu:1
#SBATCH --mail-type NONE
#SBATCH --chdir .
#SBATCH --workdir .
./run_simulation.py
This would create a job called **myjob** in the GPU queue (partiton), that
needs **24 hours** of running time, **32 gigabyte of RAM**, a single GPU of any
type, 2 processors. It will **not send any e-mails** and start in the **current
directory**.
This would create a job called "myjob" in the GPU partiton, that needs 24 hours
of running time, 32GB of RAM, a single GPU of any type, and 2 processors. It
will not send any e-mails and start in the current directory.
Interactive Jobs
++++++++++++++++
......@@ -63,19 +63,20 @@ Sometimes it may be useful to get a quick shell on one of the compute nodes.
Before submitting hundreds or thousands of jobs you might want to run some
simple checks to ensure all the paths are correct and the software is loading
as expected. Although you can usually run these tests on the master itself there are
cases when this is dangerous, for example when your tests quickly require lot's
of memory. In that case you should move those tests to one of the compute nodes:
as expected. Although you can usually run these tests on the login node itself
there are cases when this is dangerous, for example when your tests quickly
require lot's of memory. In that case you should move those tests to one of the
compute nodes:
.. code-block:: bash
srun --job-name test --pty /bin/bash
srun --pty bash
This will submit a job that requests a shell. The submission will block until
the job gets scheduled. Note that we did not use `sbatch` here, but the similar
the job gets scheduled. Note that we do not use `sbatch`, but the similar
command `srun`. The fundamental distinction here is that `srun` will block
until the command or job gets scheduled, while `sbatch` puts the job into the
queue and returns immediately.
queue and returns immediately. The parameter ``--pty`` allocates a psdeudo-terminal to the program so input/output works as expected.
When there are lot's of jobs in the queue the scheduling might take some time.
To speed things up you can submit to the testing queue which only allows jobs
......@@ -83,8 +84,22 @@ with a very short running time: Example:
.. code-block:: bash
[krause@master ~/slurmtests] srun --partition gpu --job-name test --pty /bin/bash
[krause@gpu-1 ~/slurmtests]
srun -p test --pty /bin/bash
Other useful examples are:
.. code-block:: bash
# get a quick R session with 2 cores in the test partition
srun -p test -c 2 --pty R
# Start the most recent Matlab with 32GB
# two bash commands need to be passed to bash -c
srun --mem 32g --pty bash -c 'module load matlab ; matlab'
# Start some python script with 1 GPU and 2 cores
srun -p gpu --gres gpu -c 2 python3 main.py
.. _slurm_job_wrappers:
......@@ -92,6 +107,29 @@ with a very short running time: Example:
Job Wrappers
++++++++++++
TBD, check here later.
Usually users want to collect a number of jobs into batches and submit them with one command. There are a number of approaches to do that. The most straight forward way is to use a minimal ``submit.sh`` shell script that could look a bit like this:
.. code-block:: bash
#!/bin/bash
for sub in $(seq -w 1 15) ; do
echo '#!/bin/bash' > job.slurm
echo "#SBATCH --job-name main_$sub" >> job.slurm
echo "#SBATCH --cpus 2" >> job.slurm
echo "python main.py $sub" >> job.slurm
sbatch job.slurm
rm -f job.slurm
done
This can be condensed down into a single line with the ``--wrap`` option to
sbatch:
.. code-block:: bash
for sub in $(seq -w 1 15) ; do\
sbatch -c 2 -J main_$sub --wrap "python3 main.py $sub" ;\
done
.. _slurm documentation: https://slurm.schedmd.com/quickstart.html
......@@ -3,4 +3,74 @@
Resources
---------
TBD
This is a list of common SLURM options. You can either use these options
directly with ``sbatch``/``srun`` or add them as meta-parameters in a job file.
In the later case those options need the prefix ``#SBATCH`` and must be stated
in the first section of the file before the actual commands. The complete list
can be found in ``man sbatch``.
``#SBATCH --job-name job-name``
Sets the name of the job. This is mostly useful when submitting lot's of
similar jobs in a loop.
``#SBATCH --time 24:0:0``
Sets the expected maximum running time for the job. When a job **exceeds**
those limits it will **be terminted**.
``#SBATCH --mem 10GB``
Sets another resource requirement: memory. Exceeding this value in a job is
even more crucial than running time as you might interfere with other jobs
on the node. Therefor it needs to be **terminated as well**.
``#SBATCH --cpus-per-task 2``
Requests 2 CPUs for the job. This only makes sense if your code is
multi-threaded and can actually utilize the cores.
``#SBATCH --gres gpu:1 --partition gpu``
Ask for a single GPU of any kind. It's also necessary to specify
a different partition. The ``gres`` parameter is very flexible and you can
also specify the GPU type (``1080`` or ``rtx5k``). For example, to request
2 Geforce 1080, use `--gres gpu:1080:2`. This will effectively hide all
other GPUs on the system. You can test it out by running `nvidia-smi` in an
interactive job.
``#SBATCH --workdir project/data``
Sets the working directory of the job. Every time a job gets started it
will spawn a shell on some node. To initially jump to some directory use
this option. *Otherwise* the first command of your job should always be ``cd
project/data``.
``#SBATCH --output /home/mpib/krause/logs/slurm-%j.out``
Specify the location where SLURM will save the jobs' log file. By default
(different to Torque) *stdout* and *stderr* streams will be merged together
into this output file. The ``%j`` variable will be replaced with the SLURM
job id. To save the error stream to a separate file, use ``--error``. If
you do not specify ``--output`` or ``--error``, the (combined) log is
stored in the current location. Another difference to Torque is the fact
that log file will be available right away and contents will be streamed
into it during the lifecycle of the job (you can follow incoming data with
``tail -f slurm.out``.
``#SBATCH --output /dev/null``
To discard all standard output log use the special file ``/dev/null``.
``#SBATCH --mail-user krause[,knope,dwyer]``
Send an e-mail to a single user or a list of users for some configured mail
types (see below).
``#SBATCH --mail-type NONE,[OTHER,EVENT,TYPES]``
+ **NONE** default (no mail)
+ **BEGIN** defines the beginning of a job run
+ **FAIL** send an e-mail when the job has failed
+ **END** send an e-mail when the job has finished
Check out ``man sbatch`` for mor mail types.
``#SBATCH --dependency=afterok:Job-Id[:Job2-Id...]``
This will add a dependency to the current job. It will only be started or
tagged as startable when another job with id *Job-Id* finished
successfully. You can provide more than one id using a colon as
a separator.
.. _slurm_transition:
Torque Transition Table
-----------------------
================================================= =====
Torque Slurm
================================================= =====
**Submit**
---------------------------------------------------------
``qsub job.pbs`` ``sbatch job.sh``
``echo "Rscript foo.R" | qsub`` ``sbatch --wrap "Rscript foo.R"``
``qsub -I -q testing`` ``srun -p test --pty /bin/bash``
------------------------------------------------- -----
**Query**
---------------------------------------------------------
``qstat`` ``squeue``
``qstat -r`` ``squeue -u $USER --states R``
``qstat -q`` ``sinfo -s``
------------------------------------------------- -----
**Manage**
---------------------------------------------------------
``qdel 1234`` ``scancel 1234``
``qdel all`` ``scancel -u $USER``
``qalter -l walltime=1:0:0 1234`` ``scontrol update jobid=1234 TimeLimit=1:0:0``
================================================= =====
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment