jobs.rst

.. _slurm_jobs:

Example Jobs
------------

SLURM job concepts exceed those of Torque and this document will try to
match those. For a comprehensible list, check out the official `slurm
documentation`_.

Simple Jobs
+++++++++++

The simplest job file just consist of a list of shell commands to be executed.
In that case it is equivalent to a shell script. Note, that in contrast to
Torque, SLURM jobs have to start with a hash-bang (#!) line.

Example ``simple_job.job``

.. code-block:: bash

   #!/bin/bash
   ./run_simulation.py


To submit this job run:

.. code-block:: bash

   sbatch simple_job.job


Now that is rarely sufficient. In most cases you are going to
need some resource requests and a state variable as you are very likely to
submit multiple similar jobs. It is possible to add SLURM parameters (see
:ref:`slurm_resources`) inside the job file.

Example ``job_with_resources.job``

.. code-block:: bash


   #!/bin/bash
   #SBATCH --job-name myjob
   #SBATCH --partition gpu
   #SBATCH --time 24:0:0
   #SBATCH --cpus-per-task 2
   #SBATCH --mem 32GB
   #SBATCH --gres gpu:1
   #SBATCH --mail-type NONE
   #SBATCH --workdir .

   ./run_simulation.py


This would create a job called "myjob" in the GPU partiton, that needs 24 hours
of running time, 32GB of RAM, a single GPU of any type, and 2 processors. It
will not send any e-mails and start in the current directory.

Interactive Jobs
++++++++++++++++

Sometimes it may be useful to get a quick shell on one of the compute nodes.

Before submitting hundreds or thousands of jobs you might want to run some
simple checks to ensure all the paths are correct and the software is loading
as expected. Although you can usually run these tests on the login node itself
there are cases when this is dangerous, for example when your tests quickly
require lot's of memory. In that case you should move those tests to one of the
compute nodes:

.. code-block:: bash

   srun --pty bash

This will submit a job that requests a shell. The submission will block until
the job gets scheduled. Note that we do not use `sbatch`, but the similar
command `srun`. The fundamental distinction here is that `srun` will block
until the command or job gets scheduled, while `sbatch` puts the job into the
queue and returns immediately. The parameter ``--pty`` allocates a psdeudo-terminal to the program so input/output works as expected.

When there are lot's of jobs in the queue the scheduling might take some time.
To speed things up you can submit to the testing queue which only allows jobs
with a very short running time: Example:

.. code-block:: bash

    srun -p test --pty /bin/bash

Other useful examples are:


.. code-block:: bash

   # get a quick R session with 2 cores in the test partition
   srun -p test -c 2 --pty R

   # Start the most recent Matlab with 32GB
   # two bash commands need to be passed to bash -c
   srun --mem 32g --pty bash -c 'module load matlab ; matlab'

   # Start some python script with 1 GPU and 2 cores
   srun -p gpu --gres gpu -c 2 python3 main.py


.. _slurm_job_wrappers:

Job Wrappers
++++++++++++

Usually users want to collect a number of jobs into batches and submit them with one command. There are a number of approaches to do that. The most straight forward way is to use a minimal ``submit.sh`` shell script that could look a bit like this:

.. code-block:: bash

    #!/bin/bash

    for sub in $(seq -w 1 15) ; do
        echo '#!/bin/bash'                    > job.slurm
        echo "#SBATCH --job-name main_$sub"  >> job.slurm
        echo "#SBATCH --cpus 2"              >> job.slurm
        echo "python main.py $sub"           >> job.slurm
        sbatch job.slurm
        rm -f job.slurm
    done


This can be condensed down into a single line with the ``--wrap`` option to
sbatch:

.. code-block:: bash

   for sub in $(seq -w 1 15) ; do\
     sbatch -c 2 -J main_$sub --wrap "python3 main.py $sub" ;\
   done

.. _slurm documentation: https://slurm.schedmd.com/quickstart.html