r.rst

R
=

R is a programming language and interpreter for statistical problems and tasks.
It is open source and for that reason very well supported on Linux. We are
following the Debian Linux stable release of R but there are other versions of
R available through the :doc:`../environment/modules` system.

.. code-block:: none

   [krause@master ~] module avail R
   --- /opt/environment/modules ---
   R/2           R/3.3         R/3.5         R/3.6         Rlibs/3.3-fix
   R/2.15        R/3.4         R/3.5.1       R/3.6.0
   R/3           R/3.4.4       R/3.5.3       R/3.6.2


Packages
--------

Some packages are already installed globally, for every other dependency you
can just go ahead and install them on your own. All the nodes and the master
node share your home directory so you only need to install the packages once
and they'll be available with your jobs.
Note that you should install the packages on the master as development files
are usually not distributed to the execution nodes.
Also, if you change the minor or major version of R (3.5.x to 3.6.x or 3.x ->
4.x) it's necessary to rebuild your packages.

.. code-block:: r

   [krause@master] module load R/3.6
   [krause@master] R --quiet
   > install.packages('rmarkdown', repos='http://ftp5.gwdg.de/pub/misc/cran/')
   Installing package into ‘/mnt/beegfs/home/krause/R/x86_64-pc-linux-gnu-library/3.6’
   (as ‘lib’ is unspecified)
   trying URL 'http://ftp5.gwdg.de/pub/misc/cran/src/contrib/rmarkdown_2.1.tar.gz'
   ...
   * DONE (rmarkdown)

   The downloaded source packages are in
   	‘/tmp/RtmpgDCCKY/downloaded_packages’
   >


Loop Refactoring
-----------------

There are some interesting packages for Batch Systems on CRAN `batchjobs`_, but
most R jobs that we've seen on the cluster are so simple they can be
parallelized with some small modifications. Whenever you are using simulations
or generic loops with a lot of repetitions it's possible to apply a simple
pattern.

Suppose your general code structure is something like this:

.. code-block:: r

   library(somelib)
   source("mylib")

   n <- 1000
   result <- c()

   for( i in seq(1,n) ){
       result <- append(result, simulation(i))
   }

   write(result, file="output")

You obviously need to somehow parallelize the loop. Every intermediate result
can be calculated independently. A logical idea is to have `simulation(i)` run
on `n` nodes in parallel. The only tricky part is to split and then merge the
result variables for instance by writing to a separate file. So you could
remove the loop and create a new file **simulation_loop_body.R** containing only
the loop's body:

.. code-block:: r

   library(somelib)
   source("mylib")

   i <- as.integer( commandArgs(TRUE)[1] )
   result <- append(result, simulation(i))
   write(result, file=append("output_",i))

We are reading the loop index as a parameter to the script with ``commandArgs()``.
The external loop can then be constructed with a job array (see :ref:`job_array`).

.. code-block:: bash

   echo 'Rscript simulation_loop_body.R $PBS_ARRAYID' | qsub -N simulation -d. -t 1-1000

This will create 1000 jobs with ``$PBS_ARRAYID`` holding the current index
that will be passed to the R script shown above.

Parallel Vector Operations
--------------------------

You can also use the multi core nodes of the tardis to run a single multi core
aware job with the :file:`parallel` package. You *have to make sure* your
requested number of cores matches the number you register with the script.

Assume you submit a job with :samp:`-l nodes=1:ppn=16` you can safely request 16
cores in the script then.

Example :samp:`qsub -I -l nodes=1:ppn=16`

.. code-block:: r

    library(parallel)
    library(gmp)

    cores <- Sys.getenv("PBS_NUM_PPN")
    cores <- if (cores == '') 1 else as.numeric(cores)

    data <- matrix(as.bigz(2**521)-1, 16)
    mclapply(data, factorize, mc.cores=cores)

Timings:

.. code-block:: r

    >system.time(mclapply(data, factorize, mc.cores=1))
       user  system elapsed
     99.578   0.000  99.788
    >system.time(mclapply(data, factorize, mc.cores=cores))
       user  system elapsed
    107.442   0.052   7.20

Alternatively you can use `%dopar%` and `foreach`:

.. code-block:: r

    library(doMC)
    library(gmp)

    cores <- Sys.getenv("PBS_NUM_PPN")
    cores <- if (cores == '') 1 else as.numeric(cores)
    registerDoMC(cores)

    foreach(i=seq(1,16)) %dopar% {
        factorize(matrix(as.bigz(2**521)-1))
    }


If you want to unroll the vector operation to the whole cluster, have a look at
the  `batchjobs`_ package.

.. _batchjobs: https://cran.r-project.org/web/packages/BatchJobs/index.html