Skip to content
r.rst 4.55 KiB
Newer Older
Michael Krause's avatar
Michael Krause committed
R
=

Michael's avatar
Michael committed
R is a programming language and interpreter for statistical problems and tasks.
It is open source and for that reason very well supported on Linux. We are
following the Debian Linux stable release of R but there are other versions of
R available through the :doc:`../environment/modules` system.

.. code-block:: none

   [krause@master ~] module avail R
   --- /opt/environment/modules ---
   R/2           R/3.3         R/3.5         R/3.6         Rlibs/3.3-fix
   R/2.15        R/3.4         R/3.5.1       R/3.6.0
   R/3           R/3.4.4       R/3.5.3       R/3.6.2

Michael's avatar
Michael committed

Packages
--------

Some packages are already installed globally, for every other dependency you
can just go ahead and install them on your own. All the nodes and the master
node share your home directory so you only need to install the packages once
and they'll be available with your jobs.
Note that you should install the packages on the master as development files
are usually not distributed to the execution nodes.
Also, if you change the minor or major version of R (3.5.x to 3.6.x or 3.x ->
4.x) it's necessary to rebuild your packages.
Michael's avatar
Michael committed

.. code-block:: r

   [krause@master] module load R/3.6
   [krause@master] R --quiet
   > install.packages('rmarkdown', repos='http://ftp5.gwdg.de/pub/misc/cran/')
   Installing package into ‘/mnt/beegfs/home/krause/R/x86_64-pc-linux-gnu-library/3.6’
   (as ‘lib’ is unspecified)
   trying URL 'http://ftp5.gwdg.de/pub/misc/cran/src/contrib/rmarkdown_2.1.tar.gz'
   ...
   * DONE (rmarkdown)

   The downloaded source packages are in
   	‘/tmp/RtmpgDCCKY/downloaded_packages’
   >

Michael's avatar
Michael committed


Loop Refactoring
-----------------

There are some interesting packages for Batch Systems on CRAN `batchjobs`_, but
most R jobs that we've seen on the cluster are so simple they can be
parallelized with some small modifications. Whenever you are using simulations
or generic loops with a lot of repetitions it's possible to apply a simple
pattern.

Suppose your general code structure is something like this:

.. code-block:: r

   library(somelib)
   source("mylib")

   n <- 1000
   result <- c()

   for( i in seq(1,n) ){
       result <- append(result, simulation(i))
   }

   write(result, file="output")

You obviously need to somehow parallelize the loop. Every intermediate result
can be calculated independently. A logical idea is to have `simulation(i)` run
on `n` nodes in parallel. The only tricky part is to split and then merge the
result variables for instance by writing to a separate file. So you could
remove the loop and create a new file **simulation_loop_body.R** containing only
the loop's body:

.. code-block:: r

   library(somelib)
   source("mylib")

   i <- as.integer( commandArgs(TRUE)[1] )
   result <- append(result, simulation(i))
   write(result, file=append("output_",i))

We are reading the loop index as a parameter to the script with ``commandArgs()``.
The external loop can then be constructed with a job array (see :ref:`job_array`).

.. code-block:: bash

   echo 'Rscript simulation_loop_body.R $PBS_ARRAYID' | qsub -N simulation -d. -t 1-1000
Michael's avatar
Michael committed

This will create 1000 jobs with ``$PBS_ARRAYID`` holding the current index
Michael's avatar
Michael committed
that will be passed to the R script shown above.

Parallel Vector Operations
--------------------------

You can also use the multi core nodes of the tardis to run a single multi core
aware job with the :file:`parallel` package. You *have to make sure* your
requested number of cores matches the number you register with the script.

Michael Krause's avatar
Michael Krause committed
Assume you submit a job with :samp:`-l nodes=1:ppn=16` you can safely request 16
cores in the script then.

Michael Krause's avatar
Michael Krause committed
Example :samp:`qsub -I -l nodes=1:ppn=16`

.. code-block:: r
Michael Krause's avatar
Michael Krause committed

    library(parallel)
    library(gmp)

    cores <- Sys.getenv("PBS_NUM_PPN")
    cores <- if (cores == '') 1 else as.numeric(cores)

    data <- matrix(as.bigz(2**521)-1, 16)
    mclapply(data, factorize, mc.cores=cores)

Timings:

.. code-block:: r

    >system.time(mclapply(data, factorize, mc.cores=1))
       user  system elapsed
     99.578   0.000  99.788
Michael Krause's avatar
Michael Krause committed
    >system.time(mclapply(data, factorize, mc.cores=cores))
       user  system elapsed
    107.442   0.052   7.20

Alternatively you can use `%dopar%` and `foreach`:
.. code-block:: r

    library(doMC)
    library(gmp)

    cores <- Sys.getenv("PBS_NUM_PPN")
    cores <- if (cores == '') 1 else as.numeric(cores)
    registerDoMC(cores)

    foreach(i=seq(1,16)) %dopar% {
        factorize(matrix(as.bigz(2**521)-1))
    }


If you want to unroll the vector operation to the whole cluster, have a look at
the  `batchjobs`_ package.

Michael's avatar
Michael committed
.. _batchjobs: https://cran.r-project.org/web/packages/BatchJobs/index.html