How to Use Parallel Processing in R

What is parallel processing?

Parallel processing is a computational method that uses multiple computers, or 'cores', to solve computationally intensive problems much faster than is achievable on a single computer. By breaking a problem down into subsections and solving those simultaneously, parallel processing can drastically speed up the runtime of certain programs. However, it's worth noting that some programs can not benefit from parallel computing.

What programs can I use parallel processing with?

Not all programs are capable of utilizing parallel computing. For instance, any program where each step relies on all the previous steps will not be able to take advantage of parallel computing. Instead, parallel computing is most useful for programs that run many calculations independent of each other so that the work can be evenly divided between the available cores. In order to use parallel processing, programs must first be "parallelized" so that the task can be divided to the allocated number of cores.

For R programs, there are two main ways to do this: Rslurm and doParallel. Rslurm is recommended for most tasks, but both allow for parallel processing of R code.

How do I use Rslurm?

Rslurm is an R library that allows users to run R jobs through the Slurm Workload Manager. As a result, Rslurm allows you to manage your R jobs in the Carleton Research Users Group Cluster (or CRUG). It is designed for parallel processing and will only work with parallelized programs. One advantage to Rslurm is that it uses the slurm manager, which means that users don't need to reserve cluster space beforehand and can instead add their jobs to a queue. Jobs will then be run once the job dequeues and there are enough cores available for it to run. For more information on this process, read the CRUG documentation.

To use Rslurm, you must first create a function and a data frame of parameters values that the function can be applied to. In the data frame, each column must correspond to a specific parameter of the function, and each row must correspond to a separate function call. Rslurm works by dividing the data frame up into multiple segments, and then running the function on those segments simultaneously by using multiple cores. This is done using the slurm_apply(f, params) command, where f is the function and params is the data frame. The slurm_apply command returns a slurm job which can be called by other commands later. Slurm_apply can take other parameters as well, including the number of cores a

To see the current status of their job, users can use the print_job_status(slr_job) command, where slr_job is a slurm job.

A single instance of a function can also be run using the slurm_call(f, param) command.

For another explanation of how to use Rslurm, notation about all relevant Rslurm functions, and examples of code you can use, please visit this link.

How do I use doParallel?

doParallel allows you to run some jobs in R over multiple processing cores in CRUG. doParallel is a a “parallel back-end” for the foreach R package, so it will let you execute for loops in parallel. doParallel will not work for R jobs that are not based on for loops; if you need to run such jobs in parallel, refer to the Rslurm section of this page.

You MUST have both doParallel and foreach loaded in your code to use their functionality! However, you will need to do more than simply importing the doParallel library. You must use the package's registerDoParallel function before trying to use doParallel with your job, specifying either the number of cores or the cluster you would like to use (these are doParallel's workers). For example, the code registerDoParallel(4) would cause your job to run over four cores, and registerDoParallel(c1) would run the job over the previously defined cluster c1. If you do not give registerDoParallel an argument, the result depends on what computer system you're using; on Windows, you will automatically get three workers while Unix-like systems will give you a number of workers that is equal to around half your total available cores.

You may also assign or change the number of cores after using registerDoParallel using the option command. option(cores=2) would change your job to run on two cores. You can check how many workers your job is currently using with getDoParWorkers().

For another explanation of how to use doParallel along with examples of code and documentation about the rest of the package's functionality, please visit this link.

What if I need help using parallel processing?

If you need assistance with doParallel, Rslurm or the concept of parallel processing, please contact Mike Tie.