How to Use Parallel Processing in R
What is parallel processing?
Parallel processing is a computational method that uses multiple computers, or 'cores', to solve computationally intensive problems much faster than is achievable on a single computer. By breaking a problem down into subsections and solving those simultaneously, parallel processing can drastically speed up the runtime of certain programs.
What programs can I use parallel processing with?
Not all programs are capable of utilizing parallel computing. For instance, any program where each step relies on all the previous steps will not be able to take advantage of parallel computing. Instead, parallel computing is most useful for programs that run many calculations independent of each other so that the work can be evenly divided between the available cores. In order to use parallel processing, programs must first be "parallelized" so that the task can be divided to the allocated number of cores.
For R programs, there are two main ways to do this: Rslurm and doParallel. We recommend Rslurm for most tasks, but both packages allow for parallel processing of R code.
How do I run Rstudio on the cluster?
Start Xquartz
Open the terminal window
Run the command: ssh -Y command.dmz.carleton.edu
Sign in with your password.
Run the command: QMLSCENE_DEVICE=softwarecontext rstudio
This will probably throw some errors, but should still open up Rstudio. If Rstudio opens but is all white, check to make sure you included the -Y in your command. If the commands are correct but Rstudio fails to open at all, contact Mike Tie for help.
How do I use Rslurm?
Rslurm is an R library that allows users to run R jobs through the Slurm Workload Manager. As a result, Rslurm allows you to manage your R jobs in the Carleton Research Users Group Cluster (or CRUG). It is designed for parallel processing and will only work with parallelized programs. One advantage to Rslurm is that it uses the Slurm manager, which means that users don't need to reserve cluster space beforehand and can instead add their jobs to a queue. Jobs will then be run once the job dequeues and there are enough cores available for it to run. For more information on this process, read the CRUG documentation.
To use Rslurm, you must first create a function and a data frame of parameters values that the function can be applied to. In the data frame, each column must correspond to a specific parameter of the function, and each row must correspond to a separate function call. Rslurm works by dividing the data frame up into multiple segments, and then running the function on each of those segments simultaneously by using multiple cores. This is done using the slurm_apply(f, params) command, where f is the function and params is the data frame. The slurm_apply command returns a slurm job which can be called by other commands later. slurm_apply can take other parameters as well, including the number of nodes used and the number of cpus per node.
To see the current status of your job, use the print_job_status command. To return the results of the job, use the get_slurm_out command. A single instance of a function can also be run using the slurm_call command. After a job has finished, any unnecessary temporary files can be cleaned up with the cleanup_files command. Slurm jobs are best canceled through the terminal window using the scancel(jobname) command, rather than the cancel_slurm command provided by rslurm.
For a more detailed explanation of how to use Rslurm, notation about all relevant Rslurm functions, and examples of code you can use, please visit the Rslurm documentation.
Rslurm Example:
Here's what the example is doing at important lines of code:
1) We import the rslurm library.
4) We create a data frame pars of parameters with a length of a thousand.
7) We create a function ftest that takes in the parameters stored in the data frame.
11) We run the slurm_apply command, which takes in the function ftest, the parameters, a jobname, and how many nodes we're using. It returns a slurm job object which we call in subsequent functions.
12) We print the job status with print_job_status, which displays as a table in the console.
13) We get the result of the job with get_slurm_out and store it as res.
15) We cleanup the extra files our job created using the cleanup_files command.
How do I use doParallel?
doParallel allows you to run some jobs in R over multiple processing cores in CRUG. doParallel is a a “parallel back-end” for the foreach R package, so it will let you execute for loops in parallel. doParallel will not work for R jobs that are not based on for loops; if you need to run such jobs in parallel, refer to the Rslurm section of this page.
You will need to do more than simply importing the doParallel library. You must use the package's registerDoParallel function before trying to use doParallel with your job, specifying either the number of cores or the cluster you would like to use (these are doParallel's workers). For example, the code registerDoParallel(4) would cause your job to run over four cores, and registerDoParallel(c1) would run the job over the previously defined cluster c1. If you do not give registerDoParallel an argument, the result depends on what computer system you're using; on Windows, you will automatically get three workers, while Unix-like systems will give you a number of workers equal to about half your total available cores.
You may also assign or change the number of cores after using registerDoParallel using the option command. For example, option(cores=2) would change your job to run on two cores. You can check how many workers your job is currently using with getDoParWorkers().
For another explanation of how to use doParallel along with examples of code and documentation about the rest of the package's functionality, please visit this link.
doParallel Example:
Here's what the example is doing at important lines of code:
1) We import the doParallel library
4/5) We create a lot of trials of R data for testing
Note: The middle part here (7-15) tests the baseline performance on one core before we use multiple cores as a benchmark test of the speed-up from parallel processing.
17) We create a cluster to run our job in parallel over
18) We register the cluster with doParallel
21) We make our for loop to run R commands on the data
27) We stop the cluster when our for loop is finished
What if I need help using parallel processing or I have other questions?
If you need assistance with doParallel, Rslurm, the concept of parallel processing, or other related questions, please contact Mike Tie.