Carleton Research Users Group Cluster

What is the CRUG Cluster?

One of the goals of Carleton’s Computational Research Users Group is to create a shared, powerful, and expandable computation cluster that is usable by as many of our users as possible and funded by grants and faculty startup funds. Our users include faculty and students from all departments, and their needs are diverse.

To meet these needs, the college used NSF grant money to purchase a large computer (56 cores, 768 GB ram, and 19 TB Disk). The system is setup as a VMWare EXSi server, so that we can spin up individual, one-of-a-kind servers when needed. In practice, as much of the system as possible is dedicated to running a SLURM server and compute node. SLURM is a workload manager that allows users to submit jobs to a queue, and those jobs are then sent off to be run on compute nodes. VMWare gives us the flexibility of running unique linux or windows virtual machines, and SLURM gives us the ability to add additional computation nodes when additional funding is found. Note that while all the cores and ram is available, the majority of the disk space is set aside for redundancy.

Why do we need such a system?

Historically, our faculty have purchased research systems with funding from grants or startup funds. These systems were “siloed”, meaning that the systems were used only by one faculty member and possibly their students and were off limits to the rest of the campus. Many of these systems were idle for much of the year, which seems wasteful when you consider that users without grants or startup funding didn’t have access to large computational systems.

How do we fund and maintain such as system now and into the future?

Our answer is by leveraging grant funding as much as possible and when appropriate by using startup funding of new hires that have large computational needs. Faculty with funds and who are interested in the shared system are asked to test out the current system; if it meets their needs, they have the option to “buy into it.” These faculty work with our technical staff (Mike Tie, Bruce Duffy, and Randy Hoffner) to purchase new hardware in a way that expands or updates the current system while simultaneously maximizing the benefits to the new faculty member. Faculty who have “bought into the system” will have priority use on the hardware that they have purchased. For example, a faculty member who purchases a new compute node will be able to reserve that node when they need it; this guarantees that they have the resources they need to do their research when they need it (which is what the grant or startup funds are intended for). Other users will have access to all idle processor time on nodes that haven't been reserved.

How do I use the system?

The system is designed to take advantage of the SLURM workload manager. We expect that the majority of the jobs will take advantage of multiple cores through some type of parallel processing. 95% of our needs seem to be embarrassingly parallel. Users “ssh” into command.dmz.carleton.edu and submit slurm jobs through the linux command line. An example of using SLURM can found at https://wiki.carleton.edu/pages/viewpage.action?pageId=57837534. Some useful SLURM commands can be found at https://wiki.carleton.edu/display/carl/Useful+slurm+commands. If you are new to this, please contact our technical staff for help.

What if I have a job that only needs one core or if I want to run a graphical application?

Yes, we support this too. You can run graphical jobs by X-Forwarding them to your desktop. Mathematica and Rstudio desktop client are good examples of this. However, please limit your core usage for these types of jobs to one or two cores. If you need more than two cores, then you need to submit your jobs through slurm, so that the load gets distributed to the compute nodes. If that isn’t an option, you can reserve a compute node via slurm, and then use it to run your graphical app; please contact our technical staff for help.

More information about X forwarding can be found at: X Session Forwarding for Windows and X Session Forwarding for OSX and Linux.

Who gets access to the system?

Faculty have priority, but if the system has idle cycles, it can then be used by students working with faculty on research projects as well, and then by students wanting to use it for course work.

Is there a time limit for jobs?

At the moment there is no official time limit; however, if you want to run a job that you anticipate taking longer than two weeks, please contact Mike Tie before submitting the job.

Eventually, there will be two SLURM queues. One is for jobs expecting to finish in 48 hours or less, and one is for jobs intending to take less than two weeks. If your jobs can’t finish in two weeks, you will need to write your code in such a way as to save its state so that you can restart it. At some point there may also be a queue for students. As the intention is to give all users access, anyone hogging or abusing the system will have their jobs terminated.

What if my program needs more cores, memory, disk space or gpus than the cluster currently has?

If you need more of the above, please find funding! We will happily add the new resources, and you will have priority use on the resources that you add to the system.

Is my data backed up?

Unfortunately, NO. You are responsible for backing up your own work! Please help fund a backup solution.

Is there regularly scheduled down time?

Yes; we reserve the last Monday of every month to patch and reboot all of the systems; it is critically important that we keep the systems patched.

What other compute nodes are there in the cluster?

We are gradually moving over old cluster nodes. The old cluster was funded by a Howard Hughes Medical Institute grant in 2012 and consists of four computers, each of which has 16 cores and 128 GB of ram. All of these nodes will be moved to SLURM by 12/15/19.

The following is a graphical representation of how we envision the initial system will be configured:

Note as of June 27, 2019: The command and compute nodes are drastically smaller and only one of the HHMI nodes has been converted to a slurm compute node. There is also a very large VM running, summer18.dmz.carleton.edu; this node was initially set up as a place for people to work while slurm was configured. The current plan is to dramatically reduce the size (RAM and core count) of summer18 on July 29, 2019 and to transition users to command.dmz.carleton.edu. summer18 will be removed by the end of 2019.

How do I reserve a system that I funded and have priority access on?

Talk with Mike Tie. He can remove your node from the list of available slurm nodes.

How would I get additional disk space?

With help from ITS, you would purchase 5 x 10TB drives for dtn.carleton.edu; these drives would then be exported to all the cluster nodes for your use. The data transfer node (dtn) is set up so that drives have to be installed in groups of 5; as a result, the minimal size is 50 TB. If you don’t need that much space, please share it with the rest of the cluster. As of July 27, 2019, that 50 TB costs $1700. Note that the system has some redundancy built into it, so only 40 TB will be visible to the user.

As stated earlier, we are not currently backing up the system, but a secondary dtn (dtn2.dmz.carleton.edu) is currently available as a potential backup solution. Just like dtn.dmz.carleton.edu, drives need to be purchased in blocks of 5; for $1700, we could add drives to dtn2 and backup data from the cluster to it. Ideally, you would keep your data in the cluster and backups on a separate server.

GPUs?

As of July 27, 2019, none of the compute nodes are equipped with GPUs. Please help us find funding.

What if I want to run a Windows application, some other version of linux, or Mac OSX?

We can spin up small virtual machines for unique version of linux or for windows apps; please contact one of our technical staff. Unfortunately, Mac OSX is not supported in a virtual environment?

How do I get software installed?

Speak with our technical staff.

Other questions?

Speak with Mike Tie.