This article refers to the research computing cluster designed for high-performance computational research. If you are interested in high-speed networking support or storage related to research, please take a look at this article for more information.
What is the CRUG Cluster?
One of the goals of Carleton’s Computational Research Users Group is to create a shared, powerful, and expandable computation cluster that is usable by as many of our users as possible and funded by grants and faculty startup funds. Our users include faculty and students from all departments, and their needs are diverse.
To meet these needs, the college used NSF grant money to purchase a large computer (56 cores, 768 GB ram, and 19 TB Disk). The system is setup as a VMWare EXSi server, so that we can spin up individual, one-of-a-kind servers when needed. In practice, as much of the system as possible is dedicated to running a SLURMSlurm server and compute node. SLURM Slurm is a workload manager that allows users to submit jobs to a queue, and those jobs are then sent off to be run on compute nodes. VMWare gives us the flexibility of running unique linux or windows virtual machines, and SLURM Slurm gives us the ability to add additional computation nodes when additional funding is found. Note that while all the cores and ram is available, the majority of the disk space is set aside for redundancy.
...
How do we fund and maintain such as system now and into the future?
Our answer is by leveraging to leverage grant funding as much as possible and, when appropriate by using , to use startup funding of new hires that have large computational needs. Faculty with funds and who are interested in the shared system are asked to test out the current system; if it meets their needs, they have the option to “buy into it.” These faculty work with our technical staff (Mike Tie, Bruce Duffy, and Randy Hoffner) to purchase new hardware in a way that expands or updates the current system while simultaneously maximizing the benefits to the new faculty member. Faculty who have “bought into the system” will have priority use on the hardware that they have purchased. For example, a faculty member who purchases a new compute node will be able to reserve that node when they need it; this guarantees that they have the resources they need to do their research when they need it (which is what the grant or startup funds are intended for). Other users will have access to all idle processor time on nodes that haven't been reserved.
...
The system is designed to take advantage of the SLURM Slurm workload manager. We expect that the majority of the jobs will take advantage of multiple cores through some type of parallel processing. 95% of our needs seem to be embarrassingly parallel. Users “ssh” into command.dmz.carleton.edu and submit slurm Slurm jobs through the linux command line. If you want to use the system to handle R jobs, specific documentation can be found here. An example of using SLURM Slurm can found at https://wiki.carleton.edu/pages/viewpage.action?pageId=57837534. Some useful SLURM Slurm commands can be found at https://wiki.carleton.edu/display/carl/Useful+slurm+commands. If you +Slurm+commands.
To run a job on command:
Use an sftp or scp client to upload your code and data to command.dmz.carleton.edu (If you are are looking for a graphical interface for these commands, we recommend winscp for Windows users and cyberduck for Mac Users.)
If your code requires an executable (C, C++, java), then compile your code on command.dmz.carleton.edu.
Submit your code to the compute nodes using a command such as srun or sbatch (for examples, see https://wiki.carleton.edu/pages/viewpage.action?pageId=57837534 ).
Note: command and all of the compute nodes share the same network based file system.
Note: running commands on command.dmz.carleton.edu does eat up cpu time. please limit your cpu usage on this machine; run your code on the compute nodes via slurm!
If you want a CRUG account or are new to this, please contact our technical staff Mike Tie for help.
What if I have a job that only needs one core or if I want to run a graphical application?
Yes, we support this too. You can run graphical jobs by X-Forwarding them to your desktop. Mathematica and Rstudio desktop client are good examples of this. However, please limit your core usage for these types of jobs to one or two cores. If you need more than two cores, then you need to submit your jobs through slurmSlurm, so that the load gets distributed to the compute nodes. If that isn’t an option, you can reserve a compute node via slurmSlurm, and then use it to run your graphical app; please contact our technical staff for help.
...
Faculty have priority, but if the system has idle cycles, it can then be used by students working with faculty on research projects as well, and then by students . Students wanting to use it CRUG for course work have last priority.
Is there a time limit for jobs?
...
Eventually, there will be two SLURM Slurm queues. One is for jobs expecting to finish in 48 hours or less, and one is for jobs intending to take less than two weeks. If your jobs can’t finish in two weeks, you will need to write your code in such a way as to save its state so that you can restart it. At some point there may also be a queue for students. As the intention is to give all users access, anyone hogging or abusing the system will have their jobs terminated.
What if my program needs more cores, memory, disk space, or gpus than the cluster currently has?
If you need more of any of the above, please find funding! We will happily add the new resources , and you will have priority use on the resources that you add to the system.
...
We are gradually moving over old cluster nodes. The old cluster was funded by a Howard Hughes Medical Institute grant in 2012 and consists of four computers, each of which has 16 cores and 128 GB of ram. All of these nodes will be moved to SLURM Slurm by 12/15/19.
The following is a graphical representation of how we envision the initial system will be configured:
...
Note as of June 27, 2019: The command and compute nodes are drastically smaller and only one of the HHMI nodes has been converted to a slurm Slurm compute node. There is also a very large VM running, summer18.dmz.carleton.edu; this node was initially set up as a place for people to work while slurm Slurm was configured. The current plan is to dramatically reduce the size (RAM and core count) of summer18 on July 29, 2019 and to transition users to command.dmz.carleton.edu. summer18 will be removed by the end of 2019.
...
Talk with Mike Tie. He can remove your node from the list of available slurm Slurm nodes.
How would I get additional disk space?
With help from ITS, you would purchase 5 x 10TB drives for dtn.carleton.edu; these drives would then be exported to all the cluster nodes for your use. The data transfer node (dtn) is set up so that drives have to be installed in groups of 5; as a result, the minimal size is 50 TB. If you don’t need that much space, please share it with the rest of the cluster . As because as of July 27, 2019, that 50 TB costs $1700. Note that the system has some redundancy built into it, so only 40 TB will be visible to the user.
...
As of July 27, 2019, none of the compute nodes are equipped with GPUs. Please help us find funding.
What if I want to run a Windows application, some other version of linuxLinux, or Mac OSX?
We can spin/set up small virtual machines for unique version of linux Linux or for windows Windows apps; please contact one of our technical staff. Unfortunately, Mac OSX is not supported in a virtual environment?.
How do I get software installed?
Speak with our technical staff.
Other questions?
Speak with Mike Tie.