What is the CRUG Cluster?
One of the goals of Carleton’s Computational Research Users Group is to create a shared, powerful, and expandable computation cluster that is usable by as many of our users as possible and funded by grants and faculty startup funds. Our users include faculty and students from all departments, and their needs are diverse.
To meet these needs, the college used NSF grant money to purchase a large computer (56 cores, 768 GB ram, and 19 TB Disk). The system is setup as a VMWare EXSi server, so that we can spin up individual, one-of-a-kind servers when needed. In practice, as much of the system as possible is dedicated to running a SLURMSlurm server and compute node. SLURM Slurm is a workload manager that allows users to submit jobs to a queue, and those jobs are then sent off to be run on compute nodes. VMWare gives us the flexibility of running unique linux or windows virtual machines, and SLURM Slurm gives us the ability to add additional computation nodes when additional funding is found. Note that while all the cores and ram is available, the majority of the disk space is set aside for redundancy.
...
How do we fund and maintain such as system now and into the future?
Our answer is by leveraging to leverage grant funding as much as possible and, when appropriate by using , to use startup funding of new hires that have large computational needs. Faculty with funds and who are interested in the shared system are asked to test out the current system; if it meets their needs, they have the option to “buy into it.” These faculty work with our technical staff (Mike Tie, Bruce Duffy, and Randy Hoffner) to purchase new hardware in a way that expands or updates the current system while simultaneously maximizing the benefits to the new faculty member. Faculty who have “bought into the system” will have priority use on the hardware that they have purchased. For example, a faculty member who purchases a new compute node will be able to reserve that node when they need it; this guarantees that they have the resources they need to do their research when they need it (which is what the grant or startup funds are intended for). Other users will have access to all idle processor time on nodes that haven't been reserved.
...
The system is designed to take advantage of the SLURM Slurm workload manager. We expect that the majority of the jobs will take advantage of multiple cores through some type of parallel processing. 95% of our needs seem to be embarrassingly parallel. Users “ssh” into command.dmz.carleton.edu and submit slurm Slurm jobs through the linux command line. An example of using SLURM Slurm can found at https://wiki.carleton.edu/pages/viewpage.action?pageId=57837534. Some useful SLURM Slurm commands can be found at https://wiki.carleton.edu/display/carl/Useful+slurmSlurm+commands. If you are new to this, please contact our technical staff for help.
...
Yes, we support this too. You can run graphical jobs by X-Forwarding them to your desktop. Mathematica and Rstudio desktop client are good examples of this. However, please limit your core usage for these types of jobs to one or two cores. If you need more than two cores, then you need to submit your jobs through slurmSlurm, so that the load gets distributed to the compute nodes. If that isn’t an option, you can reserve a compute node via slurmSlurm, and then use it to run your graphical app; please contact our technical staff for help.
...
Faculty have priority, but if the system has idle cycles, it can then be used by students working with faculty on research projects as well, and then by students . Students wanting to use it CRUG for course work have last priority.
Is there a time limit for jobs?
...
Eventually, there will be two SLURM Slurm queues. One is for jobs expecting to finish in 48 hours or less, and one is for jobs intending to take less than two weeks. If your jobs can’t finish in two weeks, you will need to write your code in such a way as to save its state so that you can restart it. At some point there may also be a queue for students. As the intention is to give all users access, anyone hogging or abusing the system will have their jobs terminated.
What if my program needs more cores, memory, disk space or gpus than the cluster currently has?
If you need more of any of the above, please find funding! We will happily add the new resources , and you will have priority use on the resources that you add to the system.
...
We are gradually moving over old cluster nodes. The old cluster was funded by a Howard Hughes Medical Institute grant in 2012 and consists of four computers, each of which has 16 cores and 128 GB of ram. All of these nodes will be moved to SLURM Slurm by 12/15/19.
The following is a graphical representation of how we envision the initial system will be configured:
...
Note as of June 27, 2019: The command and compute nodes are drastically smaller and only one of the HHMI nodes has been converted to a slurm Slurm compute node. There is also a very large VM running, summer18.dmz.carleton.edu; this node was initially set up as a place for people to work while slurm Slurm was configured. The current plan is to dramatically reduce the size (RAM and core count) of summer18 on July 29, 2019 and to transition users to command.dmz.carleton.edu. summer18 will be removed by the end of 2019.
...
Talk with Mike Tie. He can remove your node from the list of available slurm Slurm nodes.
How would I get additional disk space?
With help from ITS, you would purchase 5 x 10TB drives for dtn.carleton.edu; these drives would then be exported to all the cluster nodes for your use. The data transfer node (dtn) is set up so that drives have to be installed in groups of 5; as a result, the minimal size is 50 TB. If you don’t need that much space, please share it with the rest of the cluster . As because as of July 27, 2019, that 50 TB costs $1700. Note that the system has some redundancy built into it, so only 40 TB will be visible to the user.
...