Rocks GPU Clusters

Introducing Rocks GPU clusters!


This article in Spanish.


This week I would like to talk about Rocks and how great it is for building. Not only is it good for computational clusters or grid endpoints, but it is also a highly-scalable and extendable way to deploy, manage and upgrade manageable clusters.

As you may know already, some of the fastest computers in the world are cluster computers.

Let’s start by giving a brief definition of what a cluster is: A cluster is a computer system comprising two or more computers ("nodes") connected with a high-speed network. So, in other words, cluster computers can achieve higher availability, reliability, and scalability than is possible with an individual computer. With the increasing adoption of GPUs in high performance computing (HPC), NVIDIA GPUs are becoming part of some of the world’s most powerful supercomputers and clusters.

By now you might be wondering how Rocks can actually help you with this, so…

What’s that Rocks thing again?

Rocks is an open-source distribution of Linux (now a days based on CentOS) that uses a system of Python-based scripts and a MySQL database to manage a cluster of nodes. As I mentioned before, it enables end users to easily build computational clusters, grid endpoints and visualization tiled-display walls. This means it included many tools (such as MPI) which are not part of CentOS but are integral components that make a group of computers into a cluster. Thanks to its extendability, installations can be customized with additional software packages at install-time by using special user-supplied CDs (called "Roll CDs"). These "Rolls" extend the system by integrating seamlessly and automatically into the management and packaging mechanisms used by base software. This greatly simplifies installation and configuration of large numbers of computers. Given that the system is designed to manage all of the aspects of cluster administration, hundreds of researchers and enthusiasts from around the world have used Rocks to deploy their own cluster. You can get the latest release and learn more about the distribution here: http://www.rocksclusters.org. The distribution maintains the Rocks cluster management software independently of the chosen Linux installation.

The starting point for any Rocks cluster is the initial installation, the head node. It commonly uses a system of templates written in XML to dynamically generate RHEL/CentOS kickstart installation scripts for cluster node-types and manages DHCP, DNS and PXE.

A *note for our sys-admins friends: *It is important to note that none of the typical DHCP, DNS, PXE or Kickstart files should be managed by directly editing the corresponding files. Rocks has the ability to add manual entries using a system of ".local" files.

GPU Clu… wait, what!?

Yes my fellow reader, there are multiple motivating reasons for building a GPU-based research cluster. Some of this are:

  • Get a feel for production systems and performance estimates

  • Port your applications to GPUs and distributed computing (using CUDA-aware MPI for example)

  • Tune GPU and CPU load balancing for your application

  • Use the cluster as development platform

  • Early experience means increased readiness

  • The investment is relatively small for a research prototype cluster

But as with everything in this binary universe, start with the very basic we should (Yoda style), which is choosing the right hardware for the task. I will divide the process into two steps for an easier understanding of what composes a node.

  1. Node Hardware Details. This is the specification of the machine (node) for your cluster. Each node has the following components.

  2. CPU processor from any vendor (AMD, Intel…)

  3. A motherboard with the following PCI-express connections:

2x PCIe x16 third generation connections (for Tesla GPUs for example)

1x PCIe x8 wide (if you choose to use an HCI Infiniband card)

  • 2 available network ports

  • A minimum of 16-24 GB DDR3 RAM. (It is good to have more RAM in the system)

  • A power-supply unit (SMPS) with ample power rating. The total power supply needed includes power taken by the CPU, GPUs and other components in the system

  • Secondary storage (HDD / SSD preferably) based on your needs (But totally recommended though)

  • GPU boards are need to be wide enough to cover two physically adjacent PCI-e slots, so make sure that the PCIe x16 and x8 slots are physically separated on the motherboard so that you can fit a minimum of 2 PCI-e x16 GPUs and 1 PCIe x8 network card.

  • Choose the right form factor for GPUs. Once you decide your machine specs you should also decide which model GPUs you would like to consider for your system. The form factor of GPUs is an important consideration. NVIDIA for example has some great GPU’s based on Kepler microarchitecture which are pretty affordable, have outstanding performance and are very energy efficient.

Conclusion

GPU Clusters are becoming very popular and now a days affordable enough for enthusiast like me to build their own. Although it might seem like a lot of work, once you set yourself the goal of building one it’s a very enjoyable task and a very rewarding one. On a next article I’ll be providing a more in-depth, step by step tutorial on how to build one from scratch, so stay tuned and until the next one!



Article by: Matias Radzinski at Bixlabs, Uruguay