Build a Beowulf cluster without disks to optimize cost and reliability, and simplify software maintenance.
The FedoraLiveCD Project allows anyone to create a custom bootable CD or PXE boot image with little effort. For large HPC systems, this greatly simplifies the creation of diskless compute nodes, leading to higher reliability and lower costs when designing your cluster environment. The network and CPU overhead for a diskless setup are minimal, and the compute nodes will run entirely from an initial ramdisk, so they will exhibit very good I/O for normal OS disk operations.
The cluster I've designed is set up for MPI-based computation. The master node runs a queue system where jobs can be submitted and farmed out to the compute nodes to run within the allotted resources. Because my compute nodes are diskless, the goal is to produce a simple and streamlined operating system with as few libraries and utilities as necessary to get the nodes to interact with the master job scheduler. Software that is needed by jobs (such as the MPI libraries) can be shared via NFS from the master node. The compute nodes simply have a kernel and the basic libraries needed to start a job. User account information can be shared via a local LDAP service running on the master node or by any method you already may have available in your environment.
To prepare a diskless cluster, your master node will need some amount of reasonably fast local disk storage and at least 10/100 Ethernet, preferably gigabit Ethernet. Your diskless nodes will need Ethernet hardware that can PXE boot from a network interface; most modern hardware supports this. These nodes will need to be on the same physical subnet, or you will have to configure your dhcpd service to respond or relay between subnets. Your diskless nodes also should have sufficient physical memory (RAM) to hold the OS image plus have enough room to run your programs—a few gigabytes of RAM should be sufficient if you keep your OS image simple.
For the rest of this article, I assume your cluster is based on a Red Hat-derived distribution, as this is based on a Fedora-specific tool. I'm going to demonstrate an environment where all of the cluster nodes can communicate with the master on a private Ethernet subnet.
Your boot server needs to run just two services for diskless booting: DHCP and TFTP. DNSMasq can be substituted for DHCP and TFTP, but I demonstrate using separate DHCP and TFTP services because that's how I set up my own cluster. For convenience, you may choose to install bind or some other DNS to make communication between nodes more friendly. To deploy custom rpm files quickly, you may want to have access to a local repository shared via Apache or another Web service. Local rpm repositories also are a viable method to deploy custom rpm files.
First, install DHCP via yum:
yum -y install dhcp tftp-server syslinux
The file /etc/dhcpd.conf should be created, and in this config file, you need to define your subnet and a pxeclients class that simply locates the bootable pxelinux image on disk. You also need to define the diskless hosts definition for each node by associating the bootable MAC address of each node with a static IP that you define for that node. I also chose to include the host-name option, so that my diskless hosts will know a name other than localhost.localdomain once they are booted.
Next, you need to enable the TFTP dæmon. Red Hat systems launch TFTP via xinetd—I simply needed to enable the /etc/xinetd.d/tftp config file and start xinetd. If you have multiple network interfaces on your master node, you can choose to bind TFTP to one interface by using the bind command.
Once configured, both services should be added to the default runlevel and started:
chkconfig dhcpd on chkconfig xinetd on service dhcpd start service xinetd start
Now for the fun part—creating the OS image. RPMForge hosts a version of the livecd-tools package, which can be installed via yum:
yum install livecd-tools
The live CD tools require a Red Hat kickstart file—templates can be found via Google and as part of the livecd-tools package. A template kickstart is generated by anaconda on any freshly installed system in the root home directory as /root/anaconda-ks.cfg.
Of particular interest here are the %packages and the %post sections. In %packages, you can choose exactly which programs you need or want installed on the initial ramdisk image and available to the OS at boot. I recommend choosing as little as you can in order to keep the initrd small and streamlined. In %post, you can add any shell commands you need in order to customize your compute nodes further—for example, by editing config files for needed services. The example kickstart provided here works with a RHEL- or CentOS 5.5-based distribution.
If you review my example kickstart file, you'll notice that I've specified DHCP as the boot protocol for the network on each of the compute nodes. Because the dhcpd service already knows about the Ethernet MAC address of my diskless compute nodes, the nodes simply will re-request an IP address during boot and be reassigned the same one. Remember that no unique information is stored on the node's OS image, so using DHCP is the easiest way to assign IPs to each diskless node.
One special situation to note: because the compute nodes are diskless, each time SSH starts on a node, it generates a new set of host keys. When the node reboots, it generates a new set of different keys, leading to an impossible-to-maintain situation for SSH users. To solve this, I have generated a template host key that I then deploy copies of to each of my diskless compute nodes via an rpm file. To build your own version of this rpm, you need to create a spec file (see the example) and copy the host keys from /etc/ssh to the location specified by BuildRoot in the spec file. The rpmbuild command generates the rpm, and this rpm can be included in a local yum repository by specifying its name to the %packages section of your kickstart:
rpmbuild -bb sshkeys.spec
By setting up SSH with the same host key on each node, I've defeated some of the security of SSH by allowing the possibility of man-in-the-middle attacks between my master node and compute nodes. However, in my cluster environment where compute nodes communicate on a private and dedicated channel and do not have a direct connection to the outside world, this shouldn't be a problem.
Another idea that might simplify your SSH environment is to consider enabling host-based SSH authentication (so users don't have to generate private and public keys while on your cluster). The root SSH environment is hardened against SSH host-based authentication, so you'll either have to work around this security measure or set up SSH public/private keychains for the root account on your new cluster. Normal users should have no problems with host-based SSH authentication, so long as the UIDs are common among the entire cluster.
Once your kickstart has been customized to your liking, the rest of the setup is simple. Just run the livecd-creator script to generate an ISO image, then use the livecd-isto-to-pxe script to convert that into something TFTP can use.
When compiling the OS image, some active dæmons may interfere with the build process. Of particular note, SELinux must be permissive or disabled, and if you use the nameserver cache dæmon (nscd), you may need to disable it temporarily while the build process runs or else risk a corrupted image:
setenforce 0 service nscd stop livecd-creator --config=nodes-ks.cfg --fslabel=Compute_nodes livecd-iso-to-pxe Compute_nodes.iso rsync -av tftpboot/ /tftpboot/ service nscd start
I've chosen to write all of this into a handy shell script that creates the image and cleans up any temporary files for me.
Once the files have been copied to tftpboot, it's time to boot a compute node. If all goes well, the diskless client will request a DHCP address, and your DHCP server will respond with an IP and the location of the TFTP server and image to download. The client then should connect to the TFTP server, download the image and launch the OS you just created.
Problems with the PXE boot process can be diagnosed by using any network protocol analyzer, such as Wireshark. Once the image is loaded and the kernel is alive, you should see the normal boot process on the screen of the diskless compute node.
As noted before, specialized user-level software (such as the MPI libraries in my case) can be distributed to your nodes via standard NFS shares. On your NFS server (it can be the same as your master node), simply define a new share in /etc/exports and enable NFS:
chkconfig nfs on service nfs start
Your nodes need to add an entry for the NFS server either to their local fstab files or via some other method like autofs.
User home directories can be shared via NFS or via a high-performance, cluster-based filesystem, such as PVFS2 or Lustre. NFS is reliable when disk I/O is not very intensive or mostly read-only, but it breaks down if your code relies heavily on large numbers of files or heavy, simultaneous I/O operations to disk.
Please keep in mind that any customizations of the environment on the diskless nodes are not maintained between reboots. In fact, it's perfectly okay to cold-reset a diskless node; the OS image cannot be corrupted like it could be if it were on a local disk. This simplifies troubleshooting strange node problems. If a reboot doesn't clear the problem (and assuming no other diskless nodes show the same problem), it's almost certainly a hardware bug—this alone can save hours of time when working with a large cluster.