X11 Cluster Rendering

One method to run multi-screen VR environments, or to perform multi-pipe 3D rendering in general, is to use cluster rendering. In cluster rendering, a cluster of independent computers, each with their own graphics card, work together to create a synchronized image across multiple screens. Setting up and running a rendering cluster takes a little additional effort compared to setting up a "regular" cluster because X11, itself a distributed display system, gets involved.

The basic idea behind cluster rendering is that a graphics application is started by the user on a predetermined head node; once started, this application instance replicates itself onto the render nodes using remote program execution, for example via ssh. The application instances on the render nodes will then render to their node's own graphics card, while the head node synchronizes the entire cluster. In X11 lingo this means that each application instance, after being started remotely from the head node, connects to the local X11 server for rendering. (An alternative approach, where a single application instance on the head node opens remote connections to all render nodes' X11 servers is possible, but highly inefficient.)

General Cluster Setup

This section describes the recommended cluster setup for Vrui applications. Other setups are possible, but the one described here minimized administrative and software maintenance efforts. The next sections describe details of enabling cluster rendering via X11.

The "optimal" Vrui cluster consists of a set of binary compatible computers running the same version of Linux. The cluster contains one head node that users log in to and run applications, and one dedicated render node for each display in the VR environment. The render nodes have graphics cards of the same manufacturer/model to ensure consistency of the generated images; the head node has a (lower-powered) graphics card to allow user logins and a control window. Optimally, render nodes do not allow direct graphical login and run no display or window managers (see Section "Dedicated Rendering Cluster"). The cluster is connected via a dedicated broadcast-capable network, such as 1Gb/s Ethernet, with the head node connected to the outside and doubling as a firewall. The head node should have an internal name (such as head) for its IP address on the cluster network, and the render nodes should have simple names (such as render1, ..., render<n>). The /etc/hosts files on all nodes should contain the cluster-network names of all other nodes. The head node contains a large hard drive to store application data exported to the render nodes via NFS. All users' home directories are shared between the cluster, either stored on the head node's hard drive or an external file server (in which case the head node has to forward connections from the render nodes to the file server). Any tracking systems or VR input devices are connected directly to the head node.

Enabling Transparent Remote Login

Shared home directories and a shared data directory on the head node ensure that users can treat the cluster like a single multi-pipe computer for the purposes of developing and running Vrui applications. A user can log in to the head node, compile applications, and run them on the cluster directly from the command line. Since the current implementation of Vrui uses ssh to replicate applications from the head node to the render nodes, users have to set up password-less remote login from the head node to each render node. This can be achieved by generating ssh keys using the ssh-keygen program (the procedure below works for the ssh protocol version 2 on the OpenSSH package):
ssh-keygen -t dsa
This creates a DSA key. ssh-keygen will ask for a pass phrase, and a file name for the generated key. The file names should not be changed from the provided default. The pass phrase can be used to provide additional security, but can potentially be left empty.
cp ~/.ssh/id_dsa.pub ~/.ssh/authorized_keys
This copies the created public key to the list of keys authorized for password-less remote login. Since the user's home directory is shared with the render nodes, they will see the proper key file if the user logs in from the head node.
Note: Storing key pairs on an NFS-mounted directory is not particularly secure. Malicious intruders might intercept NFS packets and snoop the keys. For optimal security, keys should only be stored on local harddrives (and protected by a pass phrase), and the .ssh directories in users' shared home directories should be symlinked to those local per-user directories (such as /var/ssh/<username>/ssh). More information about this can be found on the web.
Set up ssh configuration
To improve Vrui's start-up speed, X11 forwarding should be disabled when logging into any of the render nodes. This is best controlled via ssh's configuration file, ~/.ssh/config. The file should contain a line listing the names (or a wildcard name) of all render nodes following the keyword Host, and directly afterwards a line containing the keywords ForwardX11 no. More details can be found in the man page for ssh_config. Here is an example ssh configuration file:
Host render*
ForwardX11 no
Host *
ForwardX11 yes
This sample file disables X11 forwarding for the render nodes (named render<index>), but enables X11 forwarding for any other remote host.
ssh <render node name>
This command should log the user in to the given render node, without asking for a password.

Non-optimal Clusters

The scenario above describes the optimal Vrui cluster, but Vrui runs on a wide variety of set-ups. It can support heterogeneous clusters if users' build systems are set up properly, i.e., executables/libraries for the involved architectures end up in different directories, and each node adds the proper directories to the search path on login. For example, each application's project directory could contain a single exe directory for executables, inside which are directories for each architecture, such as x86, x86_64, powerpc, etc., containing the executables for the respective platforms. User's login scripts set up the search path based on the local architecture, for example by setting setenv PATH ./exe/`uname -m` (using csh). Then applications can be started from their project directories, and each cluster node will load the proper executable. The same mechanism applies to library search paths, plug-in directories, and object file directories used during build. Of course, user's have to log in to one cluster node of each architecture and build the applications for each architecture independently. Although it is untested, this mechanism should also work for clusters that run different operating systems, as long as the underlying philosophy and file system layout is compatible enough (such as between Linux, *BSD, Mac OS X, and other Unix-like OSs).

Heterogeneous graphics cards in the cluster nodes are taken care of by Vrui itself. Well-written software queries which OpenGL extensions are supported on the local graphics card at run-time, and offer paths to fall back to less desirable rendering paths if an extension is not supported. This will ensure that an application will run on a heterogeneous cluster, but the displayed images might look different. Cards with different performance will cause the entire cluster to run as slowly as the slowest node due to Vrui's cluster-wide per-frame synchronization.

Many "ad-hoc" clusters, such as users running applications on two regular computers with side-by-side monitors, do not have a dedicated cluster network or a dedicated head node or render nodes. A shared network requires that Vrui's communications component uses UDP multicast instead of UDP broadcast. This can simply be achieved by setting an IP address from the multicast region in the configuration file. Not having a head node does not require additional setup, but is not optimally supported by the current Vrui implementation. Most notably, a user will not be able to move the application's mouse cursor between screens. How to enable distributed X11 rendering on non-dedicated render nodes is described in Section "Non-dedicated Rendering Cluster."

X11 Security Problems

The problem with cluster rendering under X11 as described above is that X11 by default does not allow a remotely started application to open connections to the local X11 server. This is for security reasons; if remote applications could connect, a remotely logged in user could potentially "hijack" a local user's display and perform mischief such as logging keystrokes, recording passwords, etc. This security feature must be disabled before X11 cluster rendering becomes possible. There are basically two scenarios: For security (and performance) reasons, the first scenario is by far the better of the two, especially when using the subnet/firewall approach. However, sometimes reality gets in the way and the second scenario has to be used.

Dedicated Rendering Cluster

On a dedicated cluster, the optimal solution is to prohibit users from logging in directly by simply not running a display manager (such as gdm, kdm, etc.). The X11 server on the slave nodes can be set up to either show some static background image or an application's display window in fullscreen mode by not even running a window manager. The easiest way to achieve this is to create a new X11 server configuration. The following directions apply to Red Hat Linux (and derivatives) and Fedora Core, but should work with no or minor changes on other Linux distributions.
  1. Create a new directory /etc/X11/nologin.
  2. Inside /etc/X11/nologin, create a text file EmptyCursor with the following contents:
    #define Emptycursor_width 16
    #define Emptycursor_height 16
    static unsigned char Emptycursor_bits[] = {
       0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
       0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
       0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
    
    This file defines an invisible mouse cursor to effectively hide the X11 default mouse cursor (yes, it's that difficult).
  3. Inside /etc/X11/nologin, create an executable text file XServer with the following contents:
    #!/bin/sh
    
    # Allow X11 connections from remote processes on the local machine:
    /usr/bin/xhost +localhost
    
    # Disable the screen saver:
    /usr/bin/xset s off
    /usr/bin/xset -dpms
    
    # Hide the mouse cursor:
    /usr/bin/xsetroot -cursor /etc/X11/nologin/EmptyCursor /etc/X11/nologin/EmptyCursor
    
    # Run the X11 root application:
    exec /usr/bin/xlogo
    
    This file will serve as replacement for a display manager and will be started whenever X11 is restarted. Using xlogo as the root application will display the X11 logo across the entire display while no application is running; this is only an example. It is possible to use almost any X11 program as the root application.
    The /usr/bin/xhost +localhost line tells the X11 server to allow local connections from remotely started applications, which is a prerequisite for cluster rendering. The /usr/bin/xset s off and /usr/bin/xset -dpms lines tell the X11 server to disable the screen saver and screen suspend; since dedicated render nodes usually have neither keyboards nor mice, users would only see a black screen otherwise.
  4. Modify /etc/X11/prefdm to use the replacement display manager instead of the regular one. Right before the selection process starts in the file, usually before the language file is sourced, insert the following lines:
    # Start replacement display manager to disable direct graphical login:
    exec xinit /etc/X11/nologin/XServer
    
    This ignores the rest of the file and always starts the replacement display manager.

Non-dedicated Rendering Cluster

If the rendering cluster's slave nodes also have to serve as regular desktop computers where users can log in directly to the console, setup becomes a little more complicated. Local X11 connections from remote processes still have to be enabled, but for security reasons, and not to open a cluster application window when a user is using the computer, they have to be disabled while a user is logged in. The best way to achieve this is to modify the display manager's script files to switch the X11 server's behavior when users log in and out of the console. The following description assumes that the display manager is gdm; the process for other display managers is probably very similar.
  1. Find the display manager's root configuration directory. This is usually either /etc/X11/gdm or /etc/gdm. The root configuration directory contains subdirectories Init, PreSession, and PostSession.
  2. The Init subdirectory contains scripts the display manager executes when it is first started. Init should contain file(s) for the local display(s) (:0, :1, ...), or only a Default file. Add the following lines to the end of either the local display file(s), or before the exit 0 line of the Default file if the local display file(s) do not exist:
    # Allow X11 connections from remote processes on the local machine:
    /usr/bin/xhost +localhost
    
    # Disable the screen saver:
    /usr/bin/xset s off
    /usr/bin/xset -dpms
    
    These lines will ensure that the X11 server is open for cluster rendering right after a restart.
  3. Add the following lines at the beginning of the PreSession/Default file, after the search path is set:
    # Prohibit X11 connections from remote processes on the local machine:
    /usr/bin/xhost -localhost
    
    # Enable the screen saver:
    /usr/bin/xset s on
    /usr/bin/xset +dpms
    
    These lines will ensure that no cluster rendering application can open a window while a user is logged in locally.
  4. Add the following lines at the end of the PostSession/Default file, before the final exit 0:
    # Allow X11 connections from remote processes on the local machine:
    /usr/bin/xhost +localhost
    
    # Disable the screen saver:
    /usr/bin/xset s off
    /usr/bin/xset -dpms
    
    These lines will re-enable cluster rendering after the local user has logged out.

An alternative method is to have users manually log in to each render node, and issue /usr/bin/xhost +localhost directly from the node's console. This is obviously not desirable for clusters with larger numbers of nodes.