Kinect Hacking

The Kinect is an accessory for Microsoft's Xbox game console. It contains an array of microphones, an active-sensing depth camera using structured light, and a color camera. The Kinect is intended to be used as a controller-free game controller, tracking the body or bodies of one or more players in its field of view.

The motivation for this project was to convert the Kinect into a 3D camera by combining the depth and color image streams received from the device, and projecting them back out into 3D space in such a way that real 3D objects inside the cameras' field of view are recreated virtually, at their proper sizes (see Figures 1 and 2).

Figure 1: Video of a user interacting with a pre-recorded life-size "holographic" avatar of himself in the KeckCAVES virtual reality environment.
Figure 2: Video showing the Augmented Reality Sandbox, which uses a Kinect camera to capture a 3D model of the surface of sand in a sandbox, and a projector to project a dynamic topographic map back onto the sand.

Kinect Sensors

The Kinect contains a regular color camera, sending images of 640*480 pixels 30 times a second. It also contains an active-sensing depth camera using a structured light approach (using what appears to be an infrared LED laser and a micromirror array), which also sends (depth) images of 640*480 pixels 30 times a second (although it appears that not every pixel is sampled on every frame).

What Makes The Kinect Special?

It is important to understand the difference between 3D cameras like the Kinect on one hand, regular (2D) cameras on the other hand, and so-called "3D cameras" -- actually, stereoscopic 2D cameras -- on the third hand (ouch).

Kinect vs Regular 2D Camera

Any camera, 2D or otherwise, works by projecting 3D objects (or people...), which you can think of as collections of 3D points in 3D space, onto a 2D imaging plane (the picture) along straight lines going through the camera's optical center point (the lens). Normally, once 3D objects are projected to a 2D plane that way, it is impossible to go back and reconstruct the original 3D objects. While each pixel in a 2D image defines a line from that pixel through the lens back out into 3D space, and while the original 3D point that generated the pixel must lie somewhere on that line, the distance that 3D point "traveled" along its line is lost in projection. There are approaches to estimate that distance for many pixels in an image by using multiple images or good old guesswork, but they have their limitations.

A 3D camera like a Kinect provides the missing bit of information necessary for 3D reconstruction. For each 2D pixel on the image plane, it not only records that pixel's color, i.e., the color of the original 3D point, but also that 3D point's distance along its projection line. There are multiple technologies to sense this depth information, but the details are not really relevant. The important part is that now, by knowing a 2D pixel's projection line and a distance along that projection line, it is possible to project each pixel back out into 3D space, which effectively reconstructs the originally captured 3D object(s). This reconstruction, which can only contain one side of an object (the one facing the camera), creates a so-called facade. By combining facades from multiple calibrated 3D cameras, one can even generate more complete 3D reconstructions.

Kinect vs So-Called "3D Camera"

There exist stereoscopic cameras on the market, which are usually advertised as "3D cameras." This is somewhat misleading. A stereoscopic camera, which can typically be recognized by having two lenses next to each other, does not capture 3D images, but rather two 2D images from slightly different viewpoints. If these two images are shown to a viewer, where the viewer's left eye sees the image captured through the left lens, and the right eye the other one, the viewer's brain will merge the so-called stereo pair into a full 3D image. The main difference is that the actual 3D reconstruction does not happen in the camera, but in the viewer's brain. As a result, images captured from these cameras are "fixed." Since they are not really 3D, they can only be viewed from the exact viewpoint from which they were originally taken. Real 3D pictures, on the other hand, can be viewed from any viewpoint, since that simply involves rendering the reconstructed 3D objects using a different perspective.

While it is possible to convert stereo pairs into true 3D images using computer vision approaches (so-called depth-from-stereo methods), those do not work very well in practice.

Project Goals

The goal of this project was to develop the software necessary to connect an unmodified, off-the-shelf, Kinect device to a regular computer, and use it as a 3D camera for a variety of 3D graphics and virtual reality applications. The software is implemented as a set of applications based on the Vrui VR toolkit, and additionally as a Vrui vislet to facilitate using the 3D video stream received from a Kinect with all existing Vrui VR applications.

Project Details

The software is composed of several classes wrapping aspects of the underlying libusb library into an exception-safe C++ framework, classes encapsulating control of the Kinect's tilt motor and color and depth cameras, and a class encapsulating the operations necessary to reproject a combined depth and color video stream into 3D space. It also contains several utility applications, including a simple calibration utility.

This software is based on the reverse engineering work of Hector Martin Cantero (marcan42 on twitter and YouTube). I didn't use any of his code, but the "magic incantations" that need to be sent to the Kinect to enable the cameras and start streaming. Those incantations were essential, because I don't own an Xbox myself, so I couldn't snoop its USB protocol. Thanks Hector!

The Kinect driver code and the 3D reconstruction code are entirely written from scratch in C++, using my own Vrui VR toolkit for 3D rendering management and interaction.

Kinect Models

As of 12/14/2013, there exist four different models of the Kinect:

The main difference between models 1414 on the one hand, and models 1473 and 1517 on the other hand, is the arrangement of the Kinect's sub-devices (camera, microphone array, tilt motor, internal USB hub). In model 1414, the camera was the "main" device, and had the Kinect's serial number attached to it. In models 1473 and 1517, the serial number is now attached to the microphone device, and the camera device's serial number is bogus.

The second-generation Kinect-for-Xbox-One, or Kinect v2 for short, is a completely new device that has very little in common with the first-generation Kinect-for-Xbox-360. Most importantly, its 3D sensing capabilities are no longer based on stereo reconstruction from a pattern of projected laser dots, but on measuring the time-of-flight of photons from an emitter on the device, to surfaces in the environment, and back to the device's IR camera. It does this in a very clever way that's worthwhile writing up in detail at some unspecified point in the future.

Library Troubles

This change caused major trouble for my Kinect package, because it uses device serial numbers to associate intrinsic calibration parameters with camera devices. Each camera's physical layout is ever so slightly different, requiring different calibration parameters for each individual device. This was easy to handle, because after opening a camera device on the user's request, the software could query the camera's serial number, and then load the appropriate calibration data. With the new models, this is no longer the case. Now, after opening a camera device, the software somehow has to figure out which microphone device is in the same enclosure, get that microphone device's serial number, and then load camera calibration data. If there's only one Kinect connected to the computer, that's easy. But if there are multiple Kinects on the same computer, it's not so easy. The libusb USB library presents all devices connected to all buses as a flat list, without representing the bus' topology, i.e., which devices are plugged into which hubs, etc. Fortunately, there is a fork of the USB library, libusbx, which added API calls to query bus topology. This way, if given a camera device, the software can find the camera device's parent, which is the Kinect's root hub, and then find the single microphone device that's connected to the same root hub.

The practical upshot is that support for Kinect-for-Xbox 1473 and Kinect-for-Windows 1517 relies on the libusbx library being installed on the host system, to provide the topology query calls. Alas, not all Linux distributions ship libusbx by default. It is of course possible to install libusbx manually, but the libusbx developers decided to use the same library name as libusb, i.e., the libusbx library file is called libusb-1.0. It probably seemed like a good idea at the time, but it's a real headache when having to separate a local install of libusbx from a system-wide install of libusb.

The USB library binding is handled by the underlying Vrui package, which attempts to auto-detect the presence of topology query calls in the libusb library. However, this does not always work if the system-wide libusb library does not have topology calls, but a locally-installed one does. In those cases, a user will have to edit the "BuildRoot/Packages.System" file in Vrui's build directory, find the definitions for LIBUSB1, and manually set the base directory for libusbx (e.g., /usr/local), and then manually set the include and library directories (e.g., /usr/local/include and /usr/local/lib, respectively). It might also be necessary to set up an rpath link flag pointing to the libusbx library file. After making these changes, Vrui will have to be rebuilt from "make clean".


Pages In This Section

Movies showing 3D video streams from the Kinect, and how they can be integrated into other 3D graphics or VR software.
Download page for the current and several older releases of the Kinect 3D Video Capture project, released under the GNU General Public License.


This page has been translated into other languages by volunteers: