More detailed descriptions of the methods involved to track the Wiimote in six degrees of freedom (position and orientation).
The Wiimote's IR camera can detect and track up to four infrared light sources. The camera has a built-in image processor that analyzes the raw camera image, identifies bright spots, and computes their (x, y) positions and approximate radius on the camera's image plane. These values can be queried by the host computer as part of the Wiimote's Bluetooth report stream.
Given a custom IR beacon with four LEDs at known positions (in a non-planar arrangement, e.g., a tetrahedron), it is possible to derive the camera's -- and hence the Wiimote's -- position and orientation relative to the beacon based on the (x, y) positions of the beacon's LEDs on the camera's image plane (see Figure 1). The IR tracking beacon is described in more detail on its own page.
|Figure 1: Projection of a custom IR beacon with four LEDs in a non-planar arrangement onto the Wiimote camera's image plane. The (x, y) positions of the four LEDs are dependent on the (known) absolute positions of the beacon LEDs in 3D space, the (known) position of the camera's focus point relatively to its image plane, and the (unknown) position and orientation of the camera in 3D space. This figure places the camera's focus point behind its image plane for clarity; in reality, the focus point is the center of the camera's lens in front of the image plane. This does not affect the projection equations.|
Intrinsic Camera Parameters
The internals of the Wiimote's IR camera, e.g., its focal length, are crucial in expressing the camera's projection equations. The following values are called the camera's intrinsic parameters:
For more precise computations, one can also consider the camera's radial distortion and the skew angle between the image sensor's pixel rows and columns, but the Wiimote's built-in IR camera seems to already compensate for those.
- Pixel size
- The width and height of each pixel on the camera's sensor in physical coordinate units, e.g., millimeters.
- Focal length
- The orthogonal distance of the camera's focus point (center of its lens) from the image plane in physical coordinate units.
- Center of projection
- The position of the orthogonal projection of the camera's focus point onto its image plane in physical coordinate units. This 2D coordinate can be combined with the camera's focal length to express the focus point position as a 3D point.
The intrinsic camera parameters have to be measured carefully, because they have a large influence on the projection equations used for 6-DOF tracking. This is normally achieved by recording the measured (x, y) positions of an IR beacon from multiple known positions and orientations, and finding the set of parameters that best describe the measurements. This process usually has to be performed only once, since the intrinsic camera parameters do not change when the camera is moved.
Now, in practice, it is extremely difficult to measure the focal length, center of projection, and pixel size in any physical units without direct access to the camera's internals. However, when looking at the projection equations, it turns out that these values only appear as ratios of one another. That means one can arbitrarily set, say, the pixel size to (1.0, 1.0) (assuming square pixels), and express the focal length and projection center in units of pixels instead of in physical units. One set of values that seems to work quite well is pixel size = (1.0, 1.0), focal length = 1280, center of projection = (512, 384). These values might be different for different Wii controllers.
Extrinsic Camera Parameters
While the intrinsic camera parameters describe the camera's internal configuration and do not change during tracking, the extrinsic parameters describe how the entire camera system is positioned in 3D space. In a tracking application, the extrinsic parameters are unknown and derived by solving the camera's projection equations based on the observed positions of a known arrangement of LEDs. Since the camera (and the Wiimote it is attached to) moves as a rigid body in space, its extrinsic parameters can be described as a 3D position (three degrees of freedom) and a 3D orientation (another three degrees of freedom). For convenience, we define the Wiimote's coordinate system such that its origin coincides with the camera's focus point (the center of the lens), its x axis points right, its y axis points forward, and its z axis points up. Due to the way the camera is affixed to the Wiimote, this implies that the camera's image plane is spanned by the x and z axes, with the horizontal pixel direction corresponding to x, and the vertical pixel direction corresponding to z.
While a 3D position can trivially be represented as an (x, y, z) position in some coordinate system, 3D orientations are not so obvious. There are three widely used representations for 3D orientations, each with their own advantages and drawbacks.
From the three alternatives, unit quaternions offer the best compromise between efficiency of representation and number of unknown parameters/additional constraints. As a result, the optimal representation for a camera's extrinsic parameters is a position p = (px, py, pz) and an orientation represented as a unit quaternion O = (ox, oy, oz, ow) with ox*ox + oy*oy + oz*oz + ow*ow = 1.
- Angle triplets
- A 3D orientation can be described as three consecutive rotations around three given axes, usually following the aircraft system of using azimuth (yaw), elevation (pitch), and roll. While angle triplets -- often referred to as "Euler angles" -- match the three degrees of freedoms of 3D orientations, they are inefficient for point transformations as they require multiple evaluations of trigonometric functions, are difficult to manipulate, and have ambiguity problems ("gimbal lock").
- Unit quaternions
- Quaternions are a four-dimensional analogon of complex numbers and represented as 4-tuples (x, y, z, w). By coincidence(?), unit quaternions, i.e., quaternions where x*x + y*y + z*z + w*w = 1, are equivalent to 3D rotations, and 3D orientations are equivalent to 3D rotations applied to a known initial orientation (the identity orientation). In other words, every unit quaternion represents a unique 3D orientation, and each 3D orientation is represented by exactly two unit quaternions, which are negatives of each other. The relationship between unit quaternions and 3D rotations is much more "linear" than for angle triplets, and there is no gimbal lock. The main advantage of quaternions is that their arithmetic closely matches 3D rotations, i.e., the quaternion corresponding to a concatenation of two 3D rotations is the product of the two rotations' quaternions. Quaternions are almost as efficient for point transformations as 3x3 matrices, and there is only one additional constraint between a unit quaternion's four parameters, namely the unit length formula shown previously.
- Orthogonal 3x3 matrices
- 3D rotations are a subclass of 3D linear transformations, which in turn are equivalent to 3x3 matrices. Hence, each 3D rotation can be expressed as a 3x3 matrix. Furthermore, 3D rotations are exactly equivalent to the subclass of orthogonal 3x3 matrices, i.e., matrices where all column vectors have unit length and are pairwise orthogonal to each other. As with quaternions, concatenation of 3D rotations is equivalent to matrix multiplication. Furthermore, point transformations are most efficiently expressed as a product between a matrix and a vector. The main drawback of using 3x3 matrices to represent 3D orientations is that a 3x3 matrix has nine parameters, and orthogonal 3x3 matrices have six additional constraints between those parameters, namely the unit length and orthogonality constraints mentioned above.
Given a point a = (ax, ay, az) in a local coordinate system with origin p = (px, py, pz) and orientation O = (ox, oy, oz, ow) with ox*ox + oy*oy + oz*oz + ow*ow = 1, the point's position in the global coordinate system a'' can be computed as follows:
To transform a point from global coordinates into a local coordinate system defined by a position p and an orientation O, one first computes a position transformation by -p = (-px, -py, -pz), and then an orientation transformation by O-1 = (ox, oy, oz, -ow) using the above formulae.
Given a camera's intrinsic and extrinsic parameters as defined above, the (x, y) position of a 3D point's projection onto the camera's image plane can be expressed by first transforming the 3D point into the camera's local coordinate system, and then projecting it onto the image plane from the camera's focus point. In other words, one first transforms the point a in global coordinates (ax, ay, az) to the point a'' in the camera's local coordinate system using the inverse point transformation formulae from the previous section, and then projects onto the image plane assuming that the focus point is at the local origin, and the image plane is spanned by the local coordinate system's x and z axes (according to the convention defined above). Given a 3D point a'' = (ax'', ay'', az'') in camera coordinates, camera pixel size (px, py) and focus point f = (fx, fy, fz), the projection's horizontal and vertical pixel coordinates (x, y) are x = (ax''*fz)/(ay''*px) + fx, and y = (az''*fz)/(ay''*py) + fy.
- Orientation transformation: a' = (ax', ay', az'), where ax' = rz*oy - ry*oz + rw*ox + rx*ow, ay' = rx*oz - rz*ox + rw*oy + ry*ow, and az' = ry*ox - rx*oy + rw*oz + rz*ow, with rx = oy*az - oz*ay + ow*ax, ry = oz*ax - ox*az + ow*ay, rz = ox*ay - oy*ax + ow*az, and rw = ox*ax + oy*ay + oz*az.
- Position transformation: a'' = (ax'', ay'', az'') where ax'' = ax' + px, ay'' = ay' + py, and az'' = az' + pz.
Solving for Extrinsic Parameters
The projection equations imply that each pair of a 3D point at a known position in global coordinates and its projection onto the image plane define two non-linear equations in seven unknowns, the camera's extrinsic parameters. If four 3D points and their projections are known, this results in nine total non-linear equations for seven unknowns (8 from the points and one from the quaternion's unity condition). In principle, it would be sufficient to know the projected positions of three known 3D points (leading to seven equations for seven unknowns); however, three points in 3D space are always planar, and the resulting system is very instable. The full (overdetermined) system of nine non-linear equations can be solved using a variety of methods; one approach that works very well in practice is based on a Levenberg-Marquardt minimization method. The basic idea is, given an estimate of the camera's extrinsic parameters, to compute the residual distance between the predicted projected point positions and the actual positions reported by the camera, and then to minimize that residual iteratively. The benefit of the Levenberg-Marquardt method is that it converges from very poor initial estimates, and, being an iterative method, performs very well for tracking applications where the unknown parameters change slowly over time.
One problem with the described approach to derive the camera's extrinsic parameters is that it relies on an association between the target points in 3D space and their projections onto the camera's image plane. However, the camera only reports up to four bright spots it detects in its image, with no particular order and no association to the LEDs on the IR beacon. This means that the first step in solving the non-linear equations is to determine which bright spot on the camera corresponds to which LED. The brute-force approach would consider any permutation of associations and take the one that yields the smallest residual value; due to the Levenberg-Marquardt method's computational complexity, however, this approach would require a very fast computer to generate results at real-time rates.
The currently implemented approach uses a tracking method to maintain point matches while all four LEDs are detected by the camera -- the new position and orientation are predicted based on the current estimate of linear and angular velocities, and predicted target point projections are matched with camera observations on a nearest-neighbor basis -- and uses the Wiimote's linear accelerometer measurements to create an initial match if not all four LEDs were visible in the previous frame, or no good match could be found. An improved tracking method could use the linear accelerometer measurements to compute better estimates of linear and angular velocity to better match LEDs and observations across frames.