TR96 35

MITSUBISHI ELECTRIC RESEARCH LABORATORIES
http://www.merl.com
Computer vision for computer games
W. T. Freeman, K. Tanaka, J. Ohta, K. Kyuma
TR96-35 October 1996
Abstract
The appeal of computer games may be enhanced by vision-based user inputs. The high speed
and low cost requirements for near-term, mass-market game applications make system design
challenging. The response time of the vision interface should be less than a video frame time
and the interface should cost less than 50 Dollars U.S. We meet these constraints with algorithms
tailored to particular hardware. We have developed a special detector, called the artificial retina
chip, which allows for fast, on-chip image processing. We describe two algorithms, based on
image moments and orientation histograms, which exploit the capabilities of the chip to provide
interactive response to the player’s hand or body positions at 10 msec frame time and at low-cost.
We show several possible game interactions.
2nd International Conference on Automatic Face and Gesture Recognition, Killington, VT, USA
This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part
without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include
the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of
the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or
republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All
rights reserved.
Copyright c Mitsubishi Electric Research Laboratories, Inc., 1996

201 Broadway, Cambridge, Massachusetts 02139
MERLCoverPageSide2
W. T. Freeman y, K. Tanaka x, J. Ohta x and K. Kyuma x
MERL, a Mitsubishi Electric Research Lab. y Mitsubishi Electric x
201 Broadway Advanced Technology R&D Center
Cambridge, MA 02139 USA 8-1-1, Tsukaguchi-Honmachi
e-mail: freeman@merl.com Amagasaki City, Hyogo 661, Japan
From: IEEE 2nd Intl. Conf. on Automatic Face There has been much recent work on computer vi-
and Gesture Recognition, sion analysis of faces and gestures (e.g. [3, 12, 4, 2]).
Killington, VT, October, 1996. The focus of this research has been on high per-
formance algorithms, not cost optimization. While
Abstract some systems perform at 10 or 30 Hz on worksta-
The appeal of computer games may be enhanced by tion or Pentium systems, this is still too slow and far
vision-based user inputs. The high speed and low too expensive for the game requirements described
cost requirements for near-term, mass-market game above. Near-term game applications require simpler
applications make system design challenging. The algorithms.
response time of the vision interface should be less Computer game applications represent a niche for
than a video frame time and the interface should cost high speed, low cost vision systems. We developed
less than $50 U.S. hardware and algorithms for a vision system aimed
We meet these constraints with algorithms tailored at this niche. Our approach has been to combine
to particular hardware. We have developed a special simple algorithms with a exible and inexpensive de-
detector, called the arti cial retina chip, which allows tector/processing module.
for fast, on-chip image processing. We describe two
algorithms, based on image moments and orientation
histograms, which exploit the capabilities of the chip 2 The Hardware
to provide interactive response to the player's hand
or body positions at 10 msec frame time and at low-
cost. We show several possible game interactions. We have developed an image detector which allows
programmable on-chip processing. By analogy with
the fast, low-level processing that occurs in the eye,
1 Introduction we call the detector the arti cial retina (AR) chip
Computer games are a popular consumer electronics [10]. Figure 1 shows the elements of the AR chip: a
item. The game players nd it captivating to inter- 2-D array of variable sensitivity photodetection cells
act with games via joysticks, buttons, trackballs, or (VSPC), a random access scanner for sensitivity con-
wired gloves. They may nd it even more engag- trol, and an output multiplexer [7]. The VSPC con-
ing to interact through natural, unencumbered hand sists of a pn photo-diode and a di erential ampli er
or body motions. A computer vision-based user in- which allows for high detection sensitivity of either
terface could provide these capabilities. Computer positive or negative polarity. This structure also re-
games represent a possible mass-market application alizes nondestructive readout of the image, essential
for computer vision. for the image processing. The detector arrays can
range in resolution from 32x32 to 256 x 256 pixels;
These applications present unique challenges. The for this paper we assume a 32x32 detector array.
33 msec delay time of NTSC video is too slow; the
system should respond within 10 msec or less to avoid The image processing of the arti cial retina can be
noticeable delays. The system must be low-cost, expressed as a matrix equation. In Fig. 1, the input
roughly comparable with existing game user input image projected onto the chip is the weight matrix
components which cost several tens of dollars, and W . All VSPC's have three electrodes. A direction
it should be robust. To be successful, the vision in- sensitivity electrode, connected along rows, yields the
terface should add some new dimension to the game sensitivity control vector, S . The VSPC sensitivities
itself. can be set to one of (+1; 0; 01) at each row. An
Some features of computer games lessen the design output electrode is connected along columns, yielding
diculties. In most games, the user can exploit im- an output photocurrent which is the vector product,
mediate visual feedback to reach the desired e ect. J = W S . The third electrode is used to reset the
If the player is leaning to make a turn in the game, accumulated photo-carriers. This hardware can sense
and he (or she) sees that he isn't turning enough, he the raw image and execute simple linear operations
can lean more. The structure of the games provides such as local derivatives and image projections.
a context which can allow dramatic, appropriate re- We have integrated this detector/processor chip
sponses from simple visual measurements. The vision into an inexpensive AR module, which contains a
system may merely track the position of the visual low-resolution (32x32) A. R. detector chip, support
center of mass of the player, but the game can turn and interface electronics, and a 16 bit 1MHz micro-
that into running, jumping or crouching, depending processor. The module is 8 x 4 x 3 cm and is inex-
on position and game context. pensive enough to cost only several tens of dollars.
Figure 1: Schematic structure of arti cial
retina chip.
3 Algorithms
Our goal is to infer useful information about the posi-
tion, size, orientation, or con guration of the player's
body or hands. We seek fast, reliable algorithms for
the inexpensive AR processor module.
We have chosen two algorithms. One uses image
moments to calculate an equivalent rectangle for the
current image. Another uses orientation histograms
to select the body pose from a menu of templates.
The rst exploits the image projection capabilities of
the AR module; the second uses its ability to quickly
calculate x and y derivatives.
3.1 Image Moments
Figure 2: 32x32 input images of user's hand,
and the equivalent rectangle having the same
Image moments [9, 1] provide useful summaries of rst and second order moments as those of the
global image information, and have been applied to image. X-Y position, orientation, and pro-
jected width is measured from the rectangle.
shape analysis or other tasks, often for binary images. (Projected height is also measured, but with
The moments involve sums over all pixels, and so are the hand extending o the picture as shown
robust against small pixel value changes. Within the here, height is redundant with the vertical po-
structure of a computer game, they can also provide sition of the center of mass).
sucient information for the computer to reliably in-
terpret control inputs from the user's body position.
Characteristics of the AR chip allow fast calculation De ne the intermediate variables a, b, and c,
of these moments.
M20
If I (x; y) is the image intensity at position x, y,
then the image moments, up to second order, are:
a =
M00
0 x2c
M00
X X I (x; y)
= M11 =
X X xy I (x; y) b = 2(
M11
M00
x y ) 0 cc
x y x y
M =
X
10
X x I (x; y) M01 =
X X y I (x; y) c=
M02
M00
y2 : 0 c (3)
x y x y We have (c.f. [9]):

M =
X X x I (x; y)2
=
X X y I (x; y)
2 arctan(b; (a 0 c))
20 M02 = (4)
x y x y 2
(1)
We can nd the position, xc , yc , orientation ,
and
r (a + c) + pb + (a 0 c) 2 2
=
r (a + c) 0 p2b + (a 0 c)
and dimensions l1 and l2 of an equivalent rectangle l1
which has the same moments as those measured in
the image [9]. Those values give a measure of the 2 2
hand's position, orientation, and aspect ratio. We l2 = (5)
2
have:
M10 M01 The extracted parameters are independent of the
xc = yc = (2) overall image intensity.
M00 M00
Figure 2 shows an image of a hand, as it could
appear within a \playing area" of a game machine,
and the corresponding equivalent rectangle calclu-
ated for each image from the image moments. Note
that the rectangle accurately re ects four variables:
the x-y position of the hand, its orientation, and its
width. In the context of a game, these image moment
measurements provide rich opportunities for control.
Figures 3 and 4 illustrate two example games based
on equivalent rectangle measurements, for car racing
and skateboard games.
Image equivalent graphics

rectangle
Figure 4: Left: Input images for a skate-

boarding game. Middle: equivalent rectangle
Figure 3: Left: 32x32 pixel input images of for each image (after background subtraction
player's hand. Middle: a rectangle having the and clipping to positive values) with cross-
equivalent image moments as the input image. hairs superimposed for reference. Right: Pos-
The x-y position, orientation, and length and sible game response to rectangle states. Hori-
width of this rectangle represent game control zontal position determines right/left steering.
parameters. Right: Possible game responses Jumps and crouches can be inferred from ver-
to hand inputs. Car orientation and position tical position and rectangle height.
follow the hand; the hand's apparent width
controls the throttle.
Px x V (x) Py y H (y)
3.1.1 Fast Calculation of Image Moments
The image moments can be calculated quickly

M10
M20
P
=
2
= x x V (x)
M01
M02
=
P 2
= y y H (y): (9)
from three projections of the image [9], and we ex-
ploit that fact to speed up the processing with the The horizontal and vertical image projections can
AR chip. be performed on the arti cial retina detector, saving
Let the vertical, horizontal, and diagonal projec- processor time. The savings depend on the image res-
tions be
V (x) =
X I (x; y); (6)
olution and micro-processor speed. For the 1 MHz.
micro-processor of the AR module and 32x32 resolu-
tion, the calculations would take 20 msec per image
y
H (y) =
X I (x; y); (7)
on the microprocessor alone, but only 10 msec per
image using the microprocessor and arti cial retina
chip. The on-chip processing of the arti cial retina
x chip brings the algorithm into the range in speed and
and
D(t) =
X I ( tp0 s ; tp+ s ): (8)
cost need for today's games.
s 2 2 3.2 Orientation histograms

The arti cial retina chip can also quickly calculate
Then the image moments are [9]:
Px V (x) horizontal and vertical derivatives, which indicate lo-
cal orientation. Histograms of local orientation can
M11 =
Pt t D(t) 0 M
M00
2
=
20
2 0 M202
be used for fast pattern recognition [11, 6]. We show
that these algorithms can distinguish poses of the
pose, or to interpolate arm or leg positions between
those of the templates.
Figure 6 illustrates that this method can work
image projection even for the 32x32 pixel images. Four test images
are shown along with the closest matching pose from
several di erent data sets of poses. The algorithm
correctly picked out the closest pose in each case.
vertical position
Figure 7 shows the small test set of possible poses
from which the algorithm chose.
Test Image Closest match
image projection
on
n
ti
tio
si
po
ec
oj
al
pr
on
e
ag
ag
di
im
horizontal position
Figure 5: Three image projections determine
the image moments. The horizontal and verti-
cal projections can be performed on the arti -
cial retina detector itself, approximately dou-
bling the throughput.
Figure 6: Test images, and closest match to

human gure, and present computation times for the each from a set of 10 poses (Figure 7). The al-
AR module. gorithm correctly picked out the most similar
Others have developed algorithms to identify the pose from each set to that of each test image.
pose of a gure [8, 12, 5], but those algorithms do not
meet our speed and cost requirements. The orienta- Figure 8 shows a hypothetical game application of
tion histogram algorithm has limitations: the arms this coarse body position measurement{a jet ying
should not cross the body, and we have not studied game. The tilt of the plane in the game re ects the
recognition choosing from among a very large set of positions of the user's arms; some arm con gurations
possible poses. However, the algorithm operates at can correspond to game commands, such as eject.
the rate and cost level necessary for computer games. 3.2.1 Fast calculation of orientation
To use orientation histograms for recognition, we histograms
rst calculate the spatial derivatives of the image,
gx and gy . At positions where the image contrast, The AR module can speed-up the calculation of
gx2 + gy2 , is above a pre-set threshold, we compute the orientation histograms. Figure 9 shows the pro-
the orientation from = arctan(gx ; gy ). (This can cessing steps: image derivatives are calculated on
be pre-computed for all pairs (gx ; gy ) with a look- the detector; background subtraction and orientation
up-table accessed by the micro-processor). We then histograms are calculated at the microprocessor. For
divide orientation into bins (e.g. 10 per bin) and this algorithm, and 32x32 resolution, the expected
calculate the orientation histogram over some region. timing is roughly the same as before: 20 msec if all
Measuring the Euclidean distance [6], or mutual in- calculations are performed on the micro-processor,
formation [11], between the orientation histograms and 10 msec if the x and y derivatives are calculated
of di erent images provides a measure of similarity on the AR chip and the rest calculated on the the
which is contrast and position independent. microprocessor.
A global orientation histogram of a gure would 4 Summary
average too much spatial information to infer pose.
We divide the image of the gure into two or four User interface for computer games represents a low-
sub-images, compute the orientation histogram inde- end niche in the spectrum of computer vision algo-
pendently for each one, and stack the resulting two rithms. The speed and cost constraints are severe,
or four histograms into one large feature vector. For necessitating simple algorithms. Yet the interactivity
the images shown, we only used the orientation his- and structure of the games allow for a rich response
tograms corresponding to the upper two quadrants. even with those simple algorithms.
We calculate a feature vector for each image, and We have developed an arti cial retina module,
compare the distance from the feature vector to a which combines a detector with on-chip image pro-
set of training images. These distances can be used cessing with a microprocessor. We devised two algo-
to select the template corresponding to the nearest rithms tuned to the AR module, suitable for com-
30
25
20
15
10
0
0 5 10 15 20 25 30 35
30
25
20
15
10
0
0 5 10 15 20 25 30 35
30
25
20
15
10
0
0 5 10 15 20 25 30 35
eject
Figure 8: Left: sample input images for y-
ing game. Middle: orientation images, from
which orientation histograms are calculated.
Right: corresponding game action.
Figure 7: Test images for Fig. 6. Three peo-
ple made ten poses, with four poses in com- [6] W. T. Freeman and M. Roth. Orientation
mon. Backgrounds were removed by subtrac-
tion. That can leave a ghost of the background histograms for hand gesture recognition. In
inside the gure, but the e ects of such resid- M. Bichsel, editor, Intl. Workshop on automatic
uals were negligible in the recognition perfor- face- and gesture-recognition, Zurich, Switzer-
mance. land, 1995. Dept. of Computer Science, Univer-
sity of Zurich, CH-8057.
puter game applications. One is based on image [7] E. Funatsu, Y. Nitta, M. Miyake, T. Toyoda,
moments, and the other on orientation histograms. K. Hara, H. Yagi, J. Ohta, and K. Kyuma.
These algorithms respond to the user's hand or body SPIE, 2597(283), 1995.
positions, within 10 msec, with hardware that will [8] D. M. Gavrila and L. S. Davis. Towards 3-d
cost several tens of dollars. We are developing com- model-based tracking and recognition of human
puter games with these algorithms and this hard- movement: a multi-view approach. In M. Bich-
ware. sel, editor, Intl. Workshop on automatic face-
and gesture-recognition, pages 272{277, Zurich,
References Switzerland, 1995. Dept. of Computer Science,
University of Zurich, CH-8057.
[1] D. H. Ballard and C. M. Brown, editors. Com-
puter Vision. Prentice Hall, 1982. [9] B. K. P. Horn. Robot vision. MIT Press, 1986.
[2] D. Beymer and T. Poggio. Face recognition from [10] K. Kyuma, E. Lange, J. Ohta, A. Hermanns,
one example view. In Proc. 5th Intl. Conf. on B. Banish, and M. Oita. Nature, 372(197), 1994.
Computer Vision, pages 500{507. IEEE, 1995. [11] R. K. McConnell. Method of and apparatus for
pattern recognition. U. S. Patent No. 4,567,610,
[3] M. Bichsel. International Workshop on Auto- Jan. 1986.
matic Face- and Gesture- Recognition. IEEE
Computer Society, 1995. [12] A. P. Pentland. Smart rooms. Scienti c Amer-
ican, 274(4):68{76, 1996.
[4] M. J. Black and Y. Yacoob. Tracking and rec-
ognizing rigid and non-rigid facial motions us-
ing local parametric models of image motion. In
Proc. 5th Intl. Conf. on Computer Vision, pages
374{381. IEEE, 1995.
[5] J. S. E. Hunter and R. Jain. Posture estima-
tion in reduced-model gesture input systems.
In M. Bichsel, editor, Intl. Workshop on auto-
matic face- and gesture-recognition, pages 290{
295, Zurich, Switzerland, 1995. Dept. of Com-
puter Science, University of Zurich, CH-8057.
image
A. R. detector
calculations
micro-processor
calculations
30
25
20
15
10
orientation 5
image
0
0 5 10 15 20 25 30 35
Figure 9: Processing steps in the orientation

histogram algorithm. Top to bottom: Origi-
nal image; x and y derivatives, calculated on
the arti cial retina detector; derivatives with
the background image subtracted; orientation
angle calculated for each pixel above a con-
trast threshold. Histograms of the orienta-
tions of the upper two 16x16 blocks of the
orientation image were concatenated into one
feature vector, which was compared with ex-
isting prototypes for recognition.

TR96 35

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

TR96 35

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TR96 35

Uploaded by

Copyright:

Available Formats

MITSUBISHI ELECTRIC RESEARCH LABORATORIES

Computer vision for computer games

W. T. Freeman, K. Tanaka, J. Ohta, K. Kyuma

TR96-35 October 1996

Copyright c Mitsubishi Electric Research Laboratories, Inc., 1996

x y x y We have (c.f. [9]):

Image equivalent graphics

Figure 4: Left: Input images for a skate-

The image moments can be calculated quickly

s 2 2 3.2 Orientation histograms

Figure 6: Test images, and closest match to

Figure 9: Processing steps in the orientation

You might also like