Vrbook
Vrbook
Vrbook
VIRTUAL REALITY
Steven M. LaValle
University of Illinois
iii
CONTENTS v vi CONTENTS
13 Frontiers 363
13.1 Touch and Proprioception . . . . . . . . . . . . . . . . . . . . . . . 363
13.2 Smell and Taste . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
13.3 Robotic Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
13.4 Brain-Machine Interfaces . . . . . . . . . . . . . . . . . . . . . . . . 380
viii PREFACE
been computer scientists, the course at Illinois has attracted students from many
disciplines, such as psychology, music, kinesiology, engineering, medicine, and eco-
nomics. Students in these other fields come with the most exciting project ideas
because they can see how VR has the potential to radically alter their discipline.
To make the course accessible to students with such diverse backgrounds, I have
Preface made the material as self-contained as possible. There is no assumed background
in software development or advanced mathematics. If prospective readers have
at least written some scripts before and can remember how to multiply matrices
together, they should be ready to go.
In addition to use by students who are studying VR in university courses, it is
The Rebirth of Virtual Reality also targeted at developers in industry, hobbyists on the forums, and researchers in
Virtual reality (VR) is a powerful technology that promises to change our lives academia. The book appears online so that it may serve as a convenient references
unlike any other. By artificially stimulating our senses, our bodies become tricked for all of these groups. To provide further assistance, there are also accompany-
into accepting another version of reality. VR is like a waking dream that could ing materials online, including lecture slides (prepared by Anna Yershova) and
take place in a magical cartoon-like world, or could transport us to another part recorded lectures (provided online for free by NPTEL of India).
of the Earth or universe. It is the next step along a path that includes many
familiar media, from paintings to movies to video games. We can even socialize
with people inside of new worlds, either of which could be real or artificial. Why Am I Writing This Book?
At the same time, VR bears the stigma of unkept promises. The hype and I enjoy teaching and research, especially when I can tie the two together. I have
excitement has often far exceeded the delivery of VR experiences to match it, been a professor and have taught university courses for two decades. Robotics has
especially for people without access to expensive laboratory equipment. This was been my main field of expertise; however, in 2012, I started working at Oculus VR
particularly painful in the early 1990s when VR seemed poised to enter mainstream a few days after its Kickstarter campaign. I left the university and became their
use but failed to catch on (outside of some niche markets). Decades later, we are head scientist, working on head tracking methods, perceptual psychology, health
witnessing an exciting rebirth. The latest technological components, mainly arising and safety, and numerous other problems. I was struck at how many new challenges
from the smartphone industry, have enabled high-resolution, low-cost, portable arose during that time because engineers and computer scientists (myself included)
VR headsets to provide compelling VR experiences. This has mobilized leading did not recognize human perception problems that were disrupting our progress.
technology companies to invest billions of US dollars into growing a VR ecosystem I became convinced that for VR to succeed, perceptual psychology must permeate
that includes art, entertainment, enhanced productivity, and social networks. At the design of VR systems. As we tackled some of these challenges, the company
the same time, a new generation of technologists is entering the field with fresh rapidly grew in visibility and influence, eventually being acquired by Facebook for
ideas. Online communities of hackers and makers, along with college students $2 billion in 2014. Oculus VR is largely credited with stimulating the rebirth of
around the world, are excitedly following the rapid advances in VR and are starting VR in the consumer marketplace.
to shape it by starting new companies, working to improve the technology, and
I quickly returned to the University of Illinois with a new educational mission:
making new kinds of experiences.
Teach a new generation of students, developers, and researchers the fundamentals
of VR in a way that fuses perceptual psychology with engineering. Furthermore,
The Intended Audience this book focuses on principles do not depend heavily on the particular technology
of today. The goal is to improve the reader’s understanding of how VR systems
The book is growing out of material for an overwhelmingly popular undergrad- work, what limitations they have, and what can be done to improve them. One
uate course on VR that I introduced at the University of Illinois in 2015 (with important component is that even though technology rapidly evolves, humans who
hardware support from Oculus/Facebook). I have never in my life seen students use it do not. It is therefore crucial to understand how our sensors systems func-
so excited to take a course. We cannot offer enough slots to come even close to tion, especially with matched with artificial stimulation. This intent is to provide
meeting the demand. Therefore, the primary target of this book is undergraduate a useful foundation as the technology evolves. In many cases, open challenges
students around the world. This book would be an ideal source for starting similar remain. The book does not provide the solutions to them, but instead provides
VR courses at other universities. Although most of the interested students have the background to begin researching them.
vii
ix x PREFACE
Online Materials I am grateful to the College of Engineering and Computer Science Department
at the University of Illinois for their support of the course. Furthermore, Ocu-
The entire book is posted online at: lus/Facebook ahs generously supported the lab with headset donations. I am also
grateful to the Indian Institute of Technology (IIT) Madras in Chennai, India for
http://vr.cs.uiuc.edu/
their hospitality and support during a short version of the course. Finally, I am
along with pointers to additional materials, such as lecture videos and slides. extremely grateful to the hundreds of students who have served as test subjects
for the course and book while it was under development. Their endless enthusiasm
and questions helped shape this material.
Suggested Use Among many helpful colleagues, I especially thank Ian Bailey, Don Greenberg,
Paul MacNeilage, Betty Mohler, Aaron Nichols, Yury Petrov, Dan Simons, and
This text may be used for a one-semester course by spending roughly one week per
Richard Yao for their helpful insights, explanations, suggestions, feedback, and
chapter, with the exception of Chapter 3, which may require two weeks. The book
pointers to materials.
can also be used to augment other courses such as computer graphics, interfaces,
I sincerely thank the many more people who have given me corrections and
and game development. Selected topics may also be drawn for a short course or
comments on early drafts of this book. This includes Frank Dellaert, Blake J.
seminar series.
Harris, Katie Mimnough, Peter Newell, Matti Pouke, Yingying (Samara) Ren,
Depending on the technical level of the students, the mathematical concepts in Matthew Romano, Killivalavan Solai, Karthik Srikanth, Jiang Tin, David Tranah,
Chapter 3 might seem too oppressive. If that is the case, students may be advised Ilija Vukotic, and Kan Zi Yang.
to skim over it and jump to subsequent chapters. They can understand most of the
later concepts without the full mathematical details of Chapter 3. Nevertheless, Steve LaValle
understanding these concepts will enhance their comprehension throughout the Urbana, Illinois, U.S.A.
book and will also make them more comfortable with programming exercises.
Lab Component
We currently use Oculus Rift and HTC Vive headsets on PCs that graphics cards
that were specifically designed for VR. Development on many more platforms
will soon become feasible for this course, including Samsung Gear VR, HTC
Vive, Google Daydeam, and Microsoft Hololens. For software, almost all stu-
dents develop VR projects using Unity 3D. Alternatives may be Unreal Engine
and CryENGINE, depending on their level of technical coding skills. Unity 3D
is the easiest because knowledge of C++ and associated low-level concepts is un-
necessary. Students with strong programming and computer graphics skills may
instead want to develop projects “from scratch”, but be aware that implementation
times may be longer.
Acknowledgments
I am very grateful to many students and colleagues who have given me extensive
feedback and advice in developing this text. It evolved over many years through
the development and teaching at the University of Illinois at Urbana-Champaign
(UIUC), starting in early 2015. The biggest thanks goes to Anna Yershova, who
has also taught the virtual reality course at the University of Illinois and collabo-
rated on the course development. We also worked side-by-side at Oculus VR since
the earliest days.
2 S. M. LaValle: Virtual Reality
Chapter 1
Introduction
1.1 What Is Virtual Reality? Figure 1.1: In the Birdly experience from the Zurich University of the Arts, the
user, wearing a VR headset, flaps his wings while flying over virtual San Francisco.
Virtual reality (VR) technology is evolving rapidly, making it precarious to define A motion platform and fan provide additional sensory stimulation. The figure on
VR in terms of specific devices that may fall out of favor in a year or two. In the right shows the stimulus presented to each eye.
this book, we are concerned with fundamental principles that are less sensitive to
particular technologies and therefore survive the test of time. Our first challenge
is to consider what VR actually means, in a way that captures the most crucial
aspects in spite of rapidly changing technology. The concept must also be general
enough to encompass what VR is considered today and what we envision for its
future.
We start with two representative examples that employ current technologies:
1) A human having an experience of flying over virtual San Francisco by flapping
his own wings (Figure 1.1); 2) a mouse running on a freely rotating ball while
exploring a virtual maze that appears on a projection screen around the mouse
(Figure 1.2). We want our definition of VR to be broad enough to include these
examples and many more, which are coming in Section 1.2. This motivates the
following.
1
1.1. WHAT IS VIRTUAL REALITY? 3 4 S. M. LaValle: Virtual Reality
Figure 1.3: (a) We animals assign neurons as place cells, which fire when we return
to specific locations. This figure depicts the spatial firing patterns of eight place
cells in a rat brain as it runs back and forth along a winding track (figure by Stuart Figure 1.4: A VR thought experiment: The brain in a vat, by Gilbert Harman in
Layton). (b) We even have grid cells, which fire in uniformly, spatially distributed 1973. (Figure by Alexander Wivel.)
patterns, apparently encoding location coordinates (figure by Torkel Hafting).
Who is the fool? Returning to the VR definition above, the idea of “fooling”
3. Artificial sensory stimulation: Through the power of engineering, one or an organism might seem fluffy or meaningless; however, this can be made sur-
more senses of the organism become hijacked, and their ordinary inputs are prisingly concrete using research from neurobiology. When an animal explores its
replaced by artificial stimulation. environment, neural structures composed of place cells are formed that encode
spatial information about its surroundings [234, 238]; see Figure 1.3(a). Each
4. Awareness: While having the experience, the organism seems unaware of the place cell is activated precisely when the organism returns to a particular location
interference, thereby being “fooled” into feeling present in a virtual world. that is covered by it. Although less understood, grid cells even encode locations
This unawareness leads to a sense of presence in an altered or another world. in a manner similar to Cartesian coordinates [222] (Figure 1.3(b)). It has been
It is accepted as being natural. shown that these neural structures may form in an organism, even when having
a VR experience [2, 44, 112]. In other words, our brains may form place cells for
places that are not real! This is a clear indication that VR is fooling our brains,
Testing the boundaries The examples shown in Figures 1.1 and 1.2 clearly fit at least partially. At this point, you may wonder whether reading a novel that
the definition. Anyone donning a modern VR headset and having a perceptual meticulously describes an environment that does not exist will cause place cells to
experience should also be included. How far does our VR definition allow one to be generated.
stray from the most common examples? Perhaps listening to music through head- We also cannot help wondering whether we are always being fooled, and some
phones should be included. What about watching a movie at a theater? Clearly, greater reality has yet to reveal itself to us. This problem has intrigued the greatest
technology has been used in the form of movie projectors and audio systems to philosophers over many centuries. One of the oldest instances is the Allegory of the
provide artificial sensory stimulation. Continuing further, what about a portrait Cave, presented by Plato in Republic. In this, Socrates describes the perspective
or painting on the wall? The technology in this case involves paints and a canvass. of people who have spent their whole lives chained to a cave wall. They face a
Finally, we might even want reading a novel to be considered as VR. The tech- blank wall and only see shadows projected onto the walls as people pass by. He
nologies are writing and printing. The stimulation is visual, but does not seem explains that the philosopher is like one of the cave people being finally freed
as direct as a movie screen and audio system. We will not worry too much about from the cave to see the true nature of reality, rather than being only observed
the precise boundary of our VR definition. Good arguments could be made either through projections. This idea has been repeated and popularized throughout
way about some of these border cases. They nevertheless serve as a good point of history, and also connects deeply with spirituality and religion. In 1641, René
reference for historical perspective, which is presented in Section 1.3. Descartes hypothesized the idea of an evil demon who has directed his entire effort
1.1. WHAT IS VIRTUAL REALITY? 5 6 S. M. LaValle: Virtual Reality
at deceiving humans with the illusion of the external physical world. In 1973, head, hands, or legs. Other possibilities include voice commands, heart rate, body
Gilbert Hartman introduced the idea of a brain in a vat (Figure 1.4), which is a temperature, and skin conductance (are you sweating?).
thought experiment that suggests how such an evil demon might operate. This
is the basis of the 1999 movie The Matrix. In that story, machines have fooled First- vs. Third-person If you are reading this book, then you most likely
the entire human race by connecting to their brains to a convincing simulated want to develop VR systems or experiences. Pay close attention to this next point!
world, while harvesting their real bodies. The lead character Neo must decide When a scientist designs an experiment for an organism, as shown in Figure 1.2,
whether to face the new reality or take a memory-erasing pill that will allow him then the separation is clear: The laboratory subject (organism) has a first-person
to comfortably live in the simulation. experience, while the scientist is a third-person observer. The scientist carefully
designs the VR system as part of an experiment that will help to resolve a scientific
Terminology regarding various “realities” The term virtual reality dates hypothesis. For example, how does turning off a few neurons in a rat’s brain affect
back to German philosopher Immanuel Kant [333], although its use did not in- its navigation ability? On the other hand, when engineers or developers construct
volve technology. Its modern use was popularized by Jaron Lanier in the 1980s. a VR system or experience, they are usually targeting themselves and people
Although it is already quite encompassing, several competing terms related to VR like them. They feel perfectly comfortable moving back and forth between being
are in common use at present. The term virtual environments predates widespread the “scientist” and the “lab subject” while evaluating and refining their work.
usage of VR and is presently preferred by most university researchers [108]. It is As you will learn throughout this book, this is a bad idea! The creators of the
typically considered to be synonymous with VR; however, we emphasize in this experience are heavily biased by their desire for it to succeed without having to
book that the perceived environment could be a captured “real” world just as well redo their work. They also know what the experience is supposed to mean or
as a completely synthetic world. Thus, the perceived environment need not seem accomplish, which provides a strong bias in comparison to a fresh subject. To
“virtual”. Augmented reality (AR) refers to systems in which most of the visual complicate matters further, the creator’s body will physically and mentally adapt
stimuli are propagated directly through glass or cameras to the eyes, and some to whatever flaws are present so that they may soon become invisible. We have
additional structures appear to be superimposed onto the user’s world. The term seen these kinds of things before. For example, it is hard to predict how others
mixed reality (MR) is sometimes used to refer to an entire spectrum that encom- will react to your own writing. Also, it is usually harder to proofread your own
passes VR, AR, and normal reality. More recently, the term VR/AR/MR has even writing in comparison to that of others. In the case of VR, these effects are much
been used to refer to all forms. Telepresence refers to systems that enable users stronger and yet elusive to the point that you must force yourself to pay attention
to feel like they are somewhere else in the real world; if they are able to control to them. Take great care when hijacking the senses that you have trusted all of
anything, such as a flying drone, then teleoperation is an appropriate term. For your life. This will most likely be uncharted territory for you.
our purposes, virtual environments, AR, mixed reality, telepresence, and teleoper-
ation will all be considered as perfect examples of VR. The most important idea More real than reality? How “real” should the VR experience be? It is tempt-
of VR is that the user’s perception of reality has been altered through engineering, ing to try to make it match our physical world as closely as possible. This is
rather than whether the environment they believe they are in seems more “real” referred to in Section 10.1 as the universal simulation principle: Any interaction
or “virtual”. Thus, perception engineering could be a reasonable term as well. mechanism in the real world can be simulated in VR. Our brains are most familiar
Unfortunately, name virtual reality itself seems to be self contradictory, which with these settings, thereby making it seem most appropriate. This philosophy has
is a philosophical problem which was rectified in [34] by proposing the alternative dominated the video game industry at times, for example, in the development of
term virtuality. While acknowledging all of these issues, we will nevertheless con- highly realistic first-person shooter (FPS) games that are beautifully rendered on
tinue onward with term virtual reality. The following distinction, however, will increasingly advanced graphics cards. In spite of this, understand that extremely
become important: The real world refers to the physical world that contains the simple, cartoon-like environments can also be effective and even preferable. Ex-
user at the time of the experience, and the virtual world refers to the perceived amples appear throughout history, as discussed in Section 1.3.
world as part of the targeted VR experience. As a VR experience creator, think carefully about the task, goals, or desired
effect you want to have on the user. You have the opportunity to make the
Interactivity Most VR experiences involve another crucial component: inter- experience better than reality. What will they be doing? Taking a math course?
action. Does the sensory stimulation depend on actions taken by the organism? Experiencing a live theatrical performance? Writing software? Designing a house?
If the answer is “no”, then the VR system is called open-loop; otherwise, it is Maintaining a long-distance relationship? Playing a game? Having a meditation
closed-loop. In the case of closed-loop VR, the organism has partial control over and relaxation session? Traveling to another place on Earth, or in the universe?
the stimulation, which could vary as a result of body motions, including eyes, For each of these, think about how the realism requirements might vary. For
1.1. WHAT IS VIRTUAL REALITY? 7 8 S. M. LaValle: Virtual Reality
example, consider writing software in VR. We currently write software by typing the senses and the brain, leading to fatigue or sickness. This phenomenon has been
into windows that appear on a large screen. Note that even though this is a familiar studied under the heading simulator sickness for decades; in this book we will refer
experience for many people, it was not even possible in the physical world of the to adverse symptoms from VR usage as VR sickness. Sometimes the discomfort
1950s. In VR, we could simulate the modern software development environment is due to problems in the VR hardware and low-level software; however, in most
by convincing the programmer that she is sitting in front of a screen; however, this cases, it is caused by a careless VR developer who misunderstands or disregards
misses the point that we can create almost anything in VR. Perhaps a completely the side effects of the experience on the user. This is one reason why human
new interface will emerge that does not appear to be a screen sitting on a desk in physiology and perceptual psychology are large components of this book. To
an office. For example, the windows could be floating above a secluded beach or develop comfortable VR experiences, you must understand how these factor in. In
forest. Furthermore, imagine how a debugger could show the program execution many cases, fatigue arises because the brain appears to work harder to integrate
trace. the unusual stimuli being presented to the senses. In some cases, inconsistencies
with prior expectations, and outputs from other senses, even lead to dizziness and
Synthetic vs. captured Two extremes exist when constructing a virtual world nausea.
as part of a VR experience. At one end, we may program a synthetic world, which Another factor that leads to fatigue is an interface that requires large amounts
is completely invented from geometric primitives and simulated physics. This of muscular effort. For example, it might be tempting move objects around in
is common in video games and such virtual environments were assumed to be a sandbox game by moving your arms around in space. This quickly leads to
the main way to experience VR in earlier decades. At the other end, the world fatigue and an avoidable phenomenon called gorilla arms, in which people feel
may be captured using modern imaging techniques. For viewing on a screen, the that the weight of their extended arms is unbearable. For example, by following
video camera has served this purpose for over a century. Capturing panoramic the principle of the computer mouse, it may be possible to execute large, effective
images and videos and then seeing them from any viewpoint in a VR system is motions in the virtual space by small, comfortable motions of a controller. Over
a natural extension. In many settings, however, too much information is lost long periods of time, the brain will associate the motions well enough for it to
when projecting the real world onto the camera sensor. What happens when seem realistic while also greatly reducing fatigue.
the user changes her head position and viewpoint? More information should be
captured in this case. Using depth sensors and SLAM (Simultaneous Localization
And Mapping) techniques, a 3D representation of the surrounding world can be 1.2 Modern VR Experiences
captured and maintained over time as it changes. It is extremely difficult, however,
to construct an accurate and reliable representation, unless the environment is The modern era of VR was brought about by advances in display, sensing, and
explicitly engineered for such capture (for example, a motion capture studio). computing technology from the smartphone industry. From Palmer Luckey’s 2012
As humans interact, it becomes important to track their motions, which is an Oculus Rift design to simply building a case for smart phones [121, 239, 306], the
important form of capture. What are their facial expressions while wearing a VR world has quickly changed as VR headsets are mass produced and placed into
headset? Do we need to know their hand gestures? What can we infer about the hands of a wide variety of people. This trend is similar in many ways to the
their emotional state? Are their eyes focused on me? Synthetic representations of home computer and web browser revolutions; as more people have access to the
ourselves called avatars enable us to interact and provide a level of anonymity, if technology, the set of things they do with it substantially broadens.
desired in some contexts. The attentiveness or emotional state can be generated This section gives you a quick overview of what people are doing with VR
synthetically. We can also enhance our avatars by tracking the motions and other today, and provides a starting point for searching for similar experiences on the
attributes of our actual bodies. A well-known problem is the uncanny valley, in Internet. Here, we can only describe the experiences in words and pictures, which
which a high degree of realism has been achieved in a avatar, but its appearance is a long way from the appreciation gained by experiencing them yourself. This
makes people feel uneasy. It seems almost right, but the small differences are printed medium (a book) is woefully inadequate for fully conveying the medium
disturbing. There is currently no easy way to make ourselves appear to others in of VR. Perhaps this is how it was in the 1890s to explain in a newspaper what a
a VR experience exactly as we do in the real world, and in most cases, we might movie theater was like! If possible, it is strongly recommended that you try many
not want to. VR experiences yourself to form first-hand opinions and spark your imagination
to do something better.
Health and safety Although the degree of required realism may vary based on
the tasks, one requirement remains invariant: The health and safety of the users. Video games People have dreamed of entering their video game worlds for
Unlike simpler media such as radio or television, VR has the power to overwhelm decades. By 1982, this concept was already popularized by the Disney movie Tron.
1.2. MODERN VR EXPERIENCES 9 10 S. M. LaValle: Virtual Reality
(a) (b)
(c) (d)
Figure 1.5: (a) Valve’s Portal 2 demo for the HTC Vive headset is a puzzle-solving
experience in a virtual world. (b) The Virtuix Omni treadmill for walking through
first-person shooter games. (c) Lucky’s Tale for the Oculus Rift maintains a third-
person perspective as the player floats above his character. (d) In the Dumpy
game from DePaul University, the player appears to have a large elephant trunk.
The purpose of the game is to enjoy this unusual embodiment by knocking things
down with a swinging trunk.
Figure 1.5 shows several video game experiences in VR. Most gamers currently
want to explore large, realistic worlds through an avatar. Figure 1.5(a) shows
Valve’s Portal 2, which is a puzzle-solving adventure game developed for the HTC
Vive headset. Figure 1.5(b) shows an omnidirectional treadmill peripheral that Figure 1.6: Oculus Story Studio produced Emmy-winning Henry, an immersive
gives users the sense of walking while they slide their feet in a dish on the floor. short story about an unloved hedgehog who hopes to make a new friend, the
These two examples give the user a first-person perspective of their character. By viewer.
contrast, Figure 1.5(c) shows Lucky’s Tale, which instead yields a comfortable
third-person perspective as the user seems to float above the character that she
controls. Figure 1.5(d) shows a game that contrasts all the others in that it was
designed to specifically exploit the power of VR.
(a) (b)
(a) (b)
(a) (b)
Figure 1.9: A simple VR experience that presents Google Street View images
through a VR headset: (a) A familiar scene in Paris. (b) Left and right eye views
Figure 1.11: Examples of robotic avatars: (a) The DORA robot from the Univer-
are created inside the headset, while also taking into account the user’s looking
sity of Pennsylvania mimics the users head motions, allowing him to look around in
direction.
a remote world while maintaining a stereo view (panoramas are monoscopic). (b)
The Plexidrone, a low-cost flying robot that is designed for streaming panoramic
video.
Figure 1.12: Virtual societies develop through interacting avatars that meet in
Figure 1.10: Jaunt captured a panoramic video of Paul McCartney performing
virtual worlds that are maintained on a common server. A snapshot from Second
Live and Let Die, which provides a VR experience where users felt like they were
Life is shown here.
on stage with the rock star.
1.2. MODERN VR EXPERIENCES 15 16 S. M. LaValle: Virtual Reality
Figure 1.14: Students in Barcelona made an experience where you can swap bodies
with the other gender. Each person wears a VR headset that has cameras mounted
Figure 1.13: In Clouds Over Sidra, film producer Chris Milk offers a first-person on its front. Each therefore sees the world from the approximate viewpoint of the
perspective on the suffering of Syrian refugees. other person. They were asked to move their hands in coordinated motions so
that they see their new body moving appropriately.
view on a screen but can now be experienced through VR. Groups of people could
spend time together in these spaces for a variety of reasons, including common ences, VR offers the chance to visualize geometric relationships in difficult concepts
special interests, educational goals, or simply an escape from ordinary life. or data that are hard to interpret. Furthermore, VR is naturally suited for practi-
cal training because skills developed in a realistic virtual environment may transfer
Empathy The first-person perspective provided by VR is a powerful tool for naturally to the real environment. The motivation is particularly high if the real
causing people to feel empathy for someone else’s situation. The world contin- environment is costly to provide or poses health risks. One of the earliest and most
ues to struggle with acceptance and equality for others of different race, religion, common examples of training in VR is flight simulation (Figure 1.15). Other ex-
age, gender, sexuality, social status, and education, while the greatest barrier to amples include firefighting, nuclear power plant safety, search-and-rescue, military
progress is that most people cannot fathom what it is like to have a different iden- operations, and medical procedures.
tity. Figure 1.13 shows a VR project sponsored by the United Nations to yield Beyond these common uses of VR, perhaps the greatest opportunities for VR
feelings of empathy for those caught up in the Syrian crisis of 2015. Some of us education lie in the humanities, including history, anthropology, and foreign lan-
may have compassion for the plight of others, but it is a much stronger feeling to guage acquisition. Consider the difference between reading a book on the Victo-
understand their struggle because you have been there before. Figure 1.14 shows rian era in England and being able to roam the streets of 19th-century London,
a VR system that allows men and women to swap bodies. Through virtual so- in a simulation that has been painstakingly constructed by historians. We could
cieties, many more possibilities can be explored. What if you were 10cm shorter even visit an ancient city that has been reconstructed from ruins (Figure 1.16).
than everyone else? What if you teach your course with a different gender? What Fascinating possibilities exist for either touring physical museums through a VR
if you were the victim of racial discrimination by the police? Using VR, we can interface or scanning and exhibiting artifacts directly in virtual museums. These
imagine many “games of life” where you might not get as far without being in the examples fall under the heading of digital heritage.
“proper” group.
Virtual prototyping In the real world, we build prototypes to understand how
Education In addition to teaching empathy, the first-person perspective could a proposed design feels or functions. Thanks to 3D printing and related tech-
revolutionize many areas of education. In engineering, mathematics, and the sci- nologies, this is easier than ever. At the same time, virtual prototyping enables
1.2. MODERN VR EXPERIENCES 17 18 S. M. LaValle: Virtual Reality
Figure 1.15: A flight simulator used by the US Air Force (photo by Javier Garcia).
The user sits in a physical cockpit while being surrounded by displays that show
the environment.
Figure 1.19: The Microsoft Hololens uses advanced see-through display technology
to superimpose graphical images onto the ordinary physical world, as perceived
Figure 1.18: A heart visualization system based on images of a real human heart. by looking through the glasses.
This was developed by the Jump Trading Simulation and Education Center and
the University of Illinois. ders through repeated exposure, improving or maintaining cognitive skills in spite
of aging, and improving motor skills to overcome balance, muscular, or nervous
designers to inhabit a virtual world that contains their prototype (Figure 1.17). system disorders. VR systems could also one day improve longevity by enabling
They can quickly interact with it and make modifications. They also have op- aging people to virtually travel, engage in fun physical therapy, and overcome
portunities to bring clients into their virtual world so that they can communicate loneliness by connecting with family and friends through an interface that makes
their ideas. Imagine you want to remodel your kitchen. You could construct a them feel present and included in remote activities.
model in VR and then explain to a contractor exactly how it should look. Virtual
prototyping in VR has important uses in many businesses, including real estate, Augmented and mixed reality In many applications, it is advantageous for
architecture, and the design of aircraft, spacecraft, cars, furniture, clothing, and users to see the live, real world with some additional graphics superimposed to
medical instruments. enhance its appearance; see Figure 1.19. This has been referred to as augmented
reality or mixed reality (both of which we consider to be part of VR in this book).
Health care Although health and safety are challenging VR issues, the tech- By placing text, icons, and other graphics into the real world, the user could
nology can also help to improve our health. There is an increasing trend toward leverage the power of the Internet to help with many operations such as navigation,
distributed medicine, in which doctors train people to perform routine medical pro- social interaction, and mechanical maintenance. Many applications to date are
cedures in remote communities around the world. Doctors can provide guidance targeted at helping businesses to conduct operations more efficiently. Imagine a
through telepresence, and also use VR technology for training. In another use of factory environment in which workers see identifying labels above parts that need
VR, doctors can immerse themselves in 3D organ models that were generated from to assembled, or they can look directly inside of a machine to determine potential
medical scan data (Figure 1.18). This enables them to better plan and prepare for replacement parts.
a medical procedure by studying the patient’s body shortly before an operation. These applications rely heavily on advanced computer vision techniques, which
They can also explain medical options to the patient or his family so that they must identify objects, reconstruct shapes, and identify lighting sources in the real
may make more informed decisions. In yet another use, VR can directly provide world before determining how to draw virtual objects that appear to be naturally
therapy to help patients. Examples include overcoming phobias and stress disor- embedded. Achieving a high degree of reliability becomes a challenge because
1.2. MODERN VR EXPERIENCES 21 22 S. M. LaValle: Virtual Reality
(a) (b)
Figure 1.21: (a) Epic Games created a wild roller coaster ride through virtual
living room. (b) A guillotine simulator was made by Andre Berlemont, Morten
Brunbjerg, and Erkki Trummal. Participants were hit on the neck by friends as
the blade dropped, and they could see the proper perspective as their heads rolled.
Figure 1.20: Nintendo Pokemon Go is a networked games that allows users to
imagine a virtual world that is superimposed on to the real world. They can see outward-facing camera to a standard screen inside of the headset. Pass-through
Pokemon characters only by looking “through” their smartphone screen. displays overcome current see-through display problems, but instead suffer from
latency, optical distortion, color distortion, and limited dynamic range.
(a) (b)
(c) (d)
Figure 1.23: This 1878 Horse in Motion motion picture by Eadward Muybridge,
was created by evenly spacing 24 cameras along a track and triggering them by
Figure 1.22: (a) A 30,000-year-old painting from the Bhimbetka rock shelters trip wire as the horse passes. The animation was played on a zoopraxiscope, which
in India (photo by Archaelogical Survey of India). (b) An English painting from was a precursor to the movie projector, but was mechanically similar to a record
around 1470 that depicts John Ball encouraging Wat Tyler rebels (unknown artist). player.
(c) A painting by Hans Vredeman de Vries in 1596. (d) An impressionist painting
by Claude Monet in 1874.
Moving pictures Once humans were content with staring at rectangles on the
wall, the next step was to put them into motion. The phenomenon of stroboscopic
paintings, such as the one shown in Figure 1.22(a) from 30,000 years ago. Figure apparent motion is the basis for what we call movies or motion pictures today.
1.22(b) shows a painting from the European Middle Ages. Similar to the cave Flipping quickly through a sequence of pictures gives the illusion of motion, even
painting, it relates to military conflict, a fascination of humans regardless of the at a rate as low as two pictures per second. Above ten pictures per second,
era or technology. There is much greater detail in the newer painting, leaving the motion even appears to be continuous, rather than perceived as individual
less to the imagination; however, the drawing perspective is comically wrong. pictures. One of the earliest examples of this effect is the race horse movie created
Some people seem short relative to others, rather than being further away. The by Eadward Muybridge in 1878 at the request of Leland Stanford (yes, that one!);
rear portion of the fence looks incorrect. Figure 1.22(c) shows a later painting in see Figure 1.23.
which the perspective have been meticulously accounted for, leading to a beautiful Motion picture technology quickly improved, and by 1896, a room full of spec-
palace view that requires no imagination for us to perceive it as “3D”. By the 19th tators in a movie theater screamed in terror as a short film of a train pulling into
century, many artists had grown tired of such realism and started the controversial a station convinced them that the train was about to crash into them (Figure
impressionist movement, an example of which is shown in Figure 1.22(d). Such 1.24(a)). There was no audio track. Such a reaction seems ridiculous for anyone
paintings leave more to the imagination of the viewer, much like the earlier cave who has been to a modern movie theater. As audience expectations increased,
paintings. so had the degree of realism produced by special effects. In 1902, viewers were
1.3. HISTORY REPEATS 25 26 S. M. LaValle: Virtual Reality
(a) (b)
(a) (b)
(c) (d)
(c) (d)
Figure 1.24: A progression of special effects: (a) Arrival of a Train at La Ciotat
Station, 1896. (b) A Trip to the Moon, 1902. (c) The movie 2001, from 1968. (d)
Gravity, 2013. Figure 1.25: A progression of cartoons: (a) Emile Cohl, Fantasmagorie, 1908. (b)
Mickey Mouse in Steamboat Willie, 1928. (c) The Clone Wars Series, 2003. (d)
South Park, 1997.
inspired by a Journey to the Moon (Figure 1.24(b)), but by 2013, an extremely
high degree of realism seemed necessary to keep viewers believing (Figure 1.24(c)
and 1.24(d)). and 3D movies at the time. Such tiny, blurry, black-and-white television sets seem
At the same time, motion picture audiences have been willing to accept lower comically intolerable with respect to our current expectations. The next level
degrees of realism. One motivation, as for paintings, is to leave more to the imag- of portability is to carry the system around with you. Thus, the progression is
ination. The popularity of animation (also called anime or cartoons) is a prime from: 1) having to go somewhere to watch it, to 2) being able to watch it in your
example (Figure 1.25). Even within the realm of animations, a similar trend has home, to 3) being able to carry it anywhere. Whether pictures, movies, phones,
emerged as with motion pictures in general. Starting from simple line drawings in computers, or video games, the same progression continues. We can therefore
1908 with Fantasmagorie (Figure 1.25(a)), greater detail appears in 1928 with the expect the same for VR systems. At the same time, note that the gap is closing
introduction of Mickey Mouse(Figure 1.25(b)). By 2003, animated films achieved between these levels: The quality we expect from a portable device is closer than
a much higher degree of realism (Figure 1.25(c)); however, excessively simple ani- ever before to the version that requires going somewhere to experience it.
mations have also enjoyed widespread popularity (Figure 1.25(d)).
Video games Motion pictures yield a passive, third-person experience, in con-
Toward convenience and portability Further motivations for accepting lower trast to video games which are closer to a first-person experience by allowing us
levels of realism are cost and portability. As shown in Figure 1.26, families were to interact with him. Recall from Section 1.1 the differences between open-loop
willing to gather in front of a television to watch free broadcasts in their homes, and closed-loop VR. Video games are an important step toward closed-loop VR,
even though they could go to theaters and watch high-resolution, color, panoramic, whereas motion pictures are open-loop. As shown in Figure 1.27, we see the same
1.3. HISTORY REPEATS 27 28 S. M. LaValle: Virtual Reality
(a) (b)
Figure 1.26: Although movie theaters with large screens were available, families
were also content to gather around television sets that produced a viewing quality
that would be unbearable by current standards, as shown in this photo from 1958.
trend from simplicity to improved realism and then back to simplicity. The earliest (c) (d)
games, such as Pong and Donkey Kong, left much to the imagination. First-person
shooter (FPS) games such as Doom gave the player a first-person perspective and
launched a major campaign over the following decade toward higher quality graph-
ics and realism. Assassin’s Creed shows a typical scene from a modern, realistic
video game. At the same time, wildly popular games have emerged by focusing on
simplicity. Angry Birds looks reminiscent of games from the 1980s, and Minecraft
allows users to create and inhabit worlds composed of course blocks. Note that
reduced realism often leads to simpler engineering requirements; in 2015, an ad-
vanced FPS game might require a powerful PC and graphics card, whereas simpler
games would run on a basic smartphone. Repeated lesson: Don’t assume that more
(e) (f)
realistic is better!
Figure 1.27: A progression of video games: (a) Atari’s Pong, 1972. (b) Nintendo’s
Beyond staring at a rectangle The concepts so far are still closely centered Donkey Kong, 1981. (c) id Software’s Doom, 1993. (d) Ubisoft’s Assassin’s Creed
on staring at a rectangle that is fixed on a wall. Two important steps come next: Unity, 2014. (e) Rovio Entertainment’s Angry Birds, 2009. (f) Markus “Notch”
1) Presenting a separate picture to each eye to induce a “3D” effect. 2) Increasing Persson’s Minecraft, 2011.
the field of view so that the user is not distracted by the stimulus boundary.
One way our brains infer the distance of objects from our eyes is by stereopsis.
1.3. HISTORY REPEATS 29 30 S. M. LaValle: Virtual Reality
Information is gained by observing and matching features in the world that are
visible to both the left and right eyes. The differences between their images on
the retina yield cues about distances; keep in mind that there are many more
such cues, which are explained in Section 6.1. The first experiment that showed
the 3D effect of stereopsis was performed in 1838 by Charles Wheatstone in a
system called the stereoscope (Figure 1.28(a)). By the 1930s, a portable version
became a successful commercial product known to this day as the View-Master
(Figure 1.28(b)). Pursuing this idea further led to Sensorama, which added motion
pictures, sound, vibration, and even smells to the experience (Figure 1.28(c)). An
unfortunate limitation of these designs is requiring that the viewpoint is fixed with
respect to the picture. If the device is too large, then the user’s head also becomes
fixed in the world. An alternative has been available in movie theaters since the
1950s. Stereopsis was achieved when participants wore special glasses that select
a different image for each eye using polarized light filters. This popularized 3D
(a) (b) movies, which are viewed the same way in the theaters today.
Another way to increase the sense of immersion and depth is to increase the
field of view. The Cinerama system from the 1950s offered a curved, wide field
of view that is similar to the curved, large LED (Light-Emitting Diode) displays
offered today (Figure 1.28(d)). Along these lines, we could place screens all around
us. This idea led to one important family of VR systems called the CAVE, which
was introduced in 1992 at the University of Illinois [49] (Figure 1.29(a)). The user
enters a room in which video is projected onto several walls. The CAVE system
also offers stereoscopic viewing by presenting different images to each eye using
polarized light and special glasses. Often, head tracking is additionally performed
to allow viewpoint-dependent video to appear on the walls.
(a) (b)
Figure 1.30: Second Life was introduced in 2003 as a way for people to socialize
through avatars and essentially build a virtual world to live in. Shown here is the
author giving a keynote address at the 2014 Opensimulator Community Confer-
ence. The developers build open source software tools for constructing and hosting
such communities of avatars with real people behind them.
ing, chat rooms, and emoticons. This was a precursor to many community-based,
on-line learning systems, such as the Khan Academy and Coursera. The largest
amount of online social interaction today occurs through Facebook apps, which
involve direct communication through text along with the sharing of pictures,
videos, and links.
Returning to VR, we can create avatar representations of ourselves and “live”
together in virtual environments, as is the case with Second Life and Opensim-
ulator 1.30. Without being limited to staring at rectangles, what kinds of soci-
eties will emerge with VR? Popular science fiction novels have painted a thrilling,
yet dystopian future of a world where everyone prefers to interact through VR
[47, 95, 308]. It remains to be seen what the future will bring.
As the technologies evolve over the years, keep in mind the power of simplicity
when making a VR experience. In some cases, maximum realism may be im-
portant; however, leaving much to the imagination of the users is also valuable.
Although the technology changes, one important invariant is that humans are still
designed the same way. Understanding how our senses, brains, and bodies work
is crucial to understanding the fundamentals of VR systems.
Further reading
Each chapter of this book concludes with pointers to additional, related literature that
might not have been mentioned in the preceding text. Numerous books have been
written on VR. A couple of key textbooks that precede the consumer VR revolution
are Understanding Virtual Reality by W. R. Sherman and A. B. Craig, 2002 [288] and
3D User Interfaces by D. A. Bowman et al., 2005 [31]. Books based on the current
technology include [135, 183]. For a survey of the concept of reality, see [348]. For
recent coverage of augmented reality that is beyond the scope of this book, see [280].
A vast amount of research literature has been written on VR. Unfortunately, there
is a considerable recognition gap in the sense that current industry approaches to con-
sumer VR appear to have forgotten the longer history of VR research. Many of the
issues being raised today and methods being introduced in industry were well addressed
decades earlier, albeit with older technological components. Much of the earlier work
remains relevant today and is therefore worth studying carefully. An excellent starting
place is the Handbook on Virtual Environments, 2015 [108], which contains dozens of re-
cent survey articles and thousands of references to research articles. More recent works
can be found in venues that publish papers related to VR. Browsing through recent
publications in these venues may be useful: IEEE Virtual Reality (IEEE VR), IEEE In-
ternational Conference on Mixed and Augmented Reality (ISMAR), ACM SIGGRAPH
Conference, ACM Symposium on Applied Perception, ACM SIGCHI Conference, IEEE
Symposium of 3D User Interfaces, Journal of Vision, Presence: Teleoperators and Vir-
tual Environments.
36 S. M. LaValle: Virtual Reality
Tracking
Organism VR
Hardware
Chapter 2 Stimulation
35
2.1. HARDWARE 37 38 S. M. LaValle: Virtual Reality
Configuration Control
Natural
Stimulation Sense Neural Pathways
World
Organ
Figure 2.2: Under normal conditions, the brain (and body parts) control the con-
figuration of sense organs (eyes, ears, fingertips) as they receive natural stimulation
from the surrounding, physical world.
Configuration Control
World
three DOFs correspond to possible ways the object could be rotated; in other
words, exactly three independent parameters are needed to specify how the object
is oriented. These are called yaw, pitch, and roll, and are covered in Section 3.2.
As an example, consider your left ear. As you rotate your head or move your Figure 2.4: If done well, the brain is “fooled” into believing that the virtual world
body through space, the position of the ear changes, as well as its orientation. is in fact the surrounding physical world and natural stimulation is resulting from
This yields six DOFs. The same is true for your right eye, but it also capable of it.
rotating independently of the head. Keep in mind that our bodies have many more
degrees of freedom, which affect the configuration of our sense organs. A tracking
system may be necessary to determine the position and orientation of each sense
organ that receives artificial stimuli, which will be explained shortly.
An abstract view Figure 2.2 illustrates the normal operation of one of our sense
organs without interference from VR hardware. The brain controls its configura-
tion, while the sense organ converts natural stimulation from the environment into
neural impulses that are sent to the brain. Figure 2.3 shows how it appears in a
VR system. The VR hardware contains several components that will be discussed
2.1. HARDWARE 39 40 S. M. LaValle: Virtual Reality
Figure 2.6: Using headphones, the displays are user-fixed, unlike the case of a
surround-sound system.
sound, while a subwoofer (the “1” of the “7.1”) delivers the lowest frequency com-
Figure 2.5: In a surround-sound system, the aural displays (speakers) are world- ponents. The aural displays are therefore world-fixed. Compare this to a listener
fixed while the user listens from the center. wearing headphones, as shown in Figure 2.6. In this case, the aural displays are
user-fixed. Hopefully, you have already experienced settings similar to these many
shortly. A Virtual World Generator (VWG) runs on a computer and produces times.
“another world”, which could be many possibilities, such as a pure simulation of a What are the key differences? In addition to the obvious portability of head-
synthetic world, a recording of the real world, or a live connection to another part phones, the following quickly come to mind:
of the real world. The human perceives the virtual world through each targeted
• In the surround-sound system, the generated sound (or stimulus) is far away
sense organ using a display, which emits energy that is specifically designed to
from the ears, whereas it is quite close for the headphones.
mimic the type of stimulus that would appear without VR. The process of con-
verting information from the VWG into output for the display is called rendering. • One implication of the difference in distance is that much less power is needed
In the case of human eyes, the display might be a smartphone screen or the screen for the headphones to generate an equivalent perceived loudness level com-
of a video projector. In the case of ears, the display is referred to as a speaker. pared with distant speakers.
(A display need not be visual, even though this is the common usage in everyday
life.) If the VR system is effective, then the brain is hopefully “fooled” in the sense • Another implication based on distance is the degree of privacy allowed by
shown in Figure 2.4. The user should believe that the stimulation of the senses the wearer of headphones. A surround-sound system at high volume levels
is natural and comes from a plausible world, being consistent with at least some could generate a visit by angry neighbors.
past experiences.
• Wearing electronics on your head could be uncomfortable over long periods
of time, causing a preference for surround sound over headphones.
Aural: world-fixed vs. user-fixed Recall from Section 1.3 the trend of having
to go somewhere for an experience, to having it in the home, and then finally to • Several people can enjoy the same experience in a surround-sound system
having it be completely portable. To understand these choices for VR systems (although they cannot all sit in the optimal location). Using headphones,
and their implications on technology, it will be helpful to compare a simpler case: they would need to split the audio source across their individual headphones
Audio or aural systems. simultaneously.
Figure 2.5 shows the speaker setup and listener location for a Dolby 7.1 Sur-
round Sound theater system, which could be installed at a theater or a home family • They are likely to have different costs, depending on the manufacturing
room. Seven speakers distributed around the room periphery generate most of the difficulty and available component technology. At present, headphones are
2.1. HARDWARE 41 42 S. M. LaValle: Virtual Reality
(a) (b)
(a) (b)
Figure 2.8: Two examples of haptic feedback devices. (a) The Geomagic Phan-
tom allows the user to feel strong resistance when poking into a virtual object Figure 2.9: Inertial measurement units (IMUs) have gone from large, heavy me-
with a real stylus. A robot arm provides the appropriate forces. (b) Some game chanical systems to cheap, microscopic MEMS circuits. (a) The LN-3 Inertial
controllers occasionally vibrate. Navigation System, developed in the 1960s by Litton Industries. (b) The internal
structures of a MEMS gyroscope, for which the total width is less than 1mm.
is used. Due to the plummeting costs, an array of large-panel displays may alter-
natively be employed. For headsets, a smartphone display can be placed close to Sensors Consider the input side of the VR hardware. A brief overview is given
the eyes and brought into focus using one magnifying lens for each eye. Screen here, until Chapter 9 covers sensors and tracking systems in detail. For visual and
manufacturers are currently making custom displays for VR headsets by leverag- auditory body-mounted displays, the position and orientation of the sense organ
ing the latest LED display technology from the smartphone industry. Some are must be tracked by sensors to appropriately adapt the stimulus. The orientation
targeting one display per eye with frame rates above 90Hz and over two megapixels part is usually accomplished by an inertial measurement unit or IMU. The main
per eye. Reasons for this are explained in Chapter 5. component is a gyroscope, which measures its own rate of rotation; the rate is
referred to as angular velocity and has three components. Measurements from the
New display technologies that are customized for VR are rapidly emerging. Di- gyroscope are integrated over time to obtain an estimate of the cumulative change
rect retinal stimulation is provided by using pico projector technology, including in orientation. The resulting error, called drift error, would gradually grow unless
DLP (Digital Light Processing), LCD (Liquid Crystal Display), and LCoS (Liq- other sensors are used. To reduce drift error, IMUs also contain an accelerometer
uid Crystal on Silicon). Products that have used this technology include Google and possibly a magnetometer. Over the years, IMUs have gone from existing only
Glass, Microsoft Hololens, and Avegant Glyph. To address comfort issues such as as large mechanical systems in aircraft and missiles to being tiny devices inside
vergence-accommodation mismatch (see Section 5.4), more exotic display technolo- of smartphones; see Figure 2.9. Due to their small size, weight, and cost, IMUs
gies have been prototyped but await the ability to rival other display technologies can be easily embedded in wearable devices. They are one of the most important
in terms of pixel density, field of view, frame rate, manufacturability, and cost. enabling technologies for the current generation of VR headsets and are mainly
The two main families are light-field displays [74, 161, 198] and multi-focal-plane used for tracking the user’s head orientation.
displays [4, 127, 187].
Digital cameras provide another critical source of information for tracking sys-
Now imagine displays for other sense organs. Sound is displayed to the ears tems. Like IMUs, they have become increasingly cheap and portable due to the
using classic speaker technology. Bone conduction methods may also be used, smartphone industry, while at the same time improving in image quality. Cameras
which vibrate the skull and propagate the waves to the inner ear; this method enable tracking approaches that exploit line-of-sight visibility. The idea is to iden-
appeared Google Glass. Chapter 11 covers the auditory part of VR in detail. tify features or markers in the image that serve as reference points for an moving
For the sense of touch, there are haptic displays. Two examples are pictured in object or a stationary background. Such visibility constraints severely limit the
Figure 2.8. Haptic feedback can be given in the form of vibration, pressure, or possible object positions and orientations. Standard cameras passively form an
temperature. More details on displays for touch, and even taste and smell, appear image by focusing the light through an optical system, much like the human eye.
in Chapter 13. Once the camera calibration parameters are known, an observed marker is known
2.1. HARDWARE 45 46 S. M. LaValle: Virtual Reality
(a) (b)
(a) (b)
Figure 2.10: (a) The Microsoft Kinect sensor gathers both an ordinary RGB image
and a depth map (the distance away from the sensor for each pixel). (b) The depth
Figure 2.11: Two headsets that create a VR experience by dropping a smartphone
is determined by observing the locations of projected IR dots in an image obtained
into a case. (a) Google Cardboard works with a wide variety of smartphones. (b)
from an IR camera.
Samsung Gear VR is optimized for one particular smartphone (in this case, the
Samsung S6).
to lie along a ray in space. Cameras are commonly used to track eyes, heads,
hands, entire human bodies, and any other objects in the physical world. One
of the main challenges at present is to obtain reliable and accurate performance and sensors for VR.
without placing special markers on the user or objects around the scene. In addition to the main computing systems, specialized computing hardware
As opposed to standard cameras, depth cameras work actively by projecting may be utilized. Graphical processing units (GPUs) have been optimized for
light into the scene and then observing its reflection in the image. This is typically quickly rendering graphics to a screen and they are currently being adapted to
done in the infrared (IR) spectrum so that humans do not notice; see Figure 2.10. handle the specific performance demands of VR. Also, a display interface chip
converts an input video into display commands. Finally, microcontrollers are fre-
In addition to these sensors, we rely heavily on good-old mechanical switches
quently used to gather information from sensing devices and send them to the
and potientiometers to create keyboards and game controllers. An optical mouse
main computer using standard protocols, such as USB.
is also commonly used. One advantage of these familiar devices is that users can
rapidly input data or control their characters by leveraging their existing training. To conclude with hardware, Figure 2.12 shows the hardware components for
A disadvantage is that they might be hard to find or interact with if their faces the Oculus Rift DK2, which became available in late 2014. In the lower left corner,
are covered by a headset. you can see a smartphone screen that serves as the display. Above that is a circuit
board that contains the IMU, display interface chip, a USB driver chip, a set of
chips for driving LEDs on the headset for tracking, and a programmable micro-
Computers A computer executes the virtual world generator (VWG). Where controller. The lenses, shown in the lower right, are placed so that the smartphone
should this computer be? Although unimportant for world-fixed displays, the screen appears to be “infinitely far” away, but nevertheless fills most of the field of
location is crucial for body-fixed displays. If a separate PC is needed to power the view of the user. The upper right shows flexible circuits that deliver power to IR
system, then fast, reliable communication must be provided between the headset LEDs embedded in the headset (they are hidden behind IR-transparent plastic).
and the PC. This connection is currently made by wires, leading to an awkward A camera is used for tracking, and its parts are shown in the center.
tether; current wireless speeds are not sufficient. As you have noticed, most of the
needed sensors exist on a smartphone, as well as a moderately powerful computer.
Therefore, a smartphone can be dropped into a case with lenses to provide a VR 2.2 Software
experience with little added costs (Figure 2.11). The limitation, though, is that
the VWG must be simpler than in the case of a separate PC so that it runs on less- From a developer’s standpoint, it would be ideal to program the VR system by
powerful computing hardware. In the near future, we expect to see wireless, all-in- providing high-level descriptions and having the software determine automatically
one headsets that contain all of the essential parts of smartphones for delivering all of the low-level details. In a perfect world, there would be a VR engine, which
VR experiences. These will eliminate unnecessary components of smartphones serves a purpose similar to the game engines available today for creating video
(such as the additional case), and will instead have customized optics, microchips, games. If the developer follows patterns that many before her have implemented
2.2. SOFTWARE 47 48 S. M. LaValle: Virtual Reality
Figure 2.12: Disassembly of the Oculus Rift DK2 headset (image by ifixit). Figure 2.13: The Virtual World Generator (VWG) maintains another world, which
could be synthetic, real, or some combination. From a computational perspective,
the inputs are received from the user and his surroundings, and appropriate views
already, then many complicated details can be avoided by simply calling functions of the world are rendered to displays.
from a well-designed software library. However, if the developer wants to try
something original, then she would have to design the functions from scratch.
This requires a deeper understanding of the VR fundamentals, while also being what the user is doing in the real world. A head tracker provides timely estimates
familiar with lower-level system operations. of the user’s head position and orientation. Keyboard, mouse, and game controller
Unfortunately, we are currently a long way from having fully functional, general- events arrive in a queue that are ready to be processed. The key role of the VWG
purpose VR engines. As applications of VR broaden, specialized VR engines are is to maintain enough of an internal “reality” so that renderers can extract the
also likely to emerge. For example, one might be targeted for immersive cinematog- information they need to calculate outputs for their displays.
raphy while another is geared toward engineering design. Which components will
become more like part of a VR “operating system” and which will become higher
level “engine” components? Given the current situation, developers will likely be
implementing much of the functionality of their VR systems from scratch. This Virtual world: real vs. synthetic At one extreme, the virtual world could be
may involve utilizing a software development kit (SDK) for particular headsets completely synthetic. In this case, numerous triangles are defined in a 3D space,
that handles the lowest level operations, such as device drivers, head tracking, along with material properties that indicate how they interact with light, sound,
and display output. Alternatively, they might find themselves using a game en- forces, and so on. The field of computer graphics addresses computer-generated
gine that has been recently adapted for VR, even though it was fundamentally images from synthetic models, and it remains important for VR; see Chapter 7. At
designed for video games on a screen. This can avoid substantial effort at first, the other extreme, the virtual world might be a recorded physical world that was
but then may be cumbersome when someone wants to implement ideas that are captured using modern cameras, computer vision, and Simultaneous Localization
not part of standard video games. and Mapping (SLAM) techniques; Figure 2.14. Many possibilities exist between
What software components are needed to produce a VR experience? Figure the extremes. For example, camera images may be taken of a real object, and
2.13 presents a high-level view that highlights the central role of the Virtual World then mapped onto a synthetic object in the virtual world. This is called texture
Generator (VWG). The VWG receives inputs from low-level systems that indicate mapping, a common operation in computer graphics; see Section 7.2.
2.2. SOFTWARE 49 50 S. M. LaValle: Virtual Reality
Virtual world
Matched zone
Real world
Figure 2.15: A matched zone is maintained between the user in their real world
and his representation in the virtual world. The matched zone could be moved in
the virtual world by using an interface, such as a game controller, while the user
does not correspondingly move in the real world.
Which motions from the real world should be reflected in the virtual world?
This varies among VR experiences. In a VR headset that displays images to the
eyes, head motions must be matched so that the visual renderer uses the correct
viewpoint in the virtual world. Other parts of the body are less critical, but may
become important if the user needs to perform hand-eye coordination or looks at
other parts of her body and expects them to move naturally.
Figure 2.14: Using both color and depth information from cameras, a 3D model
of the world can be extracted automatically using Simultaneous Localization and
Mapping (SLAM) techniques. Figure from [128]. User Locomotion In many VR experiences, users want to move well outside
of the matched zone. This motivates locomotion, which means moving oneself in
the virtual world, while this motion is not matched in the real world. Imagine you
Matched motion The most basic operation of the VWG is to maintain a corre- want to explore a virtual city while remaining seated in the real world. How should
spondence between user motions in the real world and the virtual world; see Figure this be achieved? You could pull up a map and point to where you want to go,
2.15. In the real world, the user’s motions are confined to a safe region, which we with a quick teleportation operation sending you to the destination. A popular
will call the matched zone. Imagine the matched zone as a place where the real option is to move oneself in the virtual world by operating a game controller,
and virtual worlds perfectly align. One of the greatest challenges is the mismatch mouse, or keyboard. By pressing buttons or moving knobs, your self in the virtual
of obstacles: What if the user is blocked in the virtual world but not in the real world could be walking, running, jumping, swimming, flying, and so on. You could
world? The reverse is also possible. In a seated experience, the user sits in a chair also climb aboard a vehicle in the virtual world and operate its controls to move
while wearing a headset. The matched zone in this case is a small region, such as yourself. These operations are certainly convenient, but often lead to sickness
one cubic meter, in which users can move their heads. Head motions should be because of a mismatch between your balance and visual senses. See Sections 2.3,
matched between the two worlds. If the user is not constrained to a seat, then 10.2, and 12.3.
the matched zone could be an entire room or an outdoor field. Note that safety
becomes an issue because the user might spill a drink, hit walls, or fall into pits Physics The VWG handles the geometric aspects of motion by applying the
that exist only in the real world, but are not visible in the virtual world. Larger appropriate mathematical transformations. In addition, the VWG usually imple-
matched zones tend to lead to greater safety issues. Users must make sure that ments some physics so that as time progresses, the virtual world behaves like the
the matched zone is cleared of dangers in the real world, or the developer should real world. In most cases, the basic laws of mechanics should govern how objects
make them visible in the virtual world. move in the virtual world. For example, if you drop an object, then it should
2.2. SOFTWARE 51 52 S. M. LaValle: Virtual Reality
accelerate to the ground due to gravitational force acting on it. One important tomized to make a particular VR experience by choosing menu options and writing
component is a collision detection algorithm, which determines whether two or high-level scripts. Examples available today are OpenSimulator, Vizard by World-
more bodies are intersecting in the virtual world. If a new collision occurs, then Viz, Unity 3D, and Unreal Engine by Epic Games. The latter two are game engines
an appropriate response is needed. For example, suppose the user pokes his head that were adapted to work for VR, and are by far the most popular among current
through a wall in the virtual world. Should the head in the virtual world be VR developers. The first one, OpenSimulator, was designed as an open-source
stopped, even though it continues to move in the real world? To make it more alternative to Second Life for building a virtual society of avatars. As already
complex, what should happen if you unload a dump truck full of basketballs into a stated, using such higher-level engines make it easy for developers to make a VR
busy street in the virtual world? Simulated physics can become quite challenging, experience in little time; however, the drawback is that it is harder to make highly
and is a discipline in itself. There is no limit to the complexity. See Section 8.3 original experiences that were not imagined by the engine builders.
for more about virtual-world physics.
In addition to handling the motions of moving objects, the physics must also
take into account how potential stimuli for the displays are created and propagate 2.3 Human Physiology and Perception
through the virtual world. How does light propagate through the environment?
How does light interact with the surfaces in the virtual world? What are the Our bodies were not designed for VR. By applying artificial stimulation to the
sources of light? How do sound and smells propagate? These correspond to senses, we are disrupting the operation of biological mechanisms that have taken
rendering problems, which are covered in Chapters 7 and 11 for visual and audio hundreds of millions of years to evolve in a natural environment. We are also
cases, respectively. providing input to the brain that is not exactly consistent with all of our other life
experiences. In some instances, our bodies may adapt to the new stimuli. This
could cause us to become unaware of flaws in the VR system. In other cases, we
Networked experiences In the case of a networked VR experience, a shared might develop heightened awareness or the ability to interpret 3D scenes that were
virtual world is maintained by a server. Each user has a distinct matched zone. once difficult or ambiguous. Unfortunately, there are also many cases where our
Their matched zones might overlap in a real world, but one must then be careful bodies react by increased fatigue or headaches, partly because the brain is working
so that they avoid unwanted collisions. Most often, these zones are disjoint and harder than usual to interpret the stimuli. Finally, the worst case is the onset of
distributed around the Earth. Within the virtual world, user interactions, includ- VR sickness, which typically involves symptoms of dizziness and nausea.
ing collisions, must be managed by the VWG. If multiple users are interacting in a
Perceptual psychology is the science of understanding how the brain converts
social setting, then the burdens of matched motions may increase. As users meet
sensory stimulation into perceived phenomena. Here are some typical questions
each other, they could expect to see eye motions, facial expressions, and body
that arise in VR and fall under this umbrella:
language; see Section 10.4.
• How far away does that object appear to be?
Developer choices for VWGs To summarize, a developer could start with
a basic Software Development Kit (SDK) from a VR headset vendor and then • How much video resolution is needed to avoid seeing pixels?
build her own VWG from scratch. The SDK should provide the basic drivers and • How many frames per second are enough to perceive motion as continuous?
an interface to access tracking data and make calls to the graphical rendering li-
braries. In this case, the developer must build the physics of the virtual world from • Is the user’s head appearing at the proper height in the virtual world?
scratch, handling problems such as avatar movement, collision detection, lighting
models, and audio. This gives the developer the greatest amount of control and • Where is that virtual sound coming from?
ability to optimize performance; however, it may come in exchange for a difficult
implementation burden. In some special cases, it might not be too difficult. For • Why am I feeling nauseated?
example, in the case of the Google Street viewer (recall Figure 1.9), the “physics”
• Why is one experience more tiring than another?
is simple: The viewing location needs to jump between panoramic images in a
comfortable way while maintaining a sense of location on the Earth. In the case of • What is presence?
telepresence using a robot, the VWG would have to take into account movements
in the physical world. Failure to handle collision detection could result in a broken To answer these questions and more, we must understand several things: 1) basic
robot (or human!). physiology of the human body, including sense organs and neural pathways, 2)
At the other extreme, a developer may use a ready-made VWG that is cus- the key theories and insights of experimental perceptual psychology, and 3) the
2.3. HUMAN PHYSIOLOGY AND PERCEPTION 53 54 S. M. LaValle: Virtual Reality
Figure 2.19: The stimulus captured by receptors works its way through a hierar-
chical network of neurons. In the early stages, signals are combined from multiple
receptors and propagated upward. At later stages, information flows bidirection-
ally.
so on. After passing through several neurons, signals from numerous receptors
are simultaneously taken into account. This allows increasingly complex patterns
to be detected in the stimulus. In the case of vision, feature detectors appear in
Figure 2.18: A typical neuron receives signals through dendrites, which interface the early hierarchical stages, enabling us to detect features such as edges, corners,
to other neurons. It outputs a signal to other neurons through axons. and motion. Once in the cerebral cortex, the signals from sensors are combined
with anything else from our life experiences that may become relevant for mak-
86 billion neurons. Around 20 billion are devoted to the part of the brain called ing an interpretation of the stimuli. Various perceptual phenomena occur, such
the cerebral cortex, which handles perception and many other high-level functions as recognizing a face or identifying a song. Information or concepts that appear
such as attention, memory, language, and consciousness. It is a large sheet of in the cerebral cortex tend to represent a global picture of the world around us.
neurons around three millimeters thick and is heavily folded so that it fits into our Surprisingly, topographic mapping methods reveal that spatial relationships among
skulls. In case you are wondering where we lie among other animals, a roundworm, receptors are maintained in some cases among the distribution of neurons. Also,
fruit fly, and rat have 302, 100 thousand, and 200 million neurons, respectively. recall from Section 1.1 that place cells and grid cells encode spatial maps of familiar
An elephant has over 250 billion neurons, which is more than us! environments.
Only mammals have a cerebral cortex. The cerebral cortex of a rat has around
20 million neurons. Cats and dogs are at 300 and 160 million, respectively. A Proprioception In addition to information from senses and memory, we also
gorilla has around 4 billion. A type of dolphin called the long-finned pilot whale use proprioception, which is the ability to sense the relative positions of parts of our
has an estimated 37 billion neurons in its cerebral cortex, making it roughly twice bodies and the amount of muscular effort being involved in moving them. Close
as many as in the human cerebral cortex; however, scientists claim this does not your eyes and move your arms around in an open area. You should have an idea of
imply superior cognitive abilities [220, 273]. where your arms are located, although you might not be able to precisely reach out
Another important factor in perception and overall cognitive ability is the and touch your fingertips together without using your eyes. This information is so
interconnection between neurons. Imagine an enormous directed graph, with the important to our brains that the motor cortex, which controls body motion, sends
usual nodes and directed edges. The nucleus or cell body of each neuron is a signals called efference copies to other parts of the brain to communicate what
node that does some kind of “processing”. Figure 2.18 shows a neuron. The motions have been executed. Proprioception is effectively another kind of sense.
dendrites are essentially input edges to the neuron, whereas the axons are output Continuing our comparison with robots, it corresponds to having encoders on joints
edges. Through a network of dendrites, the neuron can aggregate information or wheels, to indicate how far they have moved. One interesting implication of
from numerous other neurons, which themselves may have aggregated information proprioception is that you cannot tickle yourself because you know where your
from others. The result is sent to one or more neurons through the axon. For a fingers are moving; however, if someone else tickles you, then you do not have
connected axon-dendrite pair, communication occurs in a gap called the synapse, access to their efference copies. The lack of this information is crucial to the
where electrical or chemical signals are passed along. Each neuron in the human tickling sensation.
brain has on average about 7000 synaptic connections to other neurons, which
results in about 1015 edges in our enormous brain graph! Fusion of senses Signals from multiple senses and proprioception are being
processed and combined with our experiences by our neural structures through-
Hierarchical processing Upon leaving the sense-organ receptors, signals prop- out our lives. In ordinary life, without VR or drugs, our brains interpret these
agate among the neurons to eventually reach the cerebral cortex. Along the way, combinations of inputs in coherent, consistent, and familiar ways. Any attempt to
hierarchical processing is performed; see Figure 2.19. Through selectivity, each interfere with these operations is likely to cause a mismatch among the data from
receptor responds to a narrow range of stimuli, across time, space, frequency, and our senses. The brain may react in a variety of ways. It could be the case that we
2.3. HUMAN PHYSIOLOGY AND PERCEPTION 57 58 S. M. LaValle: Virtual Reality
are not consciously aware of the conflict, but we may become fatigued or develop
a headache. Even worse, we could develop symptoms of dizziness or nausea. In
other cases, the brain might react by making us so consciously aware of the con-
flict that we immediately understand that the experience is artificial. This would
correspond to a case in which the VR experience is failing to convince people that
they are present in a virtual world. To make an effective and comfortable VR
experience, trials with human subjects are essential to understand how the brain
reacts. It is practically impossible to predict what would happen in an unknown
scenario, unless it is almost identical to other well-studied scenarios.
One of the most important examples of bad sensory conflict in the context of
VR is vection, which is the illusion of self motion. The conflict arises when your
vision sense reports to your brain that you are accelerating, but your balance sense
reports that you are motionless. As people walk down the street, their balance and
vision senses are in harmony. You might have experienced vection before, even
without VR. If you are stuck in traffic or stopped at a train station, you might have
felt as if you are moving backwards while seeing a vehicle in your periphery that is
moving forward. In the 1890s, Amariah Lake constructed an amusement park ride
that consisted of a swing that remains at rest while the entire room surrounding
the swing rocks back-and-forth (Figure 2.20). In VR, vection is caused by the
locomotion operation described in Section 2.2. For example, if you accelerate
yourself forward using a controller, rather than moving forward in the real world,
then you perceive acceleration with your eyes, but not your vestibular organ. For
strategies to alleviate this problem, see Section 10.2.
Figure 2.21: The most basic psychometric function. For this example, as the stim-
ulus intensity is increased, the percentage of people detecting the phenomenon
increases. The point along the curve that corresponds to 50 percent indicates a
critical threshold or boundary in the stimulus intensity. The curve above corre- Figure 2.22: Steven’s power law (2.1) captures the relationship between the mag-
sponds to the cumulative distribution function of the error model (often assumed nitude of a stimulus and its perceived magnitude. The model is an exponential
to be Gaussian). curve, and the exponent depends on the stimulus type.
perception of stationarity. example, there may be a region where they are not sure whether the light is
red. At one extreme, they may consistently classify it as “red” and at the other
• The left and right eye views are swapped. extreme, they consistently classify it as “not red”. For the region in between,
the probability of detection is recorded, which corresponds to the frequency with
• Objects appear to one eye but not the other. which it is classified as “red”. Section 12.4 will discuss how such experiments are
• One eye view has significantly more latency than the other. designed and conducted.
• Straight lines are slightly curved due to uncorrected warping in the optical Stevens’ power law One of the most known results from psychophysics is
system. Steven’s power law, which characterizes the relationship between the magnitude of
This disconnect between the actual stimulus and one’s perception of the stimulus a physical stimulus and its perceived magnitude [311]. The hypothesis is that an
leads to the next topic. exponential relationship occurs over a wide range of sensory systems and stimuli:
p = cmx (2.1)
Psychophysics Psychophysics is the scientific study of perceptual phenomena
that are produced by physical stimuli. For example, under what conditions would in which
someone call an object “red”? The stimulus corresponds to light entering the eye,
and the perceptual phenomenon is the concept of “red” forming in the brain. Other • m is the magnitude or intensity of the stimulus,
examples of perceptual phenomena are “straight”, “larger”, “louder”, “tickles”, • p is the perceived magnitude,
and “sour”. Figure 2.21 shows a typical scenario in a psychophysical experiment.
As one parameter is varied, such as the frequency of a light, there is usually a • x relates the actual magnitude to the perceived magnitude, and is the most
range of values for which subjects cannot reliably classify the phenomenon. For important part of the equation, and
2.3. HUMAN PHYSIOLOGY AND PERCEPTION 61 62 S. M. LaValle: Virtual Reality
Chapter 3
Section 2.2 introduced the Virtual World Generator (VWG), which maintains the
geometry and physics of the virtual world. This chapter covers the geometry part, Figure 3.1: Points in the virtual world are given coordinates in a right-handed
which is needed to make models and move them around. The models could include coordinate system in which the y axis is pointing upward. The origin (0, 0, 0) lies
the walls of a building, furniture, clouds in the sky, the user’s avatar, and so on. at the point where axes intersect. Also shown is a 3D triangle is defined by its
Section 3.1 covers the basics of how to define consistent, useful models. Section three vertices, each of which is a point in R3 .
3.2 explains how to apply mathematical transforms that move them around in the
virtual world. This involves two components: Translation (changing position) and notable being Microsoft’s DirectX graphical rendering library. In these cases, one
rotation (changing orientation). Section 3.3 presents the best ways to express and of the three axes points in the opposite direction in comparison to its direction in a
manipulate 3D rotations, which are the most complicated part of moving models. right-handed system. This inconsistency can lead to hours of madness when writ-
Section 3.4 then covers how the virtual world appears if we try to “look” at it from ing software; therefore, be aware of the differences and their required conversions if
a particular perspective. This is the geometric component of visual rendering, you mix software or models that use both. If possible, avoid mixing right-handed
which is covered in Chapter 7. Finally, Section 3.5 puts all of the transformations and left-handed systems altogether.
together, so that you can see how to go from defining a model to having it appear Geometric models are made of surfaces or solid regions in R3 and contain an
in the right place on the display. infinite number of points. Because representations in a computer must be finite,
If you work with high-level engines to build a VR experience, then most of models are defined in terms of primitives in which each represents an infinite set
the concepts from this chapter might not seem necessary. You might need only to of points. The simplest and most useful primitive is a 3D triangle, as shown in
select options from menus and write simple scripts. However, an understanding of Figure 3.1. A planar surface patch that corresponds to all points “inside” and on
the basic transformations, such as how to express 3D rotations or move a camera the boundary of the triangle is fully specified by the coordinates of the triangle
viewpoint, is essential to making the software do what you want. Furthermore, vertices:
if you want to build virtual worlds from scratch, or at least want to understand ((x1 , y1 , z1 ), (x2 , y2 , z2 ), (x3 , y3 , z3 )). (3.1)
what is going on under the hood of a software engine, then this chapter is critical.
To model a complicated object or body in the virtual world, numerous trian-
gles can be arranged into a mesh, as shown in Figure 3.2. This provokes many
3.1 Geometric Models important questions:
We first need a virtual world to contain the geometric models. For our purposes, 1. How do we specify how each triangle “looks” whenever viewed by a user in
it is enough to have a 3D Euclidean space with Cartesian coordinates. Therefore, VR?
let R3 denote the virtual world, in which every point is represented as a triple 2. How do we make the object “move”?
of real-valued coordinates: (x, y, z). The coordinate axes of our virtual world are
shown in Figure 3.1. We will consistently use right-handed coordinate systems in 3. If the object surface is sharply curved, then should we use curved primitives,
this book because they represent the predominant choice throughout physics and rather than trying to approximate the curved object with tiny triangular
engineering; however, left-handed systems appear in some places, with the most patches?
63
3.1. GEOMETRIC MODELS 65 66 S. M. LaValle: Virtual Reality
Data structures Consider listing all of the triangles in a file or memory array. If Inside vs. outside Now consider the question of whether the object interior
the triangles form a mesh, then most or all vertices will be shared among multiple is part of the model (recall Figure 3.2). Suppose the mesh triangles fit together
triangles. This is clearly a waste of space. Another issue is that we will frequently perfectly so that every edge borders exactly two triangles and no triangles intersect
want to perform operations on the model. For example, after moving an object, unless they are adjacent along the surface. In this case, the model forms a complete
can we determine whether it is in collision with another object (covered in Section barrier between the inside and outside of the object. If we were to hypothetically
8.3)? A typical low-level task might be to determine which triangles share a fill the inside with a gas, then it could not leak to the outside. This is an example
common vertex or edge with a given triangle. This might require linearly searching of a coherent model. Such models are required if the notion of inside or outside
through the triangle list to determine whether they share a vertex or two. If there is critical to the VWG. For example, a penny could be inside of the dolphin, but
are millions of triangles, which is not uncommon, then it would cost too much to not intersecting with any of its boundary triangles. Would this ever need to be
perform this operation repeatedly. detected? If we remove a single triangle, then the hypothetical gas would leak
For these reasons and more, geometric models are usually encoded in clever out. There would no longer be a clear distinction between the inside and outside
data structures. The choice of the data structure should depend on which opera- of the object, making it difficult to answer the question about the penny and the
tions will be performed on the model. One of the most useful and common is the dolphin. In the extreme case, we could have a single triangle in space. There is
doubly connected edge list, also known as half-edge data structure [54, 224]. See Fig- clearly no natural inside or outside. At an extreme, the model could be as bad
ure 3.3. In this and similar data structures, there are three kinds of data elements: as polygon soup, which is a jumble of triangles that do not fit together nicely and
faces, edges, and vertices. These represent two, one, and zero-dimensional parts, could even have intersecting interiors. In conclusion, be careful when constructing
respectively, of the model. In our case, every face element represents a triangle. models so that the operations you want to perform later will be logically clear.
Each edge represents the border of one or two triangles, without duplication. Each If you are using a high-level design tool, such as Blender or Maya, to make your
vertex is shared between one or more triangles, again without duplication. The models, then coherent models will be automatically built.
3.1. GEOMETRIC MODELS 67 68 S. M. LaValle: Virtual Reality
Why triangles? Continuing upward through the questions above, triangles are divided into two parts. The first part involves determining where the points in
used because they are the simplest for algorithms to handle, especially if imple- the virtual world should appear on the display. This is accomplished by viewing
mented in hardware. GPU implementations tend to be biased toward smaller transformations in Section 3.4, which are combined with other transformations in
representations so that a compact list of instructions can be applied to numerous Section 3.5 to produce the final result. The second part involves how each part
model parts in parallel. It is certainly possible to use more complicated primi- of the model should appear after taking into account lighting sources and surface
tives, such as quadrilaterals, splines, and semi-algebraic surfaces [88, 123, 221]. properties that are defined in the virtual world. This is the rendering problem,
This could lead to smaller model sizes, but often comes at the expense of greater which is covered in Chapter 7.
computational cost for handling each primitive. For example, it is much harder
to determine whether two spline surfaces are colliding, in comparison to two 3D
triangles. 3.2 Changing Position and Orientation
Suppose that a movable model has been defined as a mesh of triangles. To move it,
Stationary vs. movable models There will be two kinds of models in the
we apply a single transformation to every vertex of every triangle. This section first
virtual world, which is embedded in R3 :
considers the simple case of translation, followed by the considerably complicated
• Stationary models, which keep the same coordinates forever. Typical exam- case of rotations. By combining translation and rotation, the model can be placed
ples are streets, floors, and buildings. anywhere, and at any orientation in the virtual world R3 .
• Movable models, which can be transformed into various positions and orien- Translations Consider the following 3D triangle,
tations. Examples include vehicles, avatars, and small furniture.
((x1 , y1 , z1 ), (x2 , y2 , z2 ), (x3 , y3 , z3 )), (3.2)
Motion can be caused in a number of ways. Using a tracking system (Chapter 9),
the model might move to match the user’s motions. Alternatively, the user might in which its vertex coordinates are expressed as generic constants.
operate a controller to move objects in the virtual world, including a representation Let xt , yt , and zt be the amount we would like to change the triangle’s position,
of himself. Finally, objects might move on their own according to the laws of along the x, y, and z axes, respectively. The operation of changing position is called
physics in the virtual world. Section 3.2 will cover the mathematical operations translation, and it is given by
that move models to the their desired places, and Chapter 8 will describe velocities,
(x1 , y1 , z1 ) 7→ (x1 + xt , y1 + yt , z1 + zt )
accelerations, and other physical aspects of motion.
(x2 , y2 , z2 ) 7→ (x2 + xt , y2 + yt , z2 + zt ) (3.3)
(x3 , y3 , z3 ) 7→ (x3 + xt , y3 + yt , z3 + zt ),
Choosing coordinate axes One often neglected point is the choice of coordi-
nates for the models, in terms of their placement and scale. If these are defined in which a 7→ b denotes that a becomes replaced by b after the transformation is
cleverly at the outset, then many tedious complications can be avoided. If the vir- applied. Applying (3.3) to every triangle in a model will translate all of it to the
tual world is supposed to correspond to familiar environments from the real world, desired location. If the triangles are arranged in a mesh, then it is sufficient to
then the axis scaling should match common units. For example, (1, 0, 0) should apply the transformation to the vertices alone. All of the triangles will retain their
mean one meter to the right of (0, 0, 0). It is also wise to put the origin (0, 0, 0) in size and shape.
a convenient location. Commonly, y = 0 corresponds to the floor of a building or
sea level of a terrain. The location of x = 0 and z = 0 could be in the center of Relativity Before the transformations become too complicated, we want to cau-
the virtual world so that it nicely divides into quadrants based on sign. Another tion you about interpreting them correctly. Figures 3.4(a) and 3.4(b) show an
common choice is to place it in the upper left when viewing the world from above example in which a triangle is translated by xt = −8 and yt = −7. The vertex
so that all x and z coordinates are nonnegative. For movable models, the location coordinates are the same in Figures 3.4(b) and 3.4(c). Figure 3.4(b) shows the case
of the origin and the axis directions become extremely important because they we are intended to cover so far: The triangle is interpreted as having moved in the
affect how the model is rotated. This should become clear in Sections 3.2 and 3.3 virtual world. However, Figure 3.4(c) shows another possibility: The coordinates
as we present rotations. of the virtual world have been reassigned so that the triangle is closer to the origin.
This is equivalent to having moved the entire world, with the triangle being the
Viewing the models Of course, one of the most important aspects of VR is only part that does not move. In this case, the translation is applied to the coor-
how the models are going to “look” when viewed on a display. This problem is dinate axes, but they are negated. When we apply more general transformations,
3.2. CHANGING POSITION AND ORIENTATION 69 70 S. M. LaValle: Virtual Reality
happens when this matrix is multiplied by the point (x, y), when it is written as
a column vector.
Performing the multiplication, we obtain
′
m11 m12 x x
= ′ , (3.5)
m21 m22 y y
in which (x′ , y ′ ) is the transformed point. Using simple algebra, the matrix multi-
plication yields
(a) Original object (b) Object moves (c) Origin moves x′ = m11 x + m12 y
(3.6)
y ′ = m21 x + m22 y.
Using notation as in (3.3), M is a transformation for which (x, y) 7→ (x′ , y ′ ).
Figure 3.4: Every transformation has two possible interpretations, even though
the math is the same. Here is a 2D example, in which a triangle is defined in (a).
We could translate the triangle by xt = −8 and yt = −7 to obtain the result in Applying the 2D matrix to points Suppose we place two points (1, 0) and
(0, 1) in the plane. They lie on the x and y axes, respectively, at one unit of
(b). If we instead wanted to hold the triangle fixed but move the origin up by 8
distance from the origin (0, 0). Using vector spaces, these two points would be the
in the x direction and 7 in the y direction, then the coordinates of the triangle
standard unit basis vectors (sometimes written as ı̂ and ̂). Watch what happens
vertices change the exact same way, as shown in (c).
if we substitute them into (3.5):
m11 m12 1 m11
this extends so that transforming the coordinate axes results in an inverse of the = (3.7)
m21 m22 0 m21
transformation that would correspondingly move the model. Negation is simply
the inverse in the case of translation. and
Thus, we have a kind of “relativity”: Did the object move, or did the whole m11 m12 0 m12
= . (3.8)
world move around it? This idea will become important in Section 3.4 when we m21 m22 1 m22
want to change viewpoints. If we were standing at the origin, looking at the These special points simply select the column vectors on M . What does this mean?
triangle, then the result would appear the same in either case; however, if the If M is applied to transform a model, then each column of M indicates precisely
origin moves, then we would move with it. A deep perceptual problem lies here as how each coordinate axis is changed.
well. If we perceive ourselves as having moved, then VR sickness might increase, Figure 3.5 illustrates the effect of applying various matrices M to a model.
even though it was the object that moved. In other words, our brains make their Starting with the upper right, the identity matrix does not cause the coordinates
best guess as to which type of motion occurred, and sometimes get it wrong. to change: (x, y) 7→ (x, y). The second example causes a flip as if a mirror were
placed at the y axis. In this case, (x, y) 7→ (−x, y). The second row shows
Getting ready for rotations How do we make the wheels roll on a car? Or examples of scaling. The matrix on the left produces (x, y) 7→ (2x, 2y), which
turn a table over onto its side? To accomplish these, we need to change the doubles the size. The matrix on the right only stretches the model in the y
model’s orientation in the virtual world. The operation that changes the orien- direction, causing an aspect ratio distortion. In the third row, it might seem that
tation is called rotation. Unfortunately, rotations in three dimensions are much the matrix on the left produces a mirror image with respect to both x and y
more complicated than translations, leading to countless frustrations for engineers axes. This is true, except that the mirror image of a mirror image restores the
and developers. To improve the clarity of 3D rotation concepts, we first start with original. Thus, this corresponds to the case of a 180-degree (π radians) rotation,
a simpler problem: 2D linear transformations. rather than a mirror image. The matrix on the right produces a shear along the
Consider a 2D virtual world, in which points have coordinates (x, y). You can x direction: (x, y) 7→ (x + y, y). The amount of displacement is proportional to
imagine this as a vertical plane in our original, 3D virtual world. Now consider a y. In the bottom row, the matrix on the left shows a skew in the y direction.
generic two-by-two matrix The final matrix might at first appear to cause more skewing, but it is degenerate.
m11 m12
The two-dimensional shape collapses into a single dimension when M is applied:
M= (3.4) (x, y) 7→ (x + y, x + y). This corresponds to the case of a singular matrix, which
m21 m22
means that its columns are not linearly independent (they are in fact identical).
in which each of the four entries could be any real number. We will look at what A matrix is singular if and only if its determinant is zero.
3.2. CHANGING POSITION AND ORIENTATION 71 72 S. M. LaValle: Virtual Reality
Only some matrices produce rotations The examples in Figure 3.5 span
the main qualitative differences between various two-by-two matrices M . Two of
them were rotation matrices: the identity matrix, which is 0 degrees of rotation,
and the 180-degree rotation matrix. Among the set of all possible M , which ones
are valid rotations? We must ensure that the model does not become distorted.
This is achieved by ensuring that M satisfies the following rules:
1. No stretching of axes.
1 0 −1 0 2. No shearing.
0 1 0 1
Identity Mirror 3. No mirror images.
If none of these rules is violated, then the result is a rotation.
To satisfy the first rule, the columns of M must have unit length:
The shearing transformations in Figure 3.5 violate this rule, which clearly causes
right angles in the model to be destroyed.
Satisfying the third rule requires that the determinant of M is positive. After
satisfying the first two rules, the only possible remaining determinants are 1 (the
−1 0 1 1
normal case) and −1 (the mirror-image case). Thus, the rule implies that:
0 −1 0 1
Rotate 180 x-shear
m11 m12
det = m11 m22 − m12 m21 = 1. (3.11)
m21 m22
Figure 3.5: Eight different matrices applied to transform a square face. These Instead of x and y, we use the notation of the matrix components. Let m11 =
examples nicely cover all of the possible cases, in a qualitative sense. cos θ and m21 = sin θ. Substituting this into M from (3.4) yields
cos θ − sin θ
, (3.13)
sin θ cos θ
3.2. CHANGING POSITION AND ORIENTATION 73 74 S. M. LaValle: Virtual Reality
Combining rotations Each of (3.16), (3.17), and (3.18) provides a single DOF
of rotations. The yaw, pitch, and roll rotations can be combined sequentially to
attain any possible 3D rotation: (a) (b) (c)
general matrix transform M , we apply the matrix inverse M −1 (if it exists). This Kinematic singularities An even worse problem arises when using yaw, pitch,
is often complicated to calculate. Fortunately, inverses are much simpler for our roll angles (and related Euler-angle variants). Even though they start off being
cases of interest. In the case of a rotation matrix R, the inverse is equal to the intuitively pleasing, the representation becomes degenerate, leading to kinematic
transpose R−1 = RT .2 To invert the homogeneous transform matrix (3.21), it is singularities that are nearly impossible to visualize. An example will be presented
tempting to write shortly. To prepare for this, recall how we represent locations on the Earth. These
−xt are points in R3 , but are represented with longitude and latitude coordinates. Just
RT −yt like the limits of yaw and pitch, longitude ranges from 0 to 2π and latitude only
−zt . (3.23) ranges from −π/2 to π/2. (Longitude is usually expressed as 0 to 180 degrees west
0 0 0 1 or east, which is equivalent.) As we travel anywhere on the Earth, the latitude
and longitude coordinates behave very much like xy coordinates; however, we
This will undo both the translation and the rotation; however, the order is wrong. tend to stay away from the poles. Near the North Pole, the latitude behaves
Remember that these operations are not commutative, which implies that order normally, but the longitude could vary a large amount while corresponding to a
must be correctly handled. See Figure 3.8. The algebra for very general matrices tiny distance traveled. Recall how a wall map of the world looks near the poles:
(part of noncommutative group theory) works out so that the inverse of a product Greenland is enormous and Antarctica wraps across the entire bottom (assuming
of matrices reverses their order: it uses a projection that keeps longitude lines straight). The poles themselves are
the kinematic singularities: At these special points, you can vary longitude, but
(ABC)−1 = C −1 B −1 A−1 . (3.24) the location on the Earth is not changing. One of two DOFs seems to be lost.
The same problem occurs with 3D rotations, but it is harder to visualize due to
This can be seen by putting the inverse next to the original product:
the extra dimension. If the pitch angle is held at β = π/2, then a kind of “North
ABCC −1 B −1 A−1 . (3.25) Pole” is reached in which α and γ vary independently but cause only one DOF (in
the case of latitude and longitude, it was one parameter varying but causing zero
In this way, C cancels with its inverse, followed by B and its inverse, and finally A DOFs). Here is how it looks when combining the yaw, pitch, and roll matrices:
and its inverse. If the order were wrong, then these cancellations would not occur.
The matrix Trb (from 3.21) applies the rotation first, followed by translation.
cos α 0 sin α
1 0 0
cos γ − sin γ 0
cos(α − γ) sin(α − γ) 0
Applying (3.23) undoes the rotation first and then translation, without reversing 0 1 0 0 0 −1 sin γ cos γ 0 = 0 0 −1 .
the order. Thus, the inverse of Trb is − sin α 0 cos α 0 1 0 0 0 1 − sin(α − γ) cos(α − γ) 0
(3.27)
0 1 0 0 −x The second matrix above corresponds to pitch (3.17) with β = π/2. The result
t
RT 0 0 1 0 −yt on the right is obtained by performing matrix multiplication and applying a sub-
0 . (3.26)
0 0 0 1 0 0 1 −zt
traction trigonometric identity. You should observe that the resulting matrix is a
0 0 0 1 function of both α and γ, but there is one DOF because only the difference α − γ
affects the resulting rotation. In the video game industry there has been some
The matrix on the right first undoes the translation (with no rotation). After that, back-and-forth battles about whether this problem is crucial. In an FPS game,
the matrix on the left undoes the rotation (with no translation). the avatar is usually not allowed to pitch his head all the way to ±π/2, thereby
avoiding this problem. In VR, it happens all the time that a user could pitch her
head straight up or down. The kinematic singularity often causes the viewpoint to
3.3 Axis-Angle Representations of Rotation spin uncontrollably. This phenomenon also occurs when sensing and controlling a
spacecraft’s orientation using mechanical gimbals; the result is called gimbal lock.
As observed in Section 3.2, 3D rotation is complicated for several reasons: 1) Nine The problems can be easily solved with axis-angle representations of rotation.
matrix entries are specified in terms of only three independent parameters, and They are harder to learn than yaw, pitch, and roll; however, it is a worthwhile
with no simple parameterization, 2) the axis of rotation is not the same every time, investment because it avoids these problems. Furthermore, many well-written
and 3) the operations are noncommutative, implying that the order of matrices is software libraries and game engines work directly with these representations. Thus,
crucial. None of these problems existed for the 2D case. to use them effectively, you should understand what they are doing.
2
Recall that to transpose a square matrix, we simply swap the i and j indices, which turns The most important insight to solving the kinematic singularity problems is
columns into rows. Euler’s rotation theorem (1775), shown in Figure 3.9. Even though the rotation
3.3. AXIS-ANGLE REPRESENTATIONS OF ROTATION 79 80 S. M. LaValle: Virtual Reality
rotation other than the identity, there are exactly two representations. This is due
to the fact that the axis could “point” in either direction. We could insist that
the axis always point in one direction, such as positive y, but this does not fully
solve the problem because of the boundary cases (horizontal axes). Quaternions,
which are coming next, nicely handle all problems with 3D rotations except this
one, which is unavoidable.
Quaternions were introduced in 1843 by William Rowan Hamilton. When see-
ing them the first time, most people have difficulty understanding their peculiar
algebra. Therefore, we will instead focus on precisely which quaternions corre-
spond to which rotations. After that, we will introduce some limited quaternion
algebra. The algebra is much less important for developing VR systems, unless
you want to implement your own 3D rotation library. The correspondence between
quaternions and 3D rotations, however, is crucial.
Figure 3.9: Euler’s rotation theorem states that every 3D rotation can be consid- A quaternion h is a 4D vector:
ered as a rotation by an angle θ about an axis through the origin, given by the
unit direction vector v = (v1 , v2 , v3 ). q = (a, b, c, d), (3.28)
a2 + b 2 + c 2 + d 2 = 1 (3.29)
must always hold. This should remind you of the equation of a unit sphere (x2 +
y 2 + z 2 = 1), but it is one dimension higher. A sphere is a 2D surface, whereas
the set of all unit quaternions is a 3D “hypersurface”, more formally known as
a manifold [27, 151]. We will use the space of unit quaternions to represent the
space of all 3D rotations. Both have 3 DOFs, which seems reasonable.
Let (v, θ) be an axis-angle representation of a 3D rotation, as depicted in Figure
3.9. Let this be represented by the following quaternion:
Figure 3.10: There are two ways to encode the same rotation in terms of axis and
angle, using either v or −v. θ θ θ θ
q = cos , v1 sin , v2 sin , v3 sin . (3.30)
2 2 2 2
axis may change after rotations are combined, Euler showed that any 3D rotation Think of q as a data structure that encodes the 3D rotation. It is easy to recover
can be expressed as a rotation θ about some axis that pokes through the origin. (v, θ) from q:
This matches the three DOFs for rotation: It takes two parameters to specify the 1
θ = 2 cos−1 a and v = √ (b, c, d). (3.31)
direction of an axis and one parameter for θ. The only trouble is that conver- 1 − a2
sions back and forth between rotation matrices and the axis-angle representation If a = 1, then (3.31) breaks; however, this corresponds to the case of the identity
are somewhat inconvenient. This motivates the introduction of a mathematical rotation.
object that is close to the axis-angle representation, closely mimics the algebra You now have the mappings (v, θ) 7→ q and q 7→ (v, θ). To test your un-
of 3D rotations, and can even be applied directly to rotate models. The perfect derstanding, Figure 3.11 shows some simple examples, which commonly occur
representation: Quaternions. in practice. Furthermore, Figure 3.12 shows some simple relationships between
quaternions and their corresponding rotations. The horizontal arrows indicate
Two-to-one problem Before getting to quaternions, it is important point out that q and −q represent the same rotation. This is true because of the double
one annoying problem with Euler’s rotation theorem. As shown in Figure 3.10, it representation issue shown in Figure 3.10. Applying (3.30) to both cases estab-
does not claim that the axis-angle representation is unique. In fact, for every 3D lishes their equivalence. The vertical arrows correspond to inverse rotations. These
3.3. AXIS-ANGLE REPRESENTATIONS OF ROTATION 81 82 S. M. LaValle: Virtual Reality
Figure 3.11: For these cases, you should be able to look at the quaternion and Figure 3.13: If we placed a virtual eye or camera into the virtual world, what
quickly picture the axis and angle of the corresponding 3D rotation. would it see? Section 3.4 provides transformations that place objects from the
virtual world onto a virtual screen, based on the particular viewpoint of a virtual
eye. A flat rectangular shape is chosen for engineering and historical reasons, even
though it does not match the shape of our retinas.
Figure 3.12: Simple relationships between equivalent quaternions and their in- in which q −1 = (a, −b, −c, −d) (recall from Figure 3.12). The rotated point is
verses. (x′ , y ′ , z ′ ), which is taken from the result p′ = (x′ , y ′ , z ′ , 1).
Here is a simple example for the point (1, 0, 0). Let p = (1, 0, 0, 1) and consider
executing a yaw rotation by π/2. According to Figure 3.11, the corresponding
hold because reversing the direction of the axis causes the rotation to be reversed
quaternion is q = (0, 0, 1, 0). The inverse is q −1 = (0, 0, −1, 0). After tediously
(rotation by θ becomes rotation by 2π − θ).
applying (3.33) to calculate (3.34), the result is p′ = (0, 1, 0, 1). Thus, the rotated
How do we apply the quaternion h = (a, b, c, d) to rotate the model? One way
point is (0, 1, 0), which is a correct yaw by π/2.
is to use the following conversion into a 3D rotation matrix:
2
2(a + b2 ) − 1 2(bc − ad) 2(bd + ac)
R(h) = 2(bc + ad) 2(a2 + c2 ) − 1 2(cd − ab) . (3.32) 3.4 Viewing Transformations
2(bd − ac) 2(cd + ab) 2(a2 + d2 ) − 1
This section describes how to transform the models in the virtual world so that
A more efficient way exists which avoids converting into a rotation matrix. To they appear on a virtual screen. The main purpose is to set the foundation for
accomplish this, we need to define quaternion multiplication. For any two quater- graphical rendering, which adds effects due to lighting, material properties, and
nions, q1 and q2 , let q1 ∗ q2 denote the product, which is defined as quantization. Ultimately, the result appears on the physical display. One side
effect of these transforms is that they also explain how cameras form images, at
a3 = a1 a2 − b 1 b 2 − c 1 c 2 − d 1 d 2 least the idealized mathematics of the process. Think of this section as describing a
virtual camera that is placed in the virtual world. What should the virtual picture,
b 3 = a1 b 2 + a2 b 1 + c 1 d 2 − c 2 d 1
(3.33) taken by that camera, look like? To make VR work correctly, the “camera” should
c 3 = a1 c 2 + a2 c 1 + b 2 d 1 − b 1 d 2 actually be one of two virtual human eyes that are placed into the virtual world.
d 3 = a1 d 2 + a2 d 1 + b 1 c 2 − b 2 c 1 . Thus, what should a virtual eye see, based on its position and orientation in the
virtual world? Rather than determine precisely what would appear on the retina,
In other words, q3 = q1 ∗ q2 as defined in (3.33). which should become clear after Section 4.4, here we merely calculate where the
Here is a way to rotate the point (x, y, z) using the rotation represented by model vertices would appear on a flat, rectangular screen in the virtual world. See
h. Let p = (x, y, z, 1), which is done to give the point the same dimensions as Figure 3.13.
3.4. VIEWING TRANSFORMATIONS 83 84 S. M. LaValle: Virtual Reality
Figure 3.15: The vector from the eye position e to a point p that it is looking at
is normalized to form ĉ in (3.37).
Following Figure 3.4, there are two possible interpretations of (3.36). As stated,
Figure 3.14: Consider an eye that is looking down the z axis in the negative this could correspond to moving all of the virtual world models (corresponding to
direction. The origin of the model is the point at which light enters the eye. Figure 3.4(b)). A more appropriate interpretation in the current setting is that
the virtual world’s coordinate frame is being moved so that it matches the eye’s
frame from Figure 3.14. This corresponds to the case of Figure 3.4(c), which was
An eye’s view Figure 3.14 shows a virtual eye that is looking down the negative
not the appropriate interpretation in Section 3.2.
z axis. It is placed in this way so that from the eye’s perspective, x increases to
the right and y is upward. This corresponds to familiar Cartesian coordinates.
The alternatives would be: 1) to face the eye in the positive z direction, which Starting from a look-at For VR, the position and orientation of the eye in the
makes the xy coordinates appear backwards, or 2) reverse the z axis, which would virtual world are given by a tracking system and possibly controller inputs. By
unfortunately lead to a left-handed coordinate system. Thus, we have made an contrast, in computer graphics, it is common to start with a description of where
odd choice that avoids worse complications. the eye is located and which way it is looking. This is called a look-at, and has
Suppose that the eye is an object model that we want to place into the virtual the following components:
world R3 at some position e = (e1 , e2 , e3 ) and orientation given by the matrix
1. Position of the eye: e
x̂1 ŷ1 ẑ1
2. Central looking direction of the eye: ĉ
Reye = x̂2 ŷ2 ẑ2 . (3.35)
x̂3 ŷ3 ẑ3 3. Up direction: û.
If the eyeball in Figure 3.14 were made of triangles, then rotation by Reye and Both ĉ and û are unit vectors. The first direction ĉ corresponds to the center of
translation by e would be applied to all vertices to place it in R3 . the view. Whatever ĉ is pointing at should end up in the center of the display.
This does not, however, solve the problem of how the virtual world should If we want this to be a particular point p in R3 (see Figure 3.15), then ĉ can be
appear to the eye. Rather than moving the eye in the virtual world, we need to calculated as
move all of the models in the virtual world to the eye’s frame of reference. This p−e
ĉ = , (3.37)
means that we need to apply the inverse transformation. The inverse rotation is kp − ek
T
Reye , the transpose of Reye . The inverse of e is −e. Applying (3.26) results in the
appropriate transform: in which k · k denotes the length of a vector. The result is just the vector from e
to p, but normalized.
The second direction û indicates which way is up. Imagine holding a camera
x̂1 x̂2 x̂3 0 1 0 0 −e1
ŷ1 ŷ2 ŷ3 0 0 1 0 −e2 out as if you are about to take a photo and then performing a roll rotation. You
Teye = ẑ1 ẑ2 ẑ3 0 0 0 1 −e3 .
(3.36) can make level ground appear to be slanted or even upside down in the picture.
0 0 0 1 0 0 0 1 Thus, û indicates the up direction for the virtual camera or eye.
We now construct the resulting transform Teye from (3.36). The translation
Note that Reye , as shown in (3.35), has been transposed and placed into the left components are already determined by e, which was given in the look-at. We need
matrix above. Also, the order of translation and rotation have been swapped, only to determine the rotation Reye , as expressed in (3.35). Recall from Section
which is required for the inverse, as mentioned in Section 3.2. 3.2 that the matrix columns indicate how the coordinate axes are transformed by
3.4. VIEWING TRANSFORMATIONS 85 86 S. M. LaValle: Virtual Reality
the matrix (refer to (3.7) and (3.8)). This simplifies the problem of determining
Reye . Each column vector is calculated as
ẑ = −ĉ
x̂ = û × ẑ (3.38)
ŷ = ẑ × x̂.
The minus sign appears for calculating ẑ because the eye is looking down the
negative z axis. The x̂ direction is calculated using the standard cross product ẑ.
For the third equation, we could use ŷ = û; however, ẑ × x̂ will cleverly correct
cases in which û generally points upward but is not perpendicular to ĉ. The unit
vectors from (3.38) are substituted into (3.35) to obtain Reye . Thus, we have all
the required information to construct Teye .
Orthographic projection Let (x, y, z) denote the coordinates of any point, Figure 3.16: Starting with any point (x, y, z), a line through the origin can be
after Teye has been applied. What would happen if we took all points and directly formed using a parameter λ. It is the set of all points of the form (λx, λy, λz)
projected them into the vertical xy plane by forcing each z coordinate to be 0? for any real value λ. For example, λ = 1/2 corresponds to the midpoint between
In other words, (x, y, z) 7→ (x, y, 0), which is called orthographic projection. If (x, y, z) and (0, 0, 0) along the line.
we imagine the xy plane as a virtual display of the models, then there would be
several problems:
1. A jumble of objects would be superimposed, rather than hiding parts of a
model that are in front of another.
2. The display wound extend infinitely in all directions (except z). If the display
is a small rectangle in the xy plane, then the model parts that are outside
of its range can be eliminated.
3. Objects that are closer should appear larger than those further away. This
happens in the real world. Recall from Section 1.3 (Figure 1.22(c)) paintings
that correctly handle perspective.
The first two problems are important graphics operations that are deferred until
Chapter 7. The third problem is addressed next.
(λx, λy, λz), (3.39) Figure 3.17: An illustration of perspective projection. The model vertices are pro-
jected onto a virtual screen by drawing lines through them and the origin (0, 0, 0).
in which λ can be any real number. In other words λ is a parameter that reaches The “image” of the points on the virtual screen corresponds to the intersections
all points on the line that contains both (x, y, z) and (0, 0, 0). See Figure 3.16. of the line with the screen.
Now we can place a planar “movie screen” anywhere in the virtual world and
see where all of the lines pierce it. To keep the math simple, we pick the z = −1
plane to place our virtual screen directly in front of the eye; see Figure 3.17. Using
3.5. CHAINING THE TRANSFORMATIONS 87 88 S. M. LaValle: Virtual Reality
the third component of (3.39), we have λz = −1, implying that λ = −1/z. Using
the first two components of (3.39), the coordinates for the points on the screen
are calculated as x′ = −x/z and y ′ = −y/z. Note that since x and y are scaled
by the same amount z for each axis, their aspect ratio is preserved on the screen.
More generally, suppose the vertical screen is placed at some location d along
the z axis. In this case, we obtain more general expressions for the location of a
point on the screen:
x′ = dx/z
(3.40)
y ′ = dy/z.
Figure 3.18: The viewing frustum.
This was obtained by solving d = λz for λ and substituting it into (3.39).
This is all we need to project the points onto a virtual screen, while respecting
the scaling properties of objects at various distances. Getting this right in VR instant in time, this and all remaining transformation matrices are the same for
helps in the perception of depth and scale, which are covered in Section 6.1. In all points in the virtual world. Here we assume that the eye is positioned at the
Section 3.5, we will adapt (3.40) using transformation matrices. Furthermore, only midpoint between the two virtual human eyes, leading to a cyclopean viewpoint.
points that lie within a zone in front of the eye will be projected onto the virtual Later in this section, we will extend it to the case of left and right eyes so that
screen. Points that are too close, too far, or in outside the normal field of view will stereo viewpoints can be constructed.
not be rendered on the virtual screen; this is addressed in Section 3.5 and Chapter
7. Canonical view transform The next transformation, Tcan performs the per-
spective projection as described in Section 3.4; however, we must explain how it
is unnaturally forced into a 4 by 4 matrix. We also want the result to be in a
3.5 Chaining the Transformations canonical form that appears to be unitless, which is again motivated by industrial
needs. Therefore, Tcan is called the canonical view transform. Figure 3.18 shows a
This section links all of the transformations of this chapter together while also viewing frustum, which is based on the four corners of a rectangular virtual screen.
slightly adjusting their form to match what is currently used in the VR and com- At z = n and z = f lie a near plane and far plane, respectively. Note that z < 0
puter graphics industries. Some of the matrices appearing in this section may seem for these cases because the z axis points in the opposite direction. The virtual
unnecessarily complicated. The reason is that the expressions are motivated by screen is contained in the near plane. The perspective projection should place all
algorithm and hardware issues, rather than mathematical simplicity. In particular, of the points inside of the frustum onto a virtual screen that is centered in the
there is a bias toward putting every transformation into a 4 by 4 homogeneous near plane. This implies d = n using (3.40).
transform matrix, even in the case of perspective projection which is not even We now want to reproduce (3.40) using a matrix. Consider the result of ap-
linear (recall (3.40)). In this way, an efficient matrix multiplication algorithm can plying the following matrix multiplication:
be iterated over the chain of matrices to produce the result.
The chain generally appears as follows: n 0 0 0 x nx
0 n 0 0 y ny
0 0 n 0 z = nz . (3.42)
T = Tvp Tcan Teye Trb . (3.41)
0 0 1 0 1 z
When T is applied to a point (x, y, z, 1), the location of the point on the screen is
produced. Remember that these matrix multiplications are not commutative, and In the first two coordinates, we obtain the numerator of (3.40). The nonlinear
the operations are applied from right to left. The first matrix Trb is the rigid body part of (3.40) is the 1/z factor. To handle this, the fourth coordinate is used
transform (3.21) applied to points on a movable model. For each rigid object in the to represent z, rather than 1 as in the case of Trb . From this point onward, the
model, Trb remains the same; however, different objects will generally be placed resulting 4D vector is interpreted as a 3D vector that is scaled by dividing out its
in various positions and orientations. For example, the wheel of a virtual car will fourth component. For example, (v1 , v2 , v3 , v4 ) is interpreted as
move differently than the avatar’s head. After Trb is applied, Teye transforms the
virtual world into the coordinate frame of the eye, according to (3.36). At a fixed (v1 /v4 , v2 /v4 , v3 /v4 ). (3.43)
3.5. CHAINING THE TRANSFORMATIONS 89 90 S. M. LaValle: Virtual Reality
Figure 3.19: The rectangular region formed by the corners of the viewing frustum, If the frustum is perfectly centered in the xy plane, then the first two components
after they are transformed by Tp . The coordinates of the selected opposite corners of the last column become 0. Finally, we define the canonical view transform Tcan
provide the six parameters, ℓ, r, b, t, n, and f , which used in Tst . from (3.41) as
Tcan = Tst Tp . (3.48)
By symmetry, the right eye is similarly handled by replacing Tlef t in (3.51) with
1 0 0 − 2t
0 1 0 0
0 0 1 0 .
Tright = (3.52)
0 0 0 1
Further Reading
Most of the matrix transforms appear in standard computer graphics texts. The pre-
sentation in this chapter closely follows [202]. For more details on quaternions and their
associated algebraic properties, see [157]. Robotics texts usually cover 3D transforma-
tions for both rigid bodies and chains of bodies, and also consider kinematic singularities;
see [163, 303].
94 S. M. LaValle: Virtual Reality
Chapter 4
93
4.1. BASIC BEHAVIOR OF LIGHT 95 96 S. M. LaValle: Virtual Reality
Specular Diffuse
Figure 4.4: Two extreme modes of reflection are shown. Specular reflection means
that all rays reflect at the same angle at which they approached. Diffuse reflection
means that the rays scatter in a way that could be independent of their approach
angle. Specular reflection is common for a polished surface, such as a mirror,
whereas diffuse reflection corresponds to a rough surface.
Figure 4.2: If the point light source were “infinitely far” away, then parallel wave- Interactions with materials As light strikes the surface of a material, one of
fronts would be obtained. Other names for this setting are: Collimated light, three behaviors might occur, as shown in Figure 4.3. In the case of transmission,
parallel rays, rays from infinity, rays to infinity, and zero vergence. the energy travels through the material and exits the other side. For a transpar-
ent material, such as glass, the transmitted light rays are slowed down and bend
according to Snell’s law, which will be covered in Section 4.2. For a translucent
material that is not transparent, the rays scatter into various directions before
exiting. In the case of absorption, energy is absorbed by the material as the light
becomes trapped. The third case is reflection, in which the light is deflected from
the surface. Along a perfectly smooth or polished surface, the rays reflect in the
same way: The exit angle is equal to the entry angle. See Figure 4.4. This case
is called specular reflection, in contrast to diffuse reflection, in which the reflected
rays scatter in arbitrary directions. Usually, all three cases of transmission, absorp-
tion, and reflection occur simultaneously. The amount of energy divided between
the cases depends on many factors, such as the angle of approach, the wavelength,
and differences between the two adjacent materials or media.
Coherent versus jumbled light The first complication is that light sources
usually do not emit coherent light, a term that means the wavefronts are perfectly
Figure 4.3: As light energy hits the boundary of a different medium, there are aligned in time and space. A laser is an exceptional case that indeed produces
three possibilities: transmission, absorption, and reflection. coherent light. It emits parallel waves of a constant wavelength that are also
synchronized in time so that their peaks align as they propagate. Common light
sources, such as light bulbs and the sun, instead emit a jumble of waves that have
various wavelengths and do not have their peaks aligned.
4.1. BASIC BEHAVIOR OF LIGHT 97 98 S. M. LaValle: Virtual Reality
Figure 4.7: The spectral reflection function of some common familiar materials.
(Figure from [292]).
that is reflected from objects all around us, causing us to perceive their color.
Each surface has its own distribution of wavelengths that it reflects. The fraction
of light energy that is reflected back depends on the wavelength, leading to the
Figure 4.6: The spectral power distribution for some common light sources. (Figure plots shown in Figure 4.7. For us to perceive an object surface as red, the red
from [292]). wavelengths must be included in the light source and the surface must strongly
reflect red wavelengths. Other wavelengths must also be suppressed. For exam-
ple, the light source could be white (containing all wavelengths) and the object
Wavelengths and colors To make sense out of the jumble of waves, we will
could strongly reflect all wavelengths, causing the surface to appear white, not
describe how they are distributed in terms of wavelengths. Figure 4.5 shows the
red. Section 6.3 will provide more details on color perception.
range of wavelengths that are visible to humans. Each wavelength corresponds
to a spectral color, which is what we would perceive with a coherent light source
fixed at that wavelength alone. Wavelengths between 700 and 1000nm are called Frequency Often times, it is useful to talk about frequency instead of wave-
infrared, which are not visible to us, but some cameras can sense them (see Section length. The frequency is the number of times per second that wave peaks pass
9.3). Wavelengths between 100 and 400nm are called ultraviolet; they are not part through a fixed location. Using both the wavelength λ and the speed s, the fre-
of our visible spectrum, but some birds, insects, and fish can perceive ultraviolet quency f is calculated as:
s
wavelengths over 300nm. Thus, our notion of visible light is already tied to human f= . (4.1)
λ
perception.
The speed of light in a vacuum is a universal constant c with value approximately
equal to 3 × 108 m/s. In this case, s = c in (4.1). Light propagates roughly 0.03
Spectral power Figure 4.6 shows how the wavelengths are distributed for com- percent faster in a vacuum than in air, causing the difference to be neglected in
mon light sources. An ideal light source would have all visible wavelengths rep- most engineering calculations. Visible light in air has a frequency range of roughly
resented with equal energy, leading to idealized white light. The opposite is total 400 to 800 terahertz, which is obtained by applying (4.1). As light propagates
darkness, which is black. We usually do not allow a light source to propagate through denser media, such as water or lenses, s is significantly smaller; that
light directly onto our retinas (don’t stare at the sun!). Instead, we observe light difference is the basis of optical systems, which are covered next.
4.2. LENSES 99 100 S. M. LaValle: Virtual Reality
(a) (b)
(a) (b)
Figure 4.9: Propagating wavefronts from a medium with low refractive index (such
as air) to one with a higher index (such as glass). (a) The effect of slower propaga-
Figure 4.8: (a) The earliest known artificially constructed lens, which was made tion on the wavefronts is shown as they enter the lower medium. (b) This shows the
between 750 and 710 BC in ancient Assyrian Nimrud. It is not known whether resulting bending of a light ray, which is always perpendicular to the wavefronts.
this artifact was purely ornamental or used to produce focused images. Picture Snell’s Law relates the refractive indices and angles as n1 sin θ1 = n2 sin θ2 .
from the British Museum. (b) A painting by Conrad con Soest from 1403, which
shows the use of reading glasses for an elderly male.
in which s is the speed of light in the medium. For example, n = 2 means that
light takes twice as long to traverse the medium than through a vacuum. For some
4.2 Lenses common examples, n = 1.000293 for air, n = 1.33 for water, and n = 1.523 for
crown glass.
Lenses have been made for thousands of years, with the oldest known artifact Figure 4.9 shows what happens to incoming light waves and rays. Suppose
shown in Figure 4.8(a). It was constructed before 700 BC in Assyrian Nimrud. in this example that the light is traveling from air into glass, so that n1 < n2 .
Whether constructed from transparent materials or from polished surfaces that Let θ1 represent the incoming angle with respect to the surface normal, and let θ2
act as mirrors, lenses bend rays of light so that a focused image is formed. Over represent the resulting angle as it passes through the material. Snell’s law relates
the centuries, their uses have given rise to several well-known devices, such as eye- the four quantities as
glasses (Figure 4.8(b)), telescopes, magnifying glasses, binoculars, cameras, and n1 sin θ1 = n2 sin θ2 . (4.3)
microscopes. Optical engineering is therefore filled with design patterns that in-
dicate how to optimize the designs of these well-understood devices. VR headsets Typically, n1 /n2 and θ1 are given, so that (4.3) is solved for θ2 to obtain
are unlike classical optical devices, leading to many new challenges that are outside
n1 sin θ1
−1
of standard patterns that have existed for centuries. Thus, the lens design pat- θ2 = sin . (4.4)
n2
terns for VR are still being written. The first step toward addressing the current
challenges is to understand how simple lenses work. If n1 < n2 , then θ2 is closer to perpendicular than θ1 . If n1 > n2 , then θ2 is further
from perpendicular. The case of n1 > n2 is also interesting in that light may not
Snell’s Law Lenses work because of Snell’s law, which expresses how much rays penetrate the surface if the incoming angle θ1 is too large. The range of sin−1 is 0
of light bend when entering or exiting a transparent material. Recall that the to 1, which implies that (4.4) provides a solution for θ2 only if
speed of light in a medium is less than the speed c in an vacuum. For a given
(n1 /n2 ) sin θ1 ≤ 1. (4.5)
material, let its refractive index be defined as
c If the condition above does not hold, then the light rays reflect from the surface.
n= , (4.2) This situation occurs while under water and looking up at the surface. Rather
s
4.2. LENSES 101 102 S. M. LaValle: Virtual Reality
Figure 4.11: A simple convex lens causes parallel rays to converge at the focal
point. The dashed line is the optical axis, which is perpendicular to the lens and
pokes through its center.
Figure 4.10: The upper part shows how a simple prism bends ascending rays into Figure 4.12: If the rays are not perpendicular to the lens, then the focal point is
descending rays, provided that the incoming ray slope is not too high. This was shifted away from the optical axis.
achieved by applying Snell’s law at the incoming and outgoing boundaries. Placing
the prism upside down causes descending rays to become ascending. Putting both
of these together, we will see that a lens is like a stack of prisms that force diverging lens surface is spherically curved so that incoming, parallel, horizontal rays of light
rays to converge through the power of refraction. converge to a point on the other side of the lens. This special place of convergence
is called the focal point. Its distance from the lens center is called the focal depth
than being able to see the world above, a swimmer might instead see a reflection, or focal length.
depending on the viewing angle. The incoming rays in Figure 4.11 are special in two ways: 1) They are parallel,
thereby corresponding to a source that is infinitely far away, and 2) they are
Prisms Imagine shining a laser beam through a prism, as shown in Figure 4.10. perpendicular to the plane in which the lens is centered. If the rays are parallel
Snell’s Law can be applied to calculate how the light ray bends after it enters and but not perpendicular to the lens plane, then the focal point shifts accordingly, as
exits the prism. Note that for the upright prism, a ray pointing slightly upward shown in Figure 4.12. In this case, the focal point is not on the optical axis. There
becomes bent downward. Recall that a larger refractive index inside the prism are two DOFs of incoming ray directions, leading to a focal plane that contains all
would cause greater bending. By placing the prism upside down, rays pointing of the focal points. Unfortunately, this planarity is just an approximation; Section
slightly downward are bent upward. Once the refractive index is fixed, the bending 4.3 explains what really happens. In this idealized setting, a real image is formed
depends only on the angles at which the rays enter and exit the surface, rather than in the image plane, as if it were a projection screen that is showing how the world
on the thickness of the prism. To construct a lens, we will exploit this principle looks in front of the lens (assuming everything in the world is very far away).
and construct a kind of curved version of Figure 4.10. If the rays are not parallel, then it may still be possible to focus them into a
real image, as shown in Figure 4.13. Suppose that a lens is given that has focal
Simple convex lens Figure 4.11 shows a simple convex lens, which should length f . If the light source is placed at distance s1 from the lens, then the rays
remind you of the prisms in Figure 4.10. Instead of making a diamond shape, the from that will be in focus if and only if the following equation is satisfied (which
4.2. LENSES 103 104 S. M. LaValle: Virtual Reality
Figure 4.13: In the real world, an object is not infinitely far away. When placed
at distance s1 from the lens, a real image forms in a focal plane at distance s2 > f
behind the lens, as calculated using (4.6). Figure 4.15: In the case of a concave lens, parallel rays are forced to diverge. The
rays can be extended backward through the lens to arrive at a focal point on the
left side. The usual sign convention is that f < 0 for concave lenses.
the lens, as shown in Figure 4.14. This exactly what happens in the case of the
View-Master and the VR headsets that were shown in Figure 2.11. The screen
is placed so that it appears magnified. To the user viewing looking through the
lenses, it appears as if the screen is infinitely far away (and quite enormous!).
Lensmaker’s equation For a given simple lens, the focal length f can be cal-
culated using the Lensmaker’s Equation (also derived from Snell’s law):
1 1 1
Figure 4.14: If the object is very close to the lens, then the lens cannot force its (n2 − n1 ) + = . (4.7)
r1 r2 f
outgoing light rays to converge to a focal point. In this case, however, a virtual
image appears and the lens works as a magnifying glass. This is the way lenses The parameters r1 and r2 represent the radius of curvature of each of the two lens
are commonly used for VR headsets. surfaces (front and back). This version assumes a thin lens approximation, which
means that the lens thickness is small relative to r1 and r2 . Also, it is typically
is derived from Snell’s law): assumed that n1 = 1, which is approximately true for air.
1 1 1
+ = . (4.6)
s1 s2 f Concave lenses For the sake of completeness, we include the case of a concave
Figure 4.11 corresponds to the idealized case in which s1 = ∞, for which solving simple lens, shown in Figure 4.15. Parallel rays are forced to diverge, rather than
(4.6) yields s2 = f . What if the object being viewed is not completely flat and lying converge; however, a meaningful notion of negative focal length exists by tracing
in a plane perpendicular to the lens? In this case, there does not exist a single the diverging rays backwards through the lens. The Lensmaker’s Equation (4.7)
plane behind the lens that would bring the entire object into focus. We must can be slightly adapted to calculate negative f in this case [104].
tolerate the fact that most of it will be approximately in focus. Unfortunately,
this is the situation almost always encountered in the real world, including the Diopters For optical systems used in VR, several lenses will be combined in
focus provided by our own eyes (see Section 4.4). succession. What is the effect of the combination? A convenient method to answer
If the light source is placed too close to the lens, then the outgoing rays might this question with simple arithmetic was invented by ophthalmologists. The idea
be diverging so much that the lens cannot force them to converge. If s1 = f , then is to define a diopter, which is D = 1/f . Thus, it is the reciprocal of the focal
the outgoing rays would be parallel (s2 = ∞). If s1 < f , then (4.6) yields s2 < 0. length. If a lens focuses parallel rays at a distance of 0.2m in behind the lens, then
In this case, a real image is not formed; however, something interesting happens: D = 5. A larger diopter D means greater converging power. Likewise, a concave
The phenomenon of magnification. A virtual image appears when looking into lens yields D < 0, with a lower number implying greater divergence. To combine
4.3. OPTICAL ABERRATIONS 105 106 S. M. LaValle: Virtual Reality
Figure 4.16: To calculate the combined optical power of a chain of lenses, the
algebra is simple: Add their diopters. This arrangement of four lenses is equivalent
to a 6-diopter lens, which has a focal length of 0.1667m.
several nearby lenses in succession, we simply add their diopters to determine their
Figure 4.17: Chromatic aberration is caused by longer wavelengths traveling more
equivalent power as a single, simple lens. Figure 4.16 shows a simple example.
quickly through the lens. The unfortunate result is a different focal plane for each
wavelength or color.
4.3 Optical Aberrations
If lenses in the real world behaved exactly as described in Section 4.2, then VR
systems would be much simpler and more impressive than they are today. Unfor-
tunately, numerous imperfections, called aberrations, degrade the images formed
by lenses. Because these problems are perceptible in everyday uses, such as view-
ing content through VR headsets or images from cameras, they are important
to understand so that some compensation for them can be designed into the VR
system.
Chromatic aberration Recall from Section 4.1 that light energy is usually a
jumble of waves with a spectrum of wavelengths. You have probably seen that
the colors of the entire visible spectrum nicely separate when white light is shined
through a prism. This is a beautiful phenomenon, but for lenses it is terrible
annoyance because it separates the focused image based on color. This problem
is called chromatic aberration.
The problem is that the speed of light through a medium depends on the
wavelength. We should therefore write a material’s refractive index as n(λ) to
indicate that it is a function of λ. Figure 4.17 shows the effect on a simple convex
lens. The focal depth becomes a function of wavelength. If we shine red, green, and
blue lasers directly into the lens along the same ray, then each color would cross
the optical axis in a different place, resulting in red, green, and blue focal points.
Recall the spectral power distribution and reflection functions from Section 4.1.
For common light sources and materials, the light passing through a lens results
in a whole continuum of focal points. Figure 4.18 shows an image with chromatic Figure 4.18: The upper image is properly focused whereas the lower image suffers
aberration artifacts. Chromatic aberration can be reduced at greater expense by from chromatic aberration. (Figure by Stan Zurek.)
combining convex and concave lenses of different materials so that the spreading
rays are partly coerced into converging [298].
4.3. OPTICAL ABERRATIONS 107 108 S. M. LaValle: Virtual Reality
Figure 4.19: Spherical aberration causes imperfect focus because rays away from
the optical axis are refracted more than those at the periphery.
Optical distortion Even if the image itself projects onto the image plane it
might be distorted at the periphery. Assuming that the lens is radially symmetric,
the distortion can be described as a stretching or compression of the image that
becomes increasingly severe away from the optical axis. Figure 4.20 shows how
this effects the image for two opposite cases: barrel distortion and pincushion
distortion. For lenses that have a wide field-of-view, the distortion is stronger,
especially in the extreme case of a fish-eyed lens. Figure 4.21 shows an image that
has strong barrel distortion. Correcting this distortion is crucial for current VR
headsets that have a wide field-of-view; otherwise, the virtual world would appear
to be warped.
Coma and flare Finally, coma is yet another aberration. In this case, the image
magnification varies dramatically as the rays are far from perpendicular to the lens.
The result is a “comet” pattern in the image plane. Another phenomenon is lens
flare, in which rays from very bright light scatter through the lens and often show
circular patterns. This is often seen in movies as the viewpoint passes by the sun
or stars, and is sometimes added artificially.
All of the aberrations of this section complicate the system or degrade the
experience in a VR headset; therefore, substantial engineering effort is spent on
mitigating these problems.
Figure 4.25: A ray of light travels through five media before hitting the retina. The Figure 4.27: A closer object yields diverging rays, but with a relaxed lens, the
indices of refraction are indicated. Considering Snell’s law, the greatest bending image is blurry on the retina.
occurs due to the transition from air to the cornea. Note that once the ray enters
the eye, it passes through only liquid or solid materials.
Figure 4.26: Normal eye operation, with relaxed lens. Figure 4.28: The process of accommodation: The eye muscles pull on the lens,
causing it to increase the total optical power and focus the image on the retina.
covered in Sections 5.1 and 5.2. The interior of the eyeball is actually liquid, as
opposed to air. The refractive indices of materials along the path from the outside
air to the retina are shown in Figure 4.25.
The optical power of the eye The outer diameter of the eyeball is roughly
24mm, which implies that a lens of at least 40D would be required to cause con-
vergence of parallel rays onto the retina center inside of the eye (recall diopters
from Section 4.2). There are effectively two convex lenses: The cornea and the
lens. The cornea is the outermost part of the eye where the light first enters and
has the greatest optical power, approximately 40D. The eye lens is less powerful
and provides an additional 20D. By adding diopters, the combined power of the
cornea and lens is 60D, which means that parallel rays are focused onto the retina
at a distance of roughly 17mm from the outer cornea. Figure 4.26 shows how this Figure 4.29: Placing a convex lens in front of the eye is another way to increase
system acts on parallel rays for a human with normal vision. Images of far away the optical power so that nearby objects can be brought into focus by the eye.
objects are thereby focused onto the retina. This is the principle of reading glasses.
4.4. THE HUMAN EYE 113 114 S. M. LaValle: Virtual Reality
Vision abnormalities The situations presented so far represent normal vision Figure 4.30: In VR headsets, the lens is placed so that the screen appears to be
throughout a person’s lifetime. One problem could be that the optical system infinitely far away.
simply does not have enough optical power to converge parallel rays onto the
retina. This condition is called hyperopia or farsightedness. Eyeglasses come to
the rescue. The simple fix is to place a convex lens (positive diopter) in front of the retinas.
eye, as in the case of reading glasses. In the opposite direction, some eyes have too To account for people with vision problems, a focusing knob may be appear
much optical power. This case is called myopia or nearsightedness, and a concave on the headset, which varies the distance between the lens and the screen. This
lens (negative diopter) is placed in front of the eye to reduce the optical power adjusts the optical power so that the rays between the lens and the cornea are no
appropriately. Recall that we have two eyes, not one. This allows the possibility longer parallel. They can be made to converge, which helps people with hyperopia.
for each eye to have a different problem, resulting in different lens diopters per Alternatively, they can be made to diverge, which helps people with myopia. Thus,
eye. Other vision problems may exist beyond optical power. The most common is they can focus sharply on the screen without placing their eyeglasses in front of the
astigmatism, which was covered in Section 4.3. In human eyes this is caused by the lens. However, if each eye requires a different diopter, then a focusing knob would
cornea having an excessively elliptical shape, rather than being radially symmetric. be required for each eye. Furthermore, if they have astigmatism, then it cannot
Special, non-simple lenses are needed to correct this condition. You might also be corrected. Placing eyeglasses inside of the headset may be the only remaining
wonder whether the aberrations from Section 4.3, such as chromatic aberration, solution, but it may be uncomfortable and could reduce the field of view.
occur in the human eye. They do, however they are corrected automatically by Many details have been skipped or dramatically simplified in this section. One
our brains because we have learned to interpret such flawed images our entire lives! important detail for a VR headset is each lens should be centered perfectly in front
of the cornea. If the distance between the two lenses is permanently fixed, then
A simple VR headset Now suppose we are constructing a VR headset by this is impossible to achieve for everyone who uses the headset. The interpupillary
placing a screen very close to the eyes. Young adults would already be unable to distance, or IPD, is the distance between human eye centers. The average among
bring it into focus it if were closer than 10cm. We want to bring it close so that humans is around 64mm, but it varies greatly by race, gender, and age (in the case
it fills the view of the user. Therefore, the optical power is increased by using a of children). To be able to center the lenses for everyone, the distance between lens
convex lens, functioning in the same way as reading glasses. See Figure 4.30. This centers should be adjustable from around 55 to 75mm. This is a common range
is also the process of magnification, from Section 4.2. The lens is usually placed for binoculars. Unfortunately, the situation is not even this simple because our
at the distance of its focal depth. Using (4.6), this implies that s2 = −f , resulting eyes also rotate within their sockets, which changes the position and orientation
in s1 = ∞. The screen appears as an enormous virtual image that is infinitely far of the cornea with respect to the lens. This amplifies optical aberration problems
away. Note, however, that a real image is nevertheless projected onto the retina. that were covered in Section 4.3. Eye movements will be covered in Section 5.3.
We do not perceive the world around us unless real images are formed on our Another important detail is the fidelity of our vision: What pixel density is needed
4.5. CAMERAS 115 116 S. M. LaValle: Virtual Reality
Figure 4.31: A pinhole camera that is recommended for viewing a solar eclipse.
(Figure from TimeAndDate.com.)
(a) (b)
for the screen that is placed in front of our eyes so that we do not notice the pixels?
Figure 4.32: (a) A CMOS active-pixel image sensor. (b) A low-cost CMOS camera
A similar question is how many dots-per-inch (DPI) are needed on a printed piece
module (SEN-11745), ready for hobbyist projects.
of paper so that we do not see the dots, even when viewed under a magnifying
glass? We return to this question in Section 5.1.
Shutters Several practical issues arise when capturing digital images. The image
is an 2D array of pixels, each of which having red (R), green (G), and blue (B)
values that typically range from 0 to 255. Consider the total amount of light energy
that hits the image plane. For a higher-resolution camera, there will generally be
4.5 Cameras less photons per pixel because the pixels are smaller. Each sensing element (one per
color per pixel) can be imagined as a bucket that collects photons, much like drops
Now that we have covered the human eye, it seems natural to describe an engi- of rain. To control the amount of photons, a shutter blocks all the light, opens
neered eye, otherwise known as a camera. People have built and used cameras for a fixed interval of time, and then closes again. For a long interval (low shutter
for hundreds of years, starting with a camera obscura that allows light to pass speed), more light is collected; however, the drawbacks are that moving objects in
through a pinhole and onto a surface that contains the real image. Figure 4.31 the scene will become blurry and that the sensing elements could become saturated
shows an example that you might have constructed to view a solar eclipse. (Re- with too much light. Photographers must strike a balance when determining the
call the perspective transformation math from Section 3.4.) Eighteenth-century shutter speed to account for the amount of light in the scene, the sensitivity of the
artists incorporated a mirror and tracing paper to un-invert the image and allow sensing elements, and the motion of the camera and objects in the scene.
it to be perfectly copied. Across the 19th century, various chemically based tech- Also relating to shutters, CMOS sensors unfortunately work by sending out
nologies were developed to etch the image automatically from the photons hitting the image information sequentially, line-by-line. The sensor is therefore coupled
the imaging surface. Across the 20th century, film was in widespread use, until with a rolling shutter, which allows light to enter for each line, just before the
digital cameras avoided the etching process altogether by electronically capturing information is sent. This means that the capture is not synchronized over the
the image using a sensor. Two popular technologies have been a Charge-Coupled entire image, which leads to odd artifacts, such as the one shown in Figure 4.33.
Device (CCD) array and a CMOS active-pixel image sensor, which is shown in Image processing algorithms that work with rolling shutters and motion typically
Figure 4.32(a). Such digital technologies record the amount of light hitting each transform the image to correct for this problem. CCD sensors grab and send the
pixel location along the image, which directly produces a captured image. The entire image at once, resulting in a global shutter. CCDs have historically been
costs of these devices has plummeted in recent years, allowing hobbyists to buy a more expensive than CMOS sensors, which resulted in widespread appearance of
camera module such as the one shown in Figure 4.32(b) for under $30 US. rolling shutter cameras in smartphones; however, the cost of global shutter cameras
4.5. CAMERAS 117 118 S. M. LaValle: Virtual Reality
Figure 4.33: The wings of a flying helicopter are apparently bent backwards due Figure 4.34: A spectrum of aperture settings, which control the amount of light
to the rolling shutter effect. that enters the lens. The values shown are called the focal ratio or f-stop.
is rapidly decreasing. bination with engineered optical components is [296]. Cameras are covered from many
different perspectives, including computer vision [111, 318], camera engineering [126],
Aperture The optical system also impacts the amount of light that arrives to and photography [297]. Mathematical foundations of imaging are thoroughly covered in
the sensor. Using a pinhole, as shown in Figure 4.31, light would fall onto the [19].
image sensor, but it would not be bright enough for most purposes (other than
viewing a solar eclipse). Therefore, a convex lens is used instead so that multiple
rays are converged to the same point in the image plane; recall Figure 4.11. This
generates more photons per sensing element. The main drawback is that the lens
sharply focuses objects at a single depth, while blurring others; recall (4.6). In the
pinhole case, all depths are essentially “in focus”, but there might not be enough
light. Photographers therefore want to tune the optical system to behave more
like a pinhole or more like a full lens, depending on the desired outcome. The
result is a controllable aperture (Figure 4.34), which appears behind the lens and
sets the size of the hole through which the light rays enter. A small radius mimics
a pinhole by blocking all but the center of the lens. A large radius allows light to
pass through the entire lens. Our eyes control the light levels in a similar manner
by contracting or dilating our pupils.. Finally, note that the larger the aperture,
the more that the aberrations covered in Section 4.3 interfere with the imaging
process.
Further Reading
Most of the basic lens and optical system concepts are covered in introductory university
physics texts. For more advanced coverage, especially lens aberrations, see the classic
optical engineering text: [298]. A convenient guide that quickly covers the geometry
of optics is [104]. Thorough coverage of optical systems that utilize electronics, lasers,
and MEMS, is given in [156]. This provides a basis for understanding next-generation
visual display technologies. An excellent book that considers the human eye in com-
120 S. M. LaValle: Virtual Reality
Chapter 5
What you perceive about the world around you is “all in your head”. After reading
Chapter 4, especially Section 4.4, you should understand that the light around us
forms images on our retinas that capture colors, motions, and spatial relationships
in the physical world. For someone with normal vision, these captured images
may appear to have perfect clarity, speed, accuracy, and resolution, while being
distributed over a large field of view. However, we are being fooled. We will see
in this chapter that this apparent perfection of our vision is mostly an illusion
because neural structures are filling in plausible details to generate a coherent
picture in our heads that is consistent with our life experiences. When building
VR technology that co-opts these processes, it important to understand how they
work. They were designed to do more with less, and fooling these processes with
VR produces many unexpected side effects because the display technology is not
a perfect replica of the surrounding world.
Section 5.1 continues where Section 4.4 left off by adding some anatomy of the
human eye to the optical system. Most of the section is on photoreceptors, which
are the “input pixels” that get paired with the “output pixels” of a digital display
for VR. Section 5.2 offers a taste of neuroscience by explaining what is known about
the visual information that hierarchically propagates from the photoreceptors up
to the visual cortex. Section 5.3 explains how our eyes move, which serves a good
purpose, but incessantly interferes with the images in our retinas. Section 5.4
concludes the chapter by applying the knowledge gained about visual physiology Figure 5.1: Physiology of the human eye. This viewpoint shows how the right
to determine VR display requirements, such as the screen resolution. eye would appear if sliced horizontally (the nose would be to the left). (From
Wikipedia user Rhcastilhos.)
5.1 From the Cornea to Photoreceptors
Parts of the eye Figure 5.1 shows the physiology of a human eye. The shape is
approximately spherical, with a diameter of around 24mm and only slight variation
among people. The cornea is a hard, transparent surface through which light enters
and provides the greatest optical power (recall from Section 4.4). The rest of the
outer surface of the eye is protected by a hard, white layer called the sclera. Most
of the eye interior consists of vitreous humor, which is a transparent, gelatinous
119
5.1. FROM THE CORNEA TO PHOTORECEPTORS 121 122 S. M. LaValle: Virtual Reality
understand the scale, the width of the smallest cones is around 1000nm. This is
mass that allows light rays to penetrate with little distortion or attenuation.
quite close to the wavelength of visible light, implying that photoreceptors need
As light rays cross the cornea, they pass through a small chamber containing
not be much smaller. Each human retina contains about 120 million rods and
aqueous humour, which is another transparent, gelatinous mass. After crossing
6 million cones that are densely packed along the retina. Figure 5.3 shows the
this, rays enter the lens by passing through the pupil. The size of the pupil is
detection capabilities of each photoreceptor type. Rod sensitivity peaks at 498nm,
controlled by a disc-shaped structure called the iris, which provides an aperture
between blue and green in the spectrum. Three categories of cones exist, based
that regulates the amount of light that is allowed to pass. The optical power of
on whether they are designed to sense blue, green, or red light.
the lens is altered by ciliary muscles. After passing through the lens, rays pass
through the vitreous humor and strike the retina, which lines more than 180◦ of Photoreceptors respond to light levels over a large dynamic range. Figure 5.4
the inner eye boundary. Since Figure 5.1 shows a 2D cross section, the retina is shows several familiar examples. The luminance is measured in SI units of candelas
shaped like an arc; however, keep in mind that it is a 2D surface. Imagine it as a
curved counterpart to a visual display. To catch the light from the output pixels, it Light source Luminance (cd/m2 ) Photons per receptor
is lined with photoreceptors, which behave like “input pixels”. The most important Paper in starlight 0.0003 0.01
part of the retina is the fovea; the highest visual acuity, which is a measure of the Paper in moonlight 0.2 1
sharpness or clarity of vision, is provided for rays that land on it. The optic disc Computer monitor 63 100
is a small hole in the retina through which neural pulses are transmitted outside Room light 316 1000
of the eye through the optic nerve. It is on the same side of the fovea as the nose.
Blue sky 2500 10,000
Paper in sunlight 40,000 100,000
Photoreceptors The retina contains two kinds of photoreceptors for vision: 1)
rods, which are triggered by very low levels of light, and 2) cones, which require Figure 5.4: Several familiar settings and the approximate number of photons per
more light and are designed to distinguish between colors. See Figure 5.2. To second hitting a photoreceptor. (Figure adapted from [160, 204].)
5.1. FROM THE CORNEA TO PHOTORECEPTORS 123 124 S. M. LaValle: Virtual Reality
Figure 5.6: An experiment that reveals your blind spot. Close your right eye and
look directly at the “X”. Vary the distance of the paper (or screen) from your eye.
Over some range, the dot should appear to vanish. You can carry this experiment
one step further by writing an “X” and dot on a textured surface, such as graph
paper. In that case, the dot disappears and you might notice the surface texture
perfectly repeating in the place where the dot once existed. This is caused by your
brain filling in the expected texture over the blind spot!
our retinas being inside-out and having no other way to route the neural signals
to the brain; see Section 5.2.
The photoreceptor densities shown in Figure 5.5 leave us with a conundrum.
With 20/20 vision, we perceive the world as if our eyes are capturing a sharp,
Figure 5.5: Photoreceptor density as a function of angle. The right of the plot colorful image over a huge angular range. This seems impossible, however, because
is the nasal side (which corresponds to rays entering from the opposite, temporal we can only sense sharp, colored images in a narrow range. Furthermore, the blind
side). (Figure based on [240]) spot should place a black hole in our image. Surprisingly, our perceptual processes
produce an illusion that a complete image is being captured. This is accomplished
per square meter, which corresponds directly to the amount of light power per area. by filling in the missing details using contextual information, which is described
The range spans seven orders of magnitude, from 1 photon hitting a photoreceptor in Section 5.2, and by frequent eye movements, the subject of Section 5.3. If you
every 100 seconds up to 100, 000 photons per receptor per second. At low light are still not convinced that your brain is fooling you into seeing a complete image,
levels, only rods are triggered. Our inability to distinguish colors at night is caused then try the blind spot experiment shown in Figure 5.6.
by the inability of rods to distinguish colors. Our eyes may take up to 35 minutes
to fully adapt to low light, resulting in a monochromatic mode called scotopic
vision. By contrast, our cones become active in brighter light. Adaptation to this 5.2 From Photoreceptors to the Visual Cortex
trichromatic mode, called photopic vision, may take up to ten minutes (you have
undoubtedly noticed the adjustment period when someone unexpectedly turns on Photoreceptors are transducers that convert the light-energy stimulus into an elec-
lights while you are lying in bed at night). trical signal called a neural impulse, thereby inserting information about the out-
side world into our neural structures. Recall from Section 2.3 that signals are
propagated upward in a hierarchical manner, from photoreceptors to the visual
Photoreceptor density The density of photoreceptors across the retina varies cortex (Figure 2.19). Think about the influence that each photoreceptor has on
greatly, as plotted in Figure 5.5. The most interesting region is the fovea, which the network of neurons. Figure 5.7 shows a simplified model. As the levels in-
has the greatest concentration of photoreceptors. The innermost part of the fovea crease, the number of influenced neurons grows rapidly. Figure 5.8 shows the same
has a diameter of only 0.5mm or an angular range of ±0.85 degrees, and contains diagram, but highlighted in a different way by showing how the number of photore-
almost entirely cones. This implies that the eye must be pointed straight at a ceptors that influence a single neuron increases with level. Neurons at the lowest
target to perceive a sharp, colored image. The entire fovea has diameter 1.5mm levels are able to make simple comparisons of signals from neighboring photore-
(±2.6 degrees angular range), with the outer ring having a dominant concentration ceptors. As the levels increase, the neurons may respond to a larger patch of the
of rods. Rays that enter the cornea from the sides land on parts of the retina with retinal image. This principle will become clear when seeing more neural structures
lower rod density and very low cone density. This corresponds to the case of in this section. Eventually, when signals reach the highest levels (beyond these
peripheral vision. We are much better at detecting movement in our periphery, figures), information from the memory of a lifetime of experiences is fused with
but cannot distinguish colors effectively. Peripheral movement detection may have the information that propagated up from photoreceptors. As the brain performs
helped our ancestors from being eaten by predators. Finally, the most intriguing significant processing, a perceptual phenomenon results, such as recognizing a face
part of the plot is the blind spot, where there are no photoreceptors. This is due to or judging the size of a tree. It takes the brain over 100ms to produce a result
5.2. FROM PHOTORECEPTORS TO THE VISUAL CORTEX 125 126 S. M. LaValle: Virtual Reality
Figure 5.7: Four levels in a simple hierarchy are shown. Each disk corresponds
to a neural cell or photoreceptor, and the arrows indicate the flow of information.
Photoreceptors generate information at Level 0. In this extremely simplified and
idealized view, each photoreceptor and neuron connects to exactly three others at
the next level. The red and gold part highlights the growing zone of influence that
a single photoreceptor can have as the levels increase.
Figure 5.9: Light passes through a few neural layers before hitting the rods and
cones. (Figure by the Institute for Dynamic Educational Advancement.)
Figure 5.11: The receptive field of an ON-center ganglion cell. (Figure by the
Institute for Dynamic Educational Advancement.)
Figure 5.10: Vertebrates (including humans) have inside-out retinas, which lead
to a blind spot and photoreceptors aimed away from the incoming light. The left
shows a vertebrate eye, and the right shows a cephalopod eye, for which nature most common and well understood types of ganglion cells are called midget, para-
got it right: The photoreceptors face the light and there is no blind spot. (Figure sol, and bistratified. They perform simple filtering operations over their receptive
by Jerry Crimson Mann.) fields based on spatial, temporal, and spectral (color) variations in the stimulus
across the photoreceptors. Figure 5.11 shows one example. In this case, a ganglion
cell is triggered when red is detected in the center but not green in the surrounding
to rods, with about 30 to 50 rods per bipolar. There are two types of bipolar area. This condition is an example of spatial opponency, for which neural struc-
cells based on their function. An ON bipolar activates when the rate of photon tures are designed to detect local image variations. Thus, consider ganglion cells as
absorption in its connected photoreceptors increases. An OFF bipolar activates for tiny image processing units that can pick out local changes in time, space, and/or
decreasing photon absorption. The bipolars connected to cones have both kinds; color. They can detect and emphasize simple image features such as edges. Once
however, the bipolars for rods have only ON bipolars. The bipolar connections the ganglion axons leave the eye through the optic nerve, a significant amount of
are considered to be vertical because they connect directly from photoreceptors image processing has already been performed to aid in visual perception. The raw
to the ganglion cells This is in contrast to the remaining two cell types in the image based purely on photons hitting the photoreceptor never leaves the eye.
inner nuclear layer. The horizontal cells are connected by inputs (dendrites) to
photoreceptors and bipolar cells within a radius of up to 1mm. Their output The optic nerve connects to a part of the thalamus called the lateral geniculate
(axon) is fed into photoreceptors, causing lateral inhibition, which means that the nucleus (LGN); see Figure 5.12. The LGN mainly serves as a router that sends
activation of one photoreceptor tends to decrease the activation of its neighbors. signals from the senses to the brain, but also performs some processing. The
Finally, amacrine cells connect horizontally between bipolar cells, other amacrine LGN sends image information to the primary visual cortex (V1), which is located
cells, and vertically to ganglion cells. There are dozens of types, and their function at the back of the brain. The visual cortex, highlighted in Figure 5.13, contains
is not well understood. Thus, scientists do not have a complete understanding of several interconnected areas that each perform specialized functions. Figure 5.14
human vision, even at the lowest layers. Nevertheless, the well understood parts shows one well-studied operation performed by the visual cortex. Chapter 6 will
contribute greatly to our ability to design effective VR systems and predict other describe visual perception, which is the conscious result of processing in the visual
human responses to visual stimuli. cortex, based on neural circuitry, stimulation of the retinas, information from
At the ganglion cell layer, several kinds of cells process portions of the retinal other senses, and expectations based on prior experiences. Characterizing how
image. Each ganglion cell has a large receptive field, which corresponds to the all of these processes function and integrate together remains an active field of
photoreceptors that contribute to its activation as shown in Figure 5.8. The three research.
5.2. FROM PHOTORECEPTORS TO THE VISUAL CORTEX 129 130 S. M. LaValle: Virtual Reality
Figure 5.13: The visual cortex is located in the back of the head (Figure by
Washington Irving).
Eye muscles The rotation of each eye is controlled by six muscles that are each
attached to the sclera (outer eyeball surface) by a tendon. Figures 5.17 and 5.18
show their names and arrangement. The tendons pull on the eye in opposite pairs.
For example, to perform a yaw (side-to-side) rotation, the tensions on the medial
5.3. EYE MOVEMENTS 131 132 S. M. LaValle: Virtual Reality
Figure 5.16: The fractal appears to be moving until you carefully fixate on a single
part to verify that it is not.
Figure 5.14: A popular example of visual cortex function is orientation tuning, in
which a single-unit recording is made of a single neuron in the cortex. As the bar
is rotated in front of the eye, the response of the neuron varies. It strongly favors
one particular orientation.
Figure 5.17: There are six muscles per eye, each of which is capable of pulling the
pupil toward its location.
Smooth pursuit In the case of smooth pursuit, the eye slowly rotates to track
a moving target feature. Examples are a car, a tennis ball, or a person walking
by. The rate of rotation is usually less than 30◦ per second, which is much slower
than for saccades. The main function of smooth pursuit is to reduce motion blur
on the retina; this is also known as image stabilization. The blur is due to the
slow response time of photoreceptors, as discussed in Section 5.1. If the target
is moving too fast, then saccades may be intermittently inserted into the pursuit
motions to catch up to it.
Types of movements We now consider movements based on their purpose, Optokinetic reflex The next category is called the optokinetic reflex, which
resulting in six categories: 1) saccades, 2) smooth pursuit, 3) vestibulo-ocular occurs when a fast object speeds along. This occurs when watching a fast-moving
reflex, 4) optokinetic reflex, 5) vergence, and 6) microsaccades. All of these motions train while standing nearby on fixed ground. The eyes rapidly and involuntar-
cause both eyes to rotate approximately the same way, except for vergence, which ily choose features for tracking on the object, while alternating between smooth
causes the eyes to rotate in opposite directions. We will skip a seventh category pursuit and saccade motions.
of motion, called rapid eye movements (REMs), because they only occur while we
are sleeping and therefore do not contribute to a VR experience. The remaining Vergence Stereopsis refers to the case in which both eyes are fixated on the
six categories will now be discussed in detail. same object, resulting in a single perceived image. Two kinds of vergence motions
occur to align the eyes with an object. See Figure 5.20. If the object is closer than
Saccades The eye can move in a rapid motion called a saccade, which lasts less a previous fixation, then a convergence motion occurs. This means that the eyes
than 45ms with rotations of about 900◦ per second. The purpose is to quickly are rotating so that the pupils are becoming closer. If the object is further, then
relocate the fovea so that important features in a scene are sensed with highest divergence motion occurs, which causes the pupils to move further apart. The eye
visual acuity. Figure 5.15 showed an example in which a face is scanned by fix- orientations resulting from vergence motions provide important information about
ating on various features in rapid succession. Each transition between features is the distance of objects.
accomplished by a saccade. Interestingly, our brains use saccadic masking to hide
the intervals of time over which saccades occur from our memory. This results Microsaccades The sixth category of movements is called microsaccades, which
in distorted time perception, as in the case when second hands click into position are small, involuntary jerks of less than one degree that trace out an erratic path.
on an analog clock. The result of saccades is that we obtain the illusion of high They are believed to augment many other processes, including control of fixations,
acuity over a large angular range. Although saccades frequently occur while we reduction of perceptual fading due to adaptation, improvement of visual acuity,
have little or no awareness of them, we have the ability to consciously control them and resolving perceptual ambiguities [269]. Although these motions have been
as we choose features for fixation. known since the 18th century [53], their behavior is extremely complex and not
5.3. EYE MOVEMENTS 135 136 S. M. LaValle: Virtual Reality
Figure 5.21: The head and eyes rotate together to fixate on new or moving targets.
Eye and head movements together Although this section has focused on eye
movement, it is important to understand that most of the time the eyes and head
Figure 5.19: The vestibulo-ocular reflex (VOR). The eye muscles are wired to are moving together. Figure 5.21 shows the angular range for yaw rotations of the
angular accelerometers in the vestibular organ to counter head movement with the head and eyes. Although eye yaw is symmetric by allowing 35◦ to the left or right,
opposite eye movement with less than 10ms of latency. The connection between the pitching of the eyes is not. Human eyes can pitch 20◦ upward and 25◦ downward,
eyes and the vestibular organ is provided by specialized vestibular and extraocular which suggests that it might be optimal to center a VR display slightly below the
motor nuclei, thereby bypassing higher brain functions. pupils when the eyes are looking directly forward. In the case of VOR, eye rotation
is controlled to counteract head motion. In the case of smooth pursuit, the head
and eyes may move together to keep a moving target in the preferred viewing area.
(a) (b)
(a) (b)
Figure 5.22: In displays, the pixels break into subpixels, much in the same way
that photoreceptors break into red, blue, and green components. (a) An LCD
Figure 5.23: (a) Due to pixels, we obtain a bad case of the jaggies (more formally
display. (Photo by Luis Flavio Loureiro dos Santos.) (b) An AMOLED PenTile
known as aliasing) instead of sharp, straight lines. (Figure from Wikipedia user
display from the Nexus One smartphone. (Photo by Matthew Rollings.)
Jmf145.) (b) In the screen-door effect, a black grid is visible around the pixels.
How good does the VR visual display need to be? Three crucial factors
for the display are: what they called a retina display.1 Is this reasonable, and how does it relate to
VR?
1. Spatial resolution: How many pixels per square area are needed? Assume that the fovea is pointed directly at the display to provide the best
sensing possible. The first issue is that red, green, and blue cones are arranged in
2. Intensity resolution and range: How many intensity values can be produced, a mosaic, as shown in Figure 5.24. The patterns are more erratic than the engi-
and what are the minimum and maximum intensity values? neered versions in Figure 5.22. Vision scientists and neurobiologists have studied
the effective or perceived input resolution through measures of visual acuity [139].
3. Temporal resolution: How fast do displays need to change their pixels?
Subjects in a study are usually asked to indicate whether they can detect or rec-
The spatial resolution factor will be addressed in the next paragraph. The second ognize a particular target. In the case of detection, for example, scientists might
factor could also be called color resolution and range because the intensity values of like to know the smallest dot that can be perceived when printed onto a surface.
each red, green, or blue subpixel produce points in the space of colors; see Section In terms of displays, a similar question is: How small do pixels need to be so
6.3. Recall the range of intensities from Figure 5.4 that trigger photoreceptors. that a single white pixel against a black background is not detectable? In the
Photoreceptors can span seven orders of magnitude of light intensity. However, case of recognition, a familiar example is attempting to read an eye chart, which
displays have only 256 intensity levels per color to cover this range. Entering sco- displays arbitrary letters of various sizes. In terms of displays, this could corre-
topic vision mode does not even seem possible using current display technology spond to trying to read text under various sizes, resolutions, and fonts. Many
because of the high intensity resolution needed at extremely low light levels. Tem- factors contribute to acuity tasks, such as brightness, contrast, eye movements,
poral resolution is extremely important, but is deferred until Section 6.2, in the time exposure, and the part of the retina that is stimulated.
context of motion perception. One of the most widely used concepts is cycles per degree, which roughly corre-
sponds to the number of stripes (or sinusoidal peaks) that can be seen as separate
How much pixel density is enough? We now address the spatial resolution. along a viewing arc; see Figure 5.25. The Snellen eye chart, which is widely used
Insights into the required spatial resolution are obtained from the photoreceptor by optometrists, is designed so that patients attempt to recognize printed letters
densities. As shown in Figure 5.22, we see individual lights when a display is highly from 20 feet away (or 6 meters). A person with “normal” 20/20 (or 6/6 in met-
magnified. As it is zoomed out, we may still perceive sharp diagonal lines as being ric) vision is expected to barely make out the horizontal stripes in the letter “E”
jagged, as shown in Figure 5.23(a); this phenomenon is known as aliasing. Another shown in Figure 5.25. This assumes he is looking directly at the letters, using the
artifact is the screen-door effect, shown in Figure 5.23(b); this is commonly noticed photoreceptors in the central fovea. The 20/20 line on the chart is designed so
in an image produced by a digital LCD projector. What does the display pixel that letter height corresponds to 30 cycles per degree when the eye is 20 feet away.
density need to be so that we do not perceive individual pixels? In 2010, Steve Jobs 1
This is equivalent to a density of 165 pixels per mm2 , but we will use linear inches because
of Apple Inc. claimed that 326 pixels per linear inch (PPI) is enough, achieving it is the international standard for display comparisons.
5.4. IMPLICATIONS FOR VR 139 140 S. M. LaValle: Virtual Reality
The total height of the “E” is 1/6 of a degree. Note that each stripe is half of a
cycle. What happens if the subject stands only 10 feet away from the eye chart?
The letters should roughly appear to twice as large.
Using simple trigonometry,
s = d tan θ, (5.1)
we can determine what the size s of some feature should be for a viewing angle
θ at a distance d from the eye. For very small θ, tan θ ≈ θ (in radians). For the
example of the eye chart, s could correspond to the height of a letter. Doubling
the distance d and the size s should keep θ roughly fixed, which corresponds to
the size of the image on the retina.
We now return to the retina display concept. Suppose that a person with
20/20 vision is viewing a large screen that is 20 feet (6.096m) away. To generate
30 cycles per degree, it must have at least 60 pixels per degree. Using (5.1), the
size would be s = 20 ∗ tan 1◦ = 0.349ft, which is equivalent to 4.189in. Thus, only
60/4.189 = 14.32 PPI would be sufficient. Now suppose that a smartphone screen
is placed 12 inches from the user’s eye. In this case, s = 12 ∗ tan 1◦ = 0.209in. This
requires that the screen have at least 60/0.209 = 286.4 PPI, which was satisfied
by the 326 PPI originally claimed by Apple.
In the case of VR, the user is not looking directly at the screen as in the case
Figure 5.24: Red, green, and blue cone photoreceptors are distributed in a com-
of smartphones. By inserting a lens for magnification, the display can be brought
plicated mosaic in the center of the fovea. (Figure by Mark Fairchild.)
even closer to the eye. This is commonly done for VR headsets, as was shown in
Figure 4.30. Suppose that the lens is positioned at its focal distance away from the
screen, which for the sake of example is only 1.5in (this is comparable to current
VR headsets). In this case, s = 1 ∗ tan 1◦ = 0.0261in, and the display must have at
least 2291.6 PPI to achieve 60 cycles per degree! The highest-density smartphone
display available today is the Super AMOLED 1440x2560 5.1 inch screen on the
Samsung S6, which is used in the Gear VR system. It has only 577 PPI, which
means that the PPI needs to increase by roughly a factor of four to obtain retina
display resolution for VR headsets.
This is not the complete story because some people, particularly youths, have
better than 20/20 vision. The limits of visual acuity have been established to
be around 60 to 77 cycles per degree, based on photoreceptor density and neural
processes [38, 51]; however, this is based on shining a laser directly onto the retina,
which bypasses many optical aberration problems as the light passes through the
eye. A small number of people (perhaps one percent) have acuity up to 60 cycles
(a) (b) per degree. In this extreme case, the display density would need to be 4583 PPI.
Thus, many factors are involved in determining a sufficient resolution for VR. It
suffices to say that the resolutions that exist today in consumer VR headsets are
Figure 5.25: (a) A single letter on an eye chart. (b) The size s of the letter (or
inadequate, and retinal display resolution will not be achieved until the PPI is
other feature of interest), the distance d of the viewer, and the viewing angle θ are
several times higher.
related as s = d tan θ.
How much field of view is enough? What if the screen is brought even closer
to the eye to fill more of the field of view? Based on the photoreceptor density plot
in Figure 5.5 and the limits of eye rotations shown in Figure 5.21, the maximum
5.4. IMPLICATIONS FOR VR 141 142 S. M. LaValle: Virtual Reality
field of view seems to be around 270◦ , which is larger than what could be provided
by a flat screen (less than 180◦ ). Increasing the field of view by bringing the screen
closer would require even higher pixel density, but lens aberrations (Section 4.3) at
the periphery may limit the effective field of view. Furthermore, if the lens is too
thick and too close to the eye, then the eyelashes may scrape it; Fresnel lenses may
provide a thin alternative, but introduce artifacts. Thus, the quest for a VR retina
display may end with a balance between optical system quality and limitations of
the human eye. Curved screens may help alleviate some of the problems.
Figure 5.26: Most displays still work in the way as old TV sets and CRT monitors:
Foveated rendering One of the frustrations with this analysis is that we have
not been able to exploit that fact that photoreceptor density decreases away from By updating pixels line-by-line. For a display that has 60 FPS (frames per second),
the fovea. We had to keep the pixel density high everywhere because we have this could take up to 16.67ms.
no control over which part of the display the user will be look at. If we could
track where the eye is looking and have a tiny, movable display that is always the sensation that truly stationary objects are sliding back and forth!2
positioned in front of the pupil, with zero delay, then much fewer pixels would be
needed. This would greatly decrease computational burdens on graphical rendering Display scanout Recall from Section 4.5 that cameras have either a rolling or
systems (covered in Chapter 7). Instead of moving a tiny screen, the process can global shutter based on whether the sensing elements are scanned line-by-line or
be simulated by keeping the fixed display but focusing the graphical rendering only in parallel. Displays work the same way, but whereas cameras are an input device,
in the spot where the eye is looking. This is called foveated rendering, which has displays are the output analog. Most displays today have a rolling scanout (called
been shown to work [105], but is currently too costly and there is too much delay raster scan), rather than global scanout. This implies that the pixels are updated
and other discrepancies between the eye movements and the display updates. In line by line, as shown in Figure 5.26. This procedure is an artifact of old TV sets
the near future, it may become an effective approach for the mass market. and monitors, which each had a cathode ray tube (CRT) with phosphor elements
on the screen. An electron beam was bent by electromagnets so that it would
repeatedly strike and refresh the glowing phosphors.
VOR gain adaptation The VOR gain is a ratio that compares the eye rotation
Due to the slow charge and response time of photoreceptors, we do not perceive
rate (numerator) to counter the rotation and translation rate of the head (denom-
the scanout pattern during normal use. However, when our eyes, features in the
inator). Because head motion has six DOFs, it is appropriate to break the gain
scene, or both are moving, then side effects of the rolling scanout may become
into six components. In the case of head pitch and yaw, the VOR gain is close to
perceptible. Think about the operation of a line-by-line printer, as in the case of a
1.0. For example, if you yaw your head to the left at 10◦ per second, then the eye
receipt printer on a cash register. If we pull on the tape while it is printing, then
yaws at 10◦ per second in the opposite direction. The VOR roll gain is very small
the lines would become stretched apart. If it is unable to print a single line at
because the eyes have a tiny roll range. The VOR translational gain depends on
once, then the lines themselves would become slanted. If we could pull the tape to
the distance to the features.
the side while it is printing, then the entire page would become slanted. You can
Recall from Section 2.3 that adaptation is a universal feature of our sensory
also achieve this effect by repeatedly drawing a horizontal line with a pencil while
systems. VOR gain is no exception. For those who wear eyeglasses, the VOR gain
using the other hand to gently pull the paper in a particular direction. The paper
must adapt due to the optical transformations described in Section 4.2. Lenses
in this analogy is the retina and the pencil corresponds to light rays attempting
affect the field of view and perceived size and distance of objects. The VOR com-
to charge photoreceptors. Figure 5.27 shows how a rectangle would distort under
fortably adapts to this problem by changing the gain. Now suppose that you are
cases of smooth pursuit and VOR. One possibility is to fix this by rendering a
wearing a VR headset that may suffer from flaws such as an imperfect optical sys-
distorted image that will be corrected by the distortion due to the line-by-line
tem, tracking latency, and incorrectly rendered objects on the screen. In this case,
scanout [212] (this was later suggested in [1]). Constructing these images requires
adaptation may occur as the brain attempts to adapt its perception of stationarity
precise calculations of the scanout timings. Yet another problem with displays is
to compensate for the flaws. In this case, your visual system could convince your
that the pixels could take so long to switch (up to 20ms) that sharp edges appear
brain that the headset is functioning correctly, and then your perception of sta-
to be blurred. We will continue discussing these problems in Section 6.2 in the
tionarity in the real world would become distorted until you readapt. For example,
after a flawed VR experience, you might yaw your head in the real world and have 2
This frequently happened to the author while developing and testing the Oculus Rift.
5.4. IMPLICATIONS FOR VR 143 144 S. M. LaValle: Virtual Reality
face, then your eyes will try to increase the lens power while the eyes are strongly
converging. If a lens is placed at a distance of its focal length from a screen, then
with normal eyes it will always be in focus while the eye is relaxed (recall Figure
4.30). What if an object is rendered to the screen so that it appears to be only
10cm away? In this case, the eyes strongly converge, but they do not need to
change the optical power of the eye lens. The eyes may nevertheless try to accom-
modate, which would have the effect of blurring the perceived image. The result
is called vergence-accommodation mismatch because the stimulus provided by VR
is inconsistent with the real world. Even if the eyes become accustomed to the
mismatch, the user may feel extra strain or fatigue after prolonged use [246, 289].
The eyes are essentially being trained to allow a new degree of freedom: Sepa-
(a) (b) (c)
rating vergence from accommodation, rather than coupling them. New display
technologies may provide some relief from this problem, but they are currently
Figure 5.27: Artifacts due to display scanout: (a) A vertical rectangle in the too costly and imprecise. For example, the mismatch can be greatly reduced by
scene. (b) How it may distort during smooth pursuit while the rectangle moves to using eye tracking to estimate the amount of vergence and then altering the power
the right in the virtual world. (c) How a stationary rectangle may distort when of the optical system [4, 187].
rotating the head to the right while using the VOR to compensate. The cases of
(b) are (c) are swapped if the direction of motion is reversed in each case. Further Reading
Most of the concepts from Sections 5.1 to 5.1 appear in standard textbooks on sensation
context of motion perception, and Section 7.4 in the context of rendering. and perception [97, 204, 350]. Chapter 7 of [204] contains substantially more neuroscience
than covered in this chapter. More details on photoreceptor structure appear in [51, 225,
337]. The interface between eyes and engineered optical systems is covered in [296], of
Retinal image slip Recall that eye movements contribute both to maintaining which digital optical systems are also related [156].
a target in a fixed location on the retina (smooth pursuit, VOR) and also to Sweeping coverage of eye movements is provided in [184]. For eye movements from
changing its location slightly to reduce perceptual fading (microsaccades). During a neuroscience perspective, see [177]. VOR gain adaptation is studied in [58, 91, 284].
ordinary activities (not VR), the eyes move and the image of a feature may move Theories of microsaccade function are discussed in [269]. Coordination between smooth
slightly on the retina due to motions and optical distortions. This is called retinal pursuit and saccades is explained in [73]. Coordination of head and eye movements
image slip. Once a VR headset is used, the motions of image features on the is studied in [162, 247]. See [17, 246, 289] regarding comfort issues with vergence-
retina might not match what would happen in the real world. This is due to accommodation mismatch.
many factors already mentioned, such as optical distortions, tracking latency, and
display scanout. Thus, the retinal image slip due to VR artifacts does not match
the retinal image slip encountered in the real world. The consequences of this
have barely been identified, much less characterized scientifically. They are likely
to contribute to fatigue, and possibly VR sickness. As an example of the problem,
there is evidence that microsaccades are triggered by the lack of retinal image slip
[71]. This implies that differences in retinal image slip due to VR usage could
interfere with microsaccade motions, which are already not fully understood.
Chapter 6
Visual Perception
This chapter continues where Chapter 5 left off by transitioning from the physiology
of human vision to perception. If we were computers, then this transition might
seem like going from low-level hardware to higher-level software and algorithms.
How do our brains interpret the world around us so effectively in spite of our
limited biological hardware? To understand how we may be fooled by visual
stimuli presented by a display, you must first understand how our we perceive or
interpret the real world under normal circumstances. It is not always clear what
we will perceive. We have already seen several optical illusions. VR itself can be Figure 6.1: This painting uses a monocular depth cue called a texture gradient to
considered as a grand optical illusion. Under what conditions will it succeed or enhance depth perception: The bricks become smaller and thinner as the depth
fail? increases. Other cues arise from perspective projection, including height in the
Section 6.1 covers perception of the distance of objects from our eyes, which is visual field and retinal image size. (“Paris Street, Rainy Day,” Gustave Caillebotte,
also related to the perception of object scale. Section 6.2 explains how we perceive 1877. Art Institute of Chicago.)
motion. An important part of this is the illusion of motion that we perceive from
videos, which are merely a sequence of pictures. Section 6.3 covers the perception
of color, which may help explain why displays use only three colors (red, green, and
blue) to simulate the entire spectral power distribution of light (recall from Section
4.1). Finally, Section 6.4 presents a statistically based model of how information
is combined from multiple sources to produce a perceptual experience.
145
6.1. PERCEPTION OF DEPTH 147 148 S. M. LaValle: Virtual Reality
Figure 6.3: The retinal image size of a familiar object is a strong monocular depth
cue. The closer object projects onto a larger number of photoreceptors, which Figure 6.4: For the Ebbinghaus illusion, the inner disc appears larger when sur-
cover a larger portion of the retina. rounded by smaller discs. The inner disc is the same size in either case. This may
be evidence of discrepancy between the true visual angle (or retinal image size)
and the perceived visual angle.
cue. In this section, we consider only depth cues, which contribute toward depth
perception. If a depth cue is derived from the photoreceptors or movements of a
single eye, then it is called a monocular depth cue. If both eyes are required, then through many aspects of perception, including shape, size, and color. The second
it is a stereo depth cue. There are many more monocular depth cues than stereo, factor is that, the object must be appear naturally so that it does not conflict with
which explains why we are able to infer so much depth information from a single other depth cues.
photograph. Figure 6.1 shows an example. The illusions in Figure 6.2 show that If there is significant uncertainty about the size of an object, then knowledge
even simple line drawings are enough to provide strong cues. Interestingly, the of its distance should contribute to estimating its size. This falls under size per-
cues used by humans also work in computer vision algorithms to extract depth ception, which is closely coupled to depth perception. Cues for each influence the
information from images [318]. other, in a way discussed in Section 6.4.
One controversial theory is that our perceived visual angle differs from the
actual visual angle. The visual angle is proportional to the retinal image size.
6.1.1 Monocular depth cues This theory is used to explain the illusion that the moon appears to be larger
when it is near the horizon. For another example, see Figure 6.4.
Retinal image size Many cues result from the geometric distortions caused
by perspective projection; recall the “3D” appearance of Figure 1.22(c). For a
familiar object, such as a human, coin, or basketball, we often judge its distance Height in the visual field Figure 6.5(a) illustrates another important cue,
by how “large” is appears to be. Recalling the perspective projection math from which is the height of the object in the visual field. The Ponzo illusion in Figure
Section 3.4, the size of the image on the retina is proportional to 1/z, in which z 6.2(a) exploits this cue. Suppose that we can see over a long distance without
is the distance from the eye (or the common convergence point for all projection obstructions. Due to perspective projection, the horizon is a line that divides
lines). See Figure 6.3. The same thing happens when taking a picture with a the view in half. The upper half is perceived as the sky, and the lower half is the
camera: A picture of a basketball would occupy larger part of the image, covering ground. The distance of objects from the horizon line corresponds directly to their
more pixels, as it becomes closer to the camera. This cue is called retinal image distance due to perspective projection: The closer to the horizon, the further the
size, and was studied in [96]. perceived distance. Size constancy scaling, if available, combines with the height
Two important factors exist. First, the viewer must be familiar with the object in the visual field, as shown in Figure 6.5(b).
to the point of comfortably knowing its true size. For familiar objects, such as
people or cars, our brains performance size constancy scaling by assuming that Accommodation Recall from Section 4.4 that the human eye lens can change
the distance, rather than the size, of the person is changing if they come closer. its optical power through the process of accommodation. For young adults, the
Size constancy falls of the general heading of subjective constancy, which appears amount of change is around 10D (diopters), but it decreases to less than 1D for
6.1. PERCEPTION OF DEPTH 149 150 S. M. LaValle: Virtual Reality
(a) (b)
Figure 6.6: Motion parallax: As the perspective changes laterally, closer objects
Figure 6.5: Height in visual field. (a) Trees closer to the horizon appear to be have larger image displacements than further objects. (Figure from Wikipedia.)
further away, even though all yield the same retinal image size. (b) Incorrect
placement of people in the visual field illustrates size constancy scaling, which is provides an ordinal depth cue called interposition by indicating which objects
closely coupled with depth cues. are in front of others. Figure 6.7(c) illustrates the image blur cue, where levels
are depth are inferred from the varying sharpness of focus. Figure 6.7(d) shows
adults over 50 years old. The ciliary muscles control the lens and their tension an atmospheric cue in which air humidity causes far away scenery to have lower
level is reported to the brain through efference copies of the motor control signal. contrast, thereby appearing to be further away.
This is the first depth cue that does not depend on signals generated by the
photoreceptors.
6.1.2 Stereo depth cues
Motion parallax Up until now, the depth cues have not exploited motions. If As you may expect, focusing both eyes on the same object enhances depth per-
you have ever looked out of the side window of a fast-moving vehicle, you might ception. Humans perceive a single focused image over a surface in space called the
have noticed that the nearby objects race by much faster than further objects. horopter; see Figure 6.8. Recall the vergence motions from Section 5.3. Similar to
The relative difference in speeds is called parallax and is an important depth cue; the accommodation cue case, motor control of the eye muscles for vergence mo-
see Figure 6.6. Even two images, from varying viewpoints within a short amount tions provides information to the brain about the amount of convergence, thereby
of time, provide strong depth information. Imagine trying to simulate a stereo providing a direct estimate of distance. Each eye provides a different viewpoint,
rig of cameras my snapping one photo and quickly moving the camera sideways which results in different images on the retina. This phenomenon is called binoc-
to snap another. If the rest of the world is stationary, then the result is roughly ular disparity. Recall from (3.50) in Section 3.5 that the viewpoint is shifted to
equivalent to having two side-by-side cameras. Pigeons frequently bob their heads the right or left to provide a lateral offset for each of the eyes. The transform
back and forth to obtain stronger depth information than is provided by their essentially shifts the virtual world to either side. The same shift would happen
pair of eyes. Finally, closely related to motion parallax is optical flow, which is a for a stereo rig of side-by-side cameras in the real world. However, the binocu-
characterization of the rates at which features move across the retina. This will lar disparity for humans is different because the eyes can rotate to converge, in
be revisited in Sections 6.2 and 8.4. addition to having a lateral offset. Thus, when fixating on an object, the retinal
images between the left and right eyes may vary only slightly, but this nevertheless
Other monocular cues Figure 6.7 shows several other monocular cues. As provides a powerful cue used by the brain.
shown in Figure 6.7(a), shadows that are cast by a light source encountering an Furthermore, when converging on an object at one depth, we perceive double
object provide an important cue. Figure 6.7(b) shows a simple drawing that images of objects at other depths (although we usually pay no attention to it).
6.1. PERCEPTION OF DEPTH 151 152 S. M. LaValle: Virtual Reality
(a) (b)
(c) (d)
Figure 6.8: The horopter is the loci of points over which the eyes can converge
Figure 6.7: Several more monocular depth cues: (a) Shadows resolve ambiguous and focus on a single depth. The T curve shows the theoretical horopter based
depth in the ball and shadow illusion. (b) The interposition of objects provides an on simple geometry. The E curve shows the empirical horopter, which is much
ordinal depth cue. (c) Due to image blur, one gnome appears to be much closer larger and correspond to the region over which a single focused image is perceived.
than the others. (d) This scene provides an atmospheric cue: Some scenery is (Figure by Rainer Zenz.)
perceived to be further away because it has lower contrast.
6.1. PERCEPTION OF DEPTH 153 154 S. M. LaValle: Virtual Reality
(a) (b)
Figure 6.9: In the Tuscany demo from Oculus VR, there are not enough familiar Figure 6.10: The Ames room: (a) Due to incorrect depth cues, incorrect size per-
objects to precisely resolve depth and size. Have you ever been to a villa like this? ception results. (b) The room is designed so that it only appears to be rectangular
Are the floor tiles a familiar size? Is the desk too low? after perspective projection is applied. One person is actually much further away
than the other. (Figure by Alex Valavanis.)
This double-image effect is called diplopia. You can perceive it by placing your
finger about 20cm in front of your face and converging on it. While fixating on your world. For example, if the user’s pupils are 64mm apart in the real world but only
finger, you should perceive double images of other objects around the periphery. 50mm apart in the virtual world, then the virtual world will seem much larger,
You can also stare into the distance while keeping your finger in the same place. which dramatically affects depth perception. Likewise, if the pupils are very far
You should see a double image of your finger. If you additionally roll your head apart, the user could either feel enormous or the virtual world might seem small.
back and forth, it should appear as if the left and right versions of your finger are Imagine simulating a Godzilla experience, where the user is 200 meters tall and
moving up and down with respect to each other. These correspond to dramatic the entire city appears to be a model. It is fine to experiment with such scale and
differences in the retinal image, but we are usually not aware of them because we depth distortions in VR, but it is important to understand their implications on
perceive both retinal images as a single image. the user’s perception.
6.1.3 Implications for VR Mismatches In the real world, all of the depth cues work together in harmony.
We are sometimes fooled by optical illusions that are designed to intentionally
Incorrect scale perception A virtual world may be filled with objects that cause inconsistencies among cues. Sometimes a simple drawing is sufficient. Figure
are not familiar to us in the real world. In many cases, they might resemble 6.10 shows an elaborate illusion that requires building a distorted room in the real
familiar objects, but their precise scale might be difficult to determine. Consider world. It is perfectly designed so that when viewed under perspective projection
the Tuscany demo world from Oculus VR, shown in Figure 6.9. The virtual villa is from one location, it appears to be a rectangular box. Once our brains accept
designed to be inhabited with humans, but it is difficult to judge the relative sizes this, we unexpectedly perceive the size of people changing as they walk across the
and distances of objects because there are not enough familiar objects. Further room! This is because all of the cues based on perspective appear to be functioning
complicating the problem is that the user’s height in VR might not match his correctly. Section 6.4 may help you to understand how multiple cues are resolved,
height in the virtual world. Is the user too short, or is the world too big? A even in the case of inconsistencies.
common and confusing occurrence is that the user might be sitting down in the In a VR system, it is easy to cause mismatches and in many cases they are
real world, but standing in the virtual world. An additional complication occurs if unavoidable. Recall from Section 5.4 that vergence-accommodation mismatch oc-
the interpupillary distance (recall from Section 4.4) is not matched with the real curs in VR headsets. Another source of mismatch may occur from imperfect head
6.1. PERCEPTION OF DEPTH 155 156 S. M. LaValle: Virtual Reality
leveraging captured data from the real world. Recall from Section 1.1 that the
virtual world may be synthetic or captured. It is generally more costly to create
synthetic worlds, but it is then simple to generate stereo viewpoints (at a higher
rendering cost). On the other hand, capturing panoramic, monoscopic images
and movies is fast and inexpensive (examples were shown in Figure 1.8). There
are already smartphone apps that stitch pictures together to make a panoramic
photo, and direct capture of panoramic video is likely to be a standard feature on
smartphones within a few years. By recognizing that this content is sufficiently
“3D” due to the wide field of view and monocular depth cues, it becomes a powerful
way to create VR experiences. There are already hundreds of millions of images in
Google Street View, shown in Figure 6.11, which can be easily viewed using Google
Cardboard or other headsets. They provide a highly immersive experience with
substantial depth perception, even though there is no stereo. There is even strong
evidence that stereo displays cause significant fatigue and discomfort, especially
for objects at a close depth [245, 246]. Therefore, one should think very carefully
about the use of stereo. In many cases, it might be more time, cost, and trouble
than it is worth to obtain the stereo cues when there may already be sufficient
Figure 6.11: In Google Cardboard and other VR headsets, hundreds of millions of monocular cues for the VR task or experience.
panoramic Street View images can be viewed. There is significant depth percep-
tion, even when the same image is presented to both eyes, because of monoscopic
depth cues.
6.2 Perception of Motion
tracking. If there is significant latency, then the visual stimuli will not appear in We rely on our vision to perceive motion for many crucial activities. One use
the correct place at the expected time. Furthermore, many tracking systems track is to separate a moving figure from a stationary background. For example, a
the head orientation only. This makes it impossible to use motion parallax as a camouflaged animal in the forest might only become noticeable when moving.
depth cue if the user moves from side to side without any rotation. To preserve This is clearly useful whether humans are the hunter or the hunted. Motion also
most depth cues based on motion, it is important to track head position, in ad- helps people to assess the 3D structure of an object. Imagine assessing the value
dition to orientation; see Section 9.3. Optical distortions may cause even more of a piece of fruit in the market by rotating it around. Another use is to visually
mismatch. guide actions, such as walking down the street or hammering a nail. VR systems
have the tall order of replicating these uses in a virtual world in spite of limited
technology. Just as important as the perception of motion is the perception of non-
Monocular cues are powerful! A common misunderstanding among the gen-
motion, which we called perception of stationarity in Section 2.3. For example, if
eral public is that depth perception enabled by stereo cues alone. We are bom-
we apply the VOR by turning our heads, then do the virtual world objects move
barded with marketing of “3D” movies and stereo displays. The most common
correctly on the display so that they appear to be stationary? Slight errors in time
instance today is the use of circularly polarized 3D glasses in movie theaters so
or image position might inadvertently trigger the perception of motion.
that each eye receives a different image when looking at the screen. VR is no
exception to this common misunderstanding. CAVE systems provided 3D glasses
with an active shutter inside so that alternating left and right frames can be pre- 6.2.1 Detection mechanisms
sented to the eyes. Note that this cuts the frame rate in half. Now that we
have comfortable headsets, presenting separate visual stimuli to each eye is much Reichardt detector Figure 6.12 shows a neural circuitry model, called a Re-
simpler. One drawback is that the rendering effort (the subject of Chapter 7) is ichardt detector, which responds to directional motion in the human vision system.
doubled, although this can be improved through some context-specific tricks. Neurons in the ganglion layer and LGN detect simple features in different spots
As you have seen in this section, there are many more monocular depth cues in the retinal image. At higher levels, motion detection neurons exist that re-
than stereo cues. Therefore, it is wrong to assume that the world is perceived spond when the feature moves from one spot on the retina to another nearby spot.
as “3D” only if there are stereo images. This insight is particularly valuable for The motion detection neuron activates for a feature speed that depends on the
6.2. PERCEPTION OF MOTION 157 158 S. M. LaValle: Virtual Reality
Figure 6.12: The neural circuitry directly supports motion detection. As the image Figure 6.13: Due to local nature of motion detectors, the aperture problem results.
feature moves across the retina, nearby feature detection neurons (labeled a and b) The motion of the larger body is ambiguous when perceived through a small hole
activate in succession. Their outputs connect to motion detection neurons (labeled because a wide range of possible body motions could produce the same effect inside
c). Due to different path lengths from a and b to c, the activation signal arrives of the hole. An incorrect motion inference usually results.
at different times. Thus, c activates when the feature was detected by a slightly
before being detected by b. clean mathematical way to describe the global motions across the retina is by a
vector field, which assigns a velocity vector at every position. The global result is
difference in path lengths from its input neurons. It is also sensitive to a partic- called the optical flow, which provides powerful cues for both object motion and
ular direction of motion based on the relative locations of the receptive fields of self motion. The latter case results in vection, which is a leading cause of VR
the input neurons. Due to the simplicity of the motion detector, it can be easily sickness; see Sections 8.4 and 10.2 for details.
fooled. Figure 6.12 shows a feature moving from left to right. Suppose that a
train of features moves from right to left. Based on the speed of the features and Distinguishing object motion from observer motion Figure 6.14 shows
the spacing between them, the detector may inadvertently fire, causing motion to two cases that produce the same images across the retina over time. In Figure
be perceived in the opposite direction. This is the basis of the wagon-wheel effect, 6.14(a), the eye is fixed while the object moves by. In Figure 6.14(b), the situation
for which a wheel with spokes or a propeller may appear to be rotating in the is reversed: The object is fixed, but the eye moves. The brain uses several cues
opposite direction, depending on the speed. The process can be further disrupted to differentiate between these cases. Saccadic suppression, which was mentioned
by causing eye vibrations from humming [276]. This simulates stroboscopic condi- in Section 5.3, hides vision signals during movements; this may suppress motion
tions, which discussed in Section 6.2.2. Another point is that the motion detectors detectors in the second case. Another cue is provided by proprioception, which
are subject to adaptation. Therefore, several illusions exist, such as the waterfall is the body’s ability to estimate its own motions due to motor commands. This
illusion [18] and the spiral aftereffect, in which incorrect motions are perceived due includes the use of eye muscles in the second case. Finally, information is provided
to aftereffects from sustained fixation [18, 205]. by large-scale motion. If it appears that the entire scene is moving, then the brain
assumes the most likely interpretation, which is that the user must be moving.
From local data to global conclusions Motion detectors are local in the This is why the haunted swing illusion, shown in Figure 2.20, is so effective.
sense that a tiny portion of the visual field causes each to activate. In most
cases, data from detectors across large patches of the visual field are integrated
6.2.2 Stroboscopic apparent motion
to indicate coherent motions of rigid bodies. (An exception would be staring at
pure analog TV static.) All pieces of a rigid body move through space according Nearly everyone on Earth has seen a motion picture, whether through a TV, smart-
to the equations from Section 3.2. This coordinated motion is anticipated by our phone, or movie screen. The motions we see are an illusion because a sequence
visual system to match common expectations. If too much of the moving body of still pictures is being flashed onto the screen. This phenomenon is called stro-
is blocked, then the aperture problem results, which is shown in Figure 6.13. A boscopic apparent motion; it was discovered and refined across the 19th century.
6.2. PERCEPTION OF MOTION 159 160 S. M. LaValle: Virtual Reality
Figure 6.16: The phi phenomenon and beta movement are physiologically distinct
effects in which motion is perceived [347, 307]. In the sequence of dots, one is
turned off at any give time. A different dot is turned off in each frame, following a
clockwise pattern. At a very low speed (2 FPS), beta movement triggers a motion
(a) (b)
perception of each on dot directly behind the off dot. The on dot appears to jump
to the position of the off dot. At a higher rate, such as 15 FPS, there instead
Figure 6.14: Two motions that cause equivalent movement of the image on the appears to be a moving hole; this corresponds to the phi phenomenon.
retina: (a) The eye is fixed and the object moves; (b) the eye moves while the
object is fixed. Both of these are hard to achieve in practice due to eye rotations The zoetrope, shown in Figure 6.15 was developed around 1834. It consists of a
(smooth pursuit and VOR). rotating drum with slits that allow each frame to be visible for an instant while
the drum rotates. In Section 1.3, Figure 1.23 showed the Horse in Motion film
from 1878.
Why does this illusion of motion work? An early theory, which has largely
been refuted in recent years, is called persistence of vision. The theory states that
images persist in the vision system during the intervals in between frames, thereby
causing them to be perceived as continuous. One piece of evidence against this
theory is that images persist in the visual cortex for around 100ms, which implies
that the 10 FPS (Frames Per Second) is the slowest speed that stroboscopic appar-
ent motion would work; however, it is also perceived down to 2 FPS [307]. Another
piece of evidence against the persistence of vision is the existence of stroboscopic
apparent motions that cannot be accounted for by it. The phi phenomenon and
beta movement are examples of motion perceived in a sequence of blinking lights,
rather than flashing frames (see Figure 6.16). The most likely reason that stro-
boscopic apparent motion works is that it triggers the neural motion detection
circuitry illustrated in Figure 6.12 [204, 211].
Frame rates How many frames per second are appropriate for a motion pic-
ture? The answer depends on the intended use. Figure 6.17 shows a table of
Figure 6.15: The zoetrope was developed in the 1830s and provided stroboscopic significant frame rates from 2 to 5000. Stroboscopic apparent motion begins at 2
apparent motion as images became visible through slits in a rotating disc. FPS. Imagine watching a security video at this rate. It is easy to distinguish indi-
vidual frames, but the motion of a person would also be perceived. Once 10 FPS
6.2. PERCEPTION OF MOTION 161 162 S. M. LaValle: Virtual Reality
FPS Occurrence
2 Stroboscopic apparent motion starts
10 Ability to distinguish individual frames is lost
16 Old home movies; early silent films
24 Hollywood classic standard
25 PAL television before interlacing
30 NTSC television before interlacing
48 Two-blade shutter; proposed new Hollywood standard
50 Interlaced PAL television
60 Interlaced NTSC television; perceived flicker in some displays
72 Three-blade shutter; minimum CRT refresh rate for comfort
90 Modern VR headsets; no more discomfort from flicker Figure 6.18: A problem with perception of stationarity under stroboscopic apparent
1000 Ability to see zipper effect for fast, blinking LED motion: The image of a feature slips across the retina in a repeating pattern as
5000 Cannot perceive zipper effect the VOR is performed.
Figure 6.17: Various frame rates and comments on the corresponding stroboscopic
may still contribute to fatigue or headaches. Therefore, frame rates were increased
apparent motion. Units are in Frames Per Second (FPS).
to even higher levels. A minimum acceptable ergonomic standard for large CRT
monitors was 72 FPS, with 85 to 90 FPS being widely considered as sufficiently
is reached, the motion is obviously more smooth and we start to lose the ability high to eliminate most flicker problems. The problem has been carefully studied
to distinguish individual frames. Early silent films ranged from 16 to 24 FPS. The by psychologists under the heading of flicker fusion threshold; the precise rates at
frame rates were often fluctuating and were played at a faster speed than they which flicker is perceptible or causes fatigue depends on many factors in addition
were filmed. Once sound was added to film, incorrect speeds and fluctuations in to FPS, such as position on retina, age, color, and light intensity. Thus, the actual
the speed were no longer tolerated because both sound and video needed to be limit depends on the kind of display, its size, specifications, how it is used, and
synchronized. This motivated playback at the fixed rate of 24 FPS, which is still who is using it. Modern LCD and LED displays, used as televisions, computer
used today by the movie industry. Personal video cameras remained at 16 or 18 screens, and smartphone screens, have 60, 120, and even 240 FPS.
FPS into the 1970s. The famous Zapruder film of the Kennedy assassination in The story does not end there. If you connect an LED to a pulse generator (put
1963 was taken at 18.3 FPS. Although 24 FPS may be enough to perceive motions a resistor in series), then flicker can be perceived at much higher rates. Set the
smoothly, a large part of cinematography is devoted to ensuring that motions are pulse generator to produce a square wave at several hundred Hz. Go to a dark
not so fast that jumps are visible due to the low frame rate. room and hold the LED in your hand. If you wave it around so fast that your
Such low frame rates unfortunately lead to perceptible flicker as the images eyes cannot track it, then the flicker becomes perceptible as a zipper pattern. Let
rapidly flash on the screen with black in between. This motivated several workarounds. this be called the zipper effect. This happens because each time the LED pulses
In the case of movie projectors, two-blade and three-blade shutters were invented on, it is imaged in a different place on the retina. Without image stabilization, it
so that they would show each frame two or three times, respectively. This enabled appears as an array of lights. The faster the motion, the further apart the images
movies to be shown at 48 FPS and 72 FPS, thereby reducing discomfort from flick- will appear. The higher the pulse rate (or FPS), the closer together the images
ering. Analog television broadcasts in the 20th century were at 25 (PAL standard) will appear. Therefore, to see the zipper effect at very high speeds, you need to
or 30 FPS (NTSC standard), depending on the country. To double the frame rate move the LED very quickly. It is possible to see the effect for a few thousand FPS.
and reduce perceived flicker, they used interlacing to draw half the image in one
frame time, and then half in the other. Every other horizontal line is drawn in
6.2.3 Implications for VR
the first half, and the remaining lines are drawn in the second. This increased the
frames rates on television screens to 50 and 60 FPS. The game industry has used Unfortunately, VR systems require much higher display performance than usual.
60 FPS standard target for smooth game play. We have already seen in Section 5.4 that much higher resolution is needed so that
As people started sitting close to giant CRT monitors in the early 1990s, the pixels and aliasing artifacts are not visible. The next problem is that higher frame
flicker problem became problematic again because sensitivity to flicker is stronger rates are needed in comparison to ordinary television or movie standards of 24
at the periphery. Furthermore, even when flicker cannot be directly perceived, it FPS or even 60 FPS. To understand why, see Figure 6.18. The problem is easiest
6.2. PERCEPTION OF MOTION 163 164 S. M. LaValle: Virtual Reality
(a) (b)
Figure 6.19: An engineering solution to reduce retinal image slip: (a) Using low
persistence, the display is lit for a short enough time to trigger photoreceptors
(t1 − t0 ) and then blanked for the remaining time (t2 − t1 ). Typically, t1 − t0 is
around one to two milliseconds. (b) If the frame rate were extremely fast (at least Figure 6.20: In 2014, this dress photo became an Internet sensation as people were
500 FPS), then the blank interval would not be needed. unable to agree upon whether it was “blue and black” or “white and gold”, which
are strikingly different perceptions of color.
of what human color perceptual systems can handle. About 57% perceive it as blue
and black (correct), 30% percent perceive it as white and gold, 10% perceive blue
and brown, and 10% could switch between perceiving any of the color combinations
[159].
Dimensionality reduction Recall from Section 4.1 that light energy is a jumble
of wavelengths and magnitudes that form the spectral power distribution. Figure
4.6 provided an illustration. As we see objects, the light in the environment is
reflected off of surfaces in a wavelength-dependent way according to the spec-
tral distribution function (Figure 4.7). As the light passes through our eyes and
is focused onto the retina, each photoreceptor receives a jumble of light energy
that contains many wavelengths. Since the power distribution is a function of
wavelength, the set of all possible distributions is a function space, which is gen-
erally infinite-dimensional. Our limited hardware cannot possibly sense the entire
function. Instead, the rod and cone photoreceptors sample it with a bias toward
certain target wavelengths, as was shown in Figure 5.3 of Section 5.1. The result
is a well-studied principle in engineering called dimensionality reduction. Here, Figure 6.21: One representation of the HSV color space, which involves three
the infinite-dimensional space of power distributions collapses down to a 3D color parameters: hue, saturation, and value (brightness). (Figure by Wikipedia user
space. It is no coincidence that human eyes have precisely three types of cones, SharkD.)
and that our RGB displays target the same colors as the photoreceptors.
three components (Figure 6.21):
Yellow = Green + Red To help understand this reduction, consider the per-
ception of “yellow”. According to the visible light spectrum (Figure 4.5), yellow • The hue, which corresponds directly to the perceived color, such as “red” or
has a wavelength of about 580nm. Suppose we had a pure light source that shines “green”.
light of exactly 580nm wavelength onto our retinas with no other wavelengths. The
spectral distribution function would have a spike at 580nm and be zero everywhere • The saturation, which is the purity of the color. In other words, how much
else. If we had a cone with peak detection at 580nm and no sensitivity to other energy is coming from wavelengths other than the wavelength of the hue?
wavelengths, then it would perfectly detect yellow. Instead, we perceive yellow • The value, which corresponds to the brightness.
by activation of both green and red cones because their sensitivity regions (Figure
5.3) include 580nm. It should then be possible to generate the same photoreceptor There are many methods to scale the HSV coordinates, which distort the color
response by sending a jumble of light that contains precisely two wavelengths: 1) space in various ways. The RGB values could alternatively be used, but are
Some “green” at 533nm, and 2) some “red” at 564nm. If the magnitudes of green sometimes more difficult for people to interpret.
and red are tuned so that the green and red cones activate in the same way as It would be ideal to have a representation in which the distance between two
they did for pure yellow, then it becomes impossible for our visual system to dis- points corresponds to the amount of perceptual difference. In other words, as
tinguish the green/red mixture from pure yellow. Both are perceived as “yellow”. two points are further apart, our ability to distinguish them is increased. The
This matching of colors from red, green and blue components is called metamerism. distance should correspond directly to the amount of distinguishability. Vision
Such a blending is precisely what is done on a RGB display to produce yellow. scientists designed a representation to achieve this, resulting in the 1931 CIE color
Suppose the intensity of each color ranges from 0 (dark) to 255 (bright). Red is standard shown in Figure 6.22. Thus, the CIE is considered to be undistorted
produced by RGB= (255, 0, 0), and green is RGB= (0, 255, 0). These each acti- from a perceptual perspective. It is only two-dimensional because it disregards
vate one LED (or LCD) color, thereby producing a pure red or green. If both are the brightness component, which is independent of color perception according to
turned on, then yellow is perceived. Thus, yellow is RGB= (255, 255, 0). color matching experiments [204].
Color spaces For convenience, a parameterized color space is often defined. One Mixing colors Suppose that we have three pure sources of light, as in that
of the most common in computer graphics is called HSV, which has the following produced by an LED, in red, blue, and green colors. We have already discussed
6.3. PERCEPTION OF COLOR 167 168 S. M. LaValle: Virtual Reality
how to produce yellow by blending red and green. In general, most perceptible
colors can be matched by a mixture of three. This is called trichromatic theory
(or Young-Helmholtz theory). A set of colors that achieves this is called primary
colors. Mixing all three evenly produces perceived white light, which on a dis-
play is achieved as RGB= (255, 255, 255). Black is the opposite: RGB= (0, 0, 0).
Such light mixtures follow a linearity property. Suppose primary colors are used
to perceptually match power distributions of two different light sources. If the
light sources are combined, then their intensities of the primary colors need only
to be added to obtain the perceptual match for the combination. Furthermore,
the overall intensity can be scaled by multiplying the red, green, and blue com-
ponents without affecting the perceived color. Only the perceived brightness may
be changed.
The discussion so far has focused on additive mixtures. When mixing paints or
printing books, colors mix subtractively because the spectral reflectance function
is being altered. When starting with a white canvass or sheet of paper, virtu-
ally all wavelengths are reflected. Painting a green line on the page prevents all
wavelengths other than green from being reflected at that spot. Removing all wave-
lengths results in black. Rather than using RGB components, printing presses are
based on CMYK, which correspond to cyan, magenta, yellow, and black. The first
three are pairwise mixes of the primary colors. A black component is included to
reduce the amount of ink wasted by using the other three colors to subtractively
produce black. Note that the targeted colors are observed only if the incoming
light contains the targeted wavelengths. The green line would appear green under
pure, matching green light, but might appear black under pure blue light.
Constancy The dress in Figure 6.20 showed an extreme case that results in
color confusion across people due to the strange lighting conditions. Ordinarily,
human color perception is surprisingly robust to the source of color. A red shirt
appears to be red whether illuminated under indoor lights at night or in direct
sunlight. These correspond to vastly different cases in terms of the spectral power
distribution that reaches the retina. Our ability to perceive an object as having
the same color over a wide variety of lighting conditions is called color constancy.
Several perceptual mechanisms allow this to happen. One of them is chromatic
adaptation, which results in a shift in perceived colors due to prolonged exposure
to specific colors. Another factor in the perceived color is the expectation from
the colors of surrounding objects. Furthermore, memory about how objects are
usually colored in the environment biases our interpretation.
Figure 6.22: 1931 CIE color standard with RGB triangle. This representation is The constancy principle also appears without regard to particular colors. Our
correct in terms of distances between perceived colors. (Figure by Jeff Yurek.) perceptual system also maintains lightness constancy so that the overall bright-
ness levels appear to be unchanged, even after lighting conditions are dramatically
altered; see Figure 6.23(a). Under the ratio principle theory, only the ratio of
reflectances between objects in a scene are perceptually maintained, whereas the
overall amount of reflected intensity is not perceived. Further complicating mat-
ters, our perception of object lightness and color are maintained as the scene
6.4. COMBINING SOURCES OF INFORMATION 169 170 S. M. LaValle: Virtual Reality
(a) (b)
Figure 6.23: (a) The perceived hot air balloon colors are perceived the same re- Figure 6.24: Gamma correction is used to span more orders of magnitude in spite
gardless of the portions that are in direct sunlight or in a shadow. (Figure by of a limited number of bits. The transformation is v ′ = cv γ , in which c is constant
Wikipedia user Shanta.) (b) The checker shadow illusion from Section 2.3 is ex- (usually c = 1) and γ controls the nonlinearity of the correction or distortion.
plained by the lightness constancy principle as the shadows prompt compensation
of the perceived lightness. (Figure by Adrian Pingstone.)
the numerous monocular cues used to judge depth. Perception may also combine
information from two or more senses. For example, people typically combine both
contains uneven illumination. A clear example is provided from shadows cast by visual and auditory cues when speaking face to face. Information from both sources
one object onto another. Our perceptual system accounts for the shadow and makes it easier to understand someone, especially if there is significant background
adjusts our perception of the object shade or color. The checker shadow illusion noise. We have also seen that information is integrated over time, as in the case
shown in Figure 6.23 is caused by this compensation due to shadows. of saccades being employed to fixate on several object features. Finally, our mem-
ories and general expectations about the behavior of the surrounding world bias
our conclusions. Thus, information is integrated from prior expectations and the
Display issues Displays generally use RGB lights to generate the palette of reception of many cues, which may come from different senses at different times.
colors and brightness. Recall Figure 5.22, which showed the subpixel mosaic of Statistical decision theory provides a useful and straightforward mathematical
individual component colors for some common displays. Usually, the intensity of model for making choices that incorporate prior biases and sources of relevant,
each R, G, and B value is set by selecting an integer from 0 to 255. This is a severe observed data. It has been applied in many fields, including economics, psychology,
limitation on the number of brightness levels, as stated in Section 5.4. One cannot signal processing, and computer science. One key component is Bayes’ rule, which
hope to densely cover all seven orders of magnitude of perceptible light intensity. specifies how the prior beliefs should be updated in light of new observations, to
One way to enhance the amount of contrast over the entire range is to perform obtain posterior beliefs. More formally, the “beliefs” are referred as probabilities.
gamma correction. In most displays, images are encoded with a gamma of about If the probability takes into account information from previous information, it is
0.45 and decoded with a gamma of 2.2. called a conditional probability. There is no room to properly introduce probability
Another issue is that the set of all available colors lies inside of the triangle theory here; only the basic ideas are given to provide some intuition without the
formed by R, G, and B vertices. This limitation is shown for the case of the sRGB rigor. For further study, find an online course or classic textbook (for example,
standard in Figure 6.22. Most the CIE is covered, but many colors that humans [272]).
are capable of perceiving cannot be generated on the display. Let
H = {h1 , h2 , . . . , hn } (6.1)
6.4 Combining Sources of Information be a set of hypotheses (or interpretations). Similarly, let
C = {c1 , c2 , . . . , cm } (6.2)
Throughout this chapter, we have seen perceptual processes that combine infor-
mation from multiple sources. These could be cues from the same sense, as in C be a set of possible outputs of a cue detector. For example, the cue detector
6.4. COMBINING SOURCES OF INFORMATION 171 172 S. M. LaValle: Virtual Reality
might output the eye color of a face that is currently visible. In this case C is the
set of possible colors:
C = {brown, blue, green, hazel}. (6.3)
Modeling a face recognizer, H would correspond to the set of people familiar to
the person.
We want to calculate probability values for each of the hypotheses in H. Each
probability value must lie between 0 to 1, and the sum of the probability values
for every hypothesis in H must sum to one. Before any cues, we start with an
assignment of values called the prior distribution, which is written as P (h). The
“P ” denotes that it is a probability function or assignment; P (h) means that an
assignment has been applied to every h in H. The assignment must be made so
that
P (h1 ) + P (h2 ) + · · · + P (hn ) = 1, (6.4)
and 0 ≤ P (hi ) ≤ 1 for each i from 1 to n.
The prior probabilities are generally distributed across the hypotheses in a dif-
fuse way; an example is shown in Figure 6.25(a). The likelihood of any hypothesis
being true before any cues is proportional to its frequency of occurring naturally,
(a) (b)
based on evolution and the lifetime of experiences of the person. For example, if
you open your eyes at a random time in your life, what is the likelihood of seeing
a human being versus a wild boar?
Under normal circumstances (not VR!), we expect that the probability for
the correct interpretation will rise as cues arrive. The probability of the correct
hypothesis should pull upward toward 1, effectively stealing probability mass from
the other hypotheses, which pushes their values toward 0; see Figure 6.25(b). A
“strong” cue should lift the correct hypothesis upward more quickly than a “weak”
cue. If a single hypothesis has a probability value close to 1, then the distribution is
considered peaked, which implies high confidence; see Figure 6.25(c). In the other
direction, inconsistent or incorrect cues have the effect of diffusing the probability
across two or more hypotheses. Thus, the probability of the correct hypothesis
may be lowered as other hypotheses are considered plausible and receive higher
values. It may also be possible that two alternative hypotheses remain strong due (c) (d)
to ambiguity that cannot be solved from the given cues; see Figure 6.25(d).
To take into account information from a cue, a conditional distribution is de-
fined, which is written as P (h | c). This is spoken as “the probability of h Figure 6.25: Example probability distributions: (a) A possible prior distribution.
given c.” This corresponds to a probability assignment for all possible combi- (b) Preference for one hypothesis starts to emerge after a cue. (c) A peaked
nations of hypotheses and cues. For example, it would include P (h2 | c5 ), if distribution, which results from strong, consistent cues. (d) Ambiguity may result
there are at least two hypotheses and five cues. Continuing our face recognizer, in two (or more) hypotheses that are strongly favored over others; this is the basis
this would look like P (Barack Obama | brown), which should be larger than of multistable perception.
P (Barack Obama | blue) (he has brown eyes).
We now arrive at the fundamental problem, which is to calculate P (h | c) after
the cue arrives. This is accomplished by Bayes’ rule:
P (c | h)P (h)
P (h | c) = . (6.5)
P (c)
6.4. COMBINING SOURCES OF INFORMATION 173 174 S. M. LaValle: Virtual Reality
the special case of two conclusions is called bistable perception. Figure 6.26(a)
shows two well-known examples. For the Necker cube, it is ambiguous which
cube face that is parallel to the viewing plane is in the foreground. It is possible
to switch between both interpretations, resulting in bistable perception. Figure
6.26(b) shows another example, in which people may see a rabbit or a duck at
various times. Another well-known example is called the spinning dancer illusion
by Nobuyuki Kayahara. In that case, the silhouette of a rotating dancer is shown
and it is possible to interpret the motion as clockwise or counterclockwise.
(a) (b) McGurk effect The McGurk effect is an experiment that clearly indicates the
power of integration by mixing visual and auditory cues [207]. A video of a person
speaking is shown with the audio track dubbed so that the spoken sounds do not
Figure 6.26: (a) The Necker cube, studied in 1832 by Swiss crystallographer Louis match the video. Two types of illusions were then observed. If “ba” is heard and
Albert Necker. (b) The rabbit duck illusion, from the 23 October 1892 issue of “ga” is shown, then most subjects perceive “da” being said. This corresponds to a
Fliegende Blätter. plausible fusion of sounds that explains the mismatch, but does not correspond to
either original cue. Alternatively, the sounds may combine to produce a perceived
The denominator can be expressed as “bga” in the case of “ga” on the sound track and “ba” on the visual track.
P (c) = P (c | h1 )P (h1 ) + P (c | h2 )P (h2 ) + · · · + P (c | hn )P (hn ), (6.6) Implications for VR Not all senses are taken over by VR. Thus, conflict will
arise because of mismatch between the real and virtual worlds. As stated several
or it can be ignored it as a normalization constant, at which point only relative
times, the most problematic case of this is vection, which is a sickness-causing con-
likelihoods are calculated instead of proper probabilities.
flict between visual and vestibular cues arising from apparent self motion in VR
The only thing accomplished by Bayes’ rule was to express P (h | c) in terms of
while remaining stationary in the real world; see Section 8.4. As another example
the prior distribution P (h) and a new conditional distribution P (c | h). The new
of mismatch, the user’s body may sense that it is sitting in a chair, but the VR
conditional distribution is easy to work with in terms of modeling. It characterizes
experience may involve walking. There would then be a height mismatch between
the likelihood that each specific cue will appear given that the hypothesis is true.
the real and virtual worlds, in addition to mismatches based on proprioception
What if information arrives from a second cue detector? In this case, (6.5) is
and touch. In addition to mismatches among the senses, imperfections in the VR
applied again, but P (h | c) is now considered the prior distribution with respect
hardware, software, content, and interfaces cause inconsistencies in comparison
to the new information. Let D = {d1 , d2 , . . . , dk } represent the possible outputs
with real-world experiences. The result is that incorrect or untended interpreta-
of the new cue detector. Bayes’ rule becomes
tions may arise. Even worse, such inconsistencies may increase fatigue as human
P (d | h)P (h | c) neural structures use more energy to interpret the confusing combination. In light
P (h | c, d) = . (6.7) of the McGurk effect, it is easy to believe that many unintended interpretations or
P (d|c)
perceptions may arise from a VR system that does not provide perfectly consistent
Above, P (d | h) makes what is called a conditional independence assumption: cues.
P (d | h) = P (d | h, c). This is simpler from a modeling perspective. More VR is also quite capable of generating new multistable perceptions. One ex-
generally, all four conditional parts of (6.7) should contain c because it is given ample, which actually occurred in the VR industry, involved designing a popup
before d arrives. As information from even more cues becomes available, Bayes’ menu. Suppose that users are placed into a dark environment and a large menu
rule is applied again as many times as needed. One difficulty that occurs in practice comes rushing up to them. A user may perceive one of two cases: 1) the menu
and modeled here is cognitive bias, which corresponds to numerous ways in which approaches the user, or 2) the user is rushing up to the menu. The vestibular sense
humans make irrational judgments in spite of the probabilistic implications of the should be enough to resolve whether the user is moving, but the visual sense is
data. overpowering. Prior knowledge about which is happening helps yield the correct
perception. Unfortunately, if the wrong interpretation is made, then VR sickness
Multistable perception In some cases, our perceptual system may alternate in increased due to the sensory conflict. This, our perceptual system could by
between two or more conclusions. This is called multistable perception, for which tricked into an interpretation that is worse for our health! Knowledge is one of
6.4. COMBINING SOURCES OF INFORMATION 175 176 S. M. LaValle: Virtual Reality
Further Reading
As with Chapter 5, much of the material from this chapter appears in textbooks on
sensation and perception [97, 204, 350] For a collection of optical illusions and their
explanations, see [233]. For more on motion detection, see Chapter 7 of [204]. Related
to this is the history of motion pictures [32, 28].
To better understand the mathematical foundations of combining cues from multiple
sources, look for books on Bayesian analysis and statistical decision theory. For example,
see [267] and Chapter 9 of [163]. An important issue is adaptation to VR system flaws
through repeated use [282, 345]. This dramatically effects the perceptual results and
fatigue from mismatches, and is a form of perceptual learning, which will be discussed
in Section 12.1.
178 S. M. LaValle: Virtual Reality
screen pixels are covered by the transformed triangle and then illuminate them
according to the physics of the virtual world.
An important condition must also be checked: For each pixel, is the triangle
even visible to the eye, or will it be blocked by part of another triangle? This clas-
sic visibility computation problem dramatically complicates the rendering process.
Chapter 7 The general problem is to determine for any pair of points in the virtual world,
whether the line segment that connects them intersects with any objects (trian-
gles). If an intersection occurs, then the line-of-sight visibility between the two
points is blocked. The main difference between the two major families of rendering
Visual Rendering methods is how visibility is handled.
177
7.1. RAY TRACING AND SHADING MODELS 179 180 S. M. LaValle: Virtual Reality
Figure 7.2: In the Lambertian shading model, the light reaching the pixel de-
pends on the angle θ between the incoming light and the surface normal, but is
independent of the viewing angle.
Figure 7.1: The first step in a ray tracing approach is called ray casting, which
extends a viewing ray that corresponds to a particular pixel on the image. The is Lambertian shading, for which the angle that the viewing ray strikes the surface
ray starts at the focal point, which is the origin after the eye transform Teye has is independent of the resulting pixel R, G, B values. This corresponds to the case
been applied. The task is to determine what part of the virtual world model is of diffuse reflection, which is suitable for a “rough” surface (recall Figure 4.4). All
visible. This is the closest intersection point between the viewing ray and the set that matters is the angle that the surface makes with respect to the light source.
of all triangles. Let n be the outward surface normal and let ℓ be a vector from the surface
intersection point to the light source. Assume both n and ℓ are unit vectors, and
let θ denote the angle between them. The dot product n · ℓ = cos θ yields the
direction (vector), the closed-form solution involves basic operations from analytic amount of attenuation (between 0 and 1) due to the tilting of the surface relative
geometry, including dot products, cross products, and the plane equation [320]. to the light source. Think about how the effective area of the triangle is reduced
For each triangle, it must be determined whether the ray intersects it. If not, due to its tilt. A pixel under the Lambertian shading model is illuminated as
then the next triangle is considered. If it does, then the intersection is recorded as
the candidate solution only if it is closer than the closest intersection encountered R = dR IR max(0, n · ℓ)
so far. After all triangles have been considered, the closest intersection point G = dG IG max(0, n · ℓ) (7.1)
will be found. Although this is simple, it is far more efficient to arrange the B = dB IB max(0, n · ℓ),
triangles into a 3D data structure. Such structures are usually hierarchical so that
many triangles can be eliminated from consideration by quick coordinate tests. in which (dR , dG , dB ) represents the spectral reflectance property of the material
Popular examples include BSP-trees and Bounding Volume Hierarchies [42, 85]. (triangle) and (Ir , IG , IR ) is represents the spectral power distribution of the light
Algorithms that sort geometric information to obtain greater efficiently generally source. Under the typical case of white light, IR = IG = IB . For a white or gray
fall under computational geometry [54]. In addition to eliminating many triangles material, we would also have dR = dG = dB .
from quick tests, many methods of calculating the ray-triangle intersection have Using vector notation, (7.1) can be compressed into
been developed to reduce the number of operations. One of the most popular is
the Möller-Trumbore intersection algorithm [217]. L = dI max(0, n · ℓ) (7.2)
Figure 7.3: In the Blinn-Phong shading model, the light reaching the pixel depends Figure 7.4: A bidirectional reflectance distribution function (BRDF), meticulously
on the angle between the normal n and the bisector b of the ℓ and v. If n = b, specifies the ratio of incoming and outgoing light energy for all possible perspec-
then ideal reflection is obtained, as in the case of a mirror. tives.
reflection would occur if v and ℓ form the same angle with respect to n. What Multiple light sources Typically, the virtual world contains multiple light
if the two angles are close, but do not quite match? The Blinn-Phong shading sources. In this case, the light from each is combined additively at the pixel.
model proposes that some amount of light is reflected, depending on the amount The result for N light sources is
of surface shininess and the difference between v and ℓ [24]. See Figure 7.3. The N
bisector b is the vector obtained by averaging ℓ and v: X
L = La + dIi max(0, n · ℓi ) + sIi max(0, n · bi )x , (7.6)
ℓ+v i=1
b= . (7.3)
kℓ + vk
in which Ii , ℓi , and bi correspond to each source.
Using the compressed vector notation, the Blinn-Phong shading model sets the
RGB pixel values as
BRDFs The shading models presented so far are in widespread use due to their
x
L = dI max(0, n · ℓ) + sI max(0, n · b) . (7.4) simplicity and efficiency, even though they neglect most of the physics. To ac-
count for shading in a more precise and general way, a bidirectional reflectance
This additively takes into account shading due to both diffuse and specular com- distribution function (BRDF) is constructed [231]; see Figure 7.4. The θi and θr
ponents. The first term is just the Lambertian shading model, (7.2). The second parameters represent the angles of light source and viewing ray, respectively, with
component causes increasing amounts of light to be reflected as b becomes closer respect to the surface. The φi and φr parameters range from 0 to 2π and represent
to n. The exponent x is a material property that expresses the amount of surface the angles made by the light and viewing vectors when looking straight down on
shininess. A lower value, such as x = 100, results in a mild amount of shininess, the surface (the vector n would point at your eye).
whereas x = 10000 would make the surface almost like a mirror. This shading The BRDF is a function of the form
model does not correspond directly to the physics of the interaction between light
and surfaces. It is merely a convenient and efficient heuristic, but widely used in radiance
f (θi , φi , θr , θi ) = , (7.7)
computer graphics. irradiance
in which radiance is the light energy reflected from the surface in directions θr
Ambient shading Another heuristic is ambient shading, which causes an object and φr and irradiance is the light energy arriving at the surface from directions
to glow without being illuminated by a light source. This lights surfaces that fall θi and φi . These are expressed at a differential level, roughly corresponding to an
into the shadows of all lights; otherwise, they would be completely black. In the infinitesimal surface patch. Informally, it is the ratio of the amount of outgoing
real world this does not happen light interreflects between objects to illuminate light to the amount of incoming light at one point on the surface. The previous
an entire environment. Such propagation has not been taken into account in the shading models can be expressed in terms of a simple BRDF. For Lambertian
shading model so far, thereby requiring a hack to fix it. Adding ambient shading shading, the BRDF is constant because the surface reflects equally in all directions.
yields The BRDF and its extensions can account for much more complex and physically
L = dI max(0, n · ℓ) + sI max(0, n · b)x + La , (7.5) correct lighting effects, with a wide variety of surface textures. See Chapter 7 of
in which La is the ambient light component. [5] for extensive coverage.
7.2. RASTERIZATION 183 184 S. M. LaValle: Virtual Reality
Figure 7.5: Complications emerge with shiny surfaces because the viewpoints are
different for the right and left eyes. Using the Blinn-Phong shading model, a
specular reflection should have different brightness levels for each eye. It may be
difficult to match the effect so that it is consistent with real-world behavior.
Global illumination Recall that the ambient shading term (7.5) was introduced
to prevent surfaces in the shadows of the light source from appearing black. The
computationally intensive but proper way to fix this problem is to calculate how
light reflects from object to object in the virtual world. In this way, objects are
illuminated indirectly from the light that reflects from others, as in the real world.
Figure 7.6: Due to the possibility of depth cycles, objects cannot be sorted in three
Unfortunately, this effectively turns all object surfaces into potential sources of
dimensions with respect to distance from the observer. Each object is partially in
light. This means that ray tracing must account for multiple reflections. This
front of one and partially behind another.
requires considering piecewise linear paths from the light source to the viewpoint,
in which each bend corresponds to a reflection. An upper limit is usually set on the
number of bounces to consider. The simple Lambertian and Blinn-Phong models rendering to object-order rendering. The objects in our case are triangles and
are often used, but more general BDRFs are also common. Increasing levels of the resulting process is called rasterization, which is the main function of modern
realism can be calculated, but with corresponding increases in computation time. graphical processing units (GPUs). In this case, an image is rendered by iterating
over every triangle and attempting to color the pixels where the triangle lands
VR-specific issues VR inherits all of the common issues from computer graph- on the image. The main problem is that the method must solve the unavoidable
ics, but also contains unique challenges. Chapters 5 and 6 mentioned the increased problem of determining which part, if any, of the triangle is the closest to the focal
resolution and frame rate requirements. This provides strong pressure to reduce point (roughly, the location of the virtual eye).
rendering complexity. Furthermore, many heuristics that worked well for graphics One way to solve it is to sort the triangles in depth order so that the closest
on a screen may be perceptibly wrong in VR. The combination of high field-of- triangle is last. This enables the triangles to be drawn on the screen in back-to-
view, resolution, varying viewpoints, and stereo images may bring out new prob- front order. If they are properly sorted, then any later triangle to be rendered will
lems. For example, Figure 7.5 illustrates how differing viewpoints from stereopsis rightfully clobber the image of previously rendered triangles at the same pixels.
could affect the appearance of shiny surfaces. In general, some rendering artifacts The triangles can be drawn one-by-one while totally neglecting the problem of
could even contribute to VR sickness. Throughout the remainder of this chapter, determining which is nearest. This is known as the Painter’s algorithm. The main
complications that are unique to VR will be increasingly discussed. flaw, however, is the potential existence of depth cycles, shown in Figure 7.6, in
which three or more triangles cannot be rendered correctly in any order by the
Painter’s algorithm. One possible fix is to detect such cases and split the triangles.
7.2 Rasterization
The ray casting operation quickly becomes a bottleneck. For a 1080p image at Depth buffer A simple and efficient method to resolve this problem is to manage
90Hz, it would need to be performed over 180 million times per second, and the the depth problem on a pixel-by-pixel basis by maintaining a depth buffer (also
ray-triangle intersection test would be performed for every triangle (although data called z-buffer), which for every pixel records the distance of the triangle from the
structures such as a BSP would quickly eliminate many from consideration). In focal point to the intersection point of the ray that intersects the triangle at that
most common cases, it is much more efficient to switch from such image-order pixel. In other words, if this were the ray casting approach, it would be distance
7.2. RASTERIZATION 185 186 S. M. LaValle: Virtual Reality
in which × denotes the standard vector cross product. These three conditions
ensure that p is “to the left” of each edge vector.
Figure 7.9: Bump mapping: By artificially altering the surface normals, the shad-
ing algorithms produce an effect that looks like a rough surface. (Figure by Brian
Figure 7.8: Texture mapping: A simple pattern or an entire image can be mapped Vibber.)
across the triangles and then rendered in the image to provide much more detail
than provided by the triangles in the model. (Figure from Wikipedia.)
be propagated over the surface [41]; see Figure 7.8. More generally, any digital
picture can be mapped onto the patch. The barycentric coordinates reference a
point inside of the image to be used to influence a pixel. The picture or “texture”
is treated as if it were painted onto the triangle; the lighting and reflectance
properties are additionally taken into account for shading the object.
Another possibility is normal mapping, which alters the shading process by
allowing the surface normal to be artificially varied over the triangle, even though
geometrically it is impossible. Recall from Section 7.1 that the normal is used
in the shading models. By allowing it to vary, simulated curvature can be given
to an object. An important case of normal mapping is called bump mapping,
which makes a flat surface look rough by irregularly perturbing the normals. If
the normals appear to have texture, then the surface will look rough after shading
is computed.
Figure 7.11: A mipmap stores the texture at multiple resolutions so that it can be
appropriately scaled without causing signficant aliasing. The overhead for storing
the extra image is typically only 1/3 the size of the original (largest) image. (The (a) (b)
image is from NASA and the mipmap was created by Wikipedia user Mulad.)
Figure 7.12: (a) Due to the perspective transformation, the tiled texture suffers
By deciding to fully include or exclude the triangle based on the coordinates from spatial aliasing as the depth increases. (b) The problem can be fixed by
of p alone, the staircasing effect is unavoidable. A better way is to render the performing supersampling.
pixel according to the fraction of the pixel region that is covered by the trian-
gle. This way its values could be blended from multiple triangles that are visible
within the pixel region. Unfortunately, this requires supersampling, which means
casting rays at a much higher density than the pixel density so that the triangle
coverage fraction can be estimated. This dramatically increases cost. Commonly,
a compromise is reached in a method called multisample anti-aliasing (or MSAA),
in which only some values are calculated at the higher density. Typically, depth
values are calculated for each sample, but shading is not.
A spatial aliasing problem results from texture mapping. The viewing trans-
formation may dramatically reduce the size and aspect ratio of the original texture
as it is mapped from the virtual world onto the screen. This may leave insufficient
resolution to properly represent a repeating pattern in the texture; see Figure
7.12. This problem is often addressed in practice by pre-calculating and storing
a mipmap for each texture; see Figure 7.11. The texture is calculated at various
resolutions by performing high-density sampling and storing the rasterized result
in images. Based on the size and viewpoint of the triangle on the screen, the
Figure 7.13: Due to the optical system in front of the screen, the viewing frustum
appropriately scaled texture image is selected and mapped onto the triangle to
is replaced by a truncated cone in the case of a circularly symmetric view. Other
reduce the aliasing artifacts.
cross-sectional shapes may be possible to account for the asymmetry of each eye
view (for example, the nose is obstructing part of the view).
Culling In practice, many triangles can be quickly eliminated before attempting
to render them. This results in a preprocessing phase of the rendering approach
called culling, which dramatically improves performance and enables faster frame
7.2. RASTERIZATION 191 192 S. M. LaValle: Virtual Reality
rates. The efficiency of this operation depends heavily on the data structure used
to represent the triangles. Thousands of triangles could be eliminated with a
single comparison of coordinates if they are all arranged in a hierarchical structure.
The most basic form of culling is called view volume culling, which eliminates all
triangles that are wholly outside of the viewing frustum (recall Figure 3.18). For a
VR headset, the frustum may have a curved cross section due to the limits of the
optical system (see Figure 7.13). In this case, the frustum must be replaced with
a region that has the appropriate shape. In the case of a truncated cone, a simple
geometric test can quickly eliminate all objects outside of the view. For example,
if p
x2 + y 2
> tan θ, (7.14)
−z
in which 2θ is the angular field of view, then the point (x, y, z) is outside of the
cone. Alternatively, the stencil buffer can be used in a GPU to mark all pixels that
would be outside of the lens view. These are quickly eliminated from consideration
by a simple test as each frame is rendered. Figure 7.14: A Fresnel lens (pronounced like “frenelle”) simulates a simple lens by
Another form is called backface culling, which removes triangles that have making a corrugated surface. The convex surface on the top lens is implemented
outward surface normals that point away from the focal point. These should not be in the Fresnel lens shown on the bottom.
rendered “from behind” if the model is consistently formed. Additionally, occlusion
culling may be used to eliminate parts of the model that might be hidden from a carpet onto the floor might inadvertently cause the floor to look as if it were
view by a closer object. This can get complicated because it once again considers simply painted. In the real world we would certainly be able to distinguish painted
the depth ordering problem. For complete details, see [5]. carpet from real carpet. The same problem occurs with normal mapping. A surface
that might look rough in a single static image due to bump mapping could look
VR-specific rasterization problems The staircasing problem due to aliasing completely flat in VR as both eyes converge onto the surface. Thus, as the quality
is expected to be worse for VR because current resolutions are well below the of VR systems improves, we should expect the rendering quality requirements to
required retina display limit calculated in Section 5.4. The problem is made sig- increase, causing many old tricks to be modified or abandoned.
nificantly worse by the continuously changing viewpoint due to head motion. Even
as the user attempts to stare at an edge, the “stairs” appear to be more like an
“escalator” because the exact choice of pixels to include in a triangle depends on 7.3 Correcting Optical Distortions
subtle variations in the viewpoint. As part of our normal perceptual processes,
our eyes are drawn toward this distracting motion. With stereo viewpoints, the Recall from Section 4.3 that barrel and pincushion distortions are common for an
situation is worse: The “escalators” from the right and left images will usually optical system with a high field of view (Figure 4.20). When looking through the
not match. As the brain attempts to fuse the two images into one coherent view, lens of a VR headset, a pincushion distortion typically results. If the images are
the aliasing artifacts provide a strong, moving mismatch. Reducing contrast at drawn on the screen without any correction, then the virtual world appears to be
edges and using anti-aliasing techniques help alleviate the problem, but aliasing incorrectly warped. If the user yaws his head back and forth, then fixed lines in
is likely to remain a significant problem until displays reach the required retina the world, such as walls, appear to dynamically change their curvature because
display density for VR. the distortion in the periphery is much stronger than in the center. If it is not
A more serious difficulty is caused by the enhanced depth perception afforded corrected, then the perception of stationarity will fail because static objects should
by a VR system. Both head motions and stereo views enable users to perceive not appear to be warping dynamically. Furthermore, contributions may be made
small differences in depth across surfaces. This should be a positive outcome; to VR sickness because incorrect accelerations are being visually perceived near
however, many tricks developed in computer graphics over the decades rely on the the periphery.
fact that people cannot perceive these differences when a virtual world is rendered How can this problem be solved? Significant research is being done in this area,
onto a fixed screen that is viewed from a significant distance. The result for VR and the possible solutions involve different optical systems and display technolo-
is that texture maps may look fake. For example, texture mapping a picture of gies. For example, digital light processing (DLP) technology directly projects light
7.3. CORRECTING OPTICAL DISTORTIONS 193 194 S. M. LaValle: Virtual Reality
into the eye without using lenses. Another way to greatly reduce this problem is to
use a Fresnel lens (see Figure 7.14), which more accurately controls the bending of
light rays by using a corrugated or sawtooth surface over a larger area; an aspheric
design can be implemented as well. A Fresnel lens is used, for example, in the HTC
Vive VR headset. One unfortunate side effect of Fresnel lenses is that glaring can
be frequently observed as light scatters across the ridges along the surface.
Whether small or large, the distortion can also be corrected in software. One
assumption is that the distortion is circularly symmetric. This means that the
amount of distortion depends only on the distance from the lens center, and not
the particular direction from the center. Even if the lens distortion is perfectly
circularly symmetric, it must also be placed so that it is centered over the eye.
Some headsets offer IPD adjustment, which allows the distance between the lenses
to be adjusted so that they are matched to the user’s eyes. If the eye is not centered
on the lens, then asymmetric distortion arises. The situation is not perfect because
as the eye rotates, the pupil moves along a spherical arc. As the position of the
Figure 7.15: The rendered image appears to have a barrel distortion. Note that
pupil over the lens changes laterally, the distortion varies and becomes asymmetric.
the resolution is effectively dropped near the periphery. (Figure by Nvidia.)
This motivates making the lens as large as possible so that this problem is reduced.
Another factor is that the distortion will change as the distance between the lens
and the screen is altered. This adjustment may be useful to accommodate users could also be used, such as adding a term c3 ru7 on the right above; however, in
with nearsightedness or farsightedness, as done in the Samsung Gear VR headset. practice this is often considered unnecessary.
The adjustment is also common in binoculars and binoculars, which explains why Correcting the distortion involves two phases:
many people do not need their glasses to use them. To handle distortion correctly,
the headset should ideally sense the adjustment setting and take it into account. 1. Determine the radial distortion function f for a particular headset, which
To fix radially symmetric distortion, suppose that the transformation chain involves a particular lens placed at a fixed distance from the screen. This
Tcan Teye Trb has been applied to the geometry, resulting in the canonical view vol- is a regression or curve-fitting problem that involves an experimental setup
ume, as covered in Section 3.5. All points that were inside of the viewing frustum that measures the distortion of many points and selects the coefficients c1 ,
now have x and y coordinates ranging from −1 to 1. Consider referring to these c2 , and so on, that provide the best fit.
points using polar coordinates (r, θ): 2. Determine the inverse of f so that it be applied to the rendered image before
p
r = x2 + y 2 the lens causes its distortion. The composition of the inverse with f should
(7.15) cancel out the distortion function.
θ = atan2(y, x),
in which atan2 represents the inverse tangent of y/x. This function is commonly Unfortunately, polynomial functions generally do not have inverses that can
used in programming languages to return an angle θ over the entire range from be determined or even expressed in a closed form. Therefore, approximations are
0 to 2π. (The arctangent alone cannot do this because the quadrant that (x, y) used. One commonly used approximation is [118]:
came from is needed.)
We now express the lens distortion in terms of transforming the radius r, c1 rd2 + c2 rd4 + c21 rd4 + c22 rd8 + 2c1 c2 rd6
f −1 (rd ) ≈ . (7.17)
without affecting the direction θ (because of symmetry). Let f denote a function 1 + 4c1 rd2 + 6c2 rd4
that applies to positive real numbers and distorts the radius. Let ru denote the
undistorted radius, and let rd denote the distorted radius. Both pincushion and Alternatively, the inverse can be calculated very accurately off-line and then stored
barrel distortion are commonly approximated using polynomials with odd powers, in an array for fast access. It needs to be done only once per headset design.
resulting in f being defined as Linear interpolation can be used for improved accuracy. The inverse values can
be accurately calculated using Newton’s method, with initial guesses provided by
rd = f (ru ) = ru + c1 ru3 + c2 ru5 , (7.16) simply plotting f (ru ) against ru and swapping the axes.
in which c1 and c2 are suitably chosen constants. If c1 < 0, then barrel distortion The transformation f −1 could be worked directly into the perspective transfor-
occurs. If c1 > 0, then pincushion distortion results. Higher-order polynomials mation, thereby replacing Tp and Tcan with a nonlinear operation. By leveraging
7.4. IMPROVING LATENCY AND FRAME RATES 195 196 S. M. LaValle: Virtual Reality
the existing graphics rendering pipeline, it is instead handled as a post-processing The perfect system As a thought experiment, imagine the perfect VR system.
step. The process of transforming the image is sometimes called distortion shading As the head moves, the viewpoint must accordingly change for visual rendering.
because it can be implemented as a shading operation in the GPU; it has nothing A magic oracle perfectly indicates the head position and orientation at any time.
to do with “shading” as defined in Section 7.1. The rasterized image that was The VWG continuously maintains the positions and orientations of all objects
calculated using methods in Section 7.2 can be converted into a transformed im- in the virtual world. The visual rendering system maintains all perspective and
age using (7.17), or another representation of f −1 , on a pixel-by-pixel basis. If viewport transformations, and the entire rasterization process continuously sets
compensating for a pincushion distortion, the resulting image will appear to have the RGB values on the display according to the shading models. Progressing with
a barrel distortion; see Figure 7.15. To improve VR performance, multiresolution this fantasy, the display itself continuously updates, taking no time to switch the
shading is used in Nvidia GTX 1080 GPUs. One problem is that the resolution is pixels. The display has retina-level resolution, as described in Section 5.4, and a
effectively dropped near the periphery because of the transformed image (Figure dynamic range of light output over seven orders of magnitude to match human
7.15). This results in wasted shading calculations in the original image. Instead, perception. In this case, visual stimulation provided by the virtual world should
the image can be rendered before the transformation by taking into account the match what would occur in a similar physical world in terms of the geometry.
final resulting resolutions after the transformation. A lower-resolution image is There would be no errors in time and space (although the physics might not
rendered in a region that will become compressed by the transformation. match anyway due to assumptions about lighting, shading, material properties,
The methods described in this section may also be used for other optical dis- color spaces, and so on).
tortions that are radially symmetric. For example, chromatic aberration can be
partially corrected by transforming the red, green, and blue subpixels differently. Historical problems In practice, the perfect system is not realizable. All of
Each color is displaced radially by a different amount to compensate for the radial these operations require time to propagate information and perform computations.
distortion that occurs based on its wavelength. If chromatic aberration correction In early VR systems, the total motion-to-photons latency was often over 100ms.
is being used, then if the lenses are removed from the VR headset, it would become In the 1990s, 60ms was considered an acceptable amount. Latency has been stated
clear that the colors are not perfectly aligned in the images being rendered to the as one of the greatest causes of VR sickness, and therefore one of the main obstruc-
display. The rendering system must create a distortion of pixel placements on the tions to widespread adoption over the past decades. People generally adapt to a
basis of color so that they will be moved closer to the correct places after they fixed latency, which somewhat mitigates the problem, but then causes problems
pass through the lens. when they have to readjust to the real world. Variable latencies are even worse due
to the inability to adapt [68]. Fortunately, latency is no longer the main problem
in most VR systems because of the latest-generation tracking, GPU, and display
technology. The latency may be around 15 to 25ms, which is even compensated
7.4 Improving Latency and Frame Rates for by predictive methods in the tracking system. The result is that the effec-
tive latency is very close to zero. Thus, other factors are now contributing more
The motion-to-photons latency in a VR headset is the amount of time it takes strongly to VR sickness and fatigue, such as vection and optical aberrations.
to update the display in response to a change in head orientation and position.
For example, suppose the user is fixating on a stationary feature in the virtual
Overview of latency reduction methods The following strategies are used
world. As the head yaws to the right, the image of the feature on the display must
together to both reduce the latency and to minimize the side effects of any re-
immediately shift to the left. Otherwise, the feature will appear to move if the
maining latency:
eyes remain fixated on it. This breaks the perception of stationarity.
1. Lower the complexity of the virtual world.
A simple example Consider the following example to get a feeling for the 2. Improve rendering pipeline performance.
latency problem. Let d be the density of the display in pixels per degree. Let ω 3. Remove delays along the path from the rendered image to switching pixels.
be the angular velocity of the head in degrees per second. Let ℓ be the latency
in seconds. Due to latency ℓ and angular velocity ω, the image is shifted by dωℓ 4. Use prediction to estimate future viewpoints and world states.
pixels. For example, if d = 40 pixels per degree, ω = 50 degrees per second, and
5. Shift or distort the rendered image to compensate for last-moment viewpoint
ℓ = 0.02 seconds, then the image is incorrectly displaced by dωℓ = 4 pixels. An
errors and missing frames.
extremely fast head turn might be at 300 degrees per second, which would result
in a 24-pixel error. Each of these will be described in succession.
7.4. IMPROVING LATENCY AND FRAME RATES 197 198 S. M. LaValle: Virtual Reality
viewpoints that might arise in the targeted VR system. In some systems, such as
Unity 3D, reducing the number of different material properties across the model
will also improve performance.
In addition to reducing the rendering time, a simplified model will also lower
computational demands on the Virtual World Generator (VWG). For a static
world, the VWG does not need to perform any updates after initialization. The
user simply views the fixed world. For dynamic worlds, the VWG maintains a simu-
lation of the virtual world that moves all geometric bodies while satisfying physical
laws that mimic the real world. It must handle the motions of any avatars, falling
objects, moving vehicles, swaying trees, and so on. Collision detection methods are
needed to make bodies react appropriately when in contact. Differential equations
that model motion laws may be integrated to place bodies correctly over time.
These issues will be explained in Chapter 8, but for now it is sufficient to under-
stand that the VWG must maintain a coherent snapshot of the virtual world each
time a rendering request is made. Thus, the VWG has a frame rate in the same
way as a display or visual rendering system. Each VWG frame corresponds to the
placement of all geometric bodies for a common time instant. How many times
per second can the VWG be updated? Can a high, constant rate of VWG frames
Figure 7.16: A variety of mesh simplification algorithms can be used to reduce the be maintained? What happens when a rendering request is made while the VWG
model complexity while retaining the most important structures. Shown here is is in the middle of updating the world? If the rendering module does not wait for
a simplification of a hand model made by the open-source library CGAL. (Figure the VWG update to be completed, then some objects could be incorrectly placed
by Fernando Cacciola.) because some are updated while others are not. Thus, the system should ideally
wait until a complete VWG frame is finished before rendering. This suggests that
the VWG update should be at least as fast as the rendering process, and the two
Simplifying the virtual world Recall from Section 3.1 that the virtual world should be carefully synchronized so that a complete, fresh VWG frame is always
is composed of geometric primitives, which are usually 3D triangles arranged in a ready for rendering.
mesh. The chain of transformations and rasterization process must be applied for
each triangle, resulting in a computational cost that is directly proportional to the
number of triangles. Thus, a model that contains tens of millions of triangles will Improving rendering performance Any techniques that improve rendering
take orders of magnitude longer to render than one made of a few thousand. In performance in the broad field of computer graphics apply here; however, one must
many cases, we obtain models that are much larger than necessary. They can often avoid cases in which side effects that were imperceptible on a computer display
be made much smaller (fewer triangles) with no perceptible difference, much in become noticeable in VR. It was already mentioned in Section 7.2 that texture
the same way that image, video, and audio compression works. Why are they too and normal mapping methods are less effective in VR for this reason; many more
big in the first place? If the model was captured from a 3D scan of the real world, discrepancies are likely to be revealed in coming years. Regarding improvements
then it is likely to contain highly dense data. Capture systems such as the FARO that are unique to VR, it was mentioned in Sections 7.2 and 7.3 that the stencil
Focus3D X Series capture large worlds while facing outside. Others, such as the buffer and multiresolution shading can be used to improve rendering performance
Matter and Form MFSV1, capture a small object by rotating it on a turntable. by exploiting the shape and distortion due to the lens in a VR headset. A further
As with cameras, systems that construct 3D models automatically are focused on improvement is to perform rasterization for the left and right eyes in parallel in the
producing highly accurate and dense representations, which maximize the model GPU, using one processor for each. The two processes are completely independent.
size. Even in the case of purely synthetic worlds, a modeling tool such as Maya This represents an important first step, among many that are likely to come, in
or Blender will automatically construct a highly accurate mesh of triangles over a design of GPUs that are targeted specifically for VR.
curved surface. Without taking specific care of later rendering burdens, the model
could quickly become unwieldy. Fortunately, it is possible to reduce the model size From rendered image to switching pixels The problem of waiting for co-
by using mesh simplification algorithms; see Figure 7.16. In this case, one must be herent VWG frames also arises in the process of rendering frames to the display:
careful to make sure that the simplified model will have sufficient quality from all When it is time to scan out the rendered image to the display, it might not be
7.4. IMPROVING LATENCY AND FRAME RATES 199 200 S. M. LaValle: Virtual Reality
Figure 7.17: If a new frame is written to the video memory while a display scanout
occurs, then tearing arises, in which parts of two or more frames become visible
at the same time.
finished yet. Recall from Section 5.4 that most displays have a rolling scanout
that draws the rows of the rasterized image, which sits in the video memory, onto
the screen one-by-one. This was motivated by the motion of the electron beam
that lit phosphors on analog TV screens. The motion is left to right, and top to
bottom, much in the same way we would write out a page of English text with
a pencil and paper. Due to inductive inertia in the magnetic coils that bent the
beam, there was a period of several milliseconds called vblank (vertical blanking
interval) in which the beam moves from the lower right back to the upper left of Figure 7.18: Buffering is commonly used in visual rendering pipelines to avoid
the screen to start the next frame. During this time, the beam was turned off tearing and lost frames; however, it introduces more latency, which is detrimental
to avoid drawing a diagonal streak across the frame, hence, the name “blanking”. to VR. (Figure by Wikipedia user Cmglee.)
Short blanking intervals also occurred as each horizontal line to bring the beam
back from the right to the left.
being written outside of the vblank interval.
In the era of digital displays, the scanning process in unnecessary, but it nev-
ertheless persists and causes some trouble. Suppose that a display runs at 100 Another strategy to avoid tearing is buffering, which is shown in Figure 7.18.
FPS. In this case, a request to draw a new rendered image is made every 10ms. The approach is simple for programmers because it allows the frames to be written
Suppose that vblank occurs for 2ms and the remaining 8ms is spent drawing in memory that is not being scanned for output to the display. The unfortunate
lines on the display. If the new rasterized image is written to the video memory side effect is that it increases the latency. For double buffering, a new frame is first
during the 2ms of vblank, then it will be correctly drawn in the remaining 8ms. drawn into the buffer and then transferred to the video memory during vblank.
It is also possible to earn extra time through beam racing [25, 212]. However, if It is often difficult to control the rate at which frames are produced because the
a new image is being written and passes where the beam is scanning it out, then operating system may temporarily interrupt the process or alter its priority. In this
tearing occurs because it appears as if is screen is torn into pieces; see Figure 7.17. case, triple buffering is an improvement that allows more time to render each frame.
If the VWG and rendering system produce frames at 300 FPS, then parts of 3 For avoiding tearing and providing smooth video game performance, buffering has
or 4 images could appear on the display because the image changes several times been useful; however, it is detrimental to VR because of the increased latency.
while the lines are being scanned out. One solution to this problem to use vsync Ideally, the displays should have a global scanout, in which all pixels are
(pronounced “vee sink”), which is a flag that prevents the video memory from switched at the same time. This allows a much longer interval to write to the
7.4. IMPROVING LATENCY AND FRAME RATES 201 202 S. M. LaValle: Virtual Reality
video memory and avoids tearing. It would also reduce the latency in the time Perturbation Image effect
it takes to scan the first pixel to the last pixel. In our example, this was an 8ms ∆α (yaw) Horizontal shift
interval. Finally, displays should reduce the pixel switching time as much as pos- ∆β (pitch) Vertical shift
sible. In a smartphone LCD screen, it could take up to 20ms to switch pixels; ∆γ (roll) Rotation about image center
however, OLED pixels can be switched in under 0.1ms. ∆x Horizontal shift
∆y Vertical shift
The power of prediction For the rest of this section, we consider how to live ∆z Contraction or expansion
with whatever latency remains. As another thought experiment, imagine that a
fortune teller is able to accurately predict the future. With such a device, it should
Figure 7.19: Six cases of post-rendering image warp based on the degrees of free-
be possible to eliminate all latency problems. We would want to ask the fortune
dom for a change in viewpoint. The first three correspond to an orientation change.
teller the following:
The remaining three correspond to a position change. These operations can be
1. At what future time will the pixels be switching? visualized by turning on a digital camera and observing how the image changes
under each of these perturbations.
2. What will be the positions and orientations of all virtual world models at
that time?
Post-rendering image warp Due to both latency and imperfections in the
3. Where will the user be looking at that time? prediction process, a last-moment adjustment might be needed before the frame
is scanned out to the display. This is called post-rendering image warp [201] (it
Let ts be answer to the first question. We need to ask the VWG to produce a frame
has also been rediscovered and called time warp and asynchronous reprojection in
for time ts and then perform visual rendering for the user’s viewpoint at time ts .
the recent VR industry). At this stage, there is no time to perform complicated
When the pixels are switched at time ts , then the stimulus will be presented to the
shading operations; therefore, a simple transformation is made to the image.
user at the exact time and place it is expected. In this case, there is zero effective
latency. Suppose that an image has been rasterized for a particular viewpoint, expressed
Now consider what happens in practice. First note that using information from by position (x, y, z) and orientation given by yaw, pitch, and roll (α, β, γ). What
all three questions above implies significant time synchronization across the VR would be different about the image if it were rasterized for a nearby viewpoint?
system: All operations must have access to a common clock. For the first question Based on the degrees of freedom for viewpoints, there are six types of adjustments;
above, determining ts should be feasible if the computer is powerful enough and see Figure 7.19. Each one of these has a direction that is not specified in the figure.
the VR system has enough control from the operating system to ensure that VWG For example, if ∆α is positive, which corresponds to a small, counterclockwise yaw
frames will be consistently produced and rendered at the frame rate. The second of the viewpoint, then the image is shifted horizontally to the right.
question is easy for the case of a static virtual world. In the case of a dynamic Figure 7.20 shows some examples of the image warp. Most cases require the
world, it might be straightforward for all bodies that move according to predictable rendered image to be larger than the targeted display; otherwise, there will be
physical laws. However, it is difficult to predict what humans will do in the virtual no data to shift into the warped image; see Figure 7.20(d). If this ever happens,
world. This complicates the answers to both the second and third questions. then it is perhaps best to repeat pixels from the rendered image edge, rather than
Fortunately, the latency is so small that momentum and inertia play a significant turning them black [201].
role; see Chapter 8. Bodies in the matched zone are following physical laws of
motion from the real world. These motions are sensed and tracked according to Flaws in the warped image Image warping due to orientation changes pro-
methods covered in Chapter 9. Although it might be hard to predict where you duces a correct image in the sense that it should be exactly what would have been
will be looking in 5 seconds, it is possible to predict with very high accuracy rendered from scratch for that orientation (without taking aliasing issues into ac-
where your head will be positioned and oriented in 20ms. You have no free will count). However, positional changes are incorrect! Perturbations in x and y do
on the scale of 20ms! Instead, momentum dominates and the head motion can be not account for motion parallax (recall from Section 6.1), which would require
accurately predicted. Some body parts, especially fingers, have much less inertia, knowing the depths of the objects. Changes in z produce similarly incorrect im-
and therefore become more difficult to predict; however, these are not as important ages because nearby objects should expand or contract by a larger amount than
as predicting head motion. The viewpoint depends only on the head motion, and further ones. To make matters worse, changes in viewpoint position might lead
latency reduction is most critical in this case to avoid perceptual problems that to a visibility event, in which part of an object may become visible only in the
lead to fatigue and VR sickness. new viewpoint; see Figure 7.21. Data structures such as an aspect graph [250]
7.4. IMPROVING LATENCY AND FRAME RATES 203 204 S. M. LaValle: Virtual Reality
Figure 7.21: If the viewing position changes, then a visibility event might be
encountered. This means that part of the object might suddenly become visible
from the new perspective. In this sample, a horizontal shift in the viewpoint
reveals a side of the cube that was originally hidden. Furthermore, the top of the
(a) (b) cube changes its shape.
and visibility complex [252] are designed to maintain such events, but are usually
not included in the rendering process. As latencies become shorter and predic-
tion becomes better, the amount of perturbation is reduced. Careful perceptual
studies are needed to evaluate conditions under which image warping errors are
perceptible or cause discomfort. An alternative to image warping is to use parallel
processing to sample several future viewpoints and render images for all of them.
The most correct image can then be selected, to greatly reduce the image warping
artifacts.
Increasing the frame rate Post-rendering image warp can also be used to
artificially increase the frame rate. For example, suppose that only one rasterized
(c) (d) image is produced every 100 milliseconds by a weak computer or GPU. This would
result in poor performance at 10 FPS. Suppose we would like to increase this to
100 FPS. In this case, a single rasterized image can be warped to produce frames
Figure 7.20: Several examples of post-rendering image warp: (a) Before warping, every 10ms until the next rasterized image is computed. In this case, 9 warped
a larger image is rasterized. The red box shows the part that is intended to be sent frames are inserted for every rasterized image that is properly rendered. This
to the display based on the viewpoint that was used at the time of rasterization; process is called inbetweening or tweening, and has been used for over a century
(b) A negative yaw (turning the head to the right) causes the red box to shift to (one of the earliest examples is the making of Fantasmagorie, which was depicted
the right. The image appears to shift to the left; (c) A positive pitch (looking in Figure 1.25(a)).
upward) causes the box to shift upward; (d) In this case, the yaw is too large and
there is no rasterized data to use for part of the image (this region is shown as a
black rectangle). 7.5 Immersive Photos and Videos
Up until now, this chapter has focused on rendering a virtual world that was
constructed synthetically from geometric models. The methods developed over
decades of computer graphics research have targeted this case. The trend has
recently changed, though, toward capturing real-world images and video, which
7.5. IMMERSIVE PHOTOS AND VIDEOS 205 206 S. M. LaValle: Virtual Reality
headset display. The result is that the user perceives it as a 3D movie, without
wearing the special glasses! Of course, she would be wearing a VR headset.
(a) (b)
Figure 7.23: (a) The 360Heros Pro10 HD is a rig that mounts ten GoPro cameras (a) (b)
in opposing directions to capture panoramic images. (b) The Ricoh Theta S
captures panoramic photos and videos using only two cameras, each with a lens Figure 7.24: (a) The photophere is texture-mapped onto the interior of a sphere
that provides a field of view larger than 180 degrees. that is modeled as a triangular mesh. (b) A photosphere stored as a cube of six
images can be quickly mapped to the sphere with relatively small loss of resolution;
seams in the stitching may become perceptible. A tradeoff exists in terms of the a cross section is shown here.
number of cameras. By using many cameras, very high resolution captures can
be made with relatively little optical distortion because each camera contributes of a cube. This is like a virtual version of a six-sided CAVE projection system.
a narrow field-of-view image to the photosphere. At the other extreme, as few as Each image can then be easily mapped onto the mesh with little loss in resolution,
two cameras are sufficient, as in the case of the Ricoh Theta S (Figure 7.23(b)). as shown in Figure 7.24(b).
The cameras are pointed 180 degrees apart and a fish-eyed lens is able to capture Once the photosphere (or moviesphere) is rendered onto the virtual sphere,
a view that is larger than 180 degrees. This design dramatically reduces costs, but the rendering process is very similar to post-rendering image warp. The image
requires significant unwarping of the two captured images. presented to the user is shifted for the rotational cases that were described in
Figure 7.19. In fact, the entire rasterization process could be performed only once,
for the entire sphere, while the image rendered to the display is adjusted based
Mapping onto a sphere The well-known map projection problem from cartog-
on the viewing direction. Further optimizations could be made by even bypassing
raphy would be confronted to map the photosphere onto a screen; however, this
the mesh and directly forming the rasterized image from the captured images.
does not arise when rendering a photosphere in VR because it is mapped directly
onto a sphere in the virtual world. The sphere of all possible viewing directions
maps to the virtual-world sphere without distortions. To directly use texture Perceptual issues Does the virtual world appear to be “3D” when viewing a
mapping techniques, the virtual-world sphere can be approximated by uniform photosphere or moviesphere? Recall from Section 6.1 that there are many more
triangles, as shown in Figure 7.24(a). The photosphere itself should be stored in monocular depth cues than stereo cues. Due to the high field-of-view of modern
a way that does not degrade its resolution in some places. We cannot simply use VR headsets and monocular depth cues, a surprisingly immersive experience is
latitude and longitude coordinates to index the pixels because the difference in obtained. Thus, it may feel more “3D” than people expect, even if the same part
resolution between the poles and the equator would be too large. We could use co- of the panoramic image is presented to both eyes. Many interesting questions
ordinates that are similar to the way quaternions cover the sphere by using indices remain for future research regarding the perception of panoramas. If different
(a, b, c) and requiring that a2 + b2 + c2 = 1; however, the structure of neighboring viewpoints are presented to the left and right eyes, then what should the radius
pixels (up, down, left, and right) is not clear. A simple and efficient compromise is of the virtual sphere be for comfortable and realistic viewing? Continuing further,
to represent the photosphere as six square images, each corresponding to the face suppose positional head tracking is used. This might improve viewing comfort,
7.5. IMMERSIVE PHOTOS AND VIDEOS 209 210 S. M. LaValle: Virtual Reality
image warping techniques are used to approximate viewpoints between the cameras
or from the interior the spherical rig. To further improve the experience, light-
field cameras (also called plenoptic cameras) offer the ability to capture both the
intensity of light rays and the direction that they are traveling through space. This
offers many advantages, such as refocusing images to different depths, after the
light field has already been captured.
Further Reading
Close connections exist between VR and computer graphics because both are required
to push visual information onto a display; however, many subtle differences arise and
VR is much less developed. For basic computer graphics, many texts provide additional
coverage on the topics from this chapter; see, for example [202]. For much more detail
on high-performance, high-resolution rendering for computer graphics, see [5]. Compre-
hensive coverage of BRDFs appears in [22], in addition to [5].
Ray tracing paradigms may need to be redesigned for VR. Useful algorithmic back-
ground from a computational geometry perspective can be found in [336, 42]. For op-
tical distortion and correction background, see [46, 117, 130, 199, 326, 330]. Chromatic
Figure 7.25: The Pantopticam prototype from Figure Digital uses dozens of cam- aberration correction appears in [230]. Automatic stitching of panoramas is covered in
eras to improve the ability to approximate more viewpoints so that stereo viewing [33, 299, 317].
and parallax from position changes can be simulated.
but the virtual world will appear more flat because parallax is not functioning.
For example, closer objects will not move more quickly as the head moves from
side to side. Can simple transformations be performed to the images so that depth
perception is enhanced? Can limited depth data, which could even be extracted
automatically from the images, greatly improve parallax and depth perception?
Another issue is designing interfaces inside of photospheres. Suppose we would
like to have a shared experience with other users inside of the sphere. In this
case, how do we perceive virtual objects inserted into the sphere, such as menus
or avatars? How well would a virtual laser pointer work to select objects?
Panoramic light fields Panoramic images are simple to construct, but are
clearly flawed because they do not account how the surround world would appear
from any viewpoint that could be obtained by user movement. To accurately
determine this, the ideal situation would be to capture the entire light field of
energy inside of whatever viewing volume that user is allowed to move. A light
field provides both the spectral power and direction of light propagation at every
point in space. If the user is able to walk around in the physical world while
wearing a VR headset, then this seems to be an impossible task. How can a rig
of cameras capture the light energy in all possible locations at the same instant in
an entire room? If the user is constrained to a small area, then the light field can
be approximately captured by a rig of cameras arranged on a sphere; a prototype
is shown in Figure 7.25. In this case, dozens of cameras may be necessary, and
212 S. M. LaValle: Virtual Reality
Chapter 8
8.1.1 A one-dimensional world Acceleration The next step is to mathematically describe the change in velocity,
which results in the acceleration, a; this is defined as:
We start with the simplest case, which is shown in Figure 8.1. Imagine a 1D world
in which motion is only possible in the vertical direction. Let y be the coordinate dv(t)
a= . (8.5)
of a moving point. Its position at any time t is indicated by y(t), meaning that y dt
actually defines a function of time. It is now as if y were an animated point, with 1
The parameter s is used instead of t to indicate that it is integrated away, much like the
an infinite number of frames per second! index in a summation.
211
8.1. VELOCITIES AND ACCELERATIONS 213 214 S. M. LaValle: Virtual Reality
The form is the same as (8.1), except that y has been replaced by v. Approxima-
tions can similarly be made. For example, ∆v ≈ a∆t.
The acceleration itself can vary over time, resulting in a(t). The following
integral relates acceleration and velocity (compare to (8.4)):
Z t
v(t) = v(0) + a(s)ds. (8.6)
0
Since acceleration may vary, you may wonder whether the naming process
continues. It could, with the next derivative called jerk, followed by snap, crackle,
and pop. In most cases, however, these higher-order derivatives are not necessary.
One of the main reasons is that motions from classical physics are sufficiently
characterized through forces and accelerations. For example, Newton’s Second (a) (b)
Law states that F = ma, in which F is the force acting on a point, m is its mass,
and a is the acceleration.
For a simple example that should be familiar, consider acceleration due to Figure 8.2: (a) Consider a merry-go-round that rotates at constant angular velocity
gravity, g = 9.8m/s2 . It is as if the ground were accelerating upward by g; hence, ω. (Picture by Seattle Parks and Recreation.) (b) In a top-down view, the velocity
the point accelerates downward relative to the Earth. Using (8.6) to integrate the vector, v, for a point on the merry-go-round is tangent to the circle that contains
acceleration, the velocity over time is v(t) = v(0) − gt. Using (8.4) to integrate it; the circle is centered on the axis of rotation and the acceleration vector, a,
the velocity and supposing v(0) = 0, we obtain points toward its center.
1
y(t) = y(0) − gt2 . (8.7) over time for one point on the body is sufficient for easily determining the positions
2 of all points on the body over time. If one point has changed its position by some
(xt , yt , zt ), then all points have changed by the same amount. More importantly,
8.1.2 Motion in a 3D world the velocity and acceleration of every point would be identical.
A moving point Now consider the motion of a point in a 3D world R3 . Imagine Once rotation is allowed, this simple behavior breaks. As a body rotates, the
that a geometric model, as defined in Section 3.1, is moving over time. This causes points no longer maintain the same velocities and accelerations. This becomes
each point (x, y, z) on the model to move, resulting a function of time for each crucial to understanding VR sickness in Section 8.4 and how tracking methods
coordinate of each point: estimate positions and orientations from sensors embedded in the world, which
(x(t), y(t), z(t)). (8.8) will be discussed in Chapter 9.
The velocity v and acceleration a from Section 8.1.1 must therefore expand to have
three coordinates. The velocity v is replaced by (vx , vy , vz ) to indicate velocity with Angular velocity To understand the issues, consider the simple case of a spin-
respect to the x, y, and z coordinates, respectively. The magnitude of v is called ning merry-go-round, as shown in Figure 8.2(a). Its orientation at every time can
the speed: be described by θ(t); see Figure 8.2(b). Let ω denote its angular velocity:
q
vx2 + vy2 + vz2 (8.9) dθ(t)
ω= . (8.10)
Continuing further, the acceleration also expands to include three components: dt
(ax , ay , az ). By default, ω has units of radians per second. If ω = 2π, then the rigid body
returns to the same orientation after one second.
Rigid-body motion Now suppose that a rigid body is moving through R3 . In Assuming θ(0) = 0 and ω is constant, the orientation at time t is given by
this case, all its points move together. How can we easily describe this motion? θ = ωt. To describe the motion of a point on the body, it will be convenient to
Recall from Section 3.2 that translations or rotations may be applied. First, con- use polar coordinates r and θ:
sider a simple case. Suppose that rotations are prohibited, and the body is only
allowed to translate through space. In this limited setting, knowing the position x = r cos θ and y = r sin θ. (8.11)
8.1. VELOCITIES AND ACCELERATIONS 215 216 S. M. LaValle: Virtual Reality
Substituting θ = ωt yields Angular acceleration If ω is allowed to vary over time, then we must consider
angular acceleration. In the 2D case, this is defined as
x = r cos ωt and y = r sin ωt. (8.12)
dω(t)
α= . (8.16)
Taking the derivative with respect to time yields 2 dt
For the 3D case, there are three components, which results in
vx = −rω sin ωt and vy = rω cos ωt. (8.13)
(αx , αy , αz ). (8.17)
The velocity is a 2D vector that when placed at the point is tangent to the circle
that contains the point (x, y); see Figure 8.2(b). These can be interpreted as accelerations of pitch, yaw, and roll angles, respec-
This makes intuitive sense because the point is heading in that direction; how- tively.
ever, the direction quickly changes because it must move along a circle. This
change in velocity implies that a nonzero acceleration occurs. The acceleration of
the point (x, y) is obtained by taking the derivative again:
8.2 The Vestibular System
As mentioned in Section 2.3, the balance sense (or vestibular sense) provides in-
ax = −rω 2 cos ωt and ay = −rω 2 sin ωt. (8.14) formation to the brain about how the head is oriented or how it is moving in
general. This is accomplished through vestibular organs that measure both linear
The result is a 2D acceleration vector that is pointing toward the center (Figure and angular accelerations of the head. These organs, together with their associated
8.2(b)), which is the rotation axis. This is called centripetal acceleration. If you neural pathways, will be referred to as the vestibular system. This system plays a
were standing at that point, then you would feel a pull in the opposite direction, crucial role for bodily functions that involve motion, from ordinary activity such
as if nature were attempting to fling you away from the center. This is precisely as walking or running, to activities that require substantial talent and training,
how artificial gravity can be achieved in a rotating space station. such as gymnastics or ballet dancing. Recall from Section 5.3 that it also enables
eye motions that counteract head movements via the VOR.
3D angular velocity Now consider the rotation of a 3D rigid body. Recall from The vestibular system is important to VR because it is usually neglected, which
Section 3.3 that Euler’s rotation theorem implies that every 3D rotation can be leads to a mismatch of perceptual cues (recall this problem from Section 6.4). In
described as a rotation θ about an axis v = (v1 , v2 , v3 ) though the origin. As the current VR systems, there is no engineered device that renders vestibular signals
orientation of the body changes over a short period of time ∆t, imagine the axis to a display that precisely stimulates the vestibular organs to values as desired.
that corresponds to the change in rotation. In the case of the merry-go-round, the Some possibilities may exist in the future with galvanic vestibular stimulation,
axis would be v = (0, 1, 0). More generally, v could be any unit vector. which provides electrical stimulation to the organ [80, 79]; however, it may take
many years before such techniques are sufficiently accurate, comfortable, and gen-
The 3D angular velocity is therefore expressed as a 3D vector:
erally approved for safe use by the masses. Another possibility is to stimulate
the vestibular system through low-frequency vibrations, which at the very least
(ωx , ωy , ωz ), (8.15)
provides some distraction.
which can be imagined as taking the original ω from the 2D case and multiplying
it by the vector v. This weights the components according to the coordinate axes. Physiology Figure 8.4 shows the location of the vestibular organs inside of the
Thus, the components could be considered as ωx = ωv1 , ωy = ωv2 , and ωz = ωv3 . human head. As in the cases of eyes and ears, there are two symmetric organs,
The ωx , ωy , and ωz components also correspond to the rotation rate in terms of corresponding to the right and left sides. Figure 8.3 shows the physiology of each
pitch, roll, and yaw, respectively. We avoided these representations in Section 3.3 vestibular organ. The cochlea handles hearing, which is covered in Section 11.2,
due to noncommutativity and kinematic singularities; however, it turns out that and the remaining parts belong to the vestibular system. The utricle and saccule
for velocities these problems do not exist [303]. Thus, we can avoid quaternions measure linear acceleration; together they form the otolith system. When the head
at this stage. is not tilted, the sensing surface of the utricle mostly lies in the horizontal plane (or
xz plane in our common coordinate systems), whereas the corresponding surface
2
If this is unfamiliar, then look up the derivatives of sines and cosines, and the chain rule, of the saccule lies in a vertical plane that is aligned in the forward direction (this
from standard calculus sources (for example, [320]). is called the sagittal plane, or yz plane). As will be explained shortly, the utricle
8.2. THE VESTIBULAR SYSTEM 217 218 S. M. LaValle: Virtual Reality
Figure 8.5: A depiction of an otolith organ (utricle or saccule), which senses linear
acceleration. (Figure by Lippincott Williams & Wilkins.)
senses acceleration components ax and az , and the saccule senses ay and az (az is
Figure 8.3: The vestibular organ. redundantly sensed).
The semicircular canals measure angular acceleration. Each canal has a diam-
eter of about 0.2 to 0.3mm, and is bent along a circular arc with a diameter of
about 2 to 3mm. Amazingly, the three canals are roughly perpendicular so that
they independently measure three components of angular velocity. The particular
canal names are anterior canal, posterior canal, and lateral canal. They are not
closely aligned with our usual 3D coordinate coordinate axes. Note from Figure
8.4 that each set of canals is rotated by 45 degrees with respect to the vertical
axis. Thus, the anterior canal of the left ear aligns with the posterior canal of
the right ear. Likewise, the posterior canal of the left ear aligns with the anterior
canal of the right ear. Although not visible in the figure, the lateral canal is also
tilted about 30 away from level. Nevertheless, all three components of angular
acceleration are sensed because the canals are roughly perpendicular.
Figure 8.6: Because of the Einstein equivalence principle, the otolith organs cannot
distinguish linear acceleration of the head from tilt with respect to gravity. In
either case, the cilia deflect in the same way, sending equivalent signals to the
neural structures. Figure 8.7: The cupula contains a center membrane that houses the cilia. If
angular acceleration occurs that is aligned with the canal direction, then pressure
cause the cilia to bend. To distinguish between particular directions inside of this is applied to the cupula, which causes the cilia to bend and send neural signals.
plane, the cilia are polarized so that each cell is sensitive to one particular direc-
tion. This is accomplished by a thicker, lead hair called the kinocilium, to which
all other hairs of the cell are attached by a ribbon across their tips so that they deform and bend the cilia on hair cells inside of it. Note that a constant angular
all bend together. velocity does not, in principle, cause pressure on the cupula; thus, the semicircular
One major sensing limitation arises because of a fundamental law from physics: canals measure angular acceleration as opposed to velocity. Each canal is polarized
The Einstein equivalence principle. In addition to the vestibular system, it also in the sense that it responds mainly to rotations about an axis perpendicular to
impacts VR tracking systems (see Section 9.2). The problem is gravity. If we were the plane that contains the entire canal.
deep in space, far away from any gravitational forces, then linear accelerations
measured by a sensor would correspond to pure accelerations with respect to a fixed Impact on perception Cues from the vestibular system are generally weak in
coordinate frame. On the Earth, we also experience force due to gravity, which comparison to other senses, especially vision. For example, a common danger for
feels as if we were on a rocket ship accelerating upward at roughly 9.8m/s2 . The a skier buried in an avalanche is that he cannot easily determine which way is up
equivalence principle states that the effects of gravity and true linear accelerations without visual cues to accompany the perception of gravity from the vestibular
on a body are indistinguishable. Figure 8.6 shows the result in terms of the otolith system. Thus, the vestibular system functions well when providing consistent
organs. The same signals are sent to the brain whether the head is tilted or it cues with other systems, including vision and proprioception. Mismatched cues
is linearly accelerating. If you close your eyes or wear a VR headset, then you are problematic. For example, some people may experience vertigo when the
should not be able to distinguish tilt from acceleration. In most settings, we are vestibular system is not functioning correctly. In this case, they feel as if the world
not confused because the vestibular signals are accompanied by other stimuli when around them is spinning or swaying. Common symptoms are nausea, vomiting,
accelerating, such as vision and a revving engine. sweating, and difficulties walking. This may even impact eye movements because
of the VOR. Section 8.4 explains a bad side effect that results from mismatched
Sensing angular acceleration The semicircular canals use the same principle vestibular and visual cues in VR.
as the otolith organs. They measure acceleration by bending cilia at the end of
hair cells. A viscous fluid moves inside of each canal. A flexible structure called
the cupula blocks one small section of the canal and contains the hair cells; see 8.3 Physics in the Virtual World
Figure 8.7. Compare the rotation of a canal to the merry-go-round. If we were
to place a liquid-filled tube around the periphery of the merry-go-round, then the 8.3.1 Tailoring the Physics to the Experience
fluid would remain fairly stable at a constant angular velocity. However, if angular
acceleration is applied, then due to friction between the fluid and the tube (and If we expect to fool our brains into believing that we inhabit the virtual world, then
also internal fluid viscosity), the fluid would start to travel inside the tube. In the many of our expectations from the real world should be matched in the virtual
semicircular canal, the moving fluid applies pressure to the cupula, causing it to world. We have already seen this in the case of the physics of light (Chapter 4)
8.3. PHYSICS IN THE VIRTUAL WORLD 221 222 S. M. LaValle: Virtual Reality
applying to visual rendering of virtual worlds (Chapter 7). Motions in the virtual
world should also behave in a familiar way.
This implies that the VWG contains a physics engine that governs the motions
of bodies in the virtual world by following principles from the physical world.
Forces acting on bodies, gravity, fluid flows, and collisions between bodies should
be handled in perceptually convincing ways. Physics engines arise throughout
engineering and physics in the context of any simulation. In video games, computer
graphics, and film, these engines perform operations that are very close to our
needs for VR. This is why popular game engines such as Unity 3D and Unreal
Engine have been quickly adapted for use in VR. As stated in Section 2.2, we have
not yet arrived at an era in which general and widely adopted VR engines exist; (a) (b)
therefore, modern game engines are worth understanding and utilizing at present.
To determine what kind of physics engine needs to be borrowed, adapted, or Figure 8.8: (a) A virtual car (Cheetah) that appears in the game Grand Theft
constructed from scratch, one should think about the desired VR experience and Auto; how many degrees of freedom should it have? (b) A human skeleton, with
determine the kinds of motions that will arise. Some common, generic questions rigid bodies connected via joints, commonly underlies the motions of an avatar.
are: (Figure from SoftKinetic).
• Will the matched zone remain fixed, or will the user need to be moved by
locomotion? If locomotion is needed, then will the user walk, run, swim, experience, rather than matching reality perfectly. Even in the case of simulating
drive cars, or fly spaceships? oneself walking around in the world, we often want to deviate from real-world
physics because of vection, which causes VR sickness (see Section 8.4).
• Will the user interact with objects? If so, then what kind of interaction is The remainder of this section covers some fundamental aspects that commonly
needed? Possibilities include carrying weapons, opening doors, tossing ob- arise: 1) numerical simulation of physical systems, 2) the control of systems using
jects, pouring drinks, operating machinery, drawing pictures, and assembling human input, and 3) collision detection, which determines whether bodies are
structures. interfering with each other.
• Will multiple users be sharing the same virtual space? If so, then how will
their motions be coordinated or constrained? 8.3.2 Numerical simulation
• Will the virtual world contain entities that appear to move autonomously, The state of the virtual world Imagine a virtual world that contains many
such as robots, animals, or humans? moving rigid bodies. For each body, think about its degrees of freedom (DOFs),
which corresponds to the number of independent parameters needed to uniquely
• Will the user be immersed in a familiar or exotic setting? A familiar setting determine its position and orientation. We would like to know the complete list of
could be a home, classroom, park, or city streets. An exotic setting might parameters needed to put every body in its proper place in a single time instant.
be scuba diving, lunar exploration, or traveling through blood vessels. A specification of values for all of these parameters is defined as the state of the
In addition to the physics engine, these questions will also guide the design of the virtual world.
interface, which is addressed in Chapter 10. The job of the physics engine can then be described as calculating the virtual
Based on the answers to the questions above, the physics engine design may be world state for every time instant or “snapshot” of the virtual world that would
simple and efficient, or completely overwhelming. As mentioned in Section 7.4, a be needed by a rendering system. Once the state is determined, the mathematical
key challenge is to keep the virtual world frequently updated so that interactions transforms of Chapter 3 are used to place the bodies correctly in the world and
between users and objects are well synchronized and renderers provide a low- calculate how they should appear on displays.
latency projection onto displays.
Note that the goal may not always be to perfectly match what would happen Degrees of freedom How many parameters are there in a virtual world model?
in the physical world. In a familiar setting, we might expect significant matching; As discussed in Section 3.2, a free-floating body has 6 DOFs which implies 6
however, in exotic settings, it often becomes more important to make a comfortable parameters to place it anywhere. In many cases, DOFs are lost due to constraints.
8.3. PHYSICS IN THE VIRTUAL WORLD 223 224 S. M. LaValle: Virtual Reality
For example, a ball that rolls on the ground has only 5 DOFs because it can first step is to describe rigid body velocities in terms of state. Returning to models
achieve any 2D position along the ground and also have any 3D orientation. It that involve one or more rigid bodies, the state corresponds to a finite number of
might be sufficient to describe a car with 3 DOFs by specifying the position along parameters. Let
the ground (two parameters) and the direction it is facing (one parameter); see x = (x1 , x2 , . . . , xn ) (8.18)
Figure 8.8(a). However, if the car is allowed to fly through the air while performing denote an n-dimensional state vector. If each xi corresponds to a position or
stunts or crashing, then all 6 DOFs are needed. orientation parameter for a rigid body, then the state vector puts all bodies in
For many models, rigid bodies are attached together in a way that allows their place. Let
relative motions. This is called multibody kinematics [163, 303]. For example, a dxi
car usually has 4 wheels which can roll to provide one rotational DOF per wheel ẋi = (8.19)
dt
. Furthermore, the front wheels can be steered to provide an additional DOF. represent the time derivative, or velocity, for each parameter.
Steering usually turns the front wheels in unison, which implies that one DOF is To obtain the state at any time t, the velocities need to be integrated over
sufficient to describe both wheels. If the car has a complicated suspension system, time. Following (8.4), the integration of each state variable determines the value
then it cannot be treated as a single rigid body, which would add many more at time t: Z t
DOFs.
xi (t) = xi (0) + ẋi (s)ds, (8.20)
Similarly, an animated character can be made by attaching rigid bodies to form 0
a skeleton; see Figure 8.8(b). Each rigid body in the skeleton is attached to one in which xi (0) is the value of xi at time t = 0.
or more other bodies by a joint. For example, a simple human character can be Two main problems arise with (8.20):
formed by attaching arms, legs, and a neck to a rigid torso. The upper left arm is
1. The integral almost always must be evaluated numerically.
attached to the torso by a shoulder joint. The lower part of the arm is attached
by an elbow joint, and so on. Some joints allow more DOFs than others. For 2. The velocity ẋi (t) must be specified at each time t.
example, the shoulder joint has 3 DOFs because it can yaw, pitch, and roll with
respect to the torso, but an elbow joint has only one DOF. Sampling rate For the first problem, time is discretized into steps, in which ∆t
To fully model the flexibility of the human body, 244 DOFs are needed, which is the step size or sampling rate. For example, ∆t might be 1ms, in which case
are controlled by 630 muscles [366]. In many settings, this would be too much the state can be calculated for times t = 0, 0.001, 0.002, . . ., in terms of seconds.
detail, which might lead to high computational complexity and difficult implemen- This can be considered as a kind of frame rate for the physics engine. Each ∆t
tation. Furthermore, one should always beware of the uncanny valley (mentioned corresponds to the production of a new frame.
in Section 1.1), in which more realism might lead to increased perceived creepiness As mentioned in Section 7.4, the VWG should synchronize the production of
of the character. Thus, having more DOFs is not clearly better, and it is up to a virtual world frames with rendering processes so that the world is not caught in an
VR content creator to determine how much mobility is needed to bring a character intermediate state with some variables updated to the new time and others stuck
to life, in a way that is compelling for a targeted purpose. at the previous time. This is a kind of tearing in the virtual world. This does
In the extreme case, rigid bodies are not sufficient to model the world. We not, however, imply that the frame rates are the same between renderers and the
might want to see waves rippling realistically across a lake, or hair gently flowing physics engine. Typically, the frame rate for the physics engine is much higher to
in the breeze. In these general settings, nonrigid models are used, in which case improve numerical accuracy.
the state can be considered as a continuous function. For example, a function of Using the sampling rate ∆t, (8.20) is approximated as
the form y = f (x, z) could describe the surface of the water. Without making k
some limiting simplifications, the result could effectively be an infinite number
X
xi ((k + 1)∆t) ≈ xi (k∆t) + ẋi (j∆t)∆t, (8.21)
of DOFs. Motions in this setting are typically described using partial differential j=1
equations (PDEs), which are integrated numerically to obtain the state at a desired
time. Usually, the computational cost is high enough to prohibit their use in for each state variable xi .
interactive VR experiences, unless shortcuts are taken by precomputing motions It is simpler to view (8.21) one step at a time. Let xi [k] denote xi (k∆t), which
or dramatically simplifying the model. is the state at time t = k∆t. The following is an update law that expresses the
new state xi [k + 1] in terms of the old state xi [k]:
xi [k + 1] ≈ xi [k] + ẋi (k∆t)∆t, (8.22)
Differential equations We now introduce some basic differential equations to
model motions. The resulting description is often called a dynamical system. The which starts with xi [0] = xi (0).
8.3. PHYSICS IN THE VIRTUAL WORLD 225 226 S. M. LaValle: Virtual Reality
Runge-Kutta integration The approximation used in (8.21) is known as Euler The phase space Unfortunately, motions are usually described in terms of ac-
integration. It is the simplest approximation, but does not perform well enough celerations (and sometimes higher-order derivatives), which need to be integrated
in many practical settings. One of the most common improvements is the fourth- twice. This leads to higher-order differential equations, which are difficult to work
order Runge-Kutta integration method, which expresses the new state as with. For this reason, phase space representations were developed in physics and
engineering. In this case, the velocities of the state variables are themselves treated
∆t as state variables. That way, the accelerations become the velocities of the velocity
xi [k + 1] ≈ xi [k] + (w1 + 2w2 + 2w3 + w4 ), (8.23)
6 variables.
in which For example, suppose that a position x1 is acted upon by gravity, which gen-
erates an acceleration a = −9.8m/s2 . This leads to a second variable x2 , which is
w1 = f (ẋi (k∆t)) defined as the velocity of x1 . Thus, by definition, ẋ1 = x2 . Furthermore, ẋ2 = a
w2 = f (ẋi (k∆t + 12 ∆t) + 21 ∆t w1 ) because the derivative of velocity is acceleration. Both of these equations fit the
(8.24) form of (8.25). Generally, the number of states increases to incorporate accelera-
w3 = f (ẋi (k∆t + 21 ∆t) + 21 ∆t w2 )
tions (or even higher-order derivatives), but the resulting dynamics are expressed
w4 = f (ẋi (k∆t + ∆t) + ∆t w3 ). in the form (8.25), which is easier to work with.
Although this is more expensive than Euler integration, the improved accuracy is
usually worthwhile in practice. Many other methods exist, with varying perfor- Handling user input Now consider the case in which a user commands an
mance depending on the particular ways in which ẋ is expressed and varies over object to move. Examples include driving a car, flying a spaceship, or walking an
time [133]. avatar around. This introduces some new parameters, called the controls, actions,
or inputs to the dynamical system. Differential equations that include these new
parameters are called control systems [14].
Time-invariant dynamical systems The second problem from (8.20) is to
Let u = (u1 , u2 , . . . , um ) be a vector of controls. The state transition equation
determine an expression for ẋ(t). This is where the laws of physics, such as the
in (8.26) is simply extended to include u:
acceleration of rigid bodies due to applied forces and gravity. The most common
case is time-invariant dynamical systems, in which ẋ depends only on the current
ẋ = f (x, u). (8.29)
state and not the particular time. This means each component xi is expressed as
Figure 8.9 shows a useful example, which involves driving a car. The control us
ẋi = fi (x1 , x2 , . . . , xn ), (8.25)
determines the speed of the car. For example, us = 1 drives forward, and us = −1
for some given vector-valued function f = (f1 , . . . , fn ). This can be written in drives in reverse. Setting us = 10 drives forward at a much faster rate. The control
compressed form by using x and ẋ to represent n-dimensional vectors: uφ determines how the front wheels are steered. The state vector is (x, z, θ), which
corresponds to the position and orientation of the car in the horizontal, xz plane.
ẋ = f (x). (8.26) The state transition equation is:
The expression above is often called the state transition equation because it indi- ẋ = us cos θ
cates the state’s rate of change. ż = us sin θ (8.30)
Here is a simple, one-dimensional example of a state transition equation: us
θ̇ = tan uφ .
L
ẋ = 2x − 1. (8.27)
Using Runge-Kutta integration, or a similar numerical method, the future states
This is called a linear differential equation. The velocity ẋ roughly doubles with can be calculated for the car, given that controls us and uφ are applied over time.
the value of x. Fortunately, linear problems can be fully solved “on paper”. The This model can also be used to steer the virtual walking of a VR user from
solution to (8.27) is of the general form first-person perspective. The viewpoint then changes according to (x, z, θ), while
1 the height y remains fixed. For the model in (8.30), the car must drive forward or
x(t) = + ce2t , (8.28) backward to change its orientation. By changing the third component to θ = uω ,
2
the user could instead specify the angular velocity directly. This would cause
in which c is a constant that depends on the given value for x(0). the user to rotate in place, as if on a merry-go-round. Many more examples
8.3. PHYSICS IN THE VIRTUAL WORLD 227 228 S. M. LaValle: Virtual Reality
Figure 8.10: Three interesting cases for collision detection (these are 2D examples).
The last case may or not cause collision, depending on the model.
Figure 8.9: A top-down view of a simple, steerable car. Its position and orientation
are given by (x, y, θ). The parameter ρ is the minimum turning radius, which 8.10(c), could be ambiguous. If one triangle is wholly inside of another, then is
depends on the maximum allowable steering angle φ. This model can also be used this a collision? If we interpret the outer triangle as a solid model, then yes. If
to “steer” human avatars, by placing the viewpoint above the center of the rear the outer triangle is only the boundary edges, and is meant to have an empty
axle. interior, then the answer is no. This is why emphasis was placed on having a
coherent model in Section 3.1; otherwise, the boundary might not be established
like these appear in Chapter 13 of [163], including bodies that are controlled via well enough to distinguish the inside from the outside.
accelerations.
It is sometimes helpful conceptually to define the motions in terms of discrete Distance functions Many collision detection methods benefit from maintaining
points in time, called stages. Using numerical integration of (8.29), we can think a distance function, which keeps track of how far the bodies are from colliding. For
about applying a control u over time ∆t to obtain a new state x[k + 1]: example, let A and B denote the set of all points occupied in R3 by two different
models. If they are in collision, then their intersection A ∩ B is not empty. If they
x[k + 1] = F (x[k], u[k]). (8.31) are not in collision, then the Hausdorff distance between A and B is the Euclidean
distance between the closest pair of points, taking one from A and one from B.3
The function F is obtained by integrating (8.29) over ∆t. Thus, if the state is
Let d(A, B) denote this distance. If A and B intersect, then d(A, B) = 0 because
x[k], and u[k] is applied, then F calculates x[k + 1] as the state at the next stage.
any point in A ∩ B will yield zero distance. If A and B do not intersect, then
d(A, B) > 0, which implies that they are not in collision (in other words, collision
8.3.3 Collision detection free).
If d(A, B) is large, then A and B are mostly likely to be collision free in the near
One of the greatest challenges in building a physics engine is handling collisions
future, even if one or both are moving. This leads to a family of collision detection
between bodies. Standard laws of motion from physics or engineering usually do
methods called incremental distance computation, which assumes that between
not take into account such interactions. Therefore, specialized algorithms are used
successive calls to the algorithm, the bodies move only a small amount. Under
to detect when such collisions occur and respond appropriately. Collision detection
this assumption the algorithm achieves “almost constant time” performance for the
methods and corresponding software are plentiful because of widespread needs in
case of convex polyhedral bodies [181, 214]. Nonconvex bodies can be decomposed
computer graphics simulations and video games, and also for motion planning of
into convex components.
robots.
A concept related to distance is penetration depth, which indicates how far one
model is poking into another [182]. This is useful for setting a threshold on how
Solid or boundary model? Figure 8.10 shows one the first difficulties with
collision detection, in terms of two triangles in a 2D world. The first two cases 3
This assumes models contain all of the points on their boundary and that they have finite
(Figures 8.10(a) and 8.10(b)) show obvious cases; however, the third case, Figure extent; otherwise, topological difficulties arise [122, 163]
8.3. PHYSICS IN THE VIRTUAL WORLD 229 230 S. M. LaValle: Virtual Reality
Figure 8.11: Four different kinds of bounding regions: (a) sphere, (b) axis-aligned
bounding box (AABB), (c) oriented bounding box (OBB), and (d) convex hull. Figure 8.12: The large circle shows the bounding region for a vertex that covers an
Each usually provides a tighter approximation than the previous one but is more L-shaped body. After performing a split along the dashed line, two smaller circles
expensive to test for intersection with others. are used to cover the two halves of the body. Each circle corresponds to a child
vertex.
much interference between the two bodies is allowed. For example, the user might body. Two opposing criteria that guide the selection of the type of bounding
be able to poke his head two centimeters into a wall, but beyond that, an action region:
should be taken.
1. The region should fit the intended model points as tightly as possible.
Simple collision tests At the lowest level, collision detection usually requires 2. The intersection test for two regions should be as efficient as possible.
testing a pair of model primitives to determine whether they intersect. In the
case of models formed from 3D triangles, then we need a method that determines Several popular choices are shown in Figure 8.11, for the case of an L-shaped body.
whether two triangles intersect. This is similar to the ray-triangle intersection test Hierarchical methods are also useful for quickly eliminating many triangles from
that was needed for visual rendering in Section 7.1, and involves basic tools from consideration in visual rendering, as mentioned in Section 7.1.
analytic geometry, such as cross products and plane equations. Efficient methods The tree is constructed for a body, A (or alternatively, B) recursively as fol-
are given in [106, 216]. lows. For each vertex, consider the set X of all points in A that are contained in
the bounding region. Two child vertices are constructed by defining two smaller
bounding regions whose union covers X. The split is made so that the portion cov-
Broad and narrow phases Suppose that a virtual world has been defined with
ered by each child is of similar size. If the geometric model consists of primitives
millions of triangles. If two complicated, nonconvex bodies are to be checked for
such as triangles, then a split could be made to separate the triangles into two
collision, then the computational cost may be high. For this complicated situation,
sets of roughly the same number of triangles. A bounding region is then computed
collision detection often becomes a two-phase process:
for each of the children. Figure 8.12 shows an example of a split for the case of
1. Broad Phase: In the broad phase, the task is to avoid performing expensive an L-shaped body. Children are generated recursively by making splits until very
computations for bodies that are far away from each other. Simple bounding simple sets are obtained. For example, in the case of triangles in space, a split is
boxes can be placed around each of the bodies, and simple tests can be per- made unless the vertex represents a single triangle. In this case, it is easy to test
formed to avoid costly collision checking unless the boxes intersect. Hashing for the intersection of two triangles.
schemes can be employed in some cases to greatly reduce the number of pairs Consider the problem of determining whether bodies A and B are in collision.
of boxes that have to be tested for intersect [215]. Suppose that the trees Ta and Tb have been constructed for A and B, respectively.
If the bounding regions of the root vertices of Ta and Tb do not intersect, then
2. Narrow Phase: In the narrow phase individual pairs of model parts are it is known that Ta and Tb are not in collision without performing any additional
each checked carefully for collision. This involves the expensive tests, such computation. If the bounding regions do intersect, then the bounding regions of
as triangle-triangle intersection. the children of Ta are compared to the bounding region of Tb . If either of these
intersect, then the bounding region of Tb is replaced with the bounding regions
In the broad phase, hierarchical methods generally decompose each body into of its children, and the process continues recursively. As long as the bounding
a tree. Each vertex in the tree represents a bounding region that contains some regions intersect, lower levels of the trees are traversed, until eventually the leaves
subset of the body. The bounding region of the root vertex contains the whole are reached. At the leaves the algorithm tests the individual triangles for collision,
8.4. MISMATCHED MOTION AND VECTION 231 232 S. M. LaValle: Virtual Reality
instead of bounding regions. Note that as the trees are traversed, if a bounding
region from the vertex v1 of Ta does not intersect the bounding region from a
vertex, v2 , of Tb , then no children of v1 have to be compared to children of v2 .
Usually, this dramatically reduces the number of comparisons, relative to a naive
approach that tests all pairs of triangles for intersection.
of the photoreceptors. Instead, we will describe vector fields over a square region,
with the understanding that it should be transformed onto a sphere for greater
accuracy.
is a constant vector field, which assigns vx = −1 and vy = 0 everywhere; see Figure 5. Vertical vection: The viewpoint is translated upward, corresponding to
8.14(a). The vector field positive vx , and resulting in downward flow as shown i Figure 8.15(e). Once
(x, y) 7→ (x + y, x + y) (8.34) again, parallax causes the speed of features to depend on the distance of the
corresponding object. This enables vertical vection to be distinguished from
is non-constant, and assigns vx = vy = x + y at each point (x, y); see Figure pitch vection.
8.14(b). For this vector field, the velocity direction is always diagonal, but the
length of the vector (speed) depends on x + y. 6. Forward/backward vection: If the viewpoint is translated along the op-
To most accurately describe the motion of features along the retina, the vector tical axis away from the scene (positive vz ), then the features flow inward
field should be defined over a spherical surface that corresponds to the locations toward the image center, as shown in Figure 8.15(f). Their speed depends
8.4. MISMATCHED MOTION AND VECTION 235 236 S. M. LaValle: Virtual Reality
on both their distance from the image center and the distance of their cor-
responding objects in the virtual world. The resulting illusion is backward
motion. Translation in the negative z direction results in perceived forward
motion (as in the case of the Millennium Falcon spaceship after its jump to
hyperspace in the Star Wars movies).
The first two are sometimes called circular vection, and the last three are known
as linear vection. Since our eyes are drawn toward moving features, changing the
viewpoint may trigger smooth pursuit eye movements (recall from Section 5.3).
In this case, the optical flows shown in Figure 8.15 would not correspond to the
motions of the features on the retina. Thus, our characterization so far ignores eye
movements, which are often designed to counteract optical flow and provide stable
(a) yaw (b) pitch images on the retina. Nevertheless, due the proprioception, the brain is aware of
these eye rotations, which results in an equivalent perception of self motion.
All forms of vection cause perceived velocity, but the perception of acceleration
is more complicated. First consider pure rotation of the viewpoint. Angular
acceleration is perceived if the rotation rate of yaw, pitch, and roll vection are
varied. Linear acceleration is also perceived, even in the case of yaw, pitch, or
roll vection at constant angular velocity. This is due to the merry-go-round effect,
which was shown in Figure 8.2(b).
Now consider pure linear vection (no rotation). Any linear acceleration of the
viewpoint will be perceived as an acceleration. However, if the viewpoint moves at
constant velocity, then this is the only form of vection in which there is no perceived
acceleration. In a VR headset, the user may nevertheless perceive accelerations
due to optical distortions or other imperfections in the rendering and display.
cues that indicate different accelerations. In some cases, these cues may be more • Prior knowledge: Just by knowing beforehand what kind of motion should
consistent, and in other cases, they may diverge further. be perceived will affect the onset of vection. This induces a prior bias that
might take longer to overcome if the bias is against self motion, but less time
Factors that affect sensitivity The intensity of vection is affected by many to overcome if it is consistent with self motion. The prior bias could be from
factors: someone telling the user what is going to happen, or it could simply by from
an accumulation of similar visual experiences through the user’s lifetime.
• Percentage of field of view: If only a small part of the visual field is Furthermore, the user might expect the motion as the result of an action
moving, then people tend to perceive that it is caused by a moving object. taken, such as turning the steering wheel of a virtual car.
However, if most of the visual field is moving, then they perceive them- • Attention: If the user is distracted by another activity, such as aiming a
selves as moving. The human visual system actually includes neurons with virtual weapon or selecting a menu option, then vection and its side effects
receptive fields that cover a large fraction of the retina for the purpose of may be mitigated.
detecting self motion [35]. As VR headsets have increased their field of view,
they project onto a larger region of the retina, thereby strengthening vection • Prior training or adaptation: With enough exposure, the body may learn
cues. to distinguish vection from true motion to the point that vection becomes
comfortable. Thus, many users can be trained to overcome VR sickness
• Distance from center view: Recall from Section 5.1 that the photore- through repeated, prolonged exposure.
ceptors are not uniformly distributed, with the highest density being at the
innermost part of the fovea. Thus, detection may seem stronger near the Due to all of these factors, and the imperfections of modern VR headsets, it
center. However, in the cases of yaw and forward/backward vection, the becomes extremely difficult to characterize the potency of vection and its resulting
optical flow vectors are stronger at the periphery, which indicates that de- side effects on user comfort.
tection may be stronger at the periphery. Sensitivity to the optical flow may
therefore be strongest somewhere between the center view and the periphery, Further Reading
depending on the viewpoint velocities, distances to objects, photoreceptor
densities, and neural detection mechanisms. For basic concepts of vectors fields, velocities, and dynamical systems, see [12]. Modeling
and analysis of mechanical dynamical systems appears in [275]. The specific problem
• Exposure time: The perception of self motion due to vection increases of human body movement is covered in [364, 365]. See [103] for an overview of game
with the time of exposure to the optical flow. If the period of exposure is engines, including issues such as simulated physics and collision detection. For coverage
very brief, such as a few milliseconds, then no vection may occur. of particular collision detection algorithms, see [100, 182].
A nice introduction to the vestibular system, including its response as a dynamical
• Spatial frequency: If the virtual world is complicated, with many small system is [150]. Vection and visually induced motion sickness are thoroughly surveyed
structures or textures, then the number of visual features will be greatly in [145], which includes an extensive collection of references for further reading. Some
increased and the optical flow becomes a stronger signal. As the VR headset key articles that address sensitivities to vection include [6, 13, 69, 178, 179, 279, 340].
display resolution increases, higher spatial frequencies can be generated.
• Contrast: With higher levels of contrast, the optical flow signal is stronger
because the features are more readily detected. Therefore, vection typically
occurs with greater intensity.
• Other sensory cues: Recall from Section 6.4 that a perceptual phenomenon
depends on the combination of many cues. Vection can be enhanced by pro-
viding additional consistent cues. For example, forward vection could be
accompanied by a fan blowing in the user’s face, a rumbling engine, and the
sounds of stationary objects in the virtual world racing by. Likewise, vection
can be weakened by providing cues that are consistent with the real world,
where no corresponding motion is occurring.
240 S. M. LaValle: Virtual Reality
physical objects may be tracked. For objects that exist in the physical world
but not the virtual world, the system might alert the user to their presence
for safety reasons. Imagine that the user is about to hit a wall, or trip over
a toddler. In some VR applications, the tracked physical objects may be
matched in VR so that the user receives touch feedback while interacting with
Chapter 9 them. In other applications, such as telepresence, a large part of the physical
world could be “brought into” the virtual world through live capture.
Section 9.1 covers the easy case of tracking rotations around a single axis
Tracking to prepare for Section 9.2, which extends the framework to tracking the 3-DOF
orientation of a 3D rigid body. This relies mainly on the angular velocity readings
of an IMU. The most common use is to track the head that wears a VR headset,
but it may apply to tracking handheld controllers or other devices. Section 9.3
Keeping track of motion in the physical world is a crucial part of any VR system. addresses the tracking of position and orientation together, which in most systems
Tracking was one of the largest obstacles to bringing VR headsets into consumer requires line-of-sight visibility between a fixed part of the physical world and the
electronics, and it will remain a major challenge due to our desire to expand and object being tracked. Section 9.4 discusses the case of tracking multiple bodies
improve VR experiences. Highly accurate tracking methods have been mostly that are attached together by joints. Finally, Section 9.5 covers the case of using
enabled by commodity hardware components, such as inertial measurement units sensors to build a representation of the physical world so that it can be brought
(IMUs) and cameras, that have plummeted in size and cost due to the smartphone into the virtual world.
industry.
Three categories of tracking may appear in VR systems, based on what is being
tracked: 9.1 Tracking 2D Orientation
1. The user’s sense organs: Recall from Section 2.1 that sense organs, such This section explains how the orientation of a rigid body is estimated using an
as eyes and ears, have DOFs that are controlled by the body. If a display inertial measurement unit (IMU). The main application is determining the view-
is attached to a sense organ, and it should be perceived as in VR as being point orientation, Reye from Section 3.4, while the user is wearing a VR headset.
attached to the surrounding world, then the position and orientation of the Another application is estimating the orientation of a hand-held controller. For
organ needs to be tracked. The inverse of the tracked transformation is ap- example, suppose we would like to make a laser pointer that works in the virtual
plied to the stimulus to correctly “undo” these DOFs. Most of the focus world, based on a direction indicated by the user. The location of a bright red
is on head tracking, which is sufficient for visual and aural components of dot in the scene would be determined by the estimated orientation of a controller.
VR; however, the visual system may further require eye tracking if the ren- More generally, the orientation of any human body part or moving object in the
dering and display technology requires compensating for the eye movements physical world can be determined if it has an attached IMU.
discussed in Section 5.3. To estimate orientation, we first consider the 2D case by closely following the
2. The user’s other body parts: If the user would like to see a compelling merry-go-round model of Section 8.1.2. The technical issues are easy to visualize
representation of his body in the virtual world, then its motion should be in this case, and extend to the more important case of 3D rotations. Thus, imagine
tracked so that it can be reproduced in the matched zone. Perhaps facial that we mount a gyroscope on a spinning merry-go-round. Its job is to measure
expressions or hand gestures are needed for interaction. Although perfect the angular velocity as the merry-go-round spins. It will be convenient throughout
matching is ideal for tracking sense organs, it is not required for tracking this chapter to distinguish a true parameter value from an estimate. To accomplish
other body parts. Small movements in the real world could convert into this, a “hat” will be placed over estimates. Thus, let ω̂ correspond to the estimated
larger virtual world motions so that the user exerts less energy. In the or measured angular velocity, which may not be the same as ω, the true value.
limiting case, the user could simply press a button to change the body con- How are ω̂ and ω related? If the gyroscope were functioning perfectly, then
figuration. For example, she might grasp an object in her virtual hand by a ω̂ would equal ω; however, in the real world this cannot be achieved. The main
single click. contributor to the discrepancy between ω̂ and ω is calibration error. The quality
of calibration is the largest differentiator between an expensive IMU (thousands
3. The rest of the environment: In the real world that surrounds the user, of dollars) and cheap one (a dollar).
239
9.1. TRACKING 2D ORIENTATION 241 242 S. M. LaValle: Virtual Reality
We now define a simple model of calibration error. The following sensor map- 4. Drift error: As the error grows over time, other sensors are needed to
ping indicates how the sensor output is related to true angular velocity: directly estimate it and compensate for it.
ω̂ = a + b ω. (9.1) All of these issues remain throughout this chapter for the more complicated set-
tings. The process of combining information from multiple sensor readings is often
Above, a and b are called the offset and scale, respectively. They are unknown called sensor fusion or filtering.
constants that interfere with the measurement. If ω were perfectly measured, then We discuss each of these for the 2D case, before extending the ideas to the 3D
we would have a = 0 and b = 1. case in Section 9.2.
Consider the effect of calibration error. Comparing the measured and true
angular velocities yields: Calibration You could buy a sensor and start using it with the assumption that
it is already well calibrated. For a cheaper sensor, however, the calibration is often
ω̂ − ω = a + b ω − ω = a + ω(b − 1). (9.2) unreliable. Suppose we have one expensive, well-calibrated sensor that reports
angular velocities with very little error. Let ω̂ ′ denote its output, to distinguish it
Now imagine using the sensor to estimate the orientation of the merry-go- from the forever unknown true value ω. Now suppose that we want to calibrate a
round. We would like to understand the difference between the true orientation θ bunch of cheap sensors so that they behave as closely as possible to the expensive
and an estimate θ̂ computed using the sensor output. Let d(t) denote a function sensor. This could be accomplished by mounting them together on a movable
of time called the drift error: surface and comparing their outputs. For greater accuracy and control, the most
expensive sensor may be part of a complete mechanical system such as an expensive
d(t) = θ(t) − θ̂(t). (9.3) turntable, calibration rig, or robot. Let ω̂ denote the output of one cheap sensor
to be calibrated; each cheap sensor must be calibrated separately.
Note that d(t) might be negative, which could be forced into being positive by Calibration involves taking many samples, sometimes thousands, and compar-
applying the absolute value to obtain |d(t)|. This will be avoided to simplify the ing ω̂ ′ to ω̂. A common criterion is the sum of squares error, which is given by
discussion.
n
Suppose it is initially given that θ(0) = 0, and to keep it simple, the angular X
velocity ω is constant. By integrating (9.2) over time, drift error is (ω̂i − ω̂i′ )2 (9.5)
i=1
d(t) = (ω̂ − ω)t = (a + b ω − ω)t = (a + ω(b − 1))t. (9.4) for n samples of the angular velocity. The task is to determine a transformation
to apply to the cheap sensor outputs ω̂ so that it behaves as closely as possible to
Of course, the drift error grows (positively or negatively) as a deviates from 0 or the expensive sensor outputs ω̂ ′ .
as b deviates from one; however, note that the second component is proportional Using the error model from (9.1), we can select constants c1 and c2 that opti-
to ω. Ignoring a, this means that the drift error is proportional to the speed of the mize the error:
merry-go-round. In terms of tracking a VR headset using a gyroscope, this means X n
that tracking error increases at a faster rate as the head rotates more quickly [168]. (c1 + c2 ω̂ − ω̂ ′ )2 . (9.6)
At this point, four general problems must be solved to make an effective track- i=1
ing system, even for this simple case: This is a classical regression problem referred to as linear least-squares. It is
typically solved by calculating the Moore-Penrose pseudoinverse of an non-square
1. Calibration: If a better sensor is available, then the two can be closely matrix that contains the sampled data [341].
paired so that the outputs of the worse sensor are transformed to behave as Once c1 and c2 are calculated, every future sensor reading is transformed as
closely to the better sensor as possible.
ω̂cal = c1 + c2 ω̂, (9.7)
2. Integration: The sensor provides measurements at discrete points in time,
resulting in a sampling rate. The orientation is estimated by aggregating or in which ω̂ is the original, raw sensor output, and ω̂cal is the calibrated output.
integrating the measurements. Thus, the calibration produces a kind of invisible wrapper around the cheap sensor
outputs so that the expensive sensor is simulated. The raw, cheap sensor outputs
3. Registration: The initial orientation must somehow be determined, either are no longer visible to outside processes. The calibrated outputs will therefore
by an additional sensor, or a clever default assumption or start-up procedure. simply be referred to as ω̂ in the remainder of this chapter.
9.1. TRACKING 2D ORIENTATION 243 244 S. M. LaValle: Virtual Reality
Integration Sensor outputs usually arrive at a regular sampling rate. For ex- error would nevertheless grow due to other factors such as quantized output values,
ample, the Oculus Rift gyroscope provides a measurement every 1ms (yielding a sampling rate limitations, and unmodeled noise. The first problem is to estimate
1000Hz sampling rate). Let ω̂[k] refer to the kth sample, which arrives at time the drift error, which is usually accomplished with an additional sensor. Practical
k∆t. examples of this will be given in Section 9.2. For the simple merry-go-round
The orientation θ(t) at time t = k∆t can be estimated by integration as: example, imagine that an overhead camera takes a picture once in a while to
k
measure the orientation. Let θ̂d [k] denote the estimated orientation from this
X single sensor measurement, arriving at stage k.
θ̂[k] = θ(0) + ω̂[i]∆t. (9.8)
Because of drift error, there are now two conflicting sources of information: 1)
i=1
The orientation θ̂[k] estimated by integrating the gyroscope, and 2) the orientation
Each output ω̂[i] causes a rotation of ∆θ[i] = ω̂[i]∆t. It is sometimes more conve- θ̂d [k] instantaneously estimated by the camera (or some other, independent sensor).
nient to write (9.8) in an incremental form, which indicates the update to θ̂ after A classic approach to blending these two sources is a complementary filter, which
each new sensor output arrives: mathematically interpolates between the two estimates:
θ̂[k] = ω̂[k]∆t + θ̂[k − 1]. (9.9) θ̂c [k] = αθ̂d [k] + (1 − α)θ̂[k], (9.10)
For the first case, θ̂[0] = θ(0). in which α is a gain parameter that must satisfy 0 < α < 1. Above, θ̂c [k] denotes
If ω(t) varies substantially between θ(k∆t) and θ((k + 1)∆t), then it is helpful the corrected estimate at stage k. Since the gyroscope is usually accurate over short
to know what ω̂[k] corresponds to. It could be angular velocity at the start of the times but gradually drifts, α is chosen to be close to zero (for example, α = 0.0001).
interval ∆t, the end of the interval, or an average over the interval. If it is the This causes the instantaneous estimate θ̂d [k] to have a gradual impact. At the other
start or end, then a trapezoidal approximation to the integral may yield less error extreme, if α were close to 1, then the estimated orientation could wildly fluctuate
over time [133]. due to errors in θd [k] in each stage. An additional consideration is that if the
sensor output θ̂d [k] arrives at a much lower rate than the gyroscope sampling rate,
Registration In (9.8), the initial orientation θ(0) was assumed to be known. In then the most recently recorded output is used. For example, a camera image
practice, this corresponds to a registration problem, which is the initial alignment might produce an orientation estimate at 60Hz, whereas the gyroscope produces
between the real and virtual worlds. To understand the issue, suppose that θ outputs at 1000Hz. In this case, θ̂d [k] would retain the same value for 16 or 17
represents the yaw direction for a VR headset. One possibility is to assign θ(0) = 0, stages, until a new camera image becomes available.
which corresponds to whichever direction the headset is facing when the tracking It is important to select the gain α to be high enough so that the drift is
system is turned on. This might be when the system is booted. If the headset corrected, but low enough so that the user does not perceive the corrections. The
has an “on head” sensor, then it could start when the user attaches the headset gain could be selected “optimally” by employing a Kalman filter [45, 140, 158];
to his head. Often times, the forward direction could be unintentionally set in a however, the optimality only holds if we have a linear stochastic system, which
bad way. For example, if one person starts a VR demo and hands the headset to is not the case in human body tracking. The relationship between Kalman and
someone else, who is facing another direction, then in VR the user would not be complementary filters, for the exact models used in this chapter, appears in [119].
facing in the intended forward direction. This could be fixed by a simple option Using simple algebra, the complementary filter formulation in (9.10) can be
that causes “forward” (and hence θ(t)) to be redefined as whichever direction the reworked to yield the following equivalent expression:
user is facing at present. ˆ
θ̂c [k] = θ̂[k] − αd[k] (9.11)
An alternative to this entire problem is to declare θ(0) = 0 to correspond to a
direction that is fixed in the physical world. For example, if the user is sitting at a in which
desk in front of a computer monitor, then the forward direction could be defined as ˆ = θ̂d [k] − θ̂[k].
d[k] (9.12)
the yaw angle for which the user and headset are facing the monitor. Implementing ˆ is just an estimate of the drift error at stage k. Thus, the complemen-
Above, d[k]
this solution requires a sensor that can measure the yaw orientation with respect tary filter can alternatively be imagined as applying the negated, signed error, by
to the surrounding physical world. For example, with the Oculus Rift, the user a small, proportional amount α, to try to incrementally force it to zero.
faces a stationary camera, which corresponds to the forward direction.
Drift correction To make a useful tracking system, the drift error (9.3) cannot 9.2 Tracking 3D Orientation
be allowed to accumulate. Even if the gyroscope were perfectly calibrated, drift
9.2. TRACKING 3D ORIENTATION 245 246 S. M. LaValle: Virtual Reality
(a) (b)
Figure 9.2: (a) A MEMS element for sensing linear acceleration. (b) Due to linear
acceleration in one direction, the plates shift and cause a change in capacitance as
measured between the outer plates. (Figure by David Askew.)
Figure 9.1: The vibrating MEMS elements respond to Coriolis forces during rota-
tion, which are converted into an electrical signal. (Figure by Fabio Pasolini.) in a homogeneous transform matrix:
ω̂x a b c j ωx
IMUs Recall from Section 2.1 (Figure 2.9) that IMUs have recently gone from ω̂y d e f k ωy
= (9.13)
large, heavy mechanical systems to cheap, microscopic MEMS circuits. This pro- ω̂z g h i ℓ ωz
gression was a key enabler to high-quality orientation tracking. The gyroscope 1 0 0 0 1 1
measures angular velocity along three orthogonal axes, to obtain ω̂x , ω̂y , and ω̂z . There are 12 and not 6 DOFs because the upper left, 3-by-3, matrix is not con-
For each axis, the sensing elements lie in the perpendicular plane, much like the strained to be a rotation matrix. The j, k, and ℓ parameters correspond to off-
semicircular canals in the vestibular organ (Section 8.2). The sensing elements in set, whereas all others handle scale and non-orthogonality. Following the same
each case are micromachined mechanical elements that vibrate and operate like a methodology as in Section 9.1, the inverse of this transform can be estimated by
tuning fork. If the sensor rotates in its direction of sensitivity, then the elements minimizing the least squares error with respect to outputs of a better sensor, which
experience Coriolis forces, which are converted into electrical signals. These signals provides ground truth. The outputs of the MEMS sensor are then adjusted by ap-
are calibrated to produce an output in degrees or radians per second; see Figure plying the estimated homogeneous transform to improve performance (this is an
9.1. extension of (9.7) to the 12-parameter case). This general methodology applies to
IMUs usually contain additional sensors that are useful for detecting drift er- calibrating gyroscopes and accelerometers. Magnetometers may also be calibrated
rors. Most commonly, accelerometers measure linear acceleration along three axes in this way, but have further complications such as soft iron bias.
to obtain âx , ây , and âz . The principle of their operation is shown in Figure 9.2. An additional challenge with MEMS sensors is dealing with other subtle de-
MEMS magnetometers also appear on many modern IMUs, which measure mag- pendencies. For example, the outputs are sensitive to the particular temperature
netic field strength along the three perpendicular axis. This is often accomplished of the MEMS elements. If a VR headset heats up during use, then calibration
by the mechanical motion of a MEMS structure that is subject to Lorentz force parameters are needed for every temperature that might arise in practice. Fortu-
as it conducts inside of a magnetic field. nately, IMUs usually contain a temperature sensor that can be used to associate
the calibration parameters with the corresponding temperatures. Finally, MEMS
elements may be sensitive to forces acting on the circuit board, which could be
changed, for example, by a dangling connector. Care must be given to isolate
Calibration Recall from Section 9.1 that the sensor outputs are distorted due external board forces from the MEMS circuit.
to calibration issues. In the one-dimensional angular velocity case, there were
only two parameters, for scale and offset, which appeared in (9.1). In the 3D
Integration Now consider the problem converting the sequence of gyroscope
setting, this would naturally extend to 3 scale and 3 offset parameters; however,
outputs into an estimate of the 3D orientation. At each stage k a vector
the situation is worse because there may also be errors due to non-orthogonality of
the MEMS elements. All of these can be accounted for by 12 parameters arranged ω̂[k] = (ω̂x [k], ω̂y [k], ω̂z [k]) (9.14)
9.2. TRACKING 3D ORIENTATION 247 248 S. M. LaValle: Virtual Reality
arrives from the sensor. In Section 9.1, the sensor output ω̂[k] was converted to a
change ∆θ[k] in orientation. For the 3D case, the change in orientation is expressed
as a quaternion.
Let q(v, θ) be the quaternion obtained by the axis-angle conversion formula
(3.30). Recall from Section 8.1.2 that the instantaneous axis of rotation is the
magnitude of the angular velocity. Thus, if ω̂[k] is the sensor output at stage k,
then the estimated rotation axis is
v̂[k] = ω̂[k]/kω̂[k]k. (9.15) Figure 9.3: If “up” is perfectly sensed by an accelerometer that is rotated by θ,
then its output needs to be rotated by θ to view it from the world frame.
Furthermore, the estimated amount of rotation that occurs during time ∆t is
∆θ̂[k] = kω̂[k]k∆t. (9.16) corresponds to pitch and roll. This will be detected using an “up” sensor. Let yaw
error refer to the remaining part of the drift error, which will be detecting using
Using the estimated rotation axis (9.15) and amount (9.16), the orientation change a “compass”. In reality, there do not exist perfect “up” and “compass” sensors,
over time ∆t is estimated to be which will be addressed later.
Suppose that a sensor attached to the rigid body always reports an “up” vector
∆q̂[k] = q(v̂[k], ∆θ̂[k]). (9.17) that is parallel to y axis in the fixed, world coordinate frame. In other words, it
would be parallel to gravity. Since the sensor is mounted to the body, it reports its
Using (9.17) at each stage, the estimated orientation q̂[k] after obtaining the latest
values in the coordinate frame of the body. For example, if the body were rolled
sensor output is calculated incrementally from q̂[k − 1] as
90 degrees so that its x axis is pointing straight up, then the “up” vector would
q̂[k] = ∆q̂[k] ∗ q̂[k − 1], (9.18) be reported as (0, 0, 1), instead of (0, 1, 0). To fix this, it would be convenient to
transform the sensor output into the world frame. This involves rotating it by
in which ∗ denotes quaternion multiplication. This is the 3D generalization of (9.9), q(t), the body orientation. For our example, this roll rotation would transform
in which simple addition could be used to combine rotations in the 2D case. In (0, 0, 1) into (0, 1, 0). Figure 9.3 shows a 2D example.
(9.18), quaternion multiplication is needed to aggregate the change in orientation Now suppose that drift error has occurred and that q̂[k] is the estimated orien-
(simple addition is commutative, which is inappropriate for 3D rotations). tation. If this transform is applied to the “up” vector, then because of drift error,
it might not be aligned with the y axis, as shown Figure 9.4. The up vector û is
Registration The registration problem for the yaw component is the same as projected into the xz plane to obtain (ûx , 0, ûz ). The tilt axis lies in the xz plane
in Section 9.2. The forward direction may be chosen from the initial orientation and is constructed as the normal to the projected up vector: t̂ = (ûz , 0, −ûx ). Per-
of the rigid body or it could be determined with respect to a fixed direction in the forming a rotation of φ about the axis t̂ would move the up vector into alignment
world. The pitch and roll components should be determined so that they align with the y axis. Thus, the tilt error portion of the drift error is the quaternion
with gravity. The virtual world should not appear to be tilted with respect to the q(t̂, φ̂).
real world (unless that is the desired effect, which is rarely the case). Unfortunately, there is no sensor that directly measures “up”. In practice,
the accelerometer is used to measure the “up” direction because gravity acts on
the sensor, causing the sensation of upward acceleration at roughly 9.8m/s2 . The
Tilt correction The drift error d(t) in (9.3) was a single angle, which could be
problem is that it also responds to true linear acceleration of the rigid body, and
positive or negative. If added to the estimate θ̂(t), the true orientation θ(t) would
this cannot be separated from gravity due to the Einstein equivalence principle. It
be obtained. It is similar for the 3D case, but with quaternion algebra. The 3D
measures the vector sum of gravity and true linear acceleration, as shown in Figure
drift error is expressed as
9.5. A simple heuristic is to trust accelerometer outputs as an estimate of the “up”
d(t) = q(t) ∗ q̂ −1 (t), (9.19)
direction only if its magnitude is close to 9.8m2 [75]. This could correspond to the
which is equal to the identity rotation if q(t) = q̂(t). Furthermore, note that common case in which the rigid body is stationary. However, this assumption is
applying the drift error to the estimate yields q(t) = d(t) ∗ q̂(t). unreliable because downward and lateral linear accelerations can be combined to
Since the drift error is a 3D rotation, it could be constructed as the product provide an output magnitude that is close to 9.8m2 , but with a direction that is
of a yaw, pitch, and a roll. Let tilt error refer to the part of the drift error that far from “up”. Better heuristics may be built from simultaneously considering the
9.2. TRACKING 3D ORIENTATION 249 250 S. M. LaValle: Virtual Reality
Head accel
Head accel
Gravity
Gravity Measured accel
Measured accel
(a) (b)
Figure 9.5: (a) There is no gravity sensor; the accelerometer measures the vector
sum of apparent acceleration due to gravity and the true acceleration of the body.
(b) A simple heuristic of accepting the reading as gravity only if the magnitude is
approximately 9.8m2 will fail in some cases.
(a) (b)
Another issue is that the projected vector in the horizontal plane does not
Figure 9.4: (a) Tilt error causes a discrepancy between the y axis and the sensed point north, resulting in a declination angle; this is the deviation from north.
up vector that is rotated using the estimate q̂[k] to obtain û. (b) The tilt axis is Fortunately, reference to the true north is not important. It only matters that the
normal to û; a rotation of −φ̂ about the tilt axis would bring them into alignment, sensor output is recorded in the registration stage to provide a fixed yaw reference.
thereby eliminating the tilt error. The most significant problem is that the magnetometer measures the vector
sum of all magnetic field sources. In addition to the Earth’s field, a building
generates its own field due to ferromagnetic metals. Furthermore, such materials
outputs of other sensors or the rate at which “up” appears to change. usually exist on the circuit board that contains the sensor. For this case, the field
Assuming that the accelerometer is producing a reliable estimate of the gravity moves with the sensor, generating a constant vector offset. Materials that serve as
direction, the up vector û is calculated from the accelerometer output â by using a source of magnetic fields are called hard iron. Other materials distort magnetic
(3.34), to obtain fields that pass through them; these are called soft iron. Magnetometer calibration
û = q̂[k] ∗ â ∗ q̂[k]−1 . (9.20) methods mainly take into account offsets due to hard-iron bias and eccentricities
due to soft-iron bias [92, 155].
After these magnetometer calibrations have been performed, the yaw drift
Yaw correction The remaining drift error component is detected by a “com-
error can be estimated from most locations with a few degrees of accuracy, which is
pass”, which outputs a vector that lies in the world xz plane and always points
sufficient to keep yaw errors from gradually accumulating. There are still problems.
“north”. Suppose this is n̂ = (0, 0, −1). Once again, the sensor output occurs
If a strong field is placed near the sensor, then the readings become dependent
in the coordinate frame of the body, and needs to be transformed by q̂[k]. The
on small location changes. This could cause the measured direction to change
difference between n̂ and the −z axis is the resulting yaw drift error.
as the rigid body translates back and forth. Another problem is that in some
As in the case of the “up” sensor, there is no “compass” in the real world.
building locations, vector sum of the Earth’s magnetic field and the field generated
Instead, there is a magnetometer, which measures a 3D magnetic field vector:
by the building could be approximately zero (if they are of similar magnitude
(m̂x , m̂y , m̂z ). Suppose this is used to measure the Earth’s magnetic field. It turns
and pointing in opposite directions). In this unfortunate case, the magnetometer
out that the field vectors do not “point” to the North pole. The Earth’s magnetic
cannot produce useful outputs for yaw drift error detection.
field produces 3D vectors that generally do not lie in the horizontal plane, resulting
in an inclination angle. Thus, the first problem is that the sensor output must
be projected into the xz plane. Residents of Ecuador may enjoy magnetic field Filtering Using the detected drift error, filtering works in the same way as
vectors that are nearly horizontal; however, in Finland they are closer to vertical; described in Section 9.1. The complementary filter (9.10) is upgraded to work with
see Figure 9.6. If the magnetic field vector is close to vertical, then the horizontal quaternions. It becomes slightly more complicated to represent the interpolation
component may become too small to be useful. in terms of α. Let (v, θ) denote the axis-angle representation of the orientation
9.2. TRACKING 3D ORIENTATION 251 252 S. M. LaValle: Virtual Reality
Figure 9.6: The inclination angle of the Earth’s magnetic field vector varies greatly To determine whether the transform has been correctly applied, one should put
over the Earth. (Map developed by NOAA/NGDC and CIRES.) on the headset and try rotating about the three canonical axes: A pure yaw, a
pure pitch, and a pure roll. Let + denote that the world is moving correctly with
ˆ which is the estimated drift error (a quaternion value). Let q(v, αθ) represent respect to a head rotation. Let − denote that it seems to move in the opposite
d[k],
direction. Figure 9.7 shows a table of the eight possible outcomes and the most
the quaternion given by axis v and angle αθ. For a small value of α, this can be
ˆ likely cause of each problem.
considered as a small step “toward” d[k].
The complementary filter in terms of quaternions becomes
A head model The translation part of the head motion has not been addressed.
q̂c [k] = q(v, −αθ) ∗ q̂[k], (9.21) Ideally, the head should be the same height in the virtual world as in the real world.
This can be handled by the translation part of the Teye matrix (3.36).
which is similar in form to (9.12). The simple subtraction from the 2D case has We must also account for the fact that as the head rotates, the eyes change
been replaced above by multiplying an inverse quaternion from the left. The their positions. For example, in a yaw head movement (nodding “no”), the pupils
ˆ is obtained by multiplying the estimated tilt and yaw
estimated drift error d[k] displace a few centimeters in the x direction. More accurately, they travel along
errors. Alternatively, they could contribute separately to the complementary filter, a circular arc in a horizontal plane. To more closely mimic the real world, the
with different gains for each, and even combined with drift error estimates from movements of the eyes through space can be simulated by changing the center of
more sources [197]. rotation according to a fictitious head model [3]. This trick is needed until Section
9.3, where position is instead estimated from more sensors.
Setting the viewpoint The viewpoint is set using the estimated orientation Recall from Section 3.5 that the cyclopean viewpoint was first considered and
q̂[k], although it might need to be adjusted to account for alternative timings, then modified to handle left and right eyes by applying horizontal offsets by insert-
for the purpose of prediction or image warping, as discussed in Section 7.4. Let ing Tlef t (3.50) and Tright (3.52). In a similar way, offsets in the y and z directions
q̂(t) denote the estimated orientation for time t. In terms of the transformations can be added to account for displacement that would come from a rotating head.
from Section 3.4, we have just estimated Reye . To calculate the correct viewpoint, The result is to insert the following before or after Tright and Tlef t :
the inverse is needed. Thus, q̂ −1 (t) would correctly transform models to take the
estimated viewpoint into account.
1 0 0 0
0 1 0 h
A debugging tip Programmers often make mistakes when connecting the tracked Thead =
0
, (9.22)
0 1 p
orientation to the viewpoint. Figure 9.7 shows a table of the common mistakes. 0 0 0 1
9.3. TRACKING POSITION AND ORIENTATION 253 254 S. M. LaValle: Virtual Reality
be fully derived from sensor data, rather than inventing positions from a plausible
head model, as in (9.22). By estimating the position, the powerful depth cue of
parallax becomes much stronger as the user moves her head from side to side. She
could even approach a small object and look at it from any viewpoint, such as
from above, below, or the sides. The methods in this section are also useful for
tracking hands in space or objects that are manipulated during a VR experience.
Why not just integrate the accelerometer? It seems natural to try to ac-
complish 6-DOF tracking with an IMU alone. Recall from Figure 9.5 that the
accelerometer measures the vector sum of true linear acceleration and acceleration
due to gravity. If the gravity component is subtracted away from the output, as
is heuristically accomplished for tilt correction, then it seems that the remaining
part is pure body acceleration. Why not simply integrate this acceleration twice
to obtain position estimates? The trouble is that the drift error rate is much larger
than in the case of a gyroscope. A simple calibration error leads to linearly grow-
Figure 9.8: To obtain a head model, the rotation center is moved so that orientation ing drift error in the gyroscope case because it is the result of a single integration.
changes induce a plausible translation of the eyes. The height h is along the y axis, After a double integration, a calibration error leads to quadratically growing drift
and the protrusion p is along the z axis (which leads a negative number). error. This becomes unbearable in practice after a fraction of a second. Further-
more, the true body acceleration cannot be accurately extracted, especially when
the body quickly rotates. Finally, as drift accumulates, what sensors can be used
in which h is a height parameter and p is a protrusion parameter. See Figure 9.8.
to estimate the positional drift error? The IMU alone cannot help. Note that it
The idea is to choose h and p that would correspond to the center of rotation of the
cannot even distinguish motions at constant velocity, including zero motion; this is
head. The parameter h is the distance from the rotation center to the eye height,
the same as our vestibular organs. Despite its shortcomings, modern IMUs remain
along the y axis. A typical value is h = 0.15m. The protrusion p is the distance
an important part of 6-DOF tracking systems because of their high sampling rates
from the rotation center to the cyclopean eye. A typical value is p = −0.10m,
and ability to accurately handle the rotational component.
which is negative because it extends opposite to the z axis. Using a fake head
model approximates the eye locations as the user rotates her head; however, it is
far from perfect. If the torso moves, then this model completely breaks, resulting in Make your own waves The IMU-based approach to tracking was passive in the
a large mismatch between the real and virtual world head motions. Nevertheless, sense that it relied on sources of information that already exist in the environment.
this head model is currently used in popular headsets, such as Samsung Gear VR. Instead, an active approach can be taken by transmitting waves into the environ-
An issue also exists with the y height of the head center. The user may be ment. Since humans operate in the same environment, waves that are perceptible,
seated in the real world, but standing in the virtual world. This mismatch might be such as light and sound, are not preferred. Instead, common energy sources in
uncomfortable. The brain knows that the body is seated because of proprioception, active tracking systems include infrared, ultrasound, and electromagnetic fields.
regardless of the visual stimuli provided by VR. If the user is standing, then the Consider transmitting an ultrasound pulse (above 20, 000 Hz) from a speaker
head-center height could be set so that the eyes are at the same height as in the and using a microphone to listen for its arrival. This is an example of an emitter-
real world. This issue even exists for the case of full six-DOF tracking, which is detector pair: The speaker is the emitter, and the microphone is the detector.
covered next; the user might be sitting, and a vertical offset is added to make him If time measurement is synchronized between source and destination, then the
appear to be standing in VR. time of arrival (TOA or time of flight) can be calculated. This is the time that
it took for the pulse to travel the distance d between the emitter and detector.
Based on the known propagation speed in the medium (330 m/s for ultrasound),
9.3 Tracking Position and Orientation the distance dˆ is estimated. One frustrating limitation of ultrasound systems is
reverberation between surfaces, causing the pulse to be received multiple times at
This section covers tracking of all 6 DOFs for a moving rigid body, with the most each detector.
important case being head tracking. For convenience, we will refer to the position When functioning correctly, the position of the detector could then be nar-
and orientation of a body as its pose. Six-DOF tracking enables Teye from 3.4 to rowed down to a sphere of radius d, ˆ centered at the transmitter; see Figure 9.9(a).
9.3. TRACKING POSITION AND ORIENTATION 255 256 S. M. LaValle: Virtual Reality
(a) (b)
(a) (b)
Figure 9.9: The principle of trilateration enables the detector location to be de-
termined from estimates of distances to known emitter. A 2D example is shown:
(a) from a single emitter, the detector could be anywhere along a circle; (b) using Figure 9.10: (a) A magnetic dipole offers a field that varies its magnitude and
three emitters, the position is uniquely determined. direction as the position changes. (b) The Razer Hydra, a game controller system
that generates a weak magnetic field using a base station, enabling it to track the
By using two transmitters and one microphone, the position is narrowed down controller positions.
to the intersection of two spheres, resulting in a circle (assuming the transmitter
locations are known). With three transmitters, the position is narrowed down to coded with a signal to distinguish it from background fields, the position and
two points, and with four or more transmitters, the position is uniquely deter- orientation of a body in the field could be estimated in the field; see Figure 9.10(a).
mined.1 The emitter and detector roles could easily be reversed so that the object This principle was used for video games in the Razer Hydra tracking system in a
being tracked carries the emitter, and several receivers are placed around it. The base station that generated a magnetic field; see Figure 9.10(b). One drawback
method of combining these measurements to determine position is called trilater- is that the field may become unpredictably warped in each environment, causing
ation. If electromagnetic waves, such as radio, light, or infrared, are used instead straight-line motions to be estimated as curved. Note that the requirements are
of ultrasound, then trilateration could still be applied even though the impossible the opposite of what was needed to use a magnetometer for yaw correction in
to measure the propagation time directly. If the transmitter amplitude is known Section 9.2; in that setting the field needed to be constant over the tracking area.
then distance can be estimated based on power degradation, rather than TOA. For estimating position, the field should vary greatly across different locations.
Alternatively, a time-varying signal can be emitted and its reflected phase shift
can be estimated when the received signal is superimposed onto the transmitted The power of visibility The most powerful paradigm for 6-DOF tracking is
signal. visibility. The idea is to identify special parts of the physical world called features
If the detectors do not know the precise time that the pulse started, then they and calculate their positions along a line-of-sight ray to a known location. Figure
could compare differences in arrival times between themselves; this is called time 9.11 shows an example inspired by a camera, but other hardware could be used.
difference of arrival (TDOA). The set of possible locations is a hyperboloid instead One crucial aspect for tracking is distinguishability. If all features appear to be the
of a sphere. Nevertheless, the hyperboloid sheets can be intersected for multiple same, then it may become difficult to determine and maintain “which is which”
emitter-detector pairs to obtain the method of multilateration. This was used in during the tracking process. Each feature should be assigned a unique label that
the Decca Navigation System in World War II to locate ships and aircraft. This is invariant over time, as rigid bodies in the world move. Confusing features with
principle is also used by our ears to localize the source of sounds, which will be each other could cause catastrophically bad estimates to be made regarding the
covered in Section 11.3. body pose.
Finally, some methods could track position by emitting a complicated field that The most common sensor used to detect features is a digital camera. Detecting,
varies over the tracking area. For example, by creating a magnetic dipole, perhaps labeling, and tracking features are common tasks in computer vision or image
1
processing. There are two options for features:
Global positioning systems (GPS) work in this way, but using radio signals, the Earth surface
constraint, and at least one more satellite eliminate time synchronization errors. 1. Natural: The features are automatically discovered, assigned labels, and
9.3. TRACKING POSITION AND ORIENTATION 257 258 S. M. LaValle: Virtual Reality
Figure 9.13: The Oculus Rift headset contains IR LEDs hidden behind IR-
Figure 9.11: The real world contains special features, which are determined to lie transparent plastic. (Photo by iFixit.)
along a line segment that connects to the focal point via perspective projection.
parts of the image. A more reliable method is to design a specific tag that is
clearly distinct from the rest of the image. Such tags can be coded to contain
large amounts of information, including a unique identification number. One of
the most common coded tags is the QR code, an example of which is shown in
Figure 9.12.
The features described so far are called passive because they do not emit en-
ergy. The hope is that sufficient light is in the world so that enough reflects off of
the feature and enters the camera sensor. A more reliable alternative is to engi-
Figure 9.12: A sample QR code, which could be printed and used as an artificial neer active features that emit their own light. For example, colored LEDs can be
feature. (Picture from Wikipedia.) mounted on the surface of a headset or controller. This comes at the expense of
requiring a power source and increasing overall object cost and weight. Further-
more, its industrial design may be compromised because it might light up like a
maintained during the tracking process. Christmas tree.
2. Artificial: The features are engineered and placed into the environment so
that they can be easily detected, matched to preassigned labels, and tracked. Cloaking with infrared Fortunately, all of these tricks can be moved to the
infrared (IR) part of the spectrum so that features are visible to cameras, but not
Natural features are advantageous because there are no setup costs. The environ- to humans. Patterns can be painted onto objects that highly reflect IR energy.
ment does not need to be engineered. Unfortunately, they are also much more Alternatively, IR LEDs can be mounted onto devices. This is the case for the
unreliable. Using a camera, this is considered to be a hard computer vision prob- Oculus Rift headset, and the IR LEDs are even hidden behind plastic that is
lem because it may be as challenging as it is for the human visual system. For transparent for IR energy, but appears black to humans; see Figure 9.13.
some objects, textures, and lighting conditions, it could work well, but it is ex- In some settings, it might be difficult to mount LEDs on the objects, as in the
tremely hard to make it work reliably for all possible settings. Imagine trying to case of tracking the subtle motions of an entire human body. This is called MOCAP
find and track features on an empty, white wall. Therefore, artificial features are or motion capture, which is described in Section 9.4. In MOCAP systems, powerful
much more common in products. IR LEDs are positioned around the camera so that they illuminate retroreflective
For artificial features, one of the simplest solutions is to print a special tag onto markers that are placed in the scene. Each marker can be imagined as a spherical
the object to be tracked. For example, one could print bright red dots onto the mirror in the IR part of the spectrum. One unfortunate drawback is that the range
object and then scan for their appearance as red blobs in the image. To solve the is limited because IR energy must travel from the camera location to the target
distinguishability problem, multiple colors, such as red, green, blue, and yellow and back again. Since energy dissipates quadratically as a function of distance,
dots, might be needed. Trouble may occur if these colors exist naturally in other doubling the distance results on one-fourth of the energy level arriving at the
9.3. TRACKING POSITION AND ORIENTATION 259 260 S. M. LaValle: Virtual Reality
camera.
At this point, it is natural to wonder why an entire image is being captured if
the resulting image processing problem is trivial. The main reason is the prolifer-
ation of low-cost digital cameras and image processing software. Why not simply
design an emitter-detector pair that produces a binary reading, indicating whether Figure 9.14: Each feature that is visible eliminates 2 DOFs. On the left, a single
the visibility beam is occluded? This is precisely how the detection beam works in feature is visible, and the resulting rigid body has only 4 DOFs remaining. On the
an automatic garage door system to ensure the door does not close on someone: right, two features are visible, resulting in only 2 DOFs. This can be visualized
An IR LED emits energy to a detection photodiode, which is essentially a switch as follows. The edge that touches both segments can be moved back and forth
that activates when it receives a sufficient level of energy for its target wavelength while preserving its length if some rotation is also applied. Rotation about an axis
(in this case IR). To reduce the amount of energy dissipation, mirrors or lenses common to the edge provides the second DOF.
could be used to focus the energy.
Even better, an IR laser can be aimed directly at the detector. The next task
is to use lenses and moving mirrors so that every detector that is visible from a results in four constraints. In this case, each constraint eliminates two DOFs,
fixed location will become illuminated at some point. The beam can be spread resulting in only two remaining DOFs; see Figure 9.14. Continuing further, if three
from a dot to a line using a lens, and then the line is moved through space using features are observed, then for the P3P problem, zero DOFs remain (except for
a spinning mirror. This is the basis of the lighthouse tracking system for the HTC the case in which collinear features are chosen on the body). It may seem that the
Vive headset, which is covered later in this section. problem is completely solved; however, zero DOFs allows for a multiple solutions
(they are isolated points in the space of solutions). The P3P problem corresponds
to trying to place a given triangle into a pyramid formed by rays so that each
The Perspective-n-Point (PnP) problem A moving rigid body needs to be triangle vertex touches a different ray. This can be generally accomplished in
“pinned down” using n observed features. This is called the Perspective-n-Point four ways, which are hard to visualize. Imagine trying to slice a tall, thin pyramid
(or PnP) problem. We can borrow much of the math from Chapter 3; however, (simplex) made of cheese so that four different slices have the exact same triangular
here we consider the placement of bodies in the real world, rather than the virtual size and shape. The cases of P4P and P5P also result in ambiguous solutions.
world. Furthermore, we have an inverse problem, which is to determine the body Finally, in the case of P6P, unique solutions are always obtained if no four features
placement based on points in the image. Up until now, the opposite problem was are coplanar. All of the mathematical details are worked out in [354].
considered. For visual rendering in Chapter 7, an image was produced based on
The PnP problem has been described in the ideal case of having perfect coor-
the known body placement in the (virtual) world.
dinate assignments to the feature points on the body and the perfect observation
The features could be placed on the body or in the surrounding world, depend- of those through the imaging process. In practice, small errors are made due to
ing on the sensing method. Suppose for now that they are on the body. Each factors such as sensor noise, image quantization, and manufacturing tolerances.
feature corresponds to a point p = (x, y, z) with coordinates defined in the frame This results in ambiguities and errors in the estimated pose, which could deviate
of the body. Let Trb be a homogeneous transformation matrix that contains the substantially from the correct answer [281]. Therefore, many more features may
pose parameters, which are assumed to be unknown. Applying the transform Trb be used in practice to improve accuracy. Furthermore, a calibration procedure,
to the point p as in (3.22) could place it anywhere in the real would. Recall the such as bundle adjustment [111, 280, 325], may be applied before the device is used
chain of transformations (3.41), which furthermore determines where each point so that the feature point locations can be more accurately assigned before pose
on the body would appear in an image. The matrix Teye held the camera pose, estimation is performed. Robustness may be improved by employing RANSAC
whereas Tvp and Tcan contained the perspective projection and transformed the [77].
projected point into image coordinates.
Now suppose that a feature has been observed to be at location (i, j) in image
coordinates. If Trb is unknown, but all other transforms are given, then there Camera-based implementation The visibility problem may be solved using
would be six independent parameters to estimate, corresponding to the 6 DOFs. a camera in two general ways, as indicated in Figure 9.15. Consider the camera
Observing (i, j) provides two independent constraints on the chain of transforms frame, which is analogous to the eye frame from Figure 3.14 in Chapter 3. A
(3.41), one i and one for j. The rigid body therefore loses 2 DOFs, as shown in world-fixed camera is usually stationary, meaning that the camera frame does not
Figure 9.14. This was the P1P problem because n, the number of features, was move relative to the world. A single transformation may be used to convert an
one. object pose as estimated from the camera frame into a convenient world frame. For
The P2P problem corresponds to observing two features in the image and example, in the case of the Oculus Rift headset, the head pose could be converted
9.3. TRACKING POSITION AND ORIENTATION 261 262 S. M. LaValle: Virtual Reality
(a) (b)
Figure 9.15: Two cases for camera placement: (a) A world-fixed camera is station- (a) (b)
ary, and the motions of objects relative to it are estimated using features on the
objects. (b) An object-fixed camera is frequently under motion and features are
ideally fixed to the world coordinate frame. Figure 9.16: The laser-based tracking approach used in the HTC Vive headset: (a)
A base station contains spinning drums that emit horizontal and vertical sheets
of IR light. An array of IR LEDs appears in the upper left, which provide a
to a world frame in which the −z direction is pointing at the camera, y is “up”, and synchronization flash. (b) Photodiodes in pockets on the front of the headset
the position is in the center of the camera’s tracking region or a suitable default detect the incident IR light.
based on the user’s initial head position. For an object-fixed camera, the estimated
pose, derived from features that remain fixed in the world, is the transformation
from the camera frame to the world frame. This case would be obtained, for integer image coordinates. Many issues affect performance: 1) quantization errors
example, if QR codes were placed on the walls. arise due to image coordinates for each blob pixel being integers; 2) if the feature
As in the case of an IMU, calibration is important for improving sensing ac- does not cover enough pixels, then the quantization errors are worse; 3) changes
curacy. The following homogeneous transformation matrix can be applied to the in lighting conditions may make it difficult to extract the feature, especially in
image produced by a camera: the case of natural features; 4) at some angles, two or more features may become
close in the image, making it difficult to separate their corresponding blobs; 5) as
α x γ u0 various features enter or leave the camera view, the resulting estimated pose may
0 αy v0 (9.23) jump. Furthermore, errors tend to be larger along the direction of the optical axis.
0 0 1
The five variables appearing in the matrix are called intrinsic parameters of the Laser-based implementation By designing a special emitter-detector pair,
camera. The αx and αy parameters handle scaling, γ handles shearing, and u0 the visibility problem can be accurately solved over great distances. This was
and v0 handle offset of the optical axis. These parameters are typically estimated accomplished by the lighthouse tracking system of the 2016 HTC Vive headset,
by taking images of an object for which all dimensions and distances have been and the Minnesota scanner from 1989 [301]. Figure 9.16 shows the lighthouse
carefully measured, and performing least-squares estimation to select the param- tracking hardware for the HTC Vive. The operation of a camera is effectively
eters that reduce the sum-of-squares error (as described in Section 9.1). For a simulated, as shown in Figure 9.17(a).
wide-angle lens, further calibration may be needed to overcome optical distortions If the base station were a camera, then the sweeping vertical stripe would
(recall Section 7.3). correspond to estimating the row of the pixel that corresponds to the feature;
Now suppose that a feature has been observed in the image, perhaps using some see Figure 9.17(a). Likewise, the sweeping horizontal stripe corresponds to the
form of blob detection to extract the pixels that correspond to it from the rest of the pixel column. The rotation rate of the spinning drum is known and is analogous
image [280, 318]. This is easiest for a global shutter camera because all pixels will to the camera frame rate. The precise timing is recorded as the beam hits each
correspond to the same instant of time. In the case of a rolling shutter, the image photodiode.
may need to be transformed to undo the effects of motion (recall Figure 4.33). Think about polar coordinates (distance and angle) relative to the base station.
The location of the observed feature is calculated as a statistic of the blob pixel Using the angular velocity of the sweep and the relative timing differences, the
locations. Most commonly, the average over all blob pixels is used, resulting in non- angle between the features as “observed” from the base station can be easily
9.3. TRACKING POSITION AND ORIENTATION 263 264 S. M. LaValle: Virtual Reality
could be used. The camera provides an additional source for detecting orientation
drift error. The camera optical axis is a straightforward reference for yaw error
estimation detection, which makes it a clear replacement for the magnetometer.
If the camera tilt is known, then the camera can also provide accurate tilt error
estimation.
The IMU was crucial for obtaining highly accurate orientation tracking be-
cause of accurate, high-frequency estimates of angular velocity provided by the
gyroscope. If the frame rate for a camera or lighthouse system is very high, then
sufficient sensor data may exist for accurate position tracking; however, it is prefer-
able to directly measure derivatives. Unfortunately, IMUs do not measure linear
(a) (b) velocity. However, the output of the linear accelerometer could be used as sug-
gested in the beginning of this section. Suppose that the accelerometer estimates
the body acceleration as
Figure 9.17: (a) This is a 2D view of the angular sweep of the IR stripe in the
laser-based tracking approach (as in HTC Vive). This could correspond to a top- â[k] = (âx [k], ây [k], âz [k]) (9.24)
down view, in which a vertical stripe spins with a yaw rotation about the base.
in the world frame (this assumes the gravity component has been subtracted from
In this case, the angular locations in the horizontal direction are observed, similar
the accelerometer output).
to column coordinates of a camera image. This could also correspond to a side
By numerical integration, the velocity v̂[k] can be estimated from â[k]. The po-
view, in which case the vertical stripe spins with a pitch rotation and the angular
sition p̂[k] is estimated by integrating the velocity estimate. The update equations
locations in the vertical direction are observed. As the beam hits the features,
using simple Euler integration are
which are photodiodes, the direction is known because of the spinning rate and
time since the synchronization flash. (b) By putting two base stations on top of v̂[k] = â[k]∆t + v̂[k − 1]
poles at the corners of the tracking area, a large region can be accurately tracked (9.25)
p̂[k] = v̂[k]∆t + p̂[k − 1].
for a headset and controllers. (Drawing by Chris Stobing.)
Note that each equation actually handles three components, x, y, and z, at the
same time. The accuracy of the second equation can be further improved by
estimated. Although the angle between features is easily determined, their angles adding 12 â[k]∆t2 to the right side.
relative to some fixed direction from the base station must be determined. This is As stated earlier, double integration of the acceleration leads to rapidly growing
accomplished by an array of IR LEDs that are pulsed on simultaneously so that all position drift error, denoted by dˆp [k] . The error detected from PnP solutions
photodiodes detect the flash (visible in Figure 9.16(a)). This could correspond, for provide an estimate of dˆp [k], but perhaps at a much lower rate than the IMU
example, to the instant of time at which each beam is at the 0 orientation. Based produces observations. For example, a camera might take pictures at 60 FPS and
on the time from the flash until the beam hits a photodiode, and the known angular the IMU might report accelerations at 1000 FPS.
velocity, the angle of the observed feature is determined. To reduce temporal drift The complementary filter from (9.10) can be extended to the case of double
error, the flash may be periodically used during operation. integration to obtain
As in the case of the camera, the distances from the base station to the features pc [k] = p̂[k] − αp dˆp [k]
are not known, but can be determined by solving the PnP problem. Multiple base (9.26)
vc [k] = v̂[k] − αv dˆp [k].
stations can be used as well, in a way that is comparable to using multiple cameras
or multiple eyes to infer depth. The result is accurate tracking over a large area, Above, pc [k] and vc [k] are the corrected position and velocity, respectively, which
as shown in Figure 9.17(b). are each calculated by a complementary filter. The estimates p̂[k] and v̂[k] are cal-
culated using (9.25). The parameters αp and αv control the amount of importance
given to the drift error estimate in comparison to IMU updates.
Filtering As in Section 9.2, outputs from sensors are combined over time by a Equation (9.26) is actually equivalent to a Kalman filter, which is the opti-
filtering method to maintain the estimate. In the current setting, the pose can mal filter (providing the most accurate estimates possible) for the case of a linear
be maintained by combining both visibility information and outputs of an IMU. dynamical system with Gaussian noise, and sensors that also suffer from Gaus-
For the orientation component of the pose, the complementary filter from (9.10) sian noise. Let ωd2 represent the variance of the estimated Gaussian noise in the
9.4. TRACKING ATTACHED BODIES 265 266 S. M. LaValle: Virtual Reality
dynamical system, and let ωs2 represent the sensor noise variance. The comple-
mentary filter
p (9.26) is equivalent to the Kalman filter if the parameters are chosen
as αp = 2ωd /ωs and αv = ωd /ωs [119]. A large variety of alternative filtering
methods exist; however, the impact of using different filtering methods is usually
small relative to calibration, sensor error models, and dynamical system mod-
els that are particular to the setup. Furthermore, the performance requirements
are mainly perceptually based, which could be different than the classical criteria
around which filtering methods were designed [167].
Once the filter is running, its pose estimates can be used to aid the PnP
problem. The PnP problem can be solved incrementally by perturbing the pose
estimated by the filter, using the most recent accelerometer outputs, so that the
observed features are perfectly matched. Small adjustments can be made to the (a) (b)
pose so that the sum-of-squares error is reduced to an acceptable level. In most
case, this improves reliability when there are so few features visible that the PnP
problem has ambiguous solutions. Without determining the pose incrementally, a Figure 9.18: (a) The first and sometimes the fourth Purkinje images of an IR
catastrophic jump to another PnP solution might occur. light source are used for eye tracking. (Figure from Wikipedia.) (b) The first
Purkinje image generates a bright reflection as shown. (Picture from Massimo
Gneo, Maurizio Schmid, Silvia Conforto, and Tomasso D’Alessio.)
9.4 Tracking Attached Bodies
several electrodes placed on the facial skin around each eye. The recorded poten-
Many tracking problems involve estimating the motion of one body relative to tials correspond to eye muscle activity, from which the eye orientation relative to
another attached, moving body. For example, an eye rotates inside of its socket, the head is determined through filtering. The second approach uses a contact lens,
which is part of the skull. Although the eye may have six DOFs when treated which contains a tiny magnetic coil that causes a potential change in a surround-
as a rigid body in space, its position and orientation are sufficiently characterized ing electromagnetic field. The third approach is called video oculography (VOG),
with two or three parameters once the head pose is given. Other examples include which shines IR light onto the eye and senses its corneal reflection using a camera
the head relative to the torso, a hand relative to the wrist, and the tip of a finger or photodiodes. The reflection is based on Purkinje images, as shown in Figure
relative to its middle bone. The entire human body can even be arranged into a 9.18. Because of its low cost and minimal invasiveness, this is the most commonly
tree of attached bodies, based on a skeleton. Furthermore, bodies may be attached used method today. The contact lens approach is the most accurate; however, it
in a similar way for other organisms, such as dogs or monkeys, and machinery, such is also the most uncomfortable.
as robots or cars. In the case of a car, the wheels rotate relative to the body. In all
of these case, the result is a multibody system. The mathematical characterization
Forward kinematics Suppose that an eye tracking method has estimated the
of the poses of bodies relative to each other is called multibody kinematics, and the
eye orientation relative to the human skull and it needs to be placed accordingly
full determination of their velocities and accelerations is called multibody dynamics.
in the virtual world. This transformation must involve a combination of the head
and eye transforms. For a more complicated problem, consider placing the right
Eye tracking Eye tracking systems been used by vision scientists for over a index finger in the world by using pose of the torso along with all of the angles
century to study eye movements. Three main uses for VR are: 1) To accomplish formed between bones at each joint. To understand how these and other related
foveated rendering, as mentioned in Section 5.4, so that high-resolution rendering problems are solved, it is helpful to first consider 2D examples.
need only be performed for the part of the image that lands on the fovea. 2) To Each body of a multibody system is called a link, and a pair of bodies are
study human behavior by recording tracking data so that insights may be gained attached at a joint, which allows one or more DOFs of motion between them.
into VR sickness, attention, and effectiveness of experiences. 3) To render the Figure 9.19 shows two common ways that one planar body might move while
eye orientations in VR so that social interaction may be improved by offering attached to another. The revolute joint is most common and characterizes the
eye-contact and indicating someone’s focus of attention; see Section 10.4. motion allowed by a human elbow.
Three general categories of eye-tracking approaches have been developed [63, Consider defining a chain of m links, B1 to Bm , and determining the location
338]. The first is electro-oculography (EOG), which obtains measurements from of a point on the last link. The points on each link are defined using coordinates
9.4. TRACKING ATTACHED BODIES 267 268 S. M. LaValle: Virtual Reality
Revolute Prismatic
Figure 9.19: Two types of 2D joints: A revolute joint allows one link to rotate
with respect to the other, and a prismatic joint allows one link to translate with
respect to the other.
of its own body frame. In this frame, the body appears as shown for Bi−1 in Figure
9.20, with the origin at the joint that connects Bi−1 to Bi−2 and the xi−1 axis
pointing through the joint that connects Bi−1 to Bi . To move the points on Bi to
the proper location in the body frame of Bi−1 , the homogeneous transform
cos θi − sin θi ai−1
Ti = sin θi cos θi 0 . (9.27)
0 0 1 Revolute Prismatic Screw
1 DOF 1 DOF 1 DOF
is applied. This rotates Bi by θi , and then translates it along the x axis by ai−1 .
For a revolute joint, θi is a variable, and ai−1 is a constant. For a prismatic joint,
θi is constant and ai−1 is a variable.
Points on Bi are moved into the body frame for B1 by applying the product
T2 · · · Ti . A three-link example is shown in Figure 9.21. To move the first link B1
into the world frame, a general 2D homogeneous transform can be applied:
cos θi − sin θi xt
T1 = sin θi cos θi yt . (9.28)
Cylindrical Spherical Planar
0 0 1
2 DOFs 3 DOFs 3 DOFs
This transform is simply added to the matrix product to move each Bi by applying
T1 T2 · · · Ti . Figure 9.22: Types of 3D joints arising from the 2D surface contact between two
A chain of 3D links is handled in the same way conceptually, but the algebra bodies.
becomes more complicated. See Section 3.3 of [163] for more details. Figure
9.22 shows six different kinds of joints that are obtained by allowing a pair of 3D
links to slide against each other. Each link is assigned a convenient coordinate constrained, then the freedom of motion for the intermediate links increases as the
frame based on the joints. Each homogeneous transform Ti contains a mixture of number of links increases. The Chebychev-Grübler-Kutzbach criterion gives the
constants and variables in which the variables correspond to the freedom allowed number of DOFs, assuming the links are not in some special, singular configura-
by the joint. The most common assignment scheme is called Denavit-Hartenberg tions [9]. A common problem in animating video game characters is to maintain
parameters [110]. In some settings, it might be preferable to replace each Ti by a kinematic constraint, such as the hand grasping a doorknob, even though the
a parameterized quaternion that rotates the body, followed by a simple addition torso or door is moving. In this case, iterative optimization is often applied to per-
that translates the body. turb each joint parameter until the error is sufficiently reduced. The error would
measure the distance between the hand and the doorknob in our example.
A tree of links may also be considered; a common example is a human torso
serving as the root, with a head, two arms, and two legs being chains that extend
from it. The human hand is another example. Coordinate frames in this case are Motion capture systems Tracking systems for attached bodies use kinematic
often assigned using Kleinfinger-Khalil parameters [148]. constraints to improve their accuracy. The most common application is tracking
the human body, for which the skeleton is well-understood in terms of links and
Constraints and inverse kinematics Recall the PnP problem from Section joints [366]. Such motion capture systems have been an important technology
9.3, which involved calculating the pose of a body based on some observed con- for the movie industry as the motions of real actors are brought into a virtual
straints. A similar problem is to determine the joint parameters for a chain of world for animation. Figure 9.24 illustrates the operation. Features, of the same
bodies by considering the constraints on the bodies. A common example is to kind as introduced in Section 9.3, are placed over the body and are visible to
calculate the poses of the arm links by using only the pose of the hand. This is cameras mounted around the capture studio. The same options exist for visibility,
generally called the inverse kinematics problem (see [8] and Section 4.4 of [163]). with the most common approach over the past decade being to use cameras with
As in the case of PnP, the number of solutions may be infinite, finite, one, or surrounding IR LEDs and placing retroreflective markers on the actor.
zero. Some 2D examples are shown in Figure 9.23. Generally, if the last link is To obtain a unique pose for each body part, it might seem that six features are
9.5. 3D SCANNING OF ENVIRONMENTS 271 272 S. M. LaValle: Virtual Reality
(a) (b)
Figure 9.23: (a) The orientations of both links can be inferred from the position of
the fixed point; however, there is a second solution if the angles are not restricted.
(b) In the case of three links, a one-dimensional family of solutions exists when
the end is fixed. This can be visualized by pushing down on the top joint, which
would cause B1 to rotate counter-clockwise. This is equivalent to the classical
four-bar mechanism, which was used to drive the wheels of a steam engine (the
fourth “link” is simply the fixed background). Figure 9.24: With a motion capture (MOCAP) system, artificial features are placed
around the body of a human actor. The motions are extracted and matched to a
needed (recall P6P from Section 9.3); however, many fewer are sufficient because kinematic model. Each rigid body in the model has an associated geometric model
of kinematic constraints. Additional features may nevertheless be used if the goal that is rendered to produce the final animated character. (Picture from Wikipedia
is to also capture skin motion as it moves along the skeleton. This is especially user Hipocrite.)
important for facial movement. Many new MOCAP technologies are currently
under development. For example, a system developed by Noitom captures human
object from many viewpoints in a controlled way. The object may be placed on a
body movement solely by placing IMUs on the body. Some systems capture motion
surface that is surrounded by cameras and other sensors, or it could be placed on a
by cameras alone, as in the case of Leap Motion (see Figure 9.25) for hand tracking,
turntable that rotates the object so that it is observed from numerous viewpoints.
and systems by Microsoft and 8i for full-body tracking by extracting contours
Alternatively, the sensors may move around while the object remains stationary;
against a green screen. Solutions based on modern depth sensors may also become
see Figure 9.26(a).
prevalent in the near future. One challenge is to make highly accurate and reliable
systems for low cost and installation effort.
SLAM A 3D scanner is useful for smaller objects, with surrounding sensors fac-
ing inward. For larger objects and stationary models, the sensors are usually inside
9.5 3D Scanning of Environments facing out; see Figure 9.26(b). A common example of a stationary model is the
inside of a building. Scanning such models is becoming increasingly important for
Up until now, this chapter has described how to use sensors to track the motions of surveying and forensics. This is also the classical robotics problem of mapping,
one or more rigid bodies. By contrast, this section describes how sensors are used in which a robot carrying sensors builds a 2D or 3D representation of its world
to build geometric models of rigid bodies. These could be movable or stationary for the purposes of navigation and collision avoidance. Robots usually need to
models, as introduced in Section 3.1. A movable model typically corresponds estimate their locations based on sensors, which is called the localization problem.
to an object that is be manipulated by the user, such as a sword, hammer, or Robot localization and tracking bodies for VR are fundamentally the same prob-
coffee cup. These models are often built from a 3D scanner, which images the lems, with the main distinction being that known motion commands are given to
9.5. 3D SCANNING OF ENVIRONMENTS 273 274 S. M. LaValle: Virtual Reality
Figure 9.25: (a) The hand model used by Leap Motion tracking. (b) The tracked Figure 9.26: (a) The Afinia ES360 scanner, which produces a 3D model of an
model superimposed in an image of the actual hands. object while it spins on a turntable. (b) The FARO Focus3D X 330 is an outward-
facing scanner for building accurate 3D models of large environments; it includes
a GPS receiver to help fuse individual scans into a coherent map.
robots, but the corresponding human intent is not directly given. Robots often
need to solve mapping and localization problems at the same time, which results in
the simultaneous localization and mapping problem; the acronym SLAM is widely Are panoramas sufficient? Before embarking on the process of creating a
used. Due to the similarity of localization, mapping, and VR tracking problems, large, detailed map of a surrounding 3D world, it is important to consider whether
deep connections exist between robotics and VR. Therefore, many mathematical it is necessary. As mentioned in Section 7.5, panoramic images and videos are be-
models, algorithms, and sensing technologies overlap. coming increasingly simple to capture. In some applications, it might be sufficient
to build an experience in which the user is transported between panoramas that
Consider the possible uses of a large, stationary model for VR. It could be
were captured from many locations that are close to each other.
captured to provide a virtual world in which the user is placed at the current
time or a later time. Image data could be combined with the 3D coordinates of
the model, to produce a photorealistic model (recall Figure 2.14 from Section 2.2). The main ingredients Building a 3D model from sensor data involves three
This is achieved by texture mapping image patches onto the triangles of the model. important steps:
1. Extracting a 3D point cloud from a fixed location.
Live capture of the current location Rather than capturing a world in which 2. Combining point clouds from multiple locations.
to transport the user, sensors could alternatively be used to capture the physical 3. Converting a point cloud into a mesh of triangles.
world where the user is currently experiencing VR. This allows obstacles in the
matched zone to be rendered in the virtual world, which might be useful for safety For the first step, a sensor is placed at a fixed position and orientation while
or to improve interactivity. For safety, the boundaries of the matched zone could 3D points are extracted. This could be accomplished in a number of ways. In
be rendered to indicate that the user is about to reach the limit. Hazards such as theory, any of the depth cues from Section 6.1 can be applied to camera images to
a hot cup of coffee or a pet walking across the matched zone could be indicated. extract 3D points. Variations in focus, texture, and shading are commonly used in
Interactivity can be improved by bringing fixed objects from the physical world computer vision as monocular cues. If two cameras are facing the same scene and
into the virtual world. For example, if the user is sitting in front of a desk, then their relative positions and orientations are known, then binocular cues are used
the desk can be drawn in the virtual world. If she touches the virtual desk, she to determine depth. By identifying the same natural feature in both images, the
will feel the real desk pushing back. This is a relatively easy way to provide touch corresponding visibility rays from each image are intersected to identify a point in
feedback in VR. space; see Figure 9.27. As in Section 9.3, the choice between natural and artificial
9.5. 3D SCANNING OF ENVIRONMENTS 275 276 S. M. LaValle: Virtual Reality
cleaned, surfaces are typically fit to the data, from which triangular meshes are
formed. Each of these problems is a research area in itself. To gain some familiar-
ity, consider experimenting with the open-source Point Cloud Library, which was
developed to handle the operations that arise in the second and third stages. Once
a triangular mesh is obtained, texture mapping may also be performed if image
data is also available. One of the greatest challenges for VR is that the resulting
models often contain numerous flaws which are much more noticeable in VR than
on a computer screen.
Figure 9.27: By using two cameras, stereo vision enables the location of a feature
in the 3D world to be determined by intersecting the corresponding visibility ray Further Reading
from each camera. To accomplish this, the camera calibration parameters and
relative poses must be known. Similarly, one camera could be replaced by a laser In addition to academic papers such as [82], some of the most useful coverage for IMU
that illuminates the feature so that it is visible to the remaining camera. In either calibration appears in corporate white papers, such as [244]. For magnetometer cal-
case, the principle is to intersect two visibility rays to obtain the result. ibration, see [92, 155, 165, 329]. Oculus Rift 3D orientation tracking is covered in
[166, 164, 165, 168]. To fully understand vision-based tracking methods, see vision
books [111, 191, 318]. Many approaches to PnP apear in research literature, such as
features exists. A single camera and an IR projector or laser scanner may be used [354, 367]. An excellent, but older, survey of tracking methods for VR/AR is [344]. One
in combination so that depth is extracted by identifying where the lit point appears of the most highly cited works is [141]. See [235] for integration of IMU and visual data
in the image. This is the basis of the Microsoft Kinect sensor (recall Figure 2.10 for tracking.
from Section 2.1). The resulting collection of 3D points is often called a point Eye tracking is surveyed in [63, 338]. Human body tracking is covered in [368]. To
cloud. fully understand kinematic constraints and solutions to inverse kiematics problems, see
In the second step, the problem is to merge scans from multiple locations. [8, 10, 48]. SLAM from a robotics perspective is thoroughly presented in [322]. A recent
If the relative position and orientation of the scanner between scans is known, survey of SLAM based on computer vision appears in [86]. Filtering or sensor fusion in
then the problem is solved. In the case of the object scanner shown in Figure the larger context can be characterized in terms of information spaces (see Chapter 11
9.26(a), this was achieved by rotating the object on a turntable so that the position of [163]).
remains fixed and the orientation is precisely known for each scan. Suppose the
sensor is instead carried by a robot, such as a drone. The robot usually maintains
its own estimate of its pose for purposes of collision avoidance and determining
whether its task is achieved. This is also useful for determining the pose that
corresponds to the time at which the scan was performed. Typically, the pose
estimates are not accurate enough, which leads to an optimization problem in
which the estimated pose is varied until the data between overlapping scans nicely
aligns. The estimation-maximization (EM) algorithm is typically used in this case,
which incrementally adjusts the pose in a way that yields the maximum likelihood
explanation of the data in a statistical sense. If the sensor is carried by a human,
then extra sensors may be included with the scanning device, as in the case of
GPS for the scanner in Figure 9.26(b); otherwise, the problem of fusing data from
multiple scans could become too difficult.
In the third stage, a large point cloud has been obtained and the problem is
to generate a clean geometric model. Many difficulties exist. The point density
may vary greatly, especially where two or more overlapping scans were made. In
this case, some points may be discarded. Another problem is that outliers may
exist, which correspond to isolated points that are far from their correct location.
Methods are needed to detect and reject outliers. Yet another problem is that
large holes or gaps in the data may exist. Once the data has been sufficiently
278 S. M. LaValle: Virtual Reality
Motor programs Throughout our lives, we develop fine motor skills to accom-
plish many specific tasks, such as writing text, tying shoelaces, throwing a ball,
and riding a bicycle. These are often called motor programs, and are learned
through repetitive trials, with gradual improvements in precision and ease as the
amount of practice increases [196]. Eventually, we produce the motions without
even having to pay attention to them. For example, most people can drive a car
Chapter 10 without paying attention to particular operations of the steering wheel, brakes,
and accelerator.
In the same way, most of us have learned how to use interfaces to computers,
Interaction such as keyboards, mice, and game controllers. Some devices are easier to learn
than others. For example, a mouse does not take long, but typing quickly on
a keyboard takes years to master. What makes one skill harder to learn than
another? This is not always easy to predict, as illustrated by the backwards brain
How should users interact with the virtual world? How should they move about? bicycle, which was designed by Destin Sandlin by reversing the steering operation
How can they grab and place objects? How should they interact with represen- so that turning the handlebars left turns the front wheel to the right [21]. It took
tations of each other? How should they interact with files or the Internet? The Sandlin six months learn how to ride it, and at the end he was unable to ride an
following insight suggests many possible interfaces. ordinary bicycle. Thus, he unlearned how to ride a normal bicycle at the expense
of learning the new one.
Universal Simulation Principle:
Any interaction mechanism from the real world can be simulated in VR. Design considerations In the development of interaction mechanisms for VR,
the main considerations are:
For example, the user might open a door by turning a knob and pulling. As another
example, the user operate a virtual aircraft by sitting in a mock-up cockpit (as 1. Effectiveness for the task in terms of achieving the required speed, accuracy,
was shown in Figure 1.15). One could even simulate putting on a VR headset, and motion range, if applicable.
leading to an experience that is comparable to a dream within a dream!
In spite of the universal simulation principle, recall from Section 1.1 that the 2. Difficulty of learning the new motor programs; ideally, the user should not
goal is not necessarily realism. It is often preferable to make the interaction better be expected to spend many months mastering a new mechanism.
than reality. Therefore, this chapter introduces interaction mechanisms that may
3. Ease of use in terms of cognitive load; in other words, the interaction mech-
not have a counterpart in the physical world.
anism should require little or no focused attention after some practice.
Section 10.1 introduces general motor learning and control concepts. The most
important concept is remapping, in which a motion in the real world may be 4. Overall comfort during use over extended periods; the user should not de-
mapped into a substantially different motion in the virtual world. This enables velop muscle fatigue, unless the task is to get some physical exercise.
many powerful interaction mechanisms. The task is to develop ones that are
easy to learn, easy to use, effective for the task, and provide a comfortable user To design and evaluate new interaction mechanisms, it is helpful to start by
experience. Section 10.2 discusses how the user may move himself in the virtual understanding the physiology and psychology of acquiring the motor skills and
world, while remaining fixed in the real world. Section 10.3 presents ways in programs. Chapters 5 and 6 covered these for visual perception, which is the
which the user may interact with other objects in the virtual world. Section 10.4 process of converting sensory input into a perceptual experience. We now consider
discusses social interaction mechanisms, which allow users to interact directly with the corresponding parts for generating output in the form of body motions in the
each other. Section 10.5 briefly considers some additional interaction mechanisms, physical world. In this case, the brain sends motor signals to the muscles, causing
such as editing text, designing 3D structures, and Web browsing. them to move, while at the same time incorporating sensory feedback by utilizing
the perceptual processes.
10.1 Motor Programs and Remapping The neurophysiology of movement First consider the neural hardware in-
volved in learning, control, and execution of voluntary movements. As shown in
277
10.1. MOTOR PROGRAMS AND REMAPPING 279 280 S. M. LaValle: Virtual Reality
(a) (b)
Figure 10.2: (a) Atari 2600 Paddle controller. (b) The Atari Breakout game, in
which the bottom line segment is a virtual paddle that allows the ball to bounce
(a) (b) to the top and eliminate bricks upon contacts.
Figure 10.1: (a) Part of the cerebral cortex is devoted to motion. (b) Many systems, in which sensor-feedback and motor control are combined in applications
other parts interact with the cortex to produce and execute motions, including such as robotics and aircraft stabilization; the subject that deals with this is called
the thalamus, spinal cord, basal ganglion, brain stem, and cerebellum. (Figures control systems. It is well-known that a closed-loop system is preferred in which
provided by The Brain from Top to Bottom, McGill University.) sensor information provides feedback during execution, as opposed to open-loop,
which specifies the motor signals as a function of time.
One of the most important factors is how long it takes to learn a motor program.
Figure 10.1(a), some parts of the cerebral cortex are devoted to motion. The pri- As usual, there is great variation across humans. A key concept is neuroplasticity,
mary motor cortex is the main source of neural signals that control movement, which is the potential of the brain to reorganize its neural structures and form new
whereas the premotor cortex and supplementary motor area appear to be involved pathways to adapt to new stimuli. Toddlers have a high level of neuroplasticity,
in the preparation and planning of movement. Many more parts are involved in which becomes greatly reduced over time through the process of synaptic pruning.
motion and communicate through neural signals, as shown in Figure 10.1(b). The This causes healthy adults to have about half as many synapses per neuron than
most interesting part is the cerebellum, meaning “little brain”, which is located at a child of age two or three [99]. Unfortunately, the result is that adults have a
the back of the skull. It seems to be a special processing unit that is mostly de- harder time acquiring new skills such as learning a new language or learning how
voted to motion, but is also involved in functions such as attention and language. to use a complicated interface. In addition to the reduction of neuroplasticity with
Damage to the cerebellum has been widely seen to affect fine motor control and age, it also greatly varies among people of the same age.
learning of new motor programs. It has been estimated to contain around 101 bil-
lion neurons [7], which is far more than the entire cerebral cortex, which contains Learning motor programs Now consider learning a motor program for a com-
around 20 billion. Even though the cerebellum is much smaller, a large number puter interface. A simple, classic example is the video game Breakout, which was
is achieved through smaller, densely packed cells. In addition to coordinating fine developed by Atari in 1976. The player turns a knob, shown in Figure 10.2. This
movements, it appears to be the storage center for motor programs. causes a line segment on the bottom of the screen to move horizontally. The Pad-
One of the most relevant uses of the cerebellum for VR is in learning sensorimo- dle contains a potentiometer that with calibration allows the knob orientation to
tor relationships, which become encoded into a motor program. All body motions be reliably estimated. The player sees the line segment positioned on the bottom
involve some kind of sensory feedback. The most common example is hand-eye of the screen and quickly associates the knob orientations. The learning process
coordination; however, even if you move your arms with your eyes closed, propri- therefore involves taking information from visual perception and the propriocep-
oception provides information in the form of efference copies of the motor signals. tion signals from turning the knob and determining the sensorimotor relationships.
Developing a tight connection between motor control signals and sensory and per- Skilled players could quickly turn the knob so that they could move the line seg-
ceptual signals is crucial to many tasks. This is also widely known in engineered ment much more quickly than one could move a small tray back and forth in the
10.1. MOTOR PROGRAMS AND REMAPPING 281 282 S. M. LaValle: Virtual Reality
cal motions produce larger screen motions. The advantages of the original Xerox
Alto mouse were scientifically argued in [39] in terms of human skill learning and
Fitts’s law [78, 192], which mathematically relates pointing task difficulty to the
time required to reach targets.
For a final example, suppose that by pressing a key, the letter “h” is instantly
placed on the screen in a familiar font. Our visual perception system recognizes
the “h” as being equivalent to the version on paper. Thus, typing the key results
in the perception of “h”. This is quite a comfortable, fast, and powerful operation.
The amount of learning required seems justified by the value of the output.
(a) (b)
Motor programs for VR The examples given so far already seem closely re-
lated to VR. A perceptual experience is controlled by body movement that is
Figure 10.3: (a) The Apple Macintosh mouse. (b) As a mouse moves across the sensed through a hardware device. Using the universal simulation principle, any
table, the virtual finger on the screen moves correspondingly, but is rotated by 90 of these and more could be brought into a VR system. The physical interaction
degrees and travels over longer distances. part might be identical (you could really be holding an Atari Paddle), or it could
be simulated through another controller. Think about possible designs.
Using the tracking methods of Chapter 9, the position and orientation of body
real world. Thus, we already have an example where the virtual world version parts could be reliably estimated and brought into VR. For the case of head track-
allows better performance than in reality. ing, it is essential to accurately maintain the viewpoint with high accuracy and
In the Breakout example, a one-dimensional mapping was learned between zero effective latency; otherwise, the VR experience is significantly degraded. This
the knob orientation and the line segment position. Many alternative control is essential because the perception of stationarity must be maintained for believ-
schemes could be developed; however, they are likely to be more frustrating. If ability and comfort. The motion of the sense organ must be matched by a tracking
you find an emulator to try Breakout, it will most likely involve using keys on system.
a keyboard to move the segment. In this case, the amount of time that a key
is held down corresponds to the segment displacement. The segment velocity is
set by the program, rather than the user. A reasonable alternative using modern Remapping For the motions of other body parts, this perfect matching is not
hardware might be to move a finger back and forth over a touch screen while the critical. Our neural systems can instead learn associations that are preferable in
segment appears directly above it. The finger would not be constrained enough terms of comfort, in the same way as the Atari Paddle, mouse, and keyboard
due to extra DOFs and the rapid back and forth motions of the finger may lead work in the real world. Thus, we want to do remapping, which involves learning a
to unnecessary fatigue, especially if the screen is large. Furthermore, there are sensorimotor mapping that produces different results in a virtual world than one
conflicting goals in positioning the screen: Making it as visible as possible versus would expect from the real world. The keyboard example above is one of the most
making it comfortable for rapid hand movement over a long period of time. In common examples of remapping. The process of pushing a pencil across paper to
the case of the Paddle, the motion is accomplished by the fingers, which have produce a letter has been replaced by pressing a key. The term remapping is even
high dexterity, while the forearm moves much less. The mapping provides an used with keyboards to mean the assignment of one or more keys to another key.
association between body movement and virtual object placement that achieves Remapping is natural for VR. For example, rather than reaching out to grab
high accuracy, fast placement, and long-term comfort. a virtual door knob, one could press a button to open the door. For a simpler
Figure 10.3 shows a more familiar example, which is the computer mouse. As case, consider holding a controller for which the pose is tracked through space, as
the mouse is pushed around on a table, encoders determine the position, which is allowed by the HTC Vive system. A scaling parameter could be set so that one
converted into a pointer position on the screen. The sensorimotor mapping seems centimeter of hand displacement in the real world corresponds to two centimeters
a bit more complex than in the Breakout example. Young children seem to imme- of displacement in the virtual world. This is similar to the scaling parameter for
diately learn how to use the mouse, whereas older adults require some practice. the mouse. Section 10.2 covers the remapping from natural walking in the real
The 2D position of the mouse is mapped to a 2D position on the screen, with two world to achieving the equivalent in a virtual world by using a controller. Section
fundamental distortions: 1) The screen is rotated 90 degrees in comparison to the 10.3 covers object interaction methods, which are again achieved by remappings.
table (horizontal to vertical motion. 2) The motion is scaled so that small physi- You can expect to see many new remapping methods for VR in the coming years.
10.2. LOCOMOTION 283 284 S. M. LaValle: Virtual Reality
Figure 10.4: Moving from left to right, the amount of viewpoint mismatch between
real and virtual motions increases.
10.2 Locomotion Figure 10.5: Locomotion along a horizontal terrain can be modeled as steering a
cart through the virtual world. A top-down view is shown. The yellow region is
Suppose that the virtual world covers a much larger area than the part of the real the matched zone (recall Figure 2.15), in which the user’s viewpoint is tracked.
world that is tracked. In other words, the matched zone is small relative to the The values of xt , zt , and θ are changed by using a controller.
virtual world. In this case, some form of interaction mechanism is needed to move
the user in the virtual world while she remains fixed within the tracked area in the
real world. An interaction mechanism that moves the user in this way is called line over long distances without visual cues is virtually impossible for humans (and
locomotion. It is as if the user is riding in a virtual vehicle that is steered through robots!) because in the real world it is impossible to achieve perfect symmetry.
the virtual world. One direction will tend to dominate through an imbalance in motor strength and
Figure 10.4 shows a spectrum of common locomotion scenarios. At the left, sensory signals, causing people to travel in circles.
the user walks around in an open space while wearing a headset. No locomotion is Imagine a VR experience in which a virtual city contains long, straight streets.
needed unless the virtual world is larger than the open space. This case involves As the user walks down the street, the yaw direction of the viewpoint can be
no mismatch between real and virtual motions. gradually varied. This represents a small amount of mismatch between the real
The two center cases correspond to a seated user wearing a headset. In these and virtual worlds, and it causes the user to walk along circular arcs. The main
cases, an interaction mechanism is used to change the position of the matched trouble with this technique is that the user has free will and might decide to walk to
zone in the virtual world. If the user is seated in a swivel chair, then he could the edge of the matched zone in the real world, even if he cannot directly perceive
change the direction he is facing (yaw orientation) by rotating the chair. This it. In this case, an unfortunate, disruptive warning might appear, suggesting that
can be considered as orienting the user’s torso in the virtual world. If the user he must rotate to reset the yaw orientation.
is seated in a fixed chair, then the virtual torso orientation is typically changed
using a controller, which results in more mismatch. The limiting case is on the
right of Figure 10.4, in which there is not even head tracking. If the user is facing Locomotion implementation Now consider the middle cases from Figure 10.4
a screen, as in the case of a first-person shooter game on a screen, then a game of sitting down and wearing a headset. Locomotion can then be simply achieved by
controller is used to change the position and orientation of the user in the virtual moving the viewpoint with a controller. It is helpful to think of the matched zone
world. This is the largest amount of mismatch because all changes in viewpoint as a controllable cart that moves across the ground of the virtual environment;
are generated by the controller. see Figure 10.5. First consider the simple case in which the ground is a horizontal
plane. Let Ttrack denote the homogeneous transform that represents the tracked
Redirected walking If the user is tracked through a very large space, such as position and orientation of the cyclopean (center) eye in the physical world. The
a square region of at least 30 meters on each side, then it is possible to make her methods described in Section 9.3 could be used to provide Ttrack for the current
think she is walking in straight lines for kilometers while she is in fact walking in time.
circles. This technique is called redirected walking [261]. Walking along a straight The position and orientation of the cart is determined by a controller. The
10.2. LOCOMOTION 285 286 S. M. LaValle: Virtual Reality
homogeneous matrix:
cos θ 0 sin θ xt
0 1 0 0
Tcart =
− sin θ
(10.1)
0 cos θ zt
0 0 0 1
encodes the position (xt , zt ) and orientation θ of the cart (as a yaw rotation,
borrowed from (3.18)). The height is set at yt = 0 in (10.1) so that it does not
change the height determined by tracking or other systems (recall from Section
9.2 that the height might be set artificially if the user is sitting in the real world,
but standing in the virtual world).
The eye transform is obtained by chaining Ttrack and Tcart to obtain
−1 −1
Teye = (Ttrack Tcart )−1 = Tcart Ttrack (10.2)
Recall from Section 3.4 that the eye transform is the inverse of the transform Figure 10.6: On the right the yaw rotation axis is centered on the head, for a
that places the geometric models. Therefore, (10.2) corresponds to changing the user who is upright in the chair. On the left, the user is leaning over in the chair.
perspective due to the cart, followed by the perspective of the tracked head on the Should the rotation axis remain fixed, or move with the user?
cart.
To move the viewpoint for a fixed direction θ, the xt and zt components are
obtained by integrating a differential equation: the user is sitting in a swivel chair and looking forward. By rotating the swivel
chair, the direction can be set. (However, this could become a problem for a wired
ẋt = s cos θ headset because the cable could wrap around the user.)
(10.3) In a fixed chair, it may become frustrating to control θ because the comfortable
żt = s sin θ.
head yaw range is limited to only 60 degrees in each direction (recall Figure 5.21).
Integrating (10.3) over a time step ∆t, the position update appears as In this case, buttons can be used to change θ by small increments in clockwise
or counterclockwise directions. Unfortunately, changing θ according to constant
xt [k + 1] = xt [k] + ẋt ∆t angular velocity causes yaw vection, which is nauseating to many people. Some
(10.4) users prefer to tap a button to instantly yaw about 10 degrees each time. If the
zt [k + 1] = zt [k] + żt ∆t.
increments are too small, then vection appears again, and if the increments are
The variable s in (10.3) is the forward speed. The average human walking speed is too large, then users become confused about their orientation.
about 1.4 meters per second. The virtual cart can be moved forward by pressing Another issue is where to locate the center of rotation, as shown in Figure 10.6.
a button or key that sets s = 1.4. Another button can be used to assign s = −1.4, What happens when the user moves his head away from the center of the chair in
which would result in backward motion. If no key or button is held down, then the real world? Should the center of rotation be about the original head center
s = 0, which causes the cart to remain stopped. An alternative control scheme or the new head center? If it is chosen as the original center, then the user will
is to use the two buttons to increase or decrease the speed, until some maximum perceive a large translation as θ is changed. However, this would also happen in
limit is reached. In this case, motion is sustained without holding down a key. the real world if the user were leaning over while riding in a cart. If it is chosen
Keys could also be used to provide lateral motion, in addition to forward/backward as the new head center, then the amount of translation is less, but might not
motion. This is called strafing in video games. It should be avoided, if possible, correspond as closely to reality.
because it cases unnecessary lateral vection. For another variation, the car-like motion model (8.30) from Section 8.3.2 could
be used so that the viewpoint cannot be rotated without translating. In other
Issues with changing direction Now consider the orientation θ. To move in a words, the avatar would have a minimum turning radius. In general, the viewpoint
different direction, θ needs to be reassigned. The assignment could be made based could be changed by controlling any virtual vehicle model. Figure 1.1 from Chapter
on the user’s head yaw direction. This becomes convenient and comfortable when 1 showed an example in which the “vehicle” is a bird.
10.2. LOCOMOTION 287 288 S. M. LaValle: Virtual Reality
1. If the field of view for the optical flow is reduced, then the vection is weak-
ened. A common example is to make a cockpit or car interior that blocks
most of the optical flow.
2. If the viewpoint is too close to the ground, then the magnitudes of velocity
and acceleration vectors of moving features are higher. This is why you
might feel as if you are traveling faster in a small car that is low to the
ground in comparison to riding at the same speed in a truck or minivan.
4. Having high spatial frequency will yield more features for the human vision
system to track. Therefore, if the passing environment is smoother, with
less detail, then vection should be reduced. Consider the case of traveling
up a staircase. If the steps are clearly visible so that they appear as moving
horizontal stripes, then the user may quickly come nauseated by the strong
vertical vection signal.
5. Reducing contrast, such as making the world seem hazy or foggy while ac-
celerating, may help. (a) (b)
6. Providing other sensory cues such as blowing wind or moving audio sources
might provide stronger evidence of motion. Including vestibular stimulation Figure 10.7: (a) Applying constant acceleration over a time interval to bring the
in the form of a rumble or vibration may also help lower the confidence of the stopped avatar up to a speed limit. The upper plot shows the speed over time.
vestibular signal. Even using head tilts to induce changes in virtual-world The lower plot shows the acceleration. The interval of time over which there is
motion may help because it would cause distracting vestibular signals. nonzero acceleration corresponds to a mismatch with the vestibular sense. (b) In
this case, an acceleration impulse is applied, resulting in the desired speed limit
7. If the world is supposed to be moving, rather than the user, then making it
being immediately achieved. In this case, the mismatch occurs over a time interval
clear through cues or special instructions can help.
that is effectively zero length. In practice, the perceived speed changes in a single
8. Providing specific tasks, such as firing a laser at flying insects, may provide pair of consecutive frames. Surprisingly, case (b) is much more comfortable than
enough distraction from the vestibular conflict. If the user is instead focused (a). It seems the brain prefers an outlier for a very short time interval, as supposed
entirely on the motion, then she might become sick more quickly. to a smaller, sustained mismatch over a longer time interval (such as 5 seconds).
9. The adverse effects of vection may decrease through repeated practice. Peo-
ple who regularly play FPS games in front of a large screen already seem to
have reduced sensitivity to vection in VR. Requiring users to practice be-
fore sickness is reduced might not be a wise strategy for companies hoping
to introduce new products. Imagine trying some new food that makes you
10.2. LOCOMOTION 289 290 S. M. LaValle: Virtual Reality
nauseated after the first 20 times of eating it, but then gradually becomes
more acceptable. Who would keep trying it?
the flashlight could be adjustable [83]. A virtual mirror could be placed so that
a selection could be made around a corner. Chapter 5 of [31] offers many other
suggestions.
With a pointer, the user simply illuminates the object of interest and presses
a button. If the goal is to retrieve the object, then it can be immediately placed
in the user’s virtual hand or inventory. If the goal is to manipulate the object in
a standard, repetitive way, then pressing the button could cause a virtual motor
program to be executed. This could be used, for example, to turn a doorknob,
thereby opening a door. In uses such as this, developers might want to set a
limit on the depth of the laser pointer, so that the user must be standing close
enough to enable the interaction. It might seem inappropriate, for example, to Figure 10.11: To make life easier on the user, a basin of attraction can be defined
turn doorknobs from across the room! around an object so that when the basin in entered, the dropped object is attracted
If the object is hard to see, then the selection process may be complicated. directly to the target pose.
It might be behind the user’s head, which might require uncomfortable turning.
The object could be so small or far away that it occupies only a few pixels on
the screen, making it difficult to precisely select it. The problem gets significantly Markus Persson (Notch), in which building blocks simply fall into place. Children
worse if there is substantial clutter around the object of interest, particularly if have built millions of virtual worlds in this way.
other selectable objects are nearby. Finally, the object may be partially or totally Alternatively, the user may be required to delicately place the object. Perhaps
occluded from view. the application involves stacking and balancing objects as high as possible. In this
case, the precision requirements would be very high, placing a burden on both the
controller tracking system and the user.
Manipulation If the user carries an object over a long distance, then it is not
necessary for her to squeeze or clutch the controller; this would yield unnecessary
fatigue. In some cases, the user might be expected to carefully inspect the object Remapping Now consider the power of remapping, as described in Section 10.1.
while having it in possession. For example, he might want to move it around in The simplest case is the use of the button to select, grasp, and place objects.
his hand to determine its 3D structure. The object orientation could be set to Instead of a button, continuous motions could be generated by the user and tracked
follow exactly the 3D orientation of a controller that the user holds. The user by systems. Examples include turning a knob, moving a slider bar, moving a finger
could even hold a real object in hand that is tracked by external cameras, but has over a touch screen, and moving a free-floating body through space. Recall that
a different appearance in the virtual world. This enables familiar force feedback one of the most important aspects of remapping is easy learnability. Reducing the
to the user, a concept that is revisited in Section 13.1. Note that an object could number of degrees of freedom that are remapped will generally ease the learning
even be manipulated directly in its original place in the virtual world, without process. To avoid gorilla arms and related problems, a scaling factor could be
bringing it close to the user’s virtual body [30]. In this case, the virtual hand is imposed on the tracked device so that a small amount of position change in the
brought to the object, while the physical hand remains in place. Having a longer controller corresponds to a large motion in the virtual world. This problem could
arm than normal can also be simulated [254], to retrieve and place objects over again be studied using Fitts’s law as in the case of the computer mouse. Note
greater distances. that this might have an adverse effect on precision in the virtual world. In some
settings orientation scaling might also be desirable. In this case, the 3D angular
velocity (ωx , ωy , ωz ) could be scaled by a factor to induce more rotation in the
Placement Now consider ungrasping the object and placing it into the world.
virtual world than in the real world.
An easy case for the user is to press a button and have the object simply fall
into the right place. This is accomplished by a basin of attraction which is an
attractive potential function defined in a neighborhood of the target pose (position Current systems The development of interaction mechanisms for manipulation
and orientation); see Figure 10.11. The minimum of the potential function is at the remains one of the greatest challenges for VR. Current generation consumer VR
target. After the object is released, the object falls into the target pose by moving headsets either leverage existing game controllers, as in the bundling of the XBox
so that the potential is reduced to its minimum. This behavior is seen in many 2D 360 controller with the Oculus Rift in 2016, or introduce systems that assume large
drawing programs so that the endpoints of line segments conveniently meet. An hand motions are the norm, as in the HTC Vive headset controller, as shown in
example of convenient object placement is in the 2011 Minecraft sandbox game by Figure 10.12. Controllers that have users moving their hands through space seem
10.4. SOCIAL INTERACTION 295 296 S. M. LaValle: Virtual Reality
(a) (b)
Figure 10.16: The Digital Emily project from 2009: (a) A real person is imaged.
(b) Geometric models are animated along with sophisticated rendering techniques
to produce realistic facial movement.
exists. At one extreme, a user may represent himself through an avatar, which is
a 3D representation that might not correspond at all to his visible, audible, and
Figure 10.14: A collection of starter avatars offered by Second Life. behavioral characteristics; see Figure 10.14. At the other extreme, a user might
be captured using imaging technology and reproduced in the virtual world with
a highly accurate 3D representation; see Figure 10.15. In this case, it may seem
as if the person were teleported directly from the real world to the virtual world.
Many other possibilities exist along this spectrum, and it is worth considering the
tradeoffs.
One major appeal of an avatar is anonymity, which offers the chance to play
a different role or exhibit different personality traits in a social setting. In a
phenomenon called the Proteus effect, it has been observed that a person’s behavior
changes based on the virtual characteristics of the avatar, which is similar to the
way in which people have been known to behave differently when wearing a uniform
or costume [360]. The user might want to live a fantasy, or try to see the world from
a different perspective. For example, people might develop a sense of empathy if
they are able to experience the world from an avatar that appears to be different
in terms of race, gender, height, weight, age, and so on.
Users may also want to experiment with other forms of embodiment. For
example, a group of children might want to inhabit the bodies of animals while
talking and moving about. Imagine if you could have people perceive you as if you
as an alien, an insect, an automobile, or even as a talking block of cheese. People
were delightfully surprised in 1986 when Pixar brought a desk lamp to life in the
animated short Luxo Jr. Hollywood movies over the past decades have been filled
with animated characters, and we have the opportunity to embody some of them
Figure 10.15: Holographic communication research from Microsoft in 2016. A 3D while inhabiting a virtual world!
representation of a person is extracted in real time and superimposed in the world, Now consider moving toward physical realism. Based on the current technology,
as seen through augmented reality glasses (Hololens). three major kinds of similarity can be independently considered:
1. Visual appearance: How close does the avatar seem to the actual person
10.4. SOCIAL INTERACTION 299 300 S. M. LaValle: Virtual Reality
2. Auditory appearance: How much does the sound coming from the avatar
match the voice, language, and speech patterns of the person?
The first kind of similarity could start to match the person by making a kinematic
model in the virtual world (recall Section 9.4) that corresponds in size and mobility
to the actual person. Other simple matching such as hair color, skin tone, and
eye color could be performed. To further improve realism, texture mapping could
be used to map skin and clothes onto the avatar. For example, a picture of
the user’s face could be texture mapped onto the avatar face. Highly accurate
matching might also be made by constructing synthetic models, or combining
information from both imaging and synthetic sources. Some of the best synthetic
matching performed to date has been by researchers at the USC Institute for
Creative Technologies; see Figure 10.16. A frustrating problem, as mentioned Figure 10.17: Oculus Social Alpha, which was an application for Samsung Gear
in Section 1.1, is the uncanny valley. People often describe computer-generated VR. Multiple users could meet in a virtual world and socialize. In this case, they
animation that tends toward human realism as seeing zombies or talking cadavers. are watching a movie together in a theater. Their head movements are provided
Thus, being far from perfectly matched is usually much better than “almost” using head tracking data. They are also able to talk to each other with localized
matched in terms of visual appearance. audio.
For the auditory part, users of Second Life and similar systems have preferred
text messaging. This interaction is treated as if they were talking aloud, in the
the head is turned. Users can also understand head nods or gestures, such as “yes”
sense that text messages can only be seen by avatars that would have been close
or “no”. Figure 10.17 shows a simple VR experience in which friends can watch a
enough to hear it at the same distance in the real world. Texting helps to ensure
movie together while being represented by avatar heads that are tracked (they can
anonymity. Recording and reproducing voice is simple in VR, making it much
also talk to each other). In some systems, eye tracking could also be used so that
simpler to match auditory appearance than visual appearance. One must take
users can see where the avatar is looking; however, in some cases, this might enter
care to render the audio with proper localization, so that it appears to others to
back into the uncanny valley. If the hands are tracked, which could be done using
be coming from the mouth of the avatar; see Chapter 11. If desired, anonymity can
controllers such as those shown in Figure 10.12, then they can also be brought into
be easily preserved in spite of audio recording by using real-time voice-changing
the virtual world.
software (such as MorphVOX or Voxal Voice Changer); this might be preferred to
texting in some settings.
Finally, note that the behavioral experience could be matched perfectly, while From one-on-one to societies Now consider social interaction on different
the avatar has a completely different visual appearance. This is the main mo- scales. The vast majority of one-on-one interaction that we have in the real world is
tivation for motion capture systems, in which the movements of a real actor are with people we know. Likewise, it is the same when interacting through technology,
recorded and then used to animate an avatar in a motion picture. Note that movie whether through text messaging, phone calls, or video chat. Most of our interaction
production is usually a long, off-line process. Accurate, real-time performance that though technology is targeted in that there is a specific purpose to the engagement.
perfectly matches the visual and behavioral appearance of a person is currently This suggests that VR can be used to take a video chat to the next level, where
unattainable in low-cost VR systems. Furthermore, capturing the user’s face is two people feel like they are face-to-face in a virtual world, or even in a panoramic
difficult if part of it is covered by a headset, although some recent progress has capture of the real world. Note, however, that in the real world, we may casually
been made in this area [180]. interact simply by being in close proximity while engaged in other activities, rather
On the other hand, current tracking systems can be leveraged to provide ac- than having a targeted engagement.
curately matched behavioral appearance in some instances. For example, head One important aspect of one-on-one communication is whether the relationship
tracking can be directly linked to the avatar head so that others can know where between the two people is symmetrical or complementary (from Paul Watzlawick’s
10.4. SOCIAL INTERACTION 301 302 S. M. LaValle: Virtual Reality
by other interfaces, then the user might not even need to sit at a desk to work. One
challenge would be to get users to learn a method that offers text entry speeds that
are comparable to a using keyboard, but enables them to work more comfortably.
3D design and visualization What are the professional benefits to being able
to inhabit a 3D virtual world? In addition to video games, several other fields
have motivated the development of computer graphics. Prior to computer-aided
design (CAD), architects and engineers spent many hours with pencil and paper
to painstakingly draw accurate lines on paper. The computer has proved to be an
indispensable tool for design. Data visualization has been a key use of computers
over the past years. Examples are medical, scientific, and market data. With all
of these uses, we are still forced to view designs and data sets by manipulating 2D
projections on screens.
VR offers the ability to interact with and view 3D versions of a design or
data set. This could be from the outside looking in, perhaps at the design of a
new kitchen utensil. It could also be from the inside looking out, perhaps at the
Figure 10.19: The Valve Steam game app store when viewed in the HTC Vive design of a new kitchen. If the perceptual concepts from Chapter 6 are carefully
headset. addressed, then the difference between the designed object or environment and the
real one may be less than ever before. Viewing a design in VR can be considered
using two input devices, one for typing and the other for pointing. In the case of a as a kind of virtual prototyping, before a physical prototype is constructed. This
PC, this has taken the form of a keyboard and mouse. With modern smartphones, enables rapid, low-cost advances in product development cycles.
people are expected to type on small touch screens, or use alternatives such as voice A fundamental challenge to achieving VR-based design and visualization is
or swipe-to-type. They use their fingers to point by touching, and additionally the interaction mechanism. What will allow an architect, artist, game developer,
zoom with a pair of fingers. movie set builder, or engineer to comfortably build 3D worlds over long periods of
time? What tools will allow people to manipulate high-dimensional data sets as
they project onto a 3D world?
Text entry and editing The typing options on a smartphone are sufficient for
entering search terms or typing a brief message, but they are woefully inadequate
for writing a novel. For professionals who current sit in front of keyboards to write The future Many more forms of interaction can be imagined, even by just apply-
reports, computer programs, newspaper articles, and so on, what kind of interfaces ing the universal simulation principle. Video games have already provided many
are needed to entice them to work in VR? ideas for interaction via a standard game controller. Beyond that, the Nintendo
One option is to track a real keyboard and mouse, making them visible VR. Wii remote has been especially effective in making virtual versions of sports ac-
Tracking of fingertips may also be needed to provide visual feedback. This enables tivities such as bowling a ball or swinging a tennis racket. What new interaction
a system to be developed that magically transforms the desk and surrounding mechanisms will be comfortable and effective for VR? If displays are presented
environment into anything. Much like the use of a background image on a desktop to senses other than vision, then even more possibilities emerge. For example,
system, a relaxing panoramic image or video could envelop the user while she could you give someone a meaningful hug on the other side of the world if they
works. For the actual work part, rather than having one screen in front of the are wearing a suit that applies the appropriate forces to the body?
user, a number of screens or windows could appear all around and at different
depths.
It is easy to borrow interface concepts from existing desktop windowing sys-
Further Reading
tems, but much research remains to design and evaluate completely novel interfaces For overviews of human motor control and learning, see the books [196, 271]. Propri-
for improved productivity and comfort while writing. What could word processing oception issues in the context of VR are covered in [61]. For more on locomotion and
look like in VR? What could an integrated development environment (IDE) for wayfinding see [52] and Chapters 6 and 7 of [31]. For grasping issues in robotics, see
writing and debugging software look like? If the keyboard and mouse are replaced [203].
10.5. ADDITIONAL INTERACTION MECHANISMS 305 306 S. M. LaValle: Virtual Reality
For more on locomotion and wayfinding see [52] and Chapters 6 and 7 of [31]. The
limits of hand-eye coordination were studied in the following seminal papers: [56, 70,
346]. The power law of practice was introduced in [229], which indicates that the log-
arithm of reaction time reduces linearly with the amount of practice. Research that
relates Fitts’s law to pointing device operation includes [72, 193, 194, 302]. For broad
coverage of human-computer interaction, see [37, 40]. For additional references on social
interaction through avatars, see [20, 210, 328].
308 S. M. LaValle: Virtual Reality
Chapter 11
Audio
Hearing is an important sense for VR and has been unfortunately neglected up Figure 11.1: Sound is a longitudinal wave of compression and rarefaction of air
until this chapter. Developers of VR systems tend to focus mainly on the vision molecules. The case of a pure tone is shown here, which leads to a sinusoidal
part because it is our strongest sense; however, the audio component of VR is pressure function. (Figure from Pond Science Institute.)
powerful and the technology exists to bring high fidelity audio experiences into
VR. In the real world, audio is crucial to art, entertainment, and oral commu-
nication. As mentioned in Section 2.1, audio recording and reproduction can be also be water, or any other gases, liquids, or solids. There is no sound in a vac-
considered as a VR experience by itself, with both a CAVE-like version (surround uum, which is unlike light propagation. For sound, the molecules in the medium
sound) and a headset version (wearing headphones). When combined consistently displace, causing variations in pressure that range from a compression extreme to
with the visual component, audio helps provide a compelling and comfortable VR a decompressed, rarefaction extreme. At a fixed point in space, the pressure varies
experience. as a function of time. Most importantly, this could be the pressure variation on a
Each section of this chapter is the auditory (or audio) complement to one of human eardrum, which is converted into a perceptual experience. The sound pres-
Chapters 4 through 7. The progression again goes from physics to physiology, and sure level is frequently reported in decibels (abbreviated as dB), which is defined
then from perception to rendering. Section 11.1 explains the physics of sound in as
terms of waves, propagation, and frequency analysis. Section 11.2 describes the Ndb = 20 ∗ log10 (pe /pr ). (11.1)
parts of the human ear and their function. This naturally leads to auditory per-
Above, pe is the pressure level of the peak compression and pr is a reference pressure
ception, which is the subject of Section 11.3. Section 11.4 concludes by presenting
level, which is usually taken as 2 × 10−7 newtons / square meter.
auditory rendering, which can produce sounds synthetically from models or re-
Sound waves are typically produced by vibrating solid materials, especially as
produce captured sounds. When reading these sections, it is important to keep
they collide or interact with each other. A simple example is striking a large bell,
in mind the visual counterpart of each subject. The similarities make it easier to
which causes it to vibrate for many seconds. Materials may also be forced into
quickly understand and the differences lead to unusual engineering solutions.
sound vibration by sufficient air flow, as in the case of a flute. Human bodies are
designed to produce sound by using lungs to force air through the vocal cords,
11.1 The Physics of Sound which causes them to vibrate. This enables talking, singing, screaming, and so on.
This section parallels many concepts from Chapter 4, which covered the basic Sound sources and attenuation As in the case of light, we can consider rays,
physics of light. Sound wave propagation is similar in many ways to light, but with for which each sound ray is perpendicular to the sound propagation wavefront.
some key differences that have major perceptual and engineering consequences. A point sound source can be defined, which produces emanating rays with equal
Whereas light is a transverse wave, which oscillates in a direction perpendicular power in all directions. This also results in power reduction at a quadratic rate as
to its propagation, sound is a longitudinal wave, which oscillates in a direction a function of distance from the source. Such a point source is useful for modeling,
parallel to its propagation. Figure 11.1 shows an example of this for a parallel but cannot be easily achieved in the real world. Planar wavefronts can be achieved
wavefront. by vibrating a large, flat plate, which results in the acoustic equivalent of colli-
Sound corresponds to vibration in a medium, which is usually air, but could mated light. An important distinction, however, is the attenuation of sound as it
307
11.1. THE PHYSICS OF SOUND 309 310 S. M. LaValle: Virtual Reality
Propagation speed Sound waves propagate at 343.2 meters per second through
air at 20◦ C (68◦ F). For comparison, light propagation is about 874,000 times
faster. We have planes and cars that can surpass the speed of sound, but are
nowhere near traveling at the speed of light. This is perhaps the most important
difference between sound and light for making VR systems. The result is that
human senses and engineered sensors easily measure differences in arrival times of
sound waves, leading to stronger emphasis on temporal information.
1 1
sinusoids are often called higher-order harmonics; the largest amplitude wave is
called the fundamental frequency. The plot of amplitude and phase as a function
0.5 0.5 of frequency is obtained by applying the Fourier transform, which will be briefly
covered in Section 11.4.
0.005 0.01 0.015 0.02 0.005 0.01 0.015 0.02
Where are the lenses? At this point, the most obvious omission in comparison
-0.5 -0.5
to Chapter 4 is the acoustic equivalent of lenses. As stated above, refraction occurs
for sound. Why is it that human ears do not focus sounds onto a spatial image in
the same way as the eyes? One problem is the long wavelengths in comparison to
-1 -1
light. Recall from Section 5.1 that the photoreceptor density in the fovea is close
(a) (b) to the wavelength of visible light. It is likely that an “ear fovea” would have to be
several meters across or more, which would makes our heads too large. Another
problem is that low-frequency sound waves interact with objects in the world in a
0.5
more complicated way. Thus, rather than forming an image, our ears instead work
by performing Fourier analysis to sift out the structure of sound waves in terms of
sinusoids of various frequencies, amplitudes, and phases. Each ear is more like a
single-pixel camera operating at tens of thousands of “frames per second”, rather
0.005 0.01 0.015 0.02
than capturing a large image at a slower frame rate. The emphasis for hearing
-0.5
is the distribution over time, whereas the emphasis is mainly on space for vision.
Nevertheless, both time and space are important for both hearing and vision.
(c) (d)
11.2 The Physiology of Human Hearing
Figure 11.3: (a) A pure tone (sinusoid) of unit amplitude and frequency 100 Hz. (b) Human ears convert sound pressure waves into neural impulses, which ultimately
Three pure tones; in addition to the original blue, the green sinusoid has amplitude lead to a perceptual experience. The anatomy of the human ear is shown in
1/3 and frequency 300 Hz, and the red one has amplitude 1/5 and frequency 500 Figure 11.4. The ear is divided into outer, middle, and inner parts, based on the
Hz. (c) Directly adding the three pure tones approximates a square-like waveform. flow of sound waves. Recall from Section 5.3 the complications of eye movements.
(d) In the frequency spectrum, there are three non-zero points, one for each pure Although cats and some other animals can rotate their ears, humans cannot, which
tone. simplifies this part of the VR engineering problem.
Fourier analysis Spectral decompositions were important for characterizing Outer ear The floppy part of the ear that protrudes from the human head is
light sources and reflections in Section 4.1. In the case of sound, they are even called the pinna. It mainly serves as a funnel for collecting sound waves and
more important. A sinusoidal wave, as shown in Figure 11.3(a), corresponds to a guiding them into the ear canal. It has the effect of amplifying sounds in the 1500
pure tone, which has a single associated frequency; this is analogous to a color from to 7500Hz frequency range [362]. It also performs subtle filtering of the sound,
the light spectrum. A more complex waveform, such the sound of a piano note, causing some variation in the high-frequency range that depends on the incoming
can be constructed from a combination of various pure tones. Figures 11.3(b) to direction of the sound source. This provides a powerful cue regarding the direction
11.3(d) provide a simple example. This principle is derived from Fourier analysis, of a sound source.
which enables any periodic function to be decomposed into sinusoids (pure tones After traveling down the ear canal, the sound waves cause the eardrum to
in our case) by simply adding them up. Each pure tone has a particular frequency, vibrate. The eardrum is a cone-shaped membrane that separates the outer ear
amplitude or scaling factor, and a possible timing for its peak, which is called its from the middle ear. Its covers only 55mm2 of area. If this were a camera, it
phase. By simply adding up a finite number of pure tones, virtually any useful would have a resolution of one pixel at this point because no additional spatial
waveform can be closely approximated. The higher-frequency, lower-amplitude information exists other than what can be inferred from the membrane vibrations.
11.2. THE PHYSIOLOGY OF HUMAN HEARING 313 314 S. M. LaValle: Virtual Reality
(a) (b)
Figure 11.5: The operation of the cochlea: (a) The perilymph transmits waves
that are forced by the oval window through a tube that extends the length of the
cochlea and back again, to the round window. (b) Because of varying thickness and
stiffness, the central spine (basilar membrane) is sensitive to particular frequencies
of vibration; this causes the mechanoreceptors, and ultimately auditory perception,
to be frequency sensitive.
Figure 11.7: Due to the precedence effect, an auditory illusion occurs if the head
is placed between stereo speakers so that one is much closer than the other. If
they output the same sound at the same time, then the person perceives the sound
arriving from the closer speaker, rather than perceiving an echo.
cells (recall from Section 5.2). Section 11.4.1 will clarify how these differences
make the ear more complex in terms of filtering.
Auditory pathways The neural pulses are routed from the left and right cochleae
up to the highest level, which is the primary auditory cortex in the brain. As usual,
hierarchical processing occurs as the signals are combined through neural struc-
tures. This enables multiple frequencies and phase shifts to be analyzed. An early
structure called the superior olive receives signals from both ears so that differ-
ences in amplitude and phase can be processed. This will become important in
Section 11.3 for determining the location of an audio source. At the highest level,
the primary auditory cortex is mapped out tonotopically (locations are based on
frequency), much in the same way as topographic mapping of the visual cortex.
Figure 11.6: A cross section of the organ of Corti. The basilar and tectorial
membranes move relative to each other, causing the hairs in the mechanoreceptors
11.3 Auditory Perception
to bend. (Figure from multiple Wikipedia users.) Now that we have seen the hardware for hearing, the next part is to understand how
we perceive sound. In the visual case, we saw that perceptual experiences are often
rial membranes causes a shearing action that moves the hairs. Each ear contains surprising because they are based on adaptation, missing data, assumptions filled
around 20,000 mechanoreceptors, which is considerably less than the 100 million in by neural structures, and many other factors. The same is true for auditory
photoreceptors in the eye. experiences. Furthermore, auditory illusions exist in the same way as optical
illusions. The McGurk effect from Section 6.4 was an example that used vision to
induce incorrect auditory perception.
Spectral decomposition By exploiting the frequency-based sensitivity of the
basilar membrane, the brain effectively has access to a spectral decomposition
of the incoming sound waves. It is similar to, but not exactly the same as, the Precedence effect A more common auditory illusion is the precedence effect, in
Fourier decomposition which discussed in Section 11.1. Several differences are which only one sound is perceived if two nearly identical sounds arrive at slightly
mentioned in Chapter 4 of [204]. If pure tones at two different frequencies are different times; see Figure 11.7. Sounds often reflect from surfaces, causing rever-
simultaneously presented to the ear, then the basilar membrane produces a third beration, which is the delayed arrival at the ears of many “copies” of the sound
tone, which is sometimes audible [149]. Also, the neural impulses that result from due to the different propagation paths that were taken from reflections, trans-
mechanoreceptor output are not linearly proportional to the frequency amplitude. missions, and diffraction. Rather than hearing a jumble, people perceive a single
Furthermore, the detection one of tone may cause detections of nearby tones (in sound. This is based on the first arrival, which usually has the largest amplitude.
terms of frequency) to be inhibited [277], much like lateral inhibition in horizontal An echo is perceived if the timing difference is larger than the echo threshold (in
11.3. AUDITORY PERCEPTION 317 318 S. M. LaValle: Virtual Reality
ception system performs critical band masking to effectively block out waves that
have frequencies outside of a particular range of interest. Another well-studied
problem is the perception of differences in pitch (or frequency). For example, for a
pure tone at 1000 Hz, could someone distinguish it from a tone at 1010 Hz? This
is an example of JND. It turns out that for frequencies below 1000 Hz, humans
can detect a change of frequency that is less than 1 Hz. The discrimination ability
decreases as the frequency increases. At 10,000 Hz, the JND is about 100 Hz.
In terms of percentages, this means that pitch perception is better than a 0.1%
difference at low frequencies, but increases to 1.0% for higher frequencies.
Also regarding pitch perception, a surprising auditory illusion occurs when the
fundamental frequency is removed from a complex waveform. Recall from Figure
11.3 that a square wave can be approximately represented by adding sinusoids of
smaller and smaller amplitudes, but higher frequencies. It turns out that people
perceive the tone of the fundamental frequency, even when it is removed, and only
the higher-order harmonics remain; several theories for this are summarized in
Chapter 5 of [204].
Figure 11.9: Spherical coordinates are used for the source point in auditory local- 4. Finally, a powerful monaural cue is provided by the reverberations entering
ization. Suppose the head is centered on the origin and facing in the −z direction. the ear as the sounds bounce around; this is especially strong in a room. Even
The azimuth θ is the angle with respect to the forward direction after projecting though the precedence effect prevents us perceiving these reverberations, the
the source into the xz plane. The elevation φ is the interior angle formed by a brain nevertheless uses the information for localization. This cue alone is
vertical triangle that connects the origin to the source and to the projection of the called echolocation, which is used naturally by some animals, including bats.
source into the plane. The radius r is the distance from the origin to the source. Some people can perform this by making clicking sounds or other sharp
noises; this allows acoustic wayfinding for blind people.
Binaural cues If both ears become involved, then a binaural cue for localization
results. The simplest case is the interaural level difference (ILD), which is the
difference in sound magnitude as heard by each ear. For example, one ear may
be facing a sound source, while the other is in the acoustic shadow (the shadow
caused by an object in front of a sound source is similar the shadow from a light
source). The closer ear would receive a much stronger vibration than the other.
Another binaural cue is interaural time difference (ITD), which is closely re-
lated to the TDOA sensing approach described in Section 9.3. The distance be-
tween the two ears is approximately 21.5cm, which results in different arrival times
of the sound from a source. Note that sound travels 21.5cm in about 0.6ms, which
means that surprisingly small differences are used for localization.
Suppose that the brain measures the difference in arrival times as 0.3ms. What
is the set of possible places where the source could have originated? This can be
solved by setting up algebraic equations, which results in a conical surface known
as a hyperboloid. If it is not known which sound came first, then the set of possible
places is a hyperboloid of two disjoint sheets. Since the brain knows which one
came first, the two sheets are narrowed down to one hyperboloid sheet, which is
Figure 11.10: Plots of the minimum audible angle (MAA) as a function of fre- called the cone of confusion; see Figure 11.11 (in most cases, it approximately
quency. Each plot corresponds to a different azimuth angle. looks like a cone, even though it is hyperboloid). Uncertainty within this cone can
be partly resolved, however, by using the distortions of the pinna.
11.4. AUDITORY RENDERING 321 322 S. M. LaValle: Virtual Reality
Figure 11.12: An overview of a linear filter and its relationship to Fourier analysis.
The top row of blocks corresponds to the time domain, whereas the bottom row
is the frequency (or spectral) domain.
would contain a thousand values for every second. Using an index variable k, we Finite impulse response An important and useful result is that the behavior
can refer to the kth sample as x[k], which corresponds to x(k∆t). Arbitrarily, the of a linear filter can be fully characterized in terms of its finite impulse response
first sample is x[0] = x(0). (FIR). The filter in (11.3) is often called an FIR filter. A finite impulse is a signal
for which x[0] = 1 and x[k] = 0 for all k > 0. Any other signal can be expressed as
Linear filters In the context of signal processing, a filter is a transformation a linear combination of time-shifted finite impulses. If a finite impulse is shifted,
that maps one signal to another. Each signal is a function of time, and the filter for example x[2] = 1, with x[k] = 0 for all other k 6= 2, then a linear filter
is like a black box that receives the one signal as input, and produces another as produces the same result, but it is just delayed two steps later. A finite impulse
output. If x represents an entire signal (over all times), then let F (x) represent can be rescaled due to filter linearity, with the output simply being rescaled. The
the resulting signal after running it through the filter. results of sending scaled and shifted impulses through the filter are also obtained
A linear filter is a special kind of filter that satisfies two algebraic properties. directly due to linearity.
The first algebraic property is additivity, which means that if two signals are added
and sent through the filter, the result should be the same as if they were each sent Nonlinear filters Any (causal) filter that does not follow the form (11.3) is
through the filter independently, and then the resulting transformed signals were called a nonlinear filter. Recall from Section 11.2, that the operation of the human
added. Using notation, this is F (x + x′ ) = F (x) + F (x′ ) for any two signals x and auditory system is almost a linear filter, but exhibits characteristics that make it
x′ . For example, if two different sounds are sent into the filter, the result should into a nonlinear filter. Linear filters are preferred because of their close connection
be the same whether they are combined before or after the filtering. This concept to spectral analysis, or frequency components, of the signal. Even if the human
will become useful as multiple sinusoids are sent through the filter. auditory system contains some nonlinear behavior, analysis based on linear filters
The second algebraic property is homogeneity, which means that if the signal is nevertheless valuable.
is scaled by a constant factor before being sent though the filter, the result would
be the same as if it were scaled by the same factor afterwards. Using notation, Returning to Fourier analysis Now consider the bottom part of Figure 11.12.
this means that cF (x) = F (cx) for every constant c and signal x. For example, The operation of a linear filter is easy to understand and compute in the frequency
this means that if we double the sound amplitude, then the output sound from domain. This is the function obtained by performing the Fourier transform on
the filter doubles its amplitude as well. the signal, which provides an amplitude for every combination of frequency and
A linear filter generally takes the form phase. This transform was briefly introduced in Section 11.1 and illustrated in
Figure 11.3. Formally, it is defined for discrete-time systems as
y[k] = c0 x[k] + c1 x[k − 1] + c2 x[k − 2] + c3 x[k − 3] + · · · + cn x[k − n], (11.3)
∞
X
in which each ci is a constant, and n + 1 is the number of samples involved in the X(f ) = x[k]e−i2πf k , (11.6)
filter. One may consider the case in which n tends to infinity, but it will not be k=−∞
pursued here. Not surprisingly, (11.3) is a linear equation. This particular form is a
causal filter because the samples on the left occur no later than the sample y[k]. A in which X(f ) is the resulting spectral √
distribution, which is a function of the
non-causal filter would require dependency on future samples, which is reasonable frequency f . The exponent involves i = −1 and is related to sinusoids through
for a recorded signal, but not for live sampling (the future is unpredictable!). Euler’s formula:
Here are some examples of linear filters (special cases of (11.3)). This one takes e−i2πf k = cos(−2πf k) + i sin(−2πf k). (11.7)
a moving average of the last three samples: Unit complex numbers are used as an algebraic trick to represent the phase. The
1 1 1 inverse Fourier transform is similar in form and converts the spectral distribution
y[k] = x[k] + x[k − 1] + x[k − 2]. (11.4) back into the time domain. These calculations are quickly performed in practice
3 3 3
by using the Fast Fourier Transform (FFT) [11, 189].
Alternatively, this is an example of exponential smoothing (also called exponentially
weighted moving average): Transfer function In some cases, a linear filter is designed by expressing how
it modifies the spectral distribution. It could amplify some frequencies, while
1 1 1 1
y[k] = x[k] + x[k − 1] + x[k − 2] + x[k − 3]. (11.5) suppressing others. In this case, the filter is defined in terms of a transfer function,
2 4 8 16 which is applied as follows: 1) transforming the original signal using the Fourier
of discretization will be ignored. transform, 2) multiplying the result by the transfer function to obtain the distorted
11.4. AUDITORY RENDERING 325 326 S. M. LaValle: Virtual Reality
exist, again with a dependency on the wavelength (or equivalently, the frequency).
For a large, smooth, flat surface, a specular reflection of sound waves occurs,
with the outgoing angle being equal to the incoming angle. The reflected sound
usually has a different amplitude and phase. The amplitude may be decreased by
a constant factor due to absorption of sound into the material. The factor usually
depends on the wavelength (or frequency). The back of [334] contains coefficients
of absorption, given with different frequencies, for many common materials.
In the case of smaller objects, or surfaces with repeated structures, such as
bricks or corrugations, the sound waves may scatter in a way that is difficult to
(a) (b)
characterize. This is similar to diffuse reflection of light, but the scattering pattern
for sound may be hard to model and calculate. One unfortunate problem is that the
Figure 11.13: An audio model is much simpler. (From Pelzer, Aspock, Schroder, scattering behavior depends on the wavelength. If the wavelength is much smaller
and Vorlander, 2014, [248]) or much larger than the size of the structure (entire object or corrugation), then
the sound waves will mainly reflect. If the wavelength is close to the structure
size, then significant, complicated scattering may occur.
spectral distribution, and then 3) applying the inverse Fourier transform to obtain
At the extreme end of modeling burdens, a bidirectional scattering distribution
the result as a function of time. The transfer function can be calculated from the
function (BSDF) could be constructed. The BSDF could be estimated from equiv-
linear filter by applying the discrete Laplace transform (called z-transform) to the
alent materials in the real world by a combination of a speaker placed in different
finite impulse response [11, 189].
locations and a microphone array to measure the scattering in a particular direc-
tion. This might work well for flat materials that are large with respect to the
11.4.2 Acoustic modeling wavelength, but it will still not handle the vast variety of complicated structures
and patterns that can appear on a surface.
The geometric modeling concepts from Section 3.1 apply to the auditory side of
VR, in addition to the visual side. In fact, the same models could be used for both.
Walls that reflect light in the virtual world also reflect sound waves. Therefore, Capturing sound Sounds could also be captured in the real world using micro-
both could be represented by the same triangular mesh. This is fine in theory, but phones and then brought into the physical world. For example, the matched zone
fine levels of detail or spatial resolution do not matter as much for audio. Due might contain microphones that become speakers at the equivalent poses in the
to high visual acuity, geometric models designed for visual rendering may have a real world. As in the case of video capture, making a system that fully captures the
high level of detail. Recall from Section 5.4 that humans can distinguish 30 stripes sound field is challenging. Simple but effective techniques based on interpolation
or more per degree of viewing angle. In the case of sound waves, small structures of sounds captured by multiple microphones are proposed in [256].
are essentially invisible to sound. One recommendation is that the acoustic model
needs to have a spatial resolution of only 0.5m [334]. Figure 11.13 shows an 11.4.3 Auralization
example. Thus, any small corrugations, door knobs, or other fine structures can
be simplified away. It remains an open challenge to automatically convert a 3D Propagation of sound in the virtual world As in visual rendering, there
model designed for visual rendering into one optimized for auditory rendering. are two main ways to handle the propagation of waves. The most expensive
Now consider a sound source in the virtual environment. This could, for ex- way is based on simulating the physics as accurately as possible, which involves
ample, be a “magical” point that emits sound waves or a vibrating planar surface. computing numerical solutions to partial differential equations that precisely model
The equivalent of white light is called white noise, which in theory contains equal wave propagation. The cheaper way is to shoot visibility rays and characterize
weight of all frequencies in the audible spectrum. Pure static from an analog TV the dominant interactions between sound sources, surfaces, and ears. The choice
or radio is an approximate example of this. In practical settings, the sound of between the two methods also depends on the particular setting; some systems
interest has a high concentration among specific frequencies, rather than being involve both kinds of computations [208, 334]. If the waves are large relative to
uniformly distributed. the objects in the environment, then numerical methods are preferred. In other
How does the sound interact with the surface? This is analogous to the shading words, the frequencies are low and the geometric models have a high level of detail.
problem from Section 7.1. In the case of light, diffuse and specular reflections At higher frequencies or with larger, simpler models, visibility-based methods are
occur with a dependency on color. In the case of sound, the same two possibilities preferable.
11.4. AUDITORY RENDERING 327 328 S. M. LaValle: Virtual Reality
Figure 11.14: Computed results for sound propagation by numerically solving the
Helmholtz wave equation (taken from [208]): (a) The pressure magnitude before
obstacle interaction is considered. (b) The pressure after taking into account
scattering. (c) The scattering component, which is the pressure from (b) minus
the pressure from (a).
Figure 11.15: Reverberations. (From Pelzer, Aspock, Schroder, and Vorlander,
Numerical wave propagation The Helmholtz wave equation expresses con- 2014, [248])
straints at every point in R3 in terms of partial derivatives of the pressure function.
Its frequency-dependent form is
visibility-based methods, which consider the paths of sound rays that emanate
ω2 from the source and bounce between obstacles. The methods involve determining
∇2 p + p = 0, (11.8)
s2 ray intersections with the geometric model primitives, which is analogous to ray
in which p is the sound pressure, ∇2 is the Laplacian operator from calculus, and tracing operations from Section 7.1.
ω is related to the frequency f as ω = 2πf . It is insightful to look at the impulse response of a sound source in a virtual
Closed-form solutions to (11.8) do not exist, except in trivial cases. There- world. If the environment is considered as a linear filter, then the impulse response
fore, numerical computations are performed by iteratively updating values over provides a complete characterization for any other sound signal [209, 248, 258].
the space; a brief survey of methods in the context of auditory rendering ap- Figure 11.15 shows the simple case of the impulse response for reflections in a
pears in [208]. The wave equation is defined over the obstacle-free portion of the rectangular room. Visibility-based methods are particularly good at simulating
virtual world. The edge of this space becomes complicated, leading to bound- the reverberations, which are important to reproduce for perceptual reasons. More
ary conditions. One or more parts of the boundary correspond to sound sources, generally, visibility-based methods may consider rays that correspond to all of the
which can be considered as vibrating objects or obstacles that force energy into cases of reflection, absorption, scattering, and diffraction. Due to the high com-
the world. At these locations, the 0 in (11.8) is replaced by a forcing function. putational cost of characterizing all rays, stochastic ray tracing offers a practical
At the other boundaries, the wave may undergo some combination of absorption, alternative by randomly sampling rays and their interactions with materials [334];
reflection, scattering, and diffraction. These are extremely difficult to model; see this falls under the general family of Monte Carlo methods, which are used, for ex-
[264] for details. In some rendering applications, these boundary interactions may ample, to approximate solutions to high-dimensional integration and optimization
simplified and handled with simple Dirichlet boundary conditions and Neumann problems.
boundary conditions [361]. If the virtual world is unbounded, then an additional
Sommerfield radiation condition is needed. For detailed models and equations for Entering the ear Sound that is generated in the virtual world must be trans-
sound propagation in a variety of settings, see [264]. An example of a numerically mitted to each ear in the physical world. It is as if a virtual microphone positioned
computed sound field is shown in Figure 11.14. in the virtual world captures the simulated sound waves. These are then converted
into audio output through a speaker that is positioned in front of the ear. Recall
Visibility-based wave propagation The alternative to numerical computa- from Section 11.3 that humans are able to localize sound sources from auditory
tions, which gradually propagate the pressure numbers through the space, is cues. How would this occur for VR if all of the sound emanates from a fixed
11.4. AUDITORY RENDERING 329 330 S. M. LaValle: Virtual Reality
speaker? The ILD and ITD cues could be simulated by ensuring that each ear re- tracked, with the full pose of each ear determined by a head model (recall Figure
ceives the appropriate sound magnitude and phase to that differences in amplitude 9.8). Alternatively, the full head pose may be tracked, directly providing the
and timing are correct. This implies that the physical head must be reproduced pose of each ear through offset transforms. To optimize performance, user-specific
at some level of detail in the virtual world so that these differences are correctly parameters can provide a perfect match: The distance along the z axis from the
calculated. For example, the distance between the ears and size of the head may eyes to the ears and the distance between ears. The latter is analogous to the IPD,
become important. the distance between pupils for the case of vision.
HRTFs This solution would still be insufficient to resolve ambiguity within the Further Reading
cone of confusion. Recall from Section 11.3 that the pinna shape distorts sounds
For mathematical and computational foundations of acoustics, see [264, 321]. Physiology
in a direction-dependent way. To fully take into account the pinna and other parts
and psychoacoustics are covered in [218, 362] and Chapters 4 and 5 of [204]. Localization
of the head that may distort the incoming sound, the solution is to develop a head- is covered thoroughly in [23]. The cone of confusion is discussed in [290]. Echo thresholds
related transfer function (HRTF). The idea is to treat this distortion as a linear are covered in [268, 358].
filter, which can be characterized in terms of its transfer function (recall Figure Some basic signal processing texts are [11, 189]. For an overview of auditory dis-
11.12). This is accomplished by placing a human subject into an anechoic chamber plays, see [335]. Convenient placement of audio sound sources from a psychophysical
and placing sound sources at different locations in the space surrounding the head. perspective is covered in [256]. Auditory rendering is covered in detail in the book [334].
At each location, an impulse is generated on a speaker, and the impulse response Some key articles on auditory rendering are [87, 209, 248, 257, 258]
is recorded with a small microphone placed inside of the ear canal of a human or
dummy. The locations are selected by incrementally varying the distance, azimuth,
and elevation; recall the coordinates for localization from Figure 11.10. In many
cases, a far-field approximation may be appropriate, in which case a large value is
fixed for the distance. This results in an HRTF that depends on only the azimuth
and elevation.
It is, of course, impractical to build an HRTF for every user. There is significant
motivation to use a single HRTF that represents the “average listener”; however,
the difficulty is that it might not be sufficient in some applications because it is not
designed for individual users (see Section 6.3.2 of [334]). One compromise might
be to offer a small selection of HRTFs to users, to account for variation among
the population, but they may be incapable of picking the one most suitable for
their particular pinnae and head. Another issue is that the transfer function may
depend on factors that frequently change, such as wearing a hat, putting on a jacket
with a hood or large collar, or getting a haircut. Recall that adaptation occurs
throughout human perception and nearly all aspects of VR. If people adapt to
frequent changes in the vicinity of their heads in the real world, then perhaps they
would also adapt to an HRTF that is not perfect. Significant research questions
remain in this area.
Tracking issues The final challenge is to ensure that the physical and virtual
ears align in the matched zone. If the user turns her head, then the sound should
be adjusted accordingly. If the sound emanates from a fixed source, then it should
be perceived as fixed while turning the head. This is another example of the
perception of stationarity. Accordingly, tracking of the ear pose (position and
orientation) is needed to determine the appropriate “viewpoint”. This is equivalent
to head tracking with simple position and orientation offsets for the right and left
ears. As for vision, there are two choices. The head orientation alone may be
332 S. M. LaValle: Virtual Reality
seems unprecedented.
Opportunities for failure exist at all levels, from hardware, to low-level software,
to content creation engines. As hardware and low-level software rapidly improve,
the burden is shifting more to developers of software engines and VR experiences.
This chapter presents several topics that may aid engineers and developers in their
quest to build better VR systems and experiences. Section 12.1 introduces meth-
Chapter 12 ods for guiding them to improve their discriminatory power. Rather than adapting
to become oblivious to a problem, a developer could train herself to become more
sensitive to problems. Section 12.2 applies the fundamentals from this book to
Evaluating VR Systems and provide simple advice for VR developers. Section 12.3 covers VR sickness, includ-
ing the main symptoms and causes, so that VR systems and experiences may be
Experiences improved. Section 12.4 introduces general methods for designing experiments that
involve human subjects, and includes some specific methods from psychophysics.
All of the concepts from this chapter should be used to gain critical feedback and
avoid pitfalls in an iterative VR development process.
Which headset is better? Which VR experience is more comfortable over a long
period of time? How much field of view is enough? What is the most appropriate
interaction mechanism? Engineers and developers want to know the answers to
12.1 Perceptual Training
these kinds of questions; however, it should be clear at this point that these are Most people who try VR for the first time are unaware of technical flaws that would
difficult to answer because of the way that human physiology and perception oper- be obvious to some experienced engineers and developers. If the VR experience
ate and interact with engineered systems. By contrast, pure engineering questions, is functioning as it should, then the user should be overwhelmed by dominant
such as “What is the estimated battery life?” or “What is the vehicle’s top speed visual stimuli and feel as if he is inhabiting the virtual world. Minor flaws may be
on level ground?”, are much more approachable. subtle or unnoticeable as attention is focused mainly on the targeted experience
Recall the definition of VR from Section 1.1, which involves an organism. When (as considered in the definition of VR from Section 1.1). Some parts might not
VR is applied by scientists to study the neural structures and perception of a rat, be functioning as designed or some perceptual issues might have been neglected.
there is a clear separation between the rat and the scientist. However, in the case This might result in an experience as that not as good as it could have been after
of VR for humans, the developer frequently tries out his own creations. In this performing some simple adjustments. Even worse, the flaws might cause the user
case, the developer alternates between the role of scientist and rat. This introduces to become fatigued or sick. At the end, such users are usually not consciously
numerous problems, especially if the developer is naive about perceptual issues. aware of what went wrong. They might blame anything, such as particular visual
Further complicating matters is adaptation, which occurs on all scales. For stimuli, a particular experience, the headset hardware, or even the whole concept
example, a person evaluating a VR experience many times over several weeks may of VR.
initially find it uncomfortable, but later become accustomed to it. Of course this This problem can be mitigated by training specific users and developers to
does not imply that its likelihood of making a fresh user sick is lower. There is also notice common types of flaws. By developing a program of perceptual training,
great variation across people. Any one person, including the developer, provides a user could be requested to look for a particular artifact or shortcoming, or to
just one data point. People who are immune to sickness from vection will have no repeatedly practice performing some task. Throughout this book, we have seen
trouble developing such systems and inflicting them upon others. the importance of adaptation in human perceptual processes. For example, if
Another factor is that most people who create systems are biased toward liking a constant stimulus is presented over a long period of time, then its perceived
what they create. Furthermore, as discussed in Section 8.4, just having the knowl- intensity diminishes.
edge of what the experience represents can effect vection. These issues fall under Through repeated and guided exposure to a particular VR system and experi-
the general heading of human factors, which has been studied for decades. One ence, users can adapt their perceptual systems. This is a form of perceptual learn-
closely related area is human-computer interaction (HCI), which uses the methods ing, which is a branch of perceptual psychology that studies long-lasting changes
discussed in this section. However, since VR works by disrupting the low-level to the perceptual systems of an organism in response to its environment. As VR
operation of sensory systems that we have trusted for our entire lives, the level of becomes a new environment for the organism, the opportunities and limits of per-
complications from the lowest-level side effects to the highest-level cognitive effects ceptual learning remain largely unexplored. Through active training, the way in
331
12.1. PERCEPTUAL TRAINING 333 334 S. M. LaValle: Virtual Reality
which users adapt can be controlled so that their perceptual abilities and discrim-
ination power increases. This in turn can be used train evaluators who provide
frequent feedback in the development process. An alternative is to develop an
automated system that can detect flaws without human intervention. It is likely
that a combination of both human and automatic evaluation will be important in
the years to come.
Another common error is to have the right and left eye images reversed. It is the virtual world is actually stationary when it is supposed to be. If a canonical
easy have this problem after making a sign error in (3.50), or misunderstanding yaw motion is made while eyes are fixated on the object, then the vestibulo-ocular
which way the viewpoint needs to shift for each eye. The phenomenon is known reflex (VOR) is invoked. In this case, then the evaluator can determine whether
as pseudoscopic vision, in which the perceived concavity of objects may be seem the object appears to move or distort its shape while the image of the object is
reversed. In many cases, however, it is difficult to visually detect the error. So- fixed on the retina. Similarly, if an object is slowly moving by and the head is
lution: Approach the edge of an object so that one side of it is visible to one eye fixed, the evaluator performs smooth pursuit to keep the object on the retina. As
only. This can be verified by using the eye-closing trick. Based on the geometry indicated in Section 5.4, the way in which an object appears to distort for a line-
of the object, make sure that the side is visible to the correct eye. For example, by-line scanout display depends on whether the motion is due to VOR or smooth
the left eye should not be the only one to see the right side of a box. pursuit. If the object moves by very quickly and the eyes do not keep it fixed on
Finally, stereoscopic vision could have an incorrect distance between the virtual the retina, then it may be possible to perceive the zipper effect.
pupils (the t parameter in (3.50)). If t = 0, then the eye closing trick could be
used to detect that the two images look identical. If t is too large or too small,
Peripheral problems The current generation of VR headsets have significant
then depth and scale perception (Section 6.1) are affected. A larger separation t
optical aberration issues; recall from Section 4.3 that these become worse as the
would cause the world to appear smaller; a smaller t would cause the opposite.
distance from the optical axis increases. It is important to distinguish between two
cases: 1) Looking through the center of the lens while detecting distortion at the
Canonical head motions Now consider errors that involve movement, which periphery, and 2) rotating the eyes to look directly through the edge of the lens.
could be caused by head tracking errors, the rendering perspective, or some com- Distortion might be less noticeable in the first case because of lower photoreceptor
bination. It is helpful to make careful, repeatable motions, which will be called density at the periphery; however, mismatches could nevertheless have an impact
canonical head motions. If rotation alone is tracked, then there are three rota- on comfort and sickness. Optical flow signals are strong at the periphery, and
tional DOFs. To spot various kinds of motion or viewpoint errors, the evaluator mismatched values may be perceived as incorrect motions.
should be trained to carefully perform individual, basic rotations. A pure yaw can In the second case, looking directly through the lens might reveal lack of fo-
be performed by nodding a “no” gesture. A pure pitch appears as a pure “yes” cus at the periphery, caused by spherical aberration. Also, chromatic aberration
gesture. A pure roll is more difficult to accomplish, which involves turning the may become visible, especially for sharp white lines against a black background.
head back and forth so that one eye is higher than the other at the extremes. In Furthermore, errors in pincushion distortion correction may be come evident as
any of these movements, it may be beneficial to translate the cyclopean viewpoint a straight line appears to become curved. These problems cannot be fixed by a
(point between the center of the eyes) as little as possible, or follow as closely to single distortion correction function (as covered in Section 7.3) because the pupil
the translation induced by the head model of Section 9.1. translates away from the optical axis when the eye rotates. A different, asym-
For each of these basic rotations, the evaluator should practice performing metric correction function would be needed for each eye orientation, which would
them at various, constant angular velocities and amplitudes. For example, she require eye tracking to determine which correction function to use at each time
should try to yaw her head very slowly, at a constant rate, up to 45 each way. instant.
Alternatively, she should try to rotate at a fast rate, up to 10 degrees each way, To observe pincushion or barrel distortion the evaluator should apply a canon-
perhaps with a frequency of 2 Hz. Using canonical head motions, common errors ical yaw motion over as large of an amplitude as possible, while fixating on an
that were given in Figure 9.7 could be determined. Other problems, such as a object. In this case, the VOR will cause the eye to rotate over a large range
discontinuity in the tracking, tilt errors, latency, and the incorrect depth of the while sweeping its view across the lens from side to side, as shown in Figure 12.2.
viewpoint can be more easily detected in this way. If the virtual world contains a large, flat wall with significant texture or spatial
If position is tracked as well, then three more kinds of canonical head motions frequency, then distortions could become clearly visible as the wall appears to be
become important, one for each position DOF. Thus, horizontal, vertical, and “breathing” during the motion. The effect may be more noticeable if the wall has
depth-changing motions can be performed to identify problems. For example, with a regular grid pattern painted on it.
horizontal, side-to-side motions, it can be determined whether motion parallax is Finally, many users do not even notice the limited field of view of the lens.
functioning correctly. Recall from Section 5.4 that any flat screen placed in front of the eye will only
cover some of the eye’s field of view. Therefore, photoreceptors at the periphery
VOR versus smooth pursuit Recall from Sections 5.3, 5.4, and 6.2 that eye will not receive any direct light rays from the display. In most cases, it is dark
movements play an important role in visual perception. An evaluator should in inside of the headset, which results in the perception of a black band around the
mind the particular eye movement mode when evaluating whether an object in visible portion of the display. Once this is pointed out to users, it becomes difficult
12.1. PERCEPTUAL TRAINING 337 338 S. M. LaValle: Virtual Reality
Conclusions This section provided some suggestions for training people to spot
problems in VR systems. Many more can be expected to emerge in the future.
For example, to evaluate auditory localization in a virtual world, evaluators should
1 2 3 close their eyes and move their heads in canonical motions. To detect lens glare
in systems that use Fresnel lenses, they should look for patterns formed by bright
Figure 12.2: A top-down view that shows how the eye rotates when fixated on lights against dark backgrounds. To detect display flicker (recall from Section 6.2),
a stationary object in the virtual world, and the head is yawed counterclockwise especially if it is as low as 60 Hz, then the evaluator should enter a bright virtual
(facing right to facing left). Lens distortions at the periphery interfere with the world, preferably white, and relax the eyes until vibrations are noticeable at the
perception of stationarity. periphery. To notice vergence-accommodation mismatch (recall from Section 5.4),
virtual objects can be placed very close to the eyes. As the eyes converge, it may
seem unusual that they are already in focus, or the eyes attempt to focus as they
for them to ignore it. would in the real world, which would cause the object to be blurred.
There is also a need to have formal training mechanisms or courses that engi-
Latency perception The direct perception of latency varies wildly among peo- neers and developers could use to improve their perceptive powers. In this case,
ple. Even when it is not perceptible, it has been one of the main contributors to evaluators could improve their skills through repeated practice. Imagine a VR ex-
VR sickness [171]. Adaptation causes great difficulty because people can adjust to perience that is a competitive game designed to enhance your perceptive abilities
a constant amount of latency through long exposure; returning to the real world in spotting VR flaws.
might be difficult in this case. For a period of time, most of real world may not
appear to be stationary!
In my own efforts at Oculus VR, I could detect latency down to about 40 ms 12.2 Recommendations for Developers
when I started working with the prototype Oculus Rift in 2012. By 2014, I was
able to detect latency down to as little as 2 ms by the following procedure. The With the widespread availability and affordability of VR headsets, the number of
first step is to face a vertical edge, such as a door frame, in the virtual world. The people developing VR experiences has grown dramatically in recent years. Most
evaluator should keep a comfortable distance, such as two meters. While fixated developers to date have come from the video game industry, where their skills and
on the edge, a canonical yaw motion should be performed with very low amplitude, experience in developing games and game engines are “ported over” to VR. In some
such a few degrees, and a frequency of about 2 Hz. The amplitude and frequency cases, simple adaptations are sufficient, but game developers have been repeatedly
of motions are important. If the amplitude is too large, then optical distortions surprised at how a highly successful and popular game experience does not trans-
may interfere. If the speed is too high, then the headset might start to flop around late directly to a comfortable, or even fun, VR experience. Most of the surprises
with respect to the head. If the speed is too low, then the latency might not be are due to a lack of understanding human physiology and perception. As the field
easily noticeable. When performing this motion, the edge should appear to be progresses, developers are coming from an increasing variety of backgrounds, in-
moving out of phase with the head if there is significant latency. cluding cinema, broadcasting, communications, social networking, visualization,
Recall that many VR systems today achieve zero effective latency, as mentioned and engineering. Artists and hobbyists have also joined in to make some of the
in Section 7.4; nevertheless, perceptible latency may occur on many systems due most innovative experiences.
to the particular combination of hardware, software, and VR content. By using This section provides some useful recommendations, which are based on a
prediction, it is even possible to obtain negative effective latency. Using arrow combination of the principles covered in this book, and recommendations from
keys that increment or decrement the prediction interval, I was able to tune the other developer guides (especially [359]). This is undoubtedly an incomplete list
effective latency down to 2 ms by applying the method above. The method is that should grow in coming years as new kinds of hardware and experiences are
closely related to the psychophysical method of adjustment, which is covered later developed. The vast majority of VR experiences to date are based on successful
in Section 12.4. I was later able to immediately spot latencies down to 10 ms 3D video games, which is evident in the kinds of recommendations being made by
12.2. RECOMMENDATIONS FOR DEVELOPERS 339 340 S. M. LaValle: Virtual Reality
developers today. Most of the recommendations below link to prior parts of this • Avoid movements of objects that cause most of the visual field to change
book, which provide scientific motivation or further explanation. in the same way; otherwise, the user might feel as if she is moving (Section
8.4).
Virtual worlds
• Determine how to cull away geometry that is too close to the face of the
• Set units in the virtual world that match the real world so that scales can user; otherwise, substantial vergence-accommodation mismatch will occur
be easily matched. For example, one unit equals one meter in the virtual (Section 5.4).
world. This helps with depth and scale perception (Section 6.1).
• Unlike in games and cinematography, the viewpoint should not change in
• Make sure that objects are completely modeled so that missing parts are not
a way that is not matched to head tracking, unless the intention is for the
noticeable as the user looks at them from viewpoints that would have been
user to feel as if she is moving in the virtual world, which itself can be
unexpected for graphics on a screen.
uncomfortable (Section 10.2).
• Very thin objects, such as leaves on a tree, might look incorrect in VR due
to varying viewpoints. • For proper depth and scale perception, the interpupillary distance of the
user in the real world should match the corresponding viewpoint distance
• Design the environment so that less locomotion is required; for example, a between eyes in the virtual world (Section 6.1).
virtual elevator would be more comfortable than virtual stairs (Sections 8.4
and 10.2). • In comparison to graphics on a screen, reduce the brightness and contrast of
the models to increase VR comfort.
• Consider visual and auditory rendering performance issues and simplify the
geometric models as needed to maintain the proper frame rates on targeted
Tracking and the matched zone
hardware (Sections 7.4 and 11.4).
• Never allow head tracking to be frozen or delayed; otherwise, the user might
Visual rendering immediately perceive self-motion (Section 8.4).
• The only difference between the left and right views should be the viewpoint,
• Make sure that the eye viewpoints are correctly located, considering stereo
not models, textures, colors, and so on (Sections 3.5 and 12.1).
offsets (Section 3.5), head models (Section 9.1), and locomotion (Section
• Never allow words, objects, or images to be fixed to part of the screen; all 10.2).
content should appear to be embedded in the virtual world. Recall from
Section 2.1 that being stationary on the screen is not the same as being • Beware of obstacles in the real world that do not exist in the virtual world; a
perceived as stationary in the virtual world. warning system may be necessary as the user approaches an obstacle (Section
8.3.1).
• Be careful when adjusting the field of view for rendering or any parameters
that affect lens distortion that so the result does not cause further mismatch • Likewise, beware of obstacles in the virtual world that do not exist in the real
(Sections 7.3 and 12.1). world. For example, it may have unwanted consequences if a user decides to
poke his head through a wall (Section 8.3.1).
• Re-evaluate common graphics tricks such as texture mapping and normal
mapping, to ensure that they are effective in VR as the user has stereoscopic • As the edge of the tracking region is reached, it is more comfortable to
viewing and is able to quickly change viewpoints (Section 7.2). gradually reduce contrast and brightness than to simply hold the position
fixed (Section 8.4).
• Anti-aliasing techniques are much more critical for VR because of the varying
viewpoint and stereoscopic viewing (Section 7.2).
Interaction
• The rendering system should be optimized so that the desired virtual world
can be updated at a frame rate that is at least as high as the hardware re- • Consider interaction mechanisms that are better than reality by giving people
quirements (for example, 90 FPS for Oculus Rift and HTC Vive); otherwise, superhuman powers, rather than applying the universal simulation principle
the frame rate may decrease and vary, which causes discomfort (Section 7.4.) (Chapter 10).
12.2. RECOMMENDATIONS FOR DEVELOPERS 341 342 S. M. LaValle: Virtual Reality
• For locomotion, follow the suggestions in Section 10.2 to reduce vection side • Unexpected differences between the virtual body and real body may be
effects. alarming. They could have a different gender, body type, or species. This
could lead to a powerful experience, or could be an accidental distraction.
• For manipulation in the virtual world, try to require the user to move as
little as possible in the physical world; avoid giving the user a case of gorilla • If only head tracking is performed, then the virtual body should satisfy some
arms (Section 10.3). basic kinematic constraints, rather than decapitating the user in the virtual
world (Section 9.4).
• With regard to social interaction, higher degrees of realism are not necessarily
better, due to the uncanny valley (Section 10.4). • Users’ self-appearance will affect their social behavior, as well as the way
people around them react to them (Section 10.4).
User interfaces
• If a floating menu, web browser, or other kind of virtual display appears, then 12.3 Comfort and VR Sickness
it should be rendered at least two meters away from the user’s viewpoint to
Experiencing discomfort as a side effect of using VR systems has been the largest
minimize vergence-accommodation mismatch (Section 5.4).
threat to widespread adoption of the technology over the past decades. It is con-
• Such a virtual display should be centered and have a relatively narrow field sidered the main reason for its failure to live up to overblown expectations in
of view, approximately one-third of the total viewing area, to minimize eye the early 1990s. Few people want a technology that causes them to suffer while
and head movement. (Section 5.3). using it, and in many cases long after using it. It has also been frustrating for
researchers to characterize VR sickness because of many factors such as variation
• Embedding menus, options, game status, and other information may be most among people, adaptation over repeated use, difficulty of measuring symptoms,
comfortable if it appears to be written into the virtual world in ways that rapidly changing technology, and content-dependent sensitivity. Advances in dis-
are familiar; this follows the universal simulation principle (Chapter 10). play, sensing, and computing technologies have caused the adverse side effects due
to hardware to reduce; however, they nevertheless remain today in consumer VR
Audio headsets. As hardware-based side effects reduce, the burden has been shifting
more toward software engineers and content developers. This is occurring because
• Be aware of the difference between a user listening over fixed, external speak- the VR experience itself has the opportunity to make people sick, even though
ers versus attached headphones; sound source localization will not function the hardware may be deemed to be perfectly comfortable. In fact, the best VR
correctly over headphones without tracking (Section 2.1). headset available may enable developers to make people more sick than ever be-
fore! For these reasons, it is critical for engineers and developers of VR systems
• Both position and orientation from tracking and avatar locomotion should to understand these unfortunate side effects so that they determine how to reduce
be taken into account for auralization (Section 11.4). or eliminate them for the vast majority of users.
associated with exposure to real and/or apparent motion. It generally involves Simulator sickness and cybersickness Once displays are used, the choices
the vestibular organs (Section 8.2), which implies that they involve sensory input discussed in Section 2.1 reappear: They may be fixed screens that surround the
or conflict regarding accelerations; in fact, people without functioning vestibular user (as in a CAVE VR system) or a head-mounted display that requires tracking.
organs do not experience motion sickness [145]. Motion sickness due to real motion Vehicle simulators are perhaps the first important application of VR, with the
occurs because of unusual forces that are experienced. This could happen from most common examples being driving a car and flying an airplane or helicopter.
spinning oneself around in circles, resulting in dizziness and nausea. Similarly, The user may sit on a fixed base, or a motorized based that responds to controls.
the symptoms occur from being transported by a vehicle that can produce forces The latter case provides vestibular stimulation, for which time synchronization of
that are extreme or uncommon. The self-spinning episode could be replaced by a motion and visual information is crucial to minimize sickness. Usually, the entire
hand-powered merry-go-round. More extreme experiences and side effects can be cockpit is rebuilt in the real world, and the visual stimuli appear at or outside
generated by a variety of motorized amusement park rides. of the windows. The head could be tracked to provide stereopsis and varying
Unfortunately, motion sickness extends well beyond entertainment, as many viewpoints, but most often this is not done so that comfort is maximized and
people suffer from motion sickness while riding in vehicles designed for transporta- technological side effects are minimized. The branch of visually induced motion
tion. People experience car sickness, sea sickness, and air sickness, from cars, sickness that results from this activity is aptly called simulator sickness, which
boats, and airplanes, respectively. It is estimated that only about 10% of people has been well-studied by the US military.
have never experienced significant nausea during transportation [171]. Militaries The term cybersickness [206] was proposed to cover any sickness associated
have performed the largest motion sickness studies because of soldiers spending with VR (or virtual environments), which properly includes simulator sickness.
long tours of duty on sea vessels and flying high-speed combat aircraft. About 70% Unfortunately, the meaning of the term has expanded in recent times to include
of naval personnel experience seasickness, and about 80% of those have decreased sickness associated with spending too much time interacting with smartphones or
work efficiency or motivation [249]. Finally, another example of unusual forces is computers in general. Furthermore, the term cyber has accumulated many odd
space travel, in which astronauts who experience microgravity complain of nausea connotations over the decades. Therefore, we refer to visually induced motion
and other symptoms; this is called space sickness. sickness, and any other forms of discomfort that arise from VR, as VR sickness.
• Eyestrain: Users may feel that their eyes are tired, fatigued, sore, or aching.
Other side effects In addition to the direct symptoms just listed, several other
phenomena are closely associated with motion and VR sickness, and potentially
persist long after usage. One of them is Sopite syndrome [102], which is closely
related to drowsiness, but may include other symptoms, such as laziness, lack of so-
cial participation, mood changes, apathy, and sleep disturbances. These symptoms
may persist even after adaptation to the systems listed above have been greatly
reduced or eliminated. Another phenomenon is postural disequilibrium, which ad-
versely affects balance and coordination [171]. Finally, another phenomenon is loss
of visual acuity during head or body motion [171], which seems to be a natural
consequence of the VOR (Section 5.3) becoming adapted to the flaws in a VR
system. This arises from forcing the perception of stationarity in spite of issues in
resolution, latency, frame rates, optical distortion, and so on.
After effects One of the most troubling aspects of VR sickness is that symptoms Figure 12.3: The symptoms are observed, but the causes are not directly measured.
might last for hours or even days after usage [304]. Most users who experience Researchers face an inverse problem, which is to speculate on the causes based on
symptoms immediately after withdrawal from a VR experience still show some observed symptoms. The trouble is that each symptom may have many possible
sickness, though at diminished levels, 30 minutes later. Only a very small number causes, some of which might not be related to the VR experience.
of outlier users may continue to experience symptoms for hours or days. Similarly,
some people who experience sea sickness complain of land sickness for extended
periods after returning to stable ground. This corresponds to postural instability
and perceived instability of the visual world; the world might appear to be rocking
[171].
From symptoms to causes The symptoms are the effect, but what are their
causes? See Figure 12.3. The unfortunate problem for the scientist or evaluator of
a VR system is that only the symptoms are observable. Any symptom could have
12.3. COMFORT AND VR SICKNESS 347 348 S. M. LaValle: Virtual Reality
any number of direct possible causes. Some of them may be known and others severity of the symptoms, the speed of their onset, the time they last after the
may be impossible to determine. Suppose, for example, that a user has developed experiment, and the rate at which the users adapt to VR.
mild nausea after 5 minutes of a VR experience. What are the chances that he
would have gotten nauseated anyway because he rode his bicycle to the test session Sensory conflict theory In addition to determining the link between cause and
and forgot to eat breakfast? What if he has a hangover from alcoholic drinks the effect in terms of offending stimuli, we should also try to understand why the body
night before? Perhaps a few users such as this could be discarded as outliers, but is reacting adversely to VR. What physiological and psychological mechanisms are
what if there was a large festival the night before which increased the number of involved in the process? Why might one person be unable to quickly adapt to
people who are fatigued before the experiment? Some of these problems can be certain stimuli, while other people are fine? What is particularly bad about the
handled by breaking them into groups that are expected to have low variability; stimulus that might be easily fixed without significantly degrading the experience?
see Section 12.4. At the very least, one should probably ask them beforehand if The determination of these mechanisms and their reasons for existing falls under
they feel nauseated; however, this could even cause them to pay more attention to etiology. Although, there is no widely encompassing and accepted theory that
nausea, which generates a bias. explains motion sickness or VR sickness, some useful and accepted theories exist.
Even if it is narrowed down that the cause was the VR experience, this deter- One of must relevant and powerful theories for understanding VR sickness is
mination may not be narrow enough to be useful. Which part of the experience sensory conflict theory [132, 145]. Recall the high-level depiction of VR systems
caused it? The user might have had no problems were it not for 10 seconds of from Figure 2.1 of Section 2.1. For VR, two kinds of mismatch exist:
stimulus during a 15-minute session. How much of the blame was due to the hard-
ware versus the particular content? The hardware might be as comfortable as an 1. The engineered stimuli do not closely enough match that which is expected
optokinetic drum, which essentially shifts the blame to the particular images on central nervous system and brain in comparison to natural stimuli. Exam-
the drum. ples include artifacts due to display resolution, aliasing, frame rates, optical
Questions relating to cause are answered by finding statistical correlations in distortion, limited colors, synthetic lighting models, and latency.
the data obtained before, during, and after the exposure to VR. Thus, causation
is not determined through directly witnessing the cause and its effect in the way 2. Some sensory systems receive no engineered stimuli. They continue to sense
as witnessing the effect of a shattered glass which is clearly caused by dropping the surrounding physical world in a natural way and send their neural signals
it on the floor. Eliminating irrelevant causes is an important part of the experi- accordingly. Examples include signals from the vestibular and proprioceptive
mental design, which involves selecting users carefully and gathering appropriate systems. Real-world accelerations continue to be sensed by the vestibular
data from them in advance. Determining more specific causes requires more ex- organs and the poses of body parts can be roughly estimated from motor
perimental trials. This is complicated by the fact that different trials cannot be signals.
easily applied to the same user. Once people are sick, they will not be able to Unsurprisingly, the most important conflict for VR involves accelerations. In
participate, or would at least give biased results that are difficult to compensate the case of vection, the human vision system provides optical flow readings con-
for. They could return on different days for different trials, but there could again sistent with motion, but the signals from the vestibular organ do not match. Note
be issues because of adaptation to VR, including the particular experiment, and that this is the reverse of a common form of motion sickness, which is traveling
simply being in a different health or emotional state on another occasion. in a moving vehicle without looking outside of it. For example, imagine reading
a book while a passenger in a car. In this case, the vestibular system reports the
Variation among users A further complication is the wide variability among accelerations of the car, but there is no corresponding optical flow.
people to VR sickness susceptibility. Accounting for individual differences among
groups must be accounted for in the design of the experiment; see Section 12.4. Forced fusion and fatigue Recall from Section 6.4 that our perceptual systems
Most researchers believe that women are more susceptible to motion sickness than integrate cues from different sources, across different sensing modalities, to obtain a
men [143, 241]; however, this conclusion is disputed in [171]. Regarding age, it coherent perceptual interpretation. In the case of minor discrepancies between the
seems that susceptibility is highest in children under 12, which then rapidly de- cues, the resulting interpretation can be considered as forced fusion [120], in which
creases as they mature to adulthood, and then gradually decreases further over the perceptual systems appear to work harder to form a match in spite of errors.
their lifetime [262]. One study even concludes that Chinese people are more sus- The situation is similar in engineering systems that perform sensor fusion or visual
ceptible than some other ethic groups [309]. The best predictor of an individual’s scene interpretation; the optimization or search for possible interpretations may be
susceptibility to motion sickness is to determine whether she or he has had it much larger in the presence of more noise or incomplete information. Forced fusion
before. Finally, note that there may also be variability across groups as in the appears to lead directly to fatigue and eyestrain. By analogy to computation, it
12.3. COMFORT AND VR SICKNESS 349 350 S. M. LaValle: Virtual Reality
may be not unlike a CPU or GPU heating up as computations intensify for a more military, but has been used much more broadly. The users are asked to score each
difficult problem. Thus, human bodies are forced to work harder as they learn of 16 standard symptoms on a four-point scale: 0 none, 1 slight, 2 moderate, and
to interpret virtual worlds in spite of engineering flaws. Fortunately, repeated 3 severe. The results are often aggregated by summing the scores for a selection of
exposure leads to learning or adaptation, which might ultimately reduce fatigue. the questions. To determine onset or decay rates, the SSQ must be administered
multiple times, such as before, after 10 minutes, after 30 minutes, immediately
Poison hypotheses Sensory conflict might seem to be enough to explain why after the experiment, and then 60 minutes afterwards.
extra burden arises, but it does not seem to imply that nausea would result. Sci- Questionnaires suffer from four main drawbacks. The first is that the answers
entists wonder what the evolutionary origins might be for responsible this and are subjective. For example, there is no clear way to calibrate what it means across
related symptoms. Note that humans have the ability to naturally nauseate them- the users to feel nausea at level “1” versus level “2”. A single user might even
selves from spinning motions that do not involve technology. The indirect poison give different ratings based on emotion or even the onset of other symptoms. The
hypothesis asserts that nausea associated with motion sickness is a by-product of second drawback is that users are asked pay attention to their symptoms, which
a mechanism that evolved in humans so that they would vomit an accidentally in- could bias their perceived onset (they may accidentally become like perceptually
gested toxin [323]. The symptoms of such toxins frequency involve conflict between trained evaluators, as discussed in Section 12.1). The third drawback is that users
visual and vestibular cues. Scientists have considered alternative evolutionary ex- must be interrupted so that they can provide scores during a session. The final
planations, such as tree sickness in primates so that they avoid swaying, unstable drawback is that the intensity over time must be sampled coarsely because a new
branches. Another explanation is the direct poison hypothesis, which asserts that questionnaire must be filled out at each time instant of interest.
nausea became associated with toxins because they were correlated throughout
evolution with activities that involved increased or prolonged accelerations. A de- Physiological measurements The alternative is to attach sensors to the user
tailed assessment of these alternative hypotheses and their incompleteness is given so that physiological measurements are automatically obtained before, during, and
in Section 23.9 of [171]. after the VR session. The data can be obtained continuously without interrupting
the user or asking him to pay attention to symptoms. There may, however, be
some discomfort or fear associated with the placement of sensors on the body.
Levels of VR sickness To improve VR systems and experiences, we must first
Researchers typically purchase a standard sensing system, such as the Biopac
be able to properly compare them in terms of their adverse side effects. Thus, the
MP150, which contains a pack of sensors, records the data, and transmits them
resulting symptoms need to be quantified. Rather than a simple yes/no response
to a computer for analysis.
for each symptom, it is more precise to obtain numbers that correspond to relative
Some physiological measures that have been used for studying VR sickness are:
severity. Several important quantities, for a particular symptom, include
• Electrocardiogram (ECG): This sensor records the electrical activity of
• The intensity of the symptom. the heart by placing electrodes on the skin. Heart rate typically increases
during a VR session.
• The rate of symptom onset or intensity increase while the stimulus is pre-
sented. • Electrogastrogram (EGG): This is similar to the ECG, but the elec-
trodes are placed near the stomach so that gastrointestinal discomfort can
• The rate of symptom decay or intensity decrease after the stimulus is re- be estimated.
moved.
• Electrooculogram (EOG): Electrodes are placed around the eyes so that
• The percentage of users who experience the symptom at a fixed level or eye movement can be estimated. Alternatively, a camera-based eye tracking
above. system may be used (Section 9.4). Eye rotations and blinking rates can be
determined.
The first three can be visualized as a plot of intensity over time. The last one is a
statistical property; many other statistics could be calculated from the raw data. • Photoplethysmogram (PPG): This provides additional data on heart
movement and is obtained by using a pulse oximeter. Typically this device
is clamped onto a fingertip and monitors the oxygen saturation of the blood.
Questionnaires The most popular way to gather quantitative data is to have
users fill out a questionnaire. Researchers have designed many questionnaires • Galvanic skin response (GSR): This sensor measures electrical resistance
over the years [170]; the most widely known and utilized is the simulator sickness across the surface of the skin. As a person sweats, the moisture of the skin
questionnaire (SSQ) [144]. It was designed for simulator sickness studies for the US surface increases conductivity. This offers a way to measure cold sweating.
12.3. COMFORT AND VR SICKNESS 351 352 S. M. LaValle: Virtual Reality
• Respiratory effort: The breathing rate and amplitude are measured from
a patch on the chest that responds to differential pressure or expansion. The
rate of breathing may increase during the VR session.
• Skin pallor: This can be measured using a camera and image processing. In
the simplest case, an IR LED and photodiode serves as an emitter-detector
pair that measures skin reflectance.
ladder. The first step is accomplished by studying the appropriate literature or valid, especially if experimental data can be obtained without users even being
gaining the background necessary to design a new method that is likely to be an aware of the experiment. Head tracking data could be collected on a server while
improvement. This will reduce the chances of falling from the ladder. The second millions of people try a VR experience.
step is to design and implement the new method. This step could include some
simple evaluation on a few users just to make sure it is worth proceeding further. Ethical standards This leads to the next challenge, which is the rights of hu-
The third step is to precisely formulate the hypothesis, regarding how it is an mans, who presumably have more of them than animals. Experiments that affect
improvement. Examples are: 1) a reduction in adverse symptoms, 2) improved their privacy or health must be avoided. Scientific experiments that involve human
comfort, 3) greater efficiency at solving tasks, 4) stronger belief that the virtual subjects must uphold high standards of ethics, which is a lesson that was painfully
world is real, and 5) a greater enjoyment of the activity. It often makes sense to learned from Nazi medical experiments and the Tuskegee syphilis experiment in
evaluate multiple criteria, but the result may be that the new method is better in the mid 20th century. The Nazi War Crimes Tribunal outcomes resulted in the
some ways and worse in others. This is a common outcome, but it is preferable Nuremberg code, which states a set of ethical principles for experimentation on
to failing to improve in any way! The hypothesis could even involve improving fu- human subjects. Today, ethical standards for human subject research are taken
ture experimental procedures; an example is [59], in which researchers determined seriously around the world, with ongoing debate or differences in particulars [236].
cases in which physiological measures are better indicators of VR sickness than In the United States, experiments involving human subjects are required by law
questionnaires. Finally, the hypothesis should be selected in a way that simplifies to be approved by an institutional review board (IRB). Typically, the term IRB
the fourth step, the experiment, as much as possible while remaining useful. is also used to refer to the proposal for an experiment or set of experiments that
For the fourth step, the experiment should be designed and conducted to test has been approved by the review board, as in the statement, “that requires an
the hypothesis. The fifth step is to analyze the data and draw a conclusion. If the IRB”. Experiments involving VR are usually not controversial and are similar to
result is a “better” method in terms of the criteria of interest, then the six step is experiments on simulator sickness that have been widely approved for decades.
reached, at which point the new method should be presented to the world.
At any step, failure could occur. For example, right after the experiment is Variables Behavioral scientists are always concerned with variables. Each vari-
conducted, it might be realized that the pool of subjects is too biased. This able takes on values in a set, which might be numerical, as in real numbers, or
requires falling down one step and redesigning or reimplementing the experiment. symbolic, as in colors, labels, or names. From their perspective, the three most
It is unfortunate if the conclusion at the fifth step is that the method is not a clear important classes of variables are:
improvement, or is even worse. This might require returning to level two or even
one. The key is to keep from falling too many steps down the ladder per failure • Dependent: These are the main objects of interest for the hypothesis.
by being careful at each step! • Independent: These have values that are directly changed or manipulated
by the scientist.
Human subjects Dealing with people is difficult, especially if they are subjects
• Nuisance: As these vary, their values might affect the values of the depen-
in a scientific experiment. They may differ wildly in terms of their prior VR
dent variable, but the scientist has less control over them and they are not
experience, susceptibility to motion sickness, suspicion of technology, moodiness,
the objects of interest.
and eagerness to make the scientist happy. They may agree to be subjects in the
experiment out of curiosity, financial compensation, boredom, or academic degree The high-level task is to formulate a hypothesis that can be evaluated in terms of
requirements (psychology students are often forced to participate in experiments). the relationship between independent and dependent variables, and then design
A scientist might be able to guess how some people will fare in the experiment an experiment that can keep the nuisance variables under control and can be
based on factors such as gender, age, or profession. The subject of applying the conducted within the budget of time, resources, and access to subjects.
scientific method to formulate and evaluate hypotheses regarding groups of people The underlying mathematics for formulating models of how the variables be-
(or animals) is called behavioral science [152]. have and predicting their behavior is probability theory, which was introduced in
One of the greatest challenges is whether they are being observed “in the Section 6.4. Unfortunately, we are faced with an inverse problem, as was noted
wild” (without even knowing they are part of an experiment) or if the experiment in Figure 12.3. Most of the behavior is not directly observable, which means that
presents stimuli or situations they would never encounter in the real world. The we must gather data and make inferences about the underlying models and try to
contrived setting sometimes causes scientists to object to the ecological validity of obtain as much confidence as possible. Thus, resolving the hypothesis is a prob-
the experiment. Fortunately, VR is a particular contrived setting that we want to lem in applied statistics, which is the natural complement or inverse of probability
evaluate. Thus, conclusions made about VR usage are more likely to be ecologically theory.
12.4. EXPERIMENTS ON HUMAN SUBJECTS 355 356 S. M. LaValle: Virtual Reality
in which s The possible r-values range between −1 and 1. Three qualitatively different
(n1 − 1)σ̂12 + (n2 − 1)σ̂22 outcomes can occur:
σ̂p = (12.6)
n1 + n2 − 2 • r > 0: This means that x and y are positively correlated. As x increases, y
and ni is the number of subjects who received treatment xi . The subtractions by tends to increase. A larger value of r implies a stronger effect.
1 and 2 in the expressions are due to Bessel’s correction. Based on the value of t,
• r = 0: This means that x and y are uncorrelated, which is theoretically
the confidence α in the null hypothesis H0 is determined by looking in a table of
equivalent to a null hypothesis.
the Student’s t cdf (Figure 12.5(b)). Typically, α = 0.05 or lower is sufficient to
declare that H1 is true (corresponding to 95% confidence). Such tables are usually • r < 0: This means that x and y are negatively correlated. As x increases, y
arranged so that for a given ν and α is, the minimum t value needed to confirm tends to decrease. A smaller value of r implies a stronger effect.
H1 with confidence 1 − α is presented. Note that if t is negative, then the effect
that x has on y runs in the opposite direction, and −t is applied to the table. In practice, it is highly unlikely to obtain r = 0 from experimental data; therefore,
The binary outcome might not be satisfying enough. This is not a problem the absolute value |r| gives an important indication of the likelihood that y depends
because difference in means, µ̂1 − µ̂2 , is an estimate of the amount of change that on x. The theoretical equivalence to the null hypothesis (r = 0) would happen
applying x2 had in comparison to x1 . This is called the average treatment effect. only as the number of subjects tends to infinity.
Thus, in addition to determining whether the H1 is true via the t-test, we also
obtain an estimate of how much it affects the outcome. Dealing with nuisance variables We have considered dependent and inde-
Student’s t-test assumed that the variance within each group is identical. If it pendent variables, but have neglected the nuisance variables. This is the most
is not, then Welch’s t-test is used [343]. Note that the variances were not given challenging part of experimental design. Only the general idea is given here; see
in advance in either case. They are estimated “on the fly” from the experimental [152, 195] for exhaustive presentations. Suppose that when looking through the
data. Welch’s t-test gives the same result as Student’s t-test if the variances data it is noted that the dependent variable y depends heavily on an identifiable
happen to be the same; therefore, when in doubt, it may be best to apply Welch’s property of the subjects, such as gender. This property would become a nuisance
t-test. Many other tests can be used and are debated in particular contexts by variable, z. We could imagine designing an experiment just to determine whether
scientists; see [124]. and how much z affects y, but the interest is in some independent variable x, not
z.
Correlation coefficient In many cases, the independent variable x and the The dependency on z drives the variance high across the subjects; however,
dependent variable y are both continuous (taking on real values). This enables if they are divided into groups that have the same z value inside of each group,
another important measure called the Pearson correlation coefficient (or Pearson’s then the variance could be considerably lower. For example, if gender is the
r). This estimates the amount of linear dependency between the two variables. nuisance variable, then we would divide the subjects into groups of men and women
For each subject i, the treatment (or level) x[i] is applied and the response is y[i]. and discover that the variance is smaller in each group. This technique is called
Note that in this case, there are no groups (or every subject is a unique group). blocking, and each group is called a block. Inside of a block, the variance of y
Also, any treatment could potentially be applied to any subject; the index i only should be low if the independent variable x is held fixed.
denotes the particular subject. The next problem is to determine which treatment should be applied to which
The r-value is calculated as the estimated covariance between x and y when subjects. Continuing with the example, it would be a horrible idea to give treat-
treated as random variables: ment x1 to women and treatment x2 to men. This completely confounds the
nuisance variable z and independent variable x dependencies on the dependent
n
X variable y. The opposite of this would be to apply x1 to half of the women and
(x[i] − µ̂x )(y[i] − µ̂y ) men, and x2 to the other half, which is significantly better. A simple alternative
i=1
r=s n
s n
, (12.7) is to use a randomized design, in which the subjects are assigned x1 or x2 at ran-
dom. This safely eliminates accidental bias and is easy for an experimenter to
X X
(x[i] − µ̂x )2 (y[i] − µ̂y ) 2
i=1 i=1
implement.
If there is more than one nuisance variable, then the assignment process be-
in which µ̂x and µ̂y are the averages of x[i] and y[i], respectively, for the set comes more complicated, which tends to cause a greater preference for randomiza-
of all subjects. The denominator is just the product of the estimated standard tion. If the subjects participate in a multiple-stage experiment where the different
deviations: σ̂x σ̂y . treatments are applied at various times, then the treatments must be carefully
12.4. EXPERIMENTS ON HUMAN SUBJECTS 359 360 S. M. LaValle: Virtual Reality
assigned. One way to handle it is by assigning the treatments according to a is to determine how small ∆x can become so that subjects perceive a difference.
Latin square, which is an m-by-m matrix in which every row and column is a The classical approaches are:
permutation of m labels (in this case, treatments).
• Method of constant stimuli: In this case, stimuli at various magnitudes
are presented in succession, along with the reference stimulus. The subject is
Analysis of variance The main remaining challenge is to identify nuisance
asked for each stimulus pair where he can perceive a difference between them.
variables that would have a significant impact on the variance. This is called
The magnitudes are usually presented in random order to suppress adapta-
analysis of variance (or ANOVA, pronounced “ay nova”), and methods that take
tion. Based on the responses over many trials, a best-fitting psychometric
this into account are called ANOVA design. Gender was an easy factor to imagine,
function is calculated, as was shown in Figure 2.21.
but others may be more subtle, such as the amount of FPS games played among
the subjects, or the time of day that the subjects participate. The topic is far too • Method of limits: The experimenter varies the stimulus magnitude in
complex to cover here (see [152]), but the important intuition is that low-variance small increments, starting with an upper or lower limit. The subject is asked
clusters must be discovered among the subjects, which serves as a basis for dividing in each case whether the new stimulus has less, equal, or more magnitude
them into blocks. This is closely related to the problem of unsupervised clustering than the reference stimulus.
(or unsupervised learning) because classes are being discovered without the use
of a “teacher” who identifies them in advance. ANOVA is also considered as a • Method of adjustment: The subject is allowed to adjust the stimulus
generalization of the t-test to three or more variables. magnitude up and down within a short amount of time, while also being able
to compare to the reference stimulus. The subject stops when she reports
More variables Variables other than independent, dependent, and nuisance that the adjusted and reference stimuli appear to have equal magnitude.
sometimes become important in the experiment. A control variable is essentially a
nuisance variable that is held fixed through the selection of subjects or experimen- Although these methods are effective and widely used, several problems exist. All
tal trials. For example, the variance may be held low by controlling the subject of them may be prone to some kinds of bias. For the last two, adaptation may
selection so that only males between the ages of 18 and 21 are used in the ex- interfere with the outcome. For the last one, there is no way to control how the
periment. The approach helps to improve the confidence in the conclusions from subject makes decisions. Another problem is efficiency, in that many iterations
the experiment, possibly with a smaller number of subjects or trials, but might may be wasted in the methods by considering stimuli that are far away from the
prevent its findings from being generalized to settings outside of the control. reference stimulus.
A confounding variable is an extraneous variable that causes the independent
and dependent variables to be correlated, but they become uncorrelated once the Adaptive methods Due to these shortcomings, researchers have found numer-
value of the confounding variable is given. For example, having a larger shoe size ous ways to improve the experimental methods over the past few decades. A large
may correlate to better speaking ability. In this case the confounding variable number of these are surveyed and compared in [324], and fall under the heading of
is the person’s age. Once the age is known, we realize that older people have adaptive psychophysical methods. Most improved methods perform staircase proce-
larger feet then small children, and are also better at speaking. This illustrates dures, in which the stimulus magnitude starts off with an easy case for the subject
the danger of inferring causal relationships from statistical correlations. and is gradually decreased (or increased if the reference magnitude is larger) until
the subject makes a mistake [90]. At this point, the direction is reversed and
Psychophysical methods Recall from Section 2.3 that psychophysics relates the steps are increased until another mistake is made. The process of making
perceptual phenomena to the original stimuli, which makes it crucial for under- a mistake and changing directions continues until the subject makes many mis-
standing VR. Stevens’ power law (2.1) related the perceived stimulus magnitude takes in a short number of iterations. The step size must be carefully chosen, and
to the actual magnitude. The JND involved determining a differential threshold, could even be reduced gradually during the experiment. The direction (increase
which is the smallest amount of stimulus change that is detectable. A special or decrease) could alternatively be decided using Bayesian or maximum-likelihood
case of this is an absolute threshold, which is the smallest magnitude stimulus (in procedures that provide an estimate for the threshold as the data are collected in
comparison to zero) that is detectable. each iteration [113, 154, 342]. These methods generally fall under the heading of
Psychophysical laws or relationships are gained through specific experiments the stochastic approximation method [266].
on human subjects. The term psychophysics and research area were introduced by
Gustav Fechner [76], who formulated three basic experimental approaches, which Stimulus magnitude estimation Recall that Stevens’ power law is not about
will described next. Suppose that x represents the stimulus magnitude. The task detection thresholds, but is instead about the perceived magnitude of a stimulus.
12.4. EXPERIMENTS ON HUMAN SUBJECTS 361 362 S. M. LaValle: Virtual Reality
For example, one plate might feel twice as hot as another. In this case, subjects
are asked to estimate the relative difference in magnitude between stimuli. Over
a sufficient number of trials, the exponent of Stevens’ power law (2.1) can be
estimated by choosing a value for x (the exponent) that minimizes the least-squares
error (recall from Section 9.1).
Further Reading
For surveys on perceptual learning, see [94, 98, 109, 253]. Hyperacuity through per-
ceptual learning is investigated in [101, 253]. In [283] it is established that perceptual
learning can occur without even focused attention.
Human sensitivity to latency in VR and computer interfaces is analyzed in [62, 67,
200, 357]. Comfort issues in stereo displays is studied in [289]. For connections between
postural sway and sickness, see [294, 313].
For some important studies related to VR sickness, see [13, 146, 147, 153, 223, 263].
General overviews of VR sickness are given in [143, 169, 305]. Motion sickness is surveyed
in [262]. See [120, 138, 55, 255] for additional coverage of forced fusion.
For coverage of the mathematical methods and statistics for human subjects exper-
imentation, see [152]. The book [195] is highly popular for its coverage of hypothesis
testing in the context of psychology. For treatment of psychophysical methods, see
[176, 324, 349] and Chapter 3 of [93].
364 S. M. LaValle: Virtual Reality
Chapter 13
Frontiers
We arrive at the final chapter, which surveys some topics that could influence
widespread VR usage in the future, but are currently in a research and development
stage. Sections 13.1 and 13.2 cover the forgotten senses. Earlier in this book, we
covered vision, hearing, and balance (vestibular) senses, which leaves touch, smell,
and taste. Section 13.1 covers touch, or more generally, the somatosensory system.
This includes physiology, perception, and engineering technology that stimulates
the somatosensory system. Section 13.2 covers the two chemical senses, smell and Figure 13.1: Six major kinds of receptors in human skin. (Figure by Pearson
taste, along with attempts to engineer “displays” for them. Section 13.3 discusses Education.)
how robots are used for telepresence and how they may ultimately become our
surrogate selves through which the real world can be explored with a VR interface. uration and movement in the world, and the ambient temperature. Within this
Just like there are avatars in a virtual world (Section 10.4), the robot becomes a category, the vestibular system (Section 8.2) handles balance, and the somatosen-
kind of physical avatar in the real world. Finally, Section 13.4 discusses steps sory system handles touch, proprioception, and kinesthesis. Consider the human
toward the ultimate level of human augmentation and interaction: Brain-machine body and all of its movable parts, such as the legs, arms, fingers, tongue, mouth,
interfaces. and lips. Proprioception corresponds to the awareness of the pose of each part
relative to others, whereas kinesthesis is the counterpart for the movement itself.
In other words, kinesthesis provides information on velocities, accelerations, and
13.1 Touch and Proprioception forces.
Visual and auditory senses are the main focus of VR systems because of their The somatosensory system has at least nine major kinds of receptors, six of
relative ease to co-opt using current technology. Their organs are concentrated in which are devoted to touch, and the remaining three are devoted to propriocep-
a small place on the head, and head tracking technology is cheap and accurate. tion and kinesthesis. Figure 13.1 depicts the six main touch receptors, which are
Unfortunately, this neglects the powerful senses of touch and proprioception, and embedded in the skin (dermis). Their names, structures, and functions are:
related systems, which provide an intimate connection to the world around us. Our • Free nerve endings: These are neurons with no specialized structure.
eyes and ears enable us to perceive the world from a distance, but touch seems to They have axons that extend up into the outer skin (epidermis), with the
allow us to directly feel it. Furthermore, proprioception gives the body a sense of primary function of sensing temperature extremes (hot and cold), and pain
where it is any in the world with respect to gravity and the relative placement or from tissue damage. These neurons are special (called pseudounipolar) in
configuration of limbs and other structures that can be moved by our muscles. We that axons perform the role of both dendrites and axons in a typical neural
will therefore consider these neglected senses, from their receptors to perception, cell.
and then to engineering systems that try to overtake them.
• Ruffini’s endings or corpuscles: These are embedded deeply in the skin
The somatosensory system The body senses provide signals to the brain about and signal the amount of stretching that is occurring at any moment. They
the human body itself, including direct contact with the skin, the body’s config- have a sluggish temporal response.
363
13.1. TOUCH AND PROPRIOCEPTION 365 366 S. M. LaValle: Virtual Reality
• Pacinian corpuscles: These are small bodies filled with fluid and respond The neural pathways for the somatosensory system work in a way that is similar
to pressure. Their response is fast, allowing them to sense vibrations (pres- to the visual pathways of Section 5.2. The signals are routed through the thala-
sure variations) of up to 250 to 350 Hz. mus, with relevant information eventually arriving at the primary somatosensory
cortex in the brain, where the higher-level processing occurs. Long before the
• Merkel’s disks: These structures appear just below the epidermis and thalamus, some of the signals are also routed through the spinal cord to motor
respond to static pressure (little or no variation over time), with a slow neurons that control muscles. This enables rapid motor response, for the purpose
temporal response. of withdrawing from painful stimuli quickly, and for the knee-jerk reflex. Inside
• Meissner’s corpuscles: These are also just below the epidermis, and re- of the primary somatosensory cortex, neurons fire in a spatial arrangement that
spond to lighter touch. Their response is faster than Merkel’s discs and corresponds to their location on the body (topographic mapping). Some neurons
Ruffini’s corpuscles, allowing vibrations up to 30 to 50 Hz to be sensed; this also have receptive fields that correspond to local patches on the skin, much in the
is not as high as is possible as the Pacinian corpuscles. same way as receptive fields works for vision (recall Figure 5.8 from Section 5.2.
Once again, lateral inhibition and spatial opponency exist and form detectors that
• Hair follicle receptors: These correspond to nerve endings that wrap allow people to estimate sharp pressure features along the surface of the skin.
closely around the hair root; they contribute to light touch sensation, and
also pain if the hair is removed. Somatosensory perception We now transition from physiology to somatosen-
sory perception. The familiar concepts from psychophysics (Sections 2.3 and 12.4)
The first four of these receptors appear in skin all over the body. Meissner’s appear again, resulting in determinations of detection thresholds, perceived stim-
corpuscles are only in parts where there are no hair follicles (glabrous skin), and ulus magnitude, and acuity or resolution along temporal and spatial axes. For
the hair follicle receptors obviously appear only where there is hair. In some example, the ability to detect the presence of a vibration, presented at different
critical places, such as eyelids, lips, and tongue, thermoreceptors called the end- frequencies and temperatures, was studied in [26].
bulbs of Krause also appear in the skin. Yet another class is nocireceptors, which
appear in joint tissues and cause a pain sensation from overstretching, injury, or
inflammation. Two-point acuity Spatial resolution has been studied by the two-point acuity
Touch has both spatial and temporal resolutions. The spatial resolution or test, in which the skin is poked in two nearby places by a pair of sharp calipers.
acuity corresponds to the density, or receptors per square area, which varies over The subjects are asked whether they perceive a single poke, or two pokes in dif-
the body. The density is high at the fingertips, and very low on the back. This ferent places at the same time. The detection thresholds are then arranged by
has implications on touch perception, which will be covered shortly. The temporal the location on the body to understand how the spatial resolution varies. The
resolution is not the same as for hearing, which extends up to 20,000 Hz; the sharpest acuity is on the tongue and hands, where points can be distinguished if
Pacinian corpuscles allow vibrations up to a few hundred Hertz to be distinguished they are as close as 2 or 3mm. The tips of the tongue and fingers have the highest
from a static pressure. acuity. For the forehead, the threshold is around 20mm. The back has the lowest
Regarding proprioception (and kinesthesis), there are three kinds of receptors: acuity, resulting in a threshold of around 60mm. These results have also been
shown to correspond directly to the sizes of receptive fields in the somatosensory
• Muscle spindles: As the name suggests, these are embedded inside of each cortex. For example, neurons that correspond to the back have much larger fields
muscle so that changes in their length can be reported to the central nervous (in terms of skin area) than those of the fingertip.
system (which includes the brain).
Texture perception By running fingers over a surface, texture perception re-
• Golgi tendon organs: These are embedded in tendons, which are each a sults. The size, shape, arrangement, and density of small elements that protrude
tough band of fibrous tissue that usually connects a muscle to bone. The from, or indent into, the surface affect the resulting perceived texture. The du-
organs report changes in muscle tension. plex theory states that coarser textures (larger elements) are mainly perceived by
• Joint receptors: These lie at the joints between bones and help coordinate spatial cues, whereas finer textures are mainly perceived through temporal cues
muscle movement while also providing information to the central nervous [125, 142]. By spatial cue, it means that the structure can be inferred by press-
system regarding relative bone positions. ing the finger against the surface. By temporal cue, the finger is slid across the
surface, resulting in a pressure vibration that can be sensed by the Pacinian and
Through these receptors, the body is aware of the relative positions, orientations, Meissner corpuscles. For a finer texture, a slower motion may be necessary so that
and velocities of its various moving parts. the vibration frequency remains below 250 to 350 Hz. Recall from Section 12.1
13.1. TOUCH AND PROPRIOCEPTION 367 368 S. M. LaValle: Virtual Reality
Figure 13.3: The rubber hand illusion, in which a person reacts to a fake hand as
if it were her own. (Figure from Guterstam, Petkova, and Ehrsson, 2011 [107])
Figure 13.2: Haptic exploration involves several different kinds interaction be-
tween the hand and an object to learn the object properties, such as size, shape, fake forearm. Furthermore, they even learned that making a stabbing gesture with
weight, firmness, and surface texture. (Figure by Allison Okamura, adapted from a needle causes anticipation of pain and the tendency to withdraw the real left
Lederman and Klatzky.) arm, which was actually not threatened [66, 293], and that hot or cold sensations
can even be perceived by association [291].
More generally, this is called a body transfer illusion [251, 293]. An example
that people can learn to improve their texture perception and acuity when read- of this was shown in Figure 1.14 of Section 1.2 for a VR system in which men
ing Braille. Thus, perceptual learning may be applied to improve tactile (touch) and women were convinced that they were swapping bodies, while the visual in-
perception. formation from a camera was coupled with coordinated hand motions to provide
tactile sensory stimulation. Applications of this phenomenon include empathy and
Haptic perception For a larger object, its overall geometric shape can be in- helping amputees to overcome phantom limb sensations. This illusion also gives
ferred through haptic exploration, which involves handling the object. Imagine insights into the kinds of motor programs that might be learnable, as discussed in
that someone hands you an unknown object, and you must determine its shape Sections 10.1 and 10.3, by controlling muscles while getting visual feedback from
while blindfolded. Figure 13.2 shows six different qualitative types of haptic ex- VR. It furthermore affects the perception of oneself in VR, which was discussed
ploration, each of which involves different kinds of receptors and combinations of in Sections 10.4 and 12.2.
spatial and temporal information. By integrating the somatosensory signals from
this in-hand manipulation, a geometric model of the object is learned. Haptic interfaces Touch sensations through engineered devices are provided
through many disparate systems. Figure 1.1 from Section 1.1 showed a system in
Somatosensory illusions Recall from Section 6.4 that the brain combines sig- which force feedback is provided by allowing the user to push mechanical wings to
nals across multiple sensing modalities to provide a perceptual experience. Just fly. Furthermore, a fan simulates wind with intensity that is proportional to the
as the McGurk effect uses mismatches between visual and auditory cues, illusions speed of the person virtually flying. The entire body also tilts so that appropriate
have also been discovered by mismatching cues between vision and somatosensory vestibular stimulation is provided.
systems. The rubber hand illusion is one of the most widely known [66]. In this Figure 13.4 shows several more examples. Figure 13.4(a) shows a PC mouse
case, scientists conducted an experiment in which the subjects were seated at a with a scroll wheel. As the wheel is rotated with the middle finger, discrete bumps
table with both arms resting on it. The subjects’ left arm was covered, but a are felt so that a more carefully calibrated movement can be generated. Figure
substitute rubber forearm was placed nearby on the table and remained visible so 13.4(b) shows a game controller attachment that provides vibration at key points
that it appeared as if it were their own left arm. The experimenter stroked both during an experience, such as an explosion or body contact.
the real and fake forearms with a paint brush to help build up visual and touch Many haptic systems involve using a robot arm to apply force or pressure at
association with the fake forearm. Using a functional MRI scanner, scientists de- precise locations and directions within a small region. Figure 13.4(c) shows such
termined that the same parts of the brain are activated whether it is the real or a system in which the user holds a pen that is attached to the robot arm. Forces
13.1. TOUCH AND PROPRIOCEPTION 369 370 S. M. LaValle: Virtual Reality
are communicated from the robot to the pen to the fingers. As the pen strikes
a virtual surface, the robot provides force feedback to the user by blocking its
motion. The pen could be dragged across the virtual surface to feel any kind of
texture [237]; a variety of simulated textures are presented in [50]. Providing such
force feedback in important in the development of medical devices that enable
doctors to perform surgical procedures through an interface that is connected to a
real device. Without accurate and timely haptic feedback, it is difficult for doctors
to perform many procedures. Imagine cutting into layers of tissue without being
able to feel the resistant forces on the scalpel. It would be easy to push a bit too
far!
Figure 13.4(d) shows a haptic display that is arranged much like a visual dis-
play. A rectangular region is indexed by rows and columns, and at each location a
small pin can be forced outward. This enables shapes to appear above the surface,
while also allowing various levels of pressure and frequencies of vibration.
All of the examples involve haptic feedback applied to the hands; however,
touch receptors appear all over the human body. To provide stimulation over a
larger fraction of receptors, a haptic suit may be needed, which provides forces,
(a) (b)
vibrations, or even electrical stimulation at various points on the suit. A drawback
of these systems is the cumbersome effort of putting on and removing the suit with
each session.
Figure 13.4: (a) The Logitech M325 wireless mouse with a scroll wheel that pro- 13.2 Smell and Taste
vides tactile feedback in the form of 72 bumps as the wheel performs a full rev-
olution. (b) The Sega Dreamcast Jump Pack (1999), which attaches to a game The only human senses not considered so far are smell and taste. They are formally
controller and provides vibrations during game play. (c) Haptic Omni pen-guiding known as olfaction and gustation, respectively [65]. Furthermore, they are usually
haptic device, which communicates pressure and vibrations through the pen to grouped together as the chemical senses because their receptors work by chemical
the fingers. (d) The KGS Dot View Model DV-2, which is a haptic pin array. The interactions with molecules that arrive upon them. The resulting chemorecep-
pins are forced upward to simulate various textures as the finger tip scans across tors respond to particular substances and sufficiently high levels of concentration.
its surface. Compared to the other senses, much less research has been done about them and
there are much fewer electronic devices that “display” stimuli to the nose and
tongue. Nevertheless, these senses are extremely important. The design of artifi-
cial smells is a huge business, which includes perfumes, deodorants, air fresheners,
cleaners, and incense. Likewise, designing tastes is the basis of the modern food
13.2. SMELL AND TASTE 371 372 S. M. LaValle: Virtual Reality
Smell physiology and perception Odors are important for several biological
purposes, which includes detecting prey and predators, selecting potential mates,
and judging whether food is safe to eat. The olfactory receptor neurons lie in the
roof of the nasal cavity, covering an area of 2 to 4 cm2 . There are around 6 million
receptors, which are believed to span 500 to 1000 different types depending on
their responsiveness to specific chemical compositions [204]. Airborne molecules
dissolve into the olfactory mucus, which triggers detection by cilia (small hairs)
that are part of the receptor. The olfactory receptors are constantly regenerating,
with an average lifespan of about 60 days. In addition to receptors, some free nerve
endings lie in the olfactory mucus as well. The sensory pathways are unusual in
that they do not connect through the thalamus before reaching their highest-
level destination, which for smell is the primary olfactory cortex. There is also a
direct route from the receptors to the amygdala, which is associated with emotional
response. This may help explain the close connection between smell and emotional Figure 13.5: A depiction of a wearable olfactory display from [114]. Micropumps
reactions. force bits of liquid from small reservoirs. The SAW atomizer is an surface acoustic
In terms of perception, humans can recognize thousands of different smells wave device that converts droplets into an atomized odor.
[286], and women generally perform better than men [36]. The discrimination
ability depends on the concentration of the smell (in terms of molecules per cubic motion sickness [147].
area). If the concentration is weaker, then discrimination ability decreases. Fur- Olfactory displays usually involve air pumps that can spray chemical com-
thermore, what is considered to be a high concentration for one odor may be barely pounds into air. The presentation of such engineered odors could be delivered
detectable for another. Consequently, the detection thresholds vary by a factor of close to the nose for a personal experience. In this case, the canisters and distri-
a thousand or more, depending on the substance. Adaptation is also important for bution system could be worn on the body [356]. A recent system is depicted in
smell. People are continuously adapting to surrounding smells, especially those of Figure 13.5. Alternatively, the smells could be delivered on the scale of a room.
their own body or home, so that they become unnoticeable. Smokers also adapt This would be preferable for a CAVE setting, but it is generally hard to control the
so that they do not perceive the polluted air in the way that non-smokers can. intensity and uniformity of the odor, especially in light of air flow that occurs from
It seems that humans can recognize many more smells than the number of open windows and air vents. It might also be desirable to vary the concentration
olfactory receptors. This is possible because of combinatorial encoding. Any single of odors over a large area so that localization can be performed, but this is again
odor (or chemical compound) may trigger multiple kinds of receptors. Likewise, difficult to achieve with accuracy.
each receptor may be triggered by multiple odors. Thus, a many-to-many mapping
exists between odors and receptors. This enables far more odors to be distinguished Taste physiology and perception We now jump from smell to taste. On the
based on the distinct subsets of receptor types that become activated. human tongue lie about 10,000 taste buds, which each contains a group of about
50 to 150 taste receptors [295]. The receptors live for an average of 10 days, with
Olfactory interfaces Adding scent to films can be traced back to the early regeneration constantly occurring. Five basic types of taste receptors have been
20th century. One system, from 1960, was called Smell-O-Vision and injected 30 identified:
different odors into the movie theater seats at different points during the film. The • Umami: This one is sensitive to amino acids, such as monosodium glutamate
Sensorama system mentioned in Figure 1.28(c) of Section 1.3 also included smells. (MSG), and is responsible for an overall sense of tastiness. This enables food
In addition, the military has used smells as part of simulators for many decades. manufacturers to cheaply add chemicals that made food seem to taste better.
A survey of previous olfactory displays and interfaces appears in [136], along The biological motivation is likely to be that amino acids are important
with current challenges and issues. It is generally believed that smell is powerful in building blocks for proteins.
its ability to increase immersion in VR. It also offers advantages in some forms of
medical treatments that involve cravings and emotional responses. Surprisingly, • Sweet: This is useful for identifying a food source in terms of its valuable
there is even recent evidence that pleasant odors help reduce visually induced sugar content.
13.2. SMELL AND TASTE 373 374 S. M. LaValle: Virtual Reality
(a) (b)
Figure 13.7: The HRP-4 humanoid robots, which are produced in Japan by
National Institute of Advanced Industrial Science and Technology (AIST) and
Kawada Industries.
• Mobility: Where can the robot go? With no mobility, telepresence is re-
duced to a stationary camera and microphone. If the task is to interact
with people, then it should be able to move into the same places that people
are capable of entering. In other settings, many modes of mobility may be
desirable, such as flying, swimming, or even crawling through pipes.
controlling a cart (Figure 10.5). The robot in the real world behaves geometrically degradation in quality. For reference, the average worldwide travel time to Google
like the cart in the pure virtual world; however, some differences are: 1) The robot to back was around 100 ms in 2012 (it was 50 to 60ms in the US) [219]. Note that
cannot simply teleport to another location. It is, however, possible to connect to by transmitting an entire panoramic view to the user, the network latency should
a different robot, if many are available, which would feel like teleportation to the not contribute to head tracking and rendering latencies.
user. 2) The robot is subject to constraints based on its physical design and its However, latency has a dramatic impact on interactivity, which is a well-known
environment. It may have rolling wheels or walking legs, and may or may not be problem to networked gamers. On the other hand, it has been found that people
able to easily traverse parts of the environment. It will also have limited driving generally tolerate latencies in phone calls of up to 200 ms before complaining of
speed, turning speed, and battery life. 3) A high cost is usually associated with difficulty conversing; however, they may become frustrated if they expect the robot
crashing the robot into people or obstacles. to immediately respond to their movement commands. Completing a manipula-
A spectrum of choices exists for the user who teleoperates the robot. At one tion task is even more difficult because of delays in hand-eye coordination. In
extreme, the user may continuously control the movements, in the way that a some cases people can be trained to overcome high latencies through adaptation,
radio-controlled car is driven using the remote. Latency becomes critical some assuming the latencies do not substantially vary during and across the trials [68].
applications, especially telesurgery [188, 355]. At the other extreme, the user may The latency poses a considerable challenge for medical applications of telepresence.
simply point out the location on a map or use a virtual laser pointer (Section Imagine if you were a doctor pushing on a scalpel via a telepresence system, but
10.2) to point to a visible location. In this case, the robot could execute all could not see or feel that it is time to stop cutting until 500 ms later. This might
of the motions by itself and take the user along for the ride. This requires a be too late!
higher degree of autonomy for the robot because it must plan its own route that
accomplishes the goals without running into obstacles; this is known in robotics
as motion planning [163]. This frees the user of having to focus attention on the 13.4 Brain-Machine Interfaces
minor robot movements, but it may be difficult to obtain reliable performance for
some combinations of robot platforms and environments. The ultimate interface between humans and machines could be through direct
sensing and stimulation of neurons. One step in this direction is to extract physi-
VR sickness issues Because of the connection to locomotion, vection once ological measures, which were introduced in Section 12.3. Rather than using them
again arises (Section 8.4). Many of the suggestions from Section 10.2 to reduce to study VR sickness, we could apply measures such as heart rate, galvanic skin
vection can be applied here, such as reducing the contrast or the field of view response, and pallor to adjust the VR experience dynamically. Various goals would
while the robot is moving. Now consider some robot-specific suggestions. Users be optimized, such as excitement, fear, comfort, or relaxation. Continuing further,
may be more comfortable controlling the robot themselves rather than a higher we could apply technology that is designed to read the firings of neurons so that
level of autonomy, even though it involves tedious concentration. Furthermore, the VR system responds to it by altering the visual and auditory displays. The
the path itself determined by a motion planning algorithm could be optimized to users can learn that certain thoughts have an associated effect in VR, resulting
reduce sickness by shortening times over which accelerations occur or by avoiding in mind control. The powers of neuroplasticity and perceptual learning (Section
close proximity to walls or objects that have high spatial frequency and contrast. 12.1) could enable them to comfortably and efficiently move their avatar bodies
Another idea is to show the motion on a 2D or 3D map while the robot is moving, in the virtual world. This might sound like pure science fiction, but substantial
from a third-person perspective. The user could conceivably be shown anything, progress has been made. For example, monkeys have been recently trained by
such as news feeds, while the robot is moving. As in the case of locomotion for neuroscientists at Duke University to drive wheelchairs using only their thoughts
virtual worlds, one must be careful not to disorient the user by failing to provide [259]. In the field of brain-machine interfaces (alternatively, BMI, brain-computer
enough information to easily infer the new position and orientation relative to the interfaces, or BCI), numerous other experiments have been performed, which con-
old one by the time the user has arrived. nect humans and animals to mechanical systems and VR experiences via their
thoughts [173, 175, 185]. Surveys of this area include [89, 232, 351].
Latency issues As expected, time delays threaten the performance and comfort
of telepresence systems. Such latencies have already been discussed in terms of Measurement methods The goal of devices that measure neural activity is
visual rendering (Section 7.4) and virtual world simulation (Section 8.3.2). A net- to decipher the voluntary intentions and decisions of the user. They are usually
worked system causes new latency to be added to that of the VR system because divided into two categories: non-invasive (attaching sensors to the skin is allowed)
information must travel from the client to the server and back again. Further- and invasive (drilling into the skull is allowed).
more, bandwidth (bits per second) is limited, which might cause further delays or First consider the non-invasive case, which is by far the most appropriate for
13.4. BRAIN-MACHINE INTERFACES 381 382 S. M. LaValle: Virtual Reality
(a) (b)
Figure 13.11: EEG systems place electrodes around the skull: (a) A skull cap that
Figure 13.10: fMRI scans based on various activities. (Figure from Mayfield Brain allows up to a few dozen signals to be measured. (b) Emotive wireless EEG device.
and Spine)
a particular neuron; however, this often increases the number of required trials
humans. The most accurate way to measure full brain activity to date is by func- because the neural response typically switches between different neurons across
tional magnetic resonance imaging (fMRI), which is shown in Figure 13.10. This is trials. As the number of neurons increases, the problem of deciphering the thoughts
related to MRI, which most people are familiar with as a common medical scanning becomes more reliable. Numerous recordings could be from a single site that
method. Ordinary MRI differs in that it provides an image of the static structures performs a known function, or could come from multiple sites to help understand
to identify abnormalities, whereas an fMRI provides images that show activities the distributed processing performed by the brain [173].
of parts of the brain over time. Unfortunately, fMRI is too slow, expensive, and
cumbersome for everyday use as a VR interface [173]. Furthermore, users typically Medical motivation It is important to understand the difference between VR
ingest a dye that increases contrast due to variations in blood flow and also must users and the main targeted community for BMI. The field of BMI has rapidly
remain rigidly fixed. developed because it may give mobility to people who suffer from neuromuscular
Thus, the most common way to measure brain activity for BMI is via electroen- disabilities [351]. Examples include driving a wheelchair and moving a prosthetic
cephalogram (EEG), which involves placing electrodes along the scalp to measure limb by using thoughts alone. The first mental control system was built by Jacques
electrical field fluctuations that emanate from neural activity; see Figure 13.11. Vidal in the 1970s [331, 332], and since that time many systems have been built
The signal-to-noise ratio is unfortunately poor because the brain tissue, bone, and using several kinds of neural signals. In all cases, it takes a significant amount of
skin effectively perform low-pass filtering that destroys most of the signal. There training and skill to operate these interfaces. People with motor disabilities may be
is also significant attenuation and interference with other neural structures. The highly motivated to include hours of daily practice as part of their therapy routine,
transfer rate of information via EEG is between 5 and 25 bits per second [173, 351]. but this would not be the case for the majority of VR users. One interesting
This is roughly equivalent to one to a few characters per second, which is two or- problem in training is that trainees require feedback, which is a perfect application
ders of magnitude slower than the average typing rate. Extracting the information of VR. The controller in the VR system is essentially replaced by the output
from EEG signals involves difficult signal processing [278]; open-source libraries of the signal processing system that analyzes the neural signals. The user can
exist, such as OpenVibe from INRIA Rennes. thus practice moving a virtual wheelchair or prosthetic limb while receiving visual
For the invasive case, electrodes are implanted intracranially (inside of the feedback from a VR system. This prevents them from injuring themselves or
skull). This provides much more information for scientists, but is limited to studies damaging equipment or furnishings while practicing.
on animals (and some humans suffering from neural disorders such as Parkinson’s
disease). Thus, invasive methods are not suitable for the vast majority of people Learning new body schema What happens to the human’s perception of her
as a VR interface. The simplest case is to perform a single-unit recording for own body when controlling a prosthetic limb? The internal brain representation
13.4. BRAIN-MACHINE INTERFACES 383 384 S. M. LaValle: Virtual Reality
of the body is referred to as a body schema. It was proposed over a century ago • Better classification techniques that can recognize the intentions and deci-
[115] that when people skillfully use tools, the body schema adapts accordingly sions of the user with higher accuracy and detail. Modern machine learning
so that the brain operates as if there is a new, extended body. This results in methods may help advance this.
perceptual assimilation of the tool and hand, which was confirmed from neural
signals in [131]. This raises a fascinating question for VR research: What sort of • Dramatic reduction in the amount of training that is required before using an
body schema could our brains learn through different visual body representations interface. If it requires more work than learning how to type, then widespread
(avatars) and interaction mechanisms for locomotion and manipulation? adoption would be unlikely.
[2] Z. M. Aghajan, L. Acharya, J. J. Moore, J. D. Cushman, C. Vuong, and M. R. [19] H. H. Barrett and K. J. Myers. Foundations of Image Science. Wiley, Hoboken,
Mehta. Impaired spatial selectivity and intact phase precession in two-dimensional NJ, 2004.
virtual reality. Nature Neuroscience, 18(1):121–128, 2015.
[20] E. P. Becerra and M. A. Stutts. Ugly duckling by day, super model by night: The
[3] Y. Akatsuka and G. A. Bekey. Compensation for end to end delays in a VR influence of body image on the use of virtual worlds. Journal of Virtual Worlds
system. In Proceedings IEEE Virtual Reality Annual International Symposium, Research, 1(2):1–19, 2008.
pages 156–159, 1998.
[21] C. Bergland. The wacky neuroscience of forgetting how to ride a bicycle. Psychol-
[4] K. Akeley, S. J. Watt, A. Reza Girschick, and M. S. Banks. A stereo display ogy Today, May 2015. Posted online.
prototype with multiple focal distances. ACM Transactions on Graphics, 23(3),
2004. [22] J. Birn. Digital Lighting and Rendering, 3rd Ed. New Riders, San Francisco, CA,
2013.
[5] T. Akenine-Möller, E. Haines, and N. Hoffman. Real-Time Rendering. CRC Press,
Boca Raton, FL, 2008. [23] J. Blauert. Spatial Hearing: Psychophysics of Human Sound Localization. MIT
Press, Boston, MA, 1996.
[6] D. Alais, C. Morrone, and D. Burr. Separate attentional resources for vision and
audition. Proceedings of the Royal Society B: Biological Sciences, 273(1592):1339– [24] J. F. Blinn. Models of light reflection for computer synthesized pictures. In
1345, 2006. Proceedings Annual Conference on Computer Graphics and Interactive Techniques,
1977.
[7] B. B. Andersen, L. Korbo, and B. Pakkenberg. A quantitative study of the hu-
man cerebellum with unbiased stereological techniques. Journal of Comparative [25] I. Bogost and N. Monfort. Racing the Beam: The Atari Video Computer System.
Neurology, 326(4):549–560, 1992. MIT Press, Cambridge, MA, 2009.
[8] J. Angeles. Spatial Kinematic Chains. Analysis, Synthesis, and Optimisation. [26] S. J. Bolanowski, G. A. Gescheider, R. T. Verillo, and C. M. Checkosky. Four chan-
Springer-Verlag, Berlin, 1982. nels mediate the aspects of touch. Journal of the Acoustical Society of America,
84(5):1680–1694, 1988.
[9] J. Angeles. Rotational Kinematics. Springer-Verlag, Berlin, 1989.
[27] W. M. Boothby. An Introduction to Differentiable Manifolds and Riemannian
[10] J. Angeles. Fundamentals of Robotic Mechanical Systems: Theory, Methods, and Geometry. Revised 2nd Ed. Academic, New York, 2003.
Algorithms. Springer-Verlag, Berlin, 2003.
[28] D. Bordwell and K. Thompson. Film History: An Introduction, 3rd Ed. McGraw-
[11] A. Antoniou. Digital Signal Processing: Signals, Systems, and Filters. McGraw- Hill, New York, NY, 2010.
Hill Education, Columbus, OH, 2005.
[29] J. K. Bowmaker and H. J. A. Dartnall. Visual pigment of rods and cones in a
[12] D. K. Arrowsmith and C. M. Place. Dynamical Systems: Differential Equations, human retina. Journal of Physiology, 298:501–511, 1980.
Maps, and Chaotic Behaviour. Chapman & Hall/CRC, New York, 1992.
[30] D. Bowman and L. Hodges. An evaluation of techniques for grabbing and manip-
[13] K. W. Arthur. Effects of Field of View on Performance with Head-Mounted Dis- ulating remote objects in immersive virtual environments. In Proceedings ACM
plays. PhD thesis, University of North Carolina at Chapel Hill, 2000. Symposium on Interactive 3D Graphics, pages 35–38, 1997.
[14] K. J. Astrom and R. Murray. Feedback Systems: An Introduction for Scientists [31] D. A. Bowman, E. Kruijff, J. J. LaViola, and I. Poupyrev. 3D User Interfaces.
and Engineers. Princeton University Press, Princeton, NJ, 2008. Addison-Wesley, Boston, MA, 2005.
385
BIBLIOGRAPHY 387 388 BIBLIOGRAPHY
[32] K. Brown. Silent films: What was the right speed? Sight and Sound, 49(3):164– [51] C. A. Curcio, K. R. Sloan, R. E. Kalina, and A. E. Hendrickson. Human photore-
167, 1980. ceptor topography. Journal of Comparative Neurobiology, 292:497–523, 1990.
[33] M. Brown and D. G. Lowe. Automatic panoramic image stitching using invariant [52] R. P. Darken and B. Peterson. Spatial orientation, wayfinding, and representation.
features. International Journal of Computer Vision, 74(1):59–73, 2007. In K. S. Hale and K. M. Stanney, editors, Handbook of Virtual Environments, 2nd
Edition, pages 131–161. CRC Press, Boca Raton, FL, 2015.
[34] N. C. Burbules. Rethinking the virtual. In J. Weiss, J. Nolan, and P. Trifonas,
editors, The International Handbook of Virtual Learning Environments, pages 3– [53] R. Darwin. New experiments on the ocular spectra of light and colours. Philo-
24. Kluwer Publishers, Dordrecht, 2005. sophical Transactions of the Royal Society of London, 76:313–348, 1786.
[35] D. C. Burr, M. C. Morrone, and L. M. Vaina. Large receptive fields for optic flow [54] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf. Computational
detection in humans. Vision Research, 38(12):1731–1743, 1998. Geometry: Algorithms and Applications, 2nd Ed. Springer-Verlag, Berlin, 2000.
[36] W. S. Cain. Odor identification by males and females: predictions vs performance. [55] K. N. de Winkel, M. KAtliar, and H. H. Bülthoff. Forced fusion in multisensory
Chemical Senses, 7(2):129–142, 1994. heading estimation. PloS ONE, 10(5), 2015.
[37] P. Cairns and A. L. Cox. Research Methods for Human-Computer Interaction. [56] J. R. Dejong. The effects of increasing skill on cycle time and its consequences for
Cambridge University Press, Cambridge, U.K., 2008. time standards. Ergonomics, 1(1):51–60, 1957.
[38] F. W. Campbell and D. G. Green. Optical and retinal factors affecting visual
resolution. Journal of Physiology, 181:576–593, 1965. [57] J. Delwiche. The impact of perceptual interactions on perceived flavor. Food
Quality and Perferences, 15, 137–146.
[39] S. K. Card, W. K. English, and B. J. Burr. Evaluation of mouse, rate-controlled iso-
metric joystick, step keys, and text keys for text selection on a CRT. Ergonomics, [58] J. L. Demer, J. Goldberg, H. A. Jenkins, and F. I. Porter. Vestibulo-ocular reflex
20:601–613, 1978. during magnified vision: Adaptation to reduce visual-vestibular conflict. Aviation,
Space, and Environmental Medicine, 58(9 Pt 2):A175–A179, 1987.
[40] J. M. Caroll. HCI Models, Theories, and Frameworks: Toward a Multidisciplinary
Science. Morgan Kaufmann, San Francisco, CA, 2003. [59] M. Dennison, Z. Wisti, and M. D’Zmura. Use of physiological signals to predict
cybersickness. Displays, 44:52–52, 2016.
[41] E. Catmull. A subdivision algorithm for computer display of curved surfaces. PhD
thesis, University of Utah, 1974. [60] D. Deutsch, T. Hamaoui, and T. Henthorn. The glissando illusion and handedness.
Neuropsychologia, 45:2981–2988, 2007.
[42] A. Y. Chang. A survey of geometric data structures for ray tracing. Technical
Report TR-CIS-2001-06, Brooklyn Polytechnic University, 2001. [61] P. DiZio, J. R. Lackner, and R. K. Champney. Proprioceptive adaptation and
aftereffects. In K. S. Hale and K. M. Stanney, editors, Handbook of Virtual Envi-
[43] N. Chaudhari, A. M. Landin, and S. D. Roper. A metabotropic glutamate receptor ronments, 2nd Edition. CRC Press, Boca Raton, FL, 2015.
variant functions as a taste receptor. Nature Neuroscience, 3(3):113–119, 2000.
[62] M. H. Draper, E. S. Viire, and T. A. Furness amd V. J. Gawron. Effects of
[44] G. Chen, J. A. King, N. Burgess, and J. O’Keefe. How vision and movement image scale and system time delay on simulator sickness with head-coupled virtual
combine in the hippocampal place code. Proceedings of the National Academy of environments. Human Factors, 43(1):129–146, 2001.
Science USA, 110(1):378–383, 2013.
[45] C. K. Chui and G. Chen. Kalman Filtering. Springer-Verlag, Berlin, 1991. [63] A. T. Duchowski. Eye Tracking Methodology: Theory and Practice, 2nd Ed.
Springer-Verlag, Berlin, 2007.
[46] D. Claus and A. W. Fitzgibbon. A rational function lens distortion model for
general cameras. In Proc. Computer Vision and Pattern Recognition, pages 213– [64] G. Dudek, P. Giguere, C. Prahacs, S. Saunderson, J. Sattar, L.-A. Torres-Mendez,
219, 2005. M. Jenkin, A. German, A. Hogue, A. Ripsman, J. Zacher, E. Milios, H. Liu,
P. Zhang, M. Buehler, and C. Georgiades. Aqua: An amphibious autonomous
[47] E. Cline. Ready Player One. Random House, 2011. robot. IEEE Computer Magazine, 40(1):46–53, 2007.
[48] D. Cox, J. Little, and D. O’Shea. Ideals, Varieties, and Algorithms. Springer- [65] R. L. Doty (Ed.). Handbook of Olfaction and Gustation, 3rd Ed. Wiley-Blackwell,
Verlag, Berlin, 1992. Hoboken, NJ, 2015.
[49] C. Cruz-Neira, D. J. SAndin, T. A. DeFanti, R. V. Kenyon, and J. C. Hart. The [66] H. H. Ehrsson, C. Spence, and R. E. Passingham. That’s my hand! Activity in
CAVE: Audio visual experience automatic virtual environment. Communications premotor cortex reflects feeling of ownership of a limb. Science, 305(5685):875–877,
of the ACM, 35(6):64–72, 1992. 2004.
[50] H. Culbertson, J. J. Lopez Delgado, and K. J. Kuchenbecker. One hundred data- [67] S. R. Ellis, K. Mania, B. D. Adelstein, and M. I. Hill. Generalizeability of latency
driven haptic texture models and open-source methods for rendering on 3D objects. detection in a variety of virtual environments. In Proceedings of the Human Factors
In Proceedings IEEE Haptics Symposium, pages 319–325, 2014. and Ergonomics Society Annual Meeting, pages 2632–2636, 2004.
BIBLIOGRAPHY 389 390 BIBLIOGRAPHY
[68] S. R. Ellis, M. J. Young, B. D. Adelstein, and S. M. Ehrlich. Discrimination [84] D. Friedman, R. Leeb, C. Guger, A. Steed, G. Pfurtscheller, and M. Slater. Nav-
of changes in latency during head movement. In Proceedings Computer Human igating virtual reality by thought: What is it like? Presence: Teleoperators and
Interfaces, pages 1129–1133, 1999. Virtual Environments, 16(1):100–110, 2007.
[69] M. Emoto, K. Masaoka, M. Sugawara, and F. Okano. Viewing angle effects from [85] H. Fuchs, Z. M. Kedem, and B. F. Naylor. On visible surface generation by a
wide field video projection images on the human equilibrium. Displays, 26(1):9–14, priori tree structures. In Proceedings ACM SIGGRAPH, pages 124–133, 1980.
2005.
[86] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendon-Mancha. Visual simulta-
[70] D. J. Encross. Control of skilled movement. Psychological Bulletin, 84:14–29, 1977. neous localization and mapping: a survey. Journal Artificial Intelligence Review,
43(1):55–81, 2015.
[71] R. Engbert and K. Mergenthaler. Mircosaccades are triggered by low retinal image
slip. Proceedings of the National Academy of Sciences of the United States of [87] T. Funkhouser, I. Carlbom, G. Elko, G. Pingali, M. Sondhi, and J. West. A
America, 103(18):7192–7197, 2008. beam tracing approach to acoustic modeling for interactive virtual environments.
In Proceedings ACM Annual Conference on Computer Graphics and Interactive
[72] B. W. Epps. Comparison of six cursor control devices based on Fitts’ law models. Techniques, pages 21–32, 1998.
In Proceedings of the 30th Annual Meeting of the Human Factors Society, pages
327–331, 1986. [88] J. Gallier. Curves and Surfaces in Geometric Modeling. Morgan Kaufmann, San
Francisco, CA, 2000.
[73] C. J. Erkelens. Coordination of smooth pursuit and saccades. Vision Research,
46(1–2):163–170, 2006. [89] S. Gao, Y. Wang, X. Gao, and B. Hong. Visual and auditory brain-computer
interfaces. IEEE Transactions on Biomedical Engineering, 61(5):1436–1447, 2014.
[74] D. Fattal, Z. Peng, T. Tran, S. Vo, M. Fiorentino, J. Brug, and R. G. Beausoleil. A
multi-directional backlight for a wide-angle, glasses-free three-dimensional display. [90] M. A. Garcia-Perez. Forced-choice staircases with fixed step sizes: asymptotic and
Nature, 495:348–351, 2013. small-sample properties. Vision Research, 38(12):1861–81, 1998.
[75] J. Favre, B. M. Jolles, O. Siegrist, and K. Aminian. Quaternion-based fusion [91] G. M. Gauthier and D. A. Robinson. Adaptation of the human vestibuloocular
of gyroscopes and accelerometers to improve 3D angle measurement. Electronics reflex to magnifying lenses. Brain Research, 92(2):331–335, 1975.
Letters, 32(11):612–614, 2006.
[92] D. Gebre-Egziabher, G. Elkaim, J. David Powell, and B. Parkinson. Calibration
[76] G. T. Fechner. Elements of Psychophysics (in German). Breitkopf and Härtel, of strapdown magnetometers in magnetic field domain. Journal of Aerospace En-
Leipzig, 1860. gineering, 19(2):87–102, 2006.
[77] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model [93] G. Gescheider. Psychophysics: The Fundamentals, 3rd Ed. Lawrence Erlbaum
fitting with applications to image analysis and automated cartography. Commu- Associates, Mahwah, NJ, 2015.
nications of the ACM, 24(6):381–395, 1981.
[94] E. Gibson. Principles of Perceptual Learning and Development. Appleton-Century-
[78] P. M. Fitts. The information capacity of the human motor system in controlling Crofts, New York, 1969.
the amplitude of movement. Journal of Experimental Psychology, 47(6):381–391,
1956. [95] W. Gibson. Neuromancer. Ace Books, 1984.
[79] R. C. Fitzpatrick and B. L. Day. Probing the human vestibular system with
[96] W. C. Gogel. An analysis of perceptions from changes in optical size. Perception
galvanic stimulation. Journal of Applied Physiology, 96(6):2301–2316, 2004.
and Psychophysics, 60(5):805–820, 1998.
[80] R. C. Fitzpatrick, J. Marsden, S. R. Lord, and B. L. Day. Galvanic vestibular
[97] E. B. Goldstein. Sensation and Perception, 9th Ed. Wadsworth, Belmont, CA,
stimulation evokes sensations of body rotation. NeuroReport, 13(18):2379–2383, 2014.
2002.
[98] R. L. Goldstone. Perceptual learning. Annual Review of Psychology, 49:585–612,
[81] T. Fong, I. Nourbakhsh, and K. Dautenhahn. A survey of socially interactive 1998.
robots: Concepts, design, and applications. Robotics and Autonomous Systems,
42(3-4):143–166, 2003. [99] A. Gopnik, A. N. Meltzoff, and P. K. Kuhl. The Scientist in the Crib: What Early
Learning Tells Us About the Mind. HarperCollins, New York, NY, 2000.
[82] W. T. Fong, S. K. Ong, and A. Y. C. Nee. Methods for in-field user calibration of
an inertial measurement unit without external equipment. Measurement Science [100] S. Gottschalk, M. C. Lin, and D. Manocha. Obbtree: A hierarchical structure for
and Technology, 19(8), 2008. rapid interference detection. In Proceedings ACM SIGGRAPH, 1996.
[83] A. K. Forsberg, K. Herndon, and R. Zelznik. Aperture based selection for im- [101] A. C. Grant, M. C. Thiagarajah, and K. Sathian. Tactile perception in blind
mersive virtual environments. In Proceedings ACM Symposium on User Interface Braille readers: A psychophysical study of acuity and hyperacuity using gratings
Software and Technology, pages 95–96, 1996. and dot patterns. Perception and Psychophysics, 62(2):301–312, 2000.
BIBLIOGRAPHY 391 392 BIBLIOGRAPHY
[102] A. Graybiel and J. Knepton. Sopite syndrome - sometimes sole manifestation of [120] J. M. Hillis, M. O. Ernst, M. S. Banks, and M. S. Landy. Combining sen-
motion sickness. Aviation, Space, and Environmental Medicine, 47(8):873–882, sory information: mandatory fusion within, but not between, senses. Science,
1976. 298(5098):1627–30, 2002.
[103] J. Gregory. Game Engine Architecture, 2nd Ed. CRC Press, Boca Raton, FL, [121] P. Hoberman, D. M. Krum, E. A. Suma, and M. Bolas. Immersive training games
2014. for smartphone-based head mounted displays. In IEEE Virtual Reality Short Pa-
pers and Posters, 2012.
[104] J. E. Greivenkamp. Field Guide to Geometrical Optics. SPIE Press, Bellingham,
WA, 2004. [122] J. G. Hocking and G. S. Young. Topology. Dover, New York, 1988.
[105] B. Guentner, M. Finch, S. Drucker, D. Tan, and J. Snyder. Foveated [123] C. M. Hoffmann. Geometric and Solid Modeling. Morgan Kaufmann, San Fran-
3D graphics. Technical report, Microsoft Research, 2012. Available at cisco, CA, 1989.
http://research.microsoft.com/.
[124] R. V. Hogg, J. McKean, and A. T. Craig. Introduction to Mathematical Statistics,
[106] P. Guigue and O. Devillers. Fast and robust triangle-triangle overlap test using 7th Ed. Pearson, New York, NY, 2012.
orientation predicates. Journal of Graphics Tools, 8(1):25–32, 2003.
[125] M. Hollins, M. H. Buonocore, and G. R. Mangun. The neural mechanisms of
[107] A. Guterstam, V. I. Petkova, and H. H. Ehrsson. The illusion of owning a third top-down attentional control. Nature Neuroscience, 3(3):284–291, 2002.
arm. PloS ONE, 6(2), 2011.
[126] G. C. Holst and T. S. Lomheim. CMOS/CCD Sensors and Camera Systems. SPIE
[108] K. S. Hale and K. M. Stanney. Handbook of Virtual Environments, 2nd Edition. Press, Bellingham, WA, 2011.
CRC Press, Boca Raton, FL, 2015.
[127] X. Hu and H. Hua. Design and assessment of a depth-fused multi-focal-plane
[109] G. Hall. Perceptual and Associative Learning. Oxford University Press, Oxford, display prototype. Journal of Display Technology, 10(4):308–316, 2014.
UK, 1991.
[128] A. S. Huang, A. Bachrach, P. Henry, M. Krainin, D. Maturana, D. Fox, and
[110] R. S. Hartenberg and J. Denavit. A kinematic notation for lower pair mechanisms N. Roy. Visual odometry and mapping for autonomous flight using an RGB-D
based on matrices. Journal of Applied Mechanics, 77:215–221, 1955. camera. In Proceedings International Symposium on Robotics Research, 2011.
[111] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision, [129] C.-M. Huang and B. Mutlu. The repertoire of robot behavior: Enabling robots
2nd Ed. Cambridge University Press, Cambridge, U.K., 2004. to achieve interaction goals through social behavior. Journal of Human-Robot
Interaction, 2(2), 2013.
[112] C. D. Harvey, F. Collman, D. A. Dombeck, and D. W. Tank. Intracellular dynamics
of hippocampal place cells during virtual navigation. Nature, 461:941–946, 2009. [130] W. Hugemann. Correcting lens distortions in digital photographs. In European
Association for Accident Research and Analysis (EVU) Conference, 2010.
[113] J. O. Harvey. Efficient estimation of sensory thresholds with ml-pest. Spatial
Vision, 11(1):121–128, 1997. [131] A. Iriki, M. Tanaka, and Y. Iwamura. Coding of modified body schema during
tool use by macaque postcentral neurones. Neuroreport, 7(14):2325–2330, 1996.
[114] K. Hashimoto, Y. Maruno, and T. Nakamoto. Brief demonstration of olfactory and
visual presentation using wearable olfactory display and head mounted display. In [132] J. A. Irwin. The pathology of sea-sickness. The Lancet, 118(3039):907–909, 1878.
Proceedings IEEE Virtual Reality Conference, page Abstract, 2016.
[133] A. Iserles. A First Course in the Numerical Analysis of Differential Equations,
[115] H. Head and G. Holmes. Sensory disturbances from cerebral lesion. Brain, 34(2- 2nd Ed. Cambridge University Press, Cambridge, U.K., 2008.
3):102–254, 1911.
[134] M. Izzetoglu, K. Izzetoglu, S. Bunce, H. Ayaz, A. Devaraj, B. Onaral, and K. Pour-
[116] E. G. Heckenmueller. Stabilization of the retinal image: A review of method, razaei. Functional near-infrared neuroimaging. IEEE Transactions on Neural Sys-
effects, and theory. Psychological Bulletin, 63:157–169, 1965. tems and Rehabilitation Engineering, 13(2):153–159, 2005.
[117] J. Heikkilä. Geometric camera calibration using circular control points. IEEE [135] J. Jerald. The VR Book. Association of Computer Machinery and Morgan &
Transactions on Pattern Analysis and Machine Intelligence, 22(10):1066–1077, Claypool Publishers, 2015.
2000.
[136] D. L. Jones, S. Dechmerowski, R. Oden, V. Lugo, J. Wang-Costello, and W. Pike.
[118] J. Heikkilä and O. Silvén. A four-step camera calibration procedure with implicit Olfactory interfaces. In K. S. Hale and K. M. Stanney, editors, Handbook of Virtual
image correction. In Proc. Computer Vision and Pattern Recognition, pages 1106– Environments, 2nd Edition, pages 131–161. CRC Press, Boca Raton, FL, 2015.
1112, 1997.
[137] N. P. Jouppi and S. Thomas. Telepresence systems with automatic preservation
[119] W. T. Higgins. A comparison of complementary and Kalman filtering. IEEE of user head height, local rotation, and remote translation. In Proc. IEEE Inter-
Transactions on Aerospace and Electronic Systems, 11(3):321–325, 1975. national Conference on Robotics and Automation, pages 62–68, 2005.
BIBLIOGRAPHY 393 394 BIBLIOGRAPHY
[138] M. Kaliuzhna, M. Prsa, S. Gale, S. J. Lee, and O. BLanke. Learning to integrate [154] L. L. Kontsevich and C. W. Tyler. Bayesian adaptive estimation of psychometric
contradictory multisensory self-motion cue pairings. Journal of Vision, 15(10), slope and threshold. Vision Research, 39(16):2729–2737, 1999.
2015.
[155] C. Konvalin. Compensating for tilt, hard-iron, and soft-iron ef-
[139] M. Kalloniatis and C. Luu. Visual acuity. In H. Kolb, R. Nelson, E. Fernandez, and fects. Available at http://www.sensorsmag.com/sensors/motion-velocity-
B. Jones, editors, Webvision: The Organization of the Retina and Visual System. displacement/compensating-tilt-hard-iron-and-soft-iron-effects-6475, December
2007. Last retrieved on October 18, 2016. 2009. Last retrieved on May 30, 2016.
[140] R. Kalman. A new approach to linear filtering and prediction problems. Transac- [156] B. C. Kress and P. Meyrueis. Applied Digital Optics: From Micro-optics to
tions of the ASME, Journal of Basic Engineering, 82:35–45, 1960. Nanophotonics. Wiley, Hoboken, NJ, 2009.
[141] H. Kato and M. Billinghurst. Marker tracking and hmd calibration for a video- [157] J. B. Kuipers. Quaternions and Rotation Sequences. Princeton University Press,
based augmented reality conferencing system. In Proceedings of IEEE and ACM Princeton, NJ, 1999.
International Workshop on Augmented Reality, 1999.
[158] P. R. Kumar and P. Varaiya. Stochastic Systems. Prentice-Hall, Englewood Cliffs,
[142] D. Katz. Der aufbau der tastwelt. Zeitschrift für Psychologie, Ergänzungsband NJ, 1986.
11, 1925.
[159] R. Lafer-Sousa, K. L. Hermann, and B. R. Conway. Striking individual differ-
[143] R. S. Kennedy and L. H. Frank. A review of motion sickness with special refer- ences in color perception uncovered by the dress photograph. Current Biology,
ence to simulator sickness. Technical Report NAVTRAEQUIPCEN 81-C-0105-16, 25(13):R545–R546, 2015.
United States Navy, 1985.
[160] M. F. Land and S.-E. Nilsson. Animal Eyes. Oxford University Press, Oxford,
[144] R. S. Kennedy, N. E. Lane, K. S. Berbaum, and M. G. Lilienthal. Simulator UK, 2002.
sickness questionnaire: An enhanced method for quantifying simulator sickness.
International Journal of Aviation Psychology, 3(3):203–220, 1993. [161] D. Lanman and D. Luebke. Near-eye light field displays. ACM Transactions on
Graphics, 32(6), 2013.
[145] B. Keshavarz, H. Hecht, and B. D. Lawson. Visually induced motion sickness:
Causes, characteristics, and countermeasures. In K. S. Hale and K. M. Stanney, [162] J. Lanman, E. Bizzi, and J. Allum. The coordination of eye and head movement
editors, Handbook of Virtual Environments, 2nd Edition, pages 647–698. CRC during smooth pursuit. Brain Research, 153(1):39–53, 1978.
Press, Boca Raton, FL, 2015.
[163] S. M. LaValle. Planning Algorithms. Cambridge University Press, Cambridge,
[146] B. Keshavarz, B. E. Riecke, L. J. Hettinger, and J. L. Campos. Vection and U.K., 2006. Available at http://planning.cs.uiuc.edu/.
visually induced motion sickness: how are they related? Frontiers in Psychology,
6(472), 2015. [164] S. M. LaValle. Help! My cockpit is drifting away. Oculus blog post. Retrieved
from https://developer.oculus.com/blog/magnetometer/, December 2013. Last
[147] B. Keshavarz, D. Stelzmann, A. Paillard, and H. Hecht. Visually induced mo- retrieved on Jan 10, 2016.
tion sickness can be alleviated by pleasant odors. Experimental Brain Research,
233:1353–1364, 2015. [165] S. M. LaValle. The latent power of prediction. Oculus blog post. Re-
trieved from https://developer.oculus.com/blog/the-latent-power-of-prediction/,
[148] W. Khalil and J. F. Kleinfinger. A new geometric notation for open and closed-loop July 2013. Last retrieved on Jan 10, 2016.
robots. In Proceedings IEEE International Conference on Robotics & Automation,
volume 3, pages 1174–1179, 1986. [166] S. M. LaValle. Sensor fusion: Keeping it simple. Oculus blog post. Re-
trieved from https://developer.oculus.com/blog/sensor-fusion-keeping-it-simple/,
[149] D. O. Kim, C. E. Molnar, and J. W. Matthews. Cochlear mechanics: Nonlinear May 2013. Last retrieved on Jan 10, 2016.
behaviour in two-tone responses as reflected in cochlear-new-fibre responses and in
ear-canal sound pressure. Journal of the Acoustical Society of America, 67(5):1704– [167] S. M. LaValle and P. Giokaris. Perception based predictive tracking for head
1721, 1980. mounted displays. US Patent 20140354515A1, December 2014.
[150] H. Kingma and M. Janssen. Biophysics of the vestibular system. In A. M. Bron- [168] S. M. LaValle, A. Yershova, M. Katsev, and M. Antonov. Head tracking for the
stein, editor, Oxford Textbook of Vertigo and Imbalance. Oxford University Press, Oculus Rift. In Proc. IEEE International Conference on Robotics and Automation,
Oxford, UK, 2013. pages 187–194, 2014.
[151] C. L. Kinsey. Topology of Surfaces. Springer-Verlag, Berlin, 1993. [169] J. J. LaViola. A discussion of cybersickness in virtual environments. ACM SIGCHI
Bulletin, 32:47–56, 2000.
[152] R. E. Kirk. Experimental Design, 4th Ed. Sage, Thousand Oaks, CA, 2013.
[170] B. D. Lawson. Motion sickness scaling. In K. S. Hale and K. M. Stanney, editors,
[153] E. M. Kolasinski. Simulator sickness in virtual environments. Technical Report Handbook of Virtual Environments, 2nd Edition, pages 601–626. CRC Press, Boca
2017, U.S. Army Research Institute, 1995. Raton, FL, 2015.
BIBLIOGRAPHY 395 396 BIBLIOGRAPHY
[171] B. D. Lawson. Motion sickness symptomatology and origins. In K. S. Hale and [187] G. D. Love, D. M. Hoffman, P. J. H. Hands, J. Gao, A. K. Kirby, and M. S. Banks.
K. M. Stanney, editors, Handbook of Virtual Environments, 2nd Edition, pages High-speed switchable lens enables the development of a volumetric stereoscopic
531–600. CRC Press, Boca Raton, FL, 2015. display. Optics Express, 17(18):15716–15725, 2009.
[172] D. Lazewatsky and W. Smart. An inexpensive robot platform for teleoperation [188] M. J. Lum, J. Rosen, H. King, D. C. Friedman, T. S. Lendvay, A. S. Wright, M. N.
and experimentation. In Proc. IEEE International Conference on Robotics and Sinanan, and B. Hannaford. Telepresence systems with automatic preservation of
Automation, pages 1211–1216, 2011. user head height, local rotation, and remote translation. In Proc. IEEE Conference
on Engineering in MEdicine and Biology Society, pages 6860–6863, 2009.
[173] M. A. Lebedev and M. A. L. Nicolelis. Brain-machine interfaces: Past, present,
and future. TRENDS in Neurosciences, 29(9):536–546, 2006. [189] R. G. Lyons. Understanding Digital Signal Processing, 3rd Ed. Prentice-Hall,
Englewood Cliffs, NJ, 2010.
[174] A. Lecuyer, L. George, and M. Marchal. Toward adaptive vr simulators combining
visual, haptic, and brain-computer interfaces. In IEEE Computer Graphics and [190] K. Y. Ma, P. Chirarattananon, and R. J. Wood. Design and fabrication of an
Applications, pages 3318–3323, 2013. insect-scale flying robot for control autonomy. In Proc. IEEE/RSJ International
Conference on Intelligent Robots and Systems, pages 1133–1140, 2012.
[175] A. Lécuyer, F. Lotte, R. B. Reilly, R. Leeb, M. Hirose, and N. Slater. Brain-
computer interfaces, virtual reality, and videogames. IEEE Computer, 41(10):66– [191] Y. Ma, S. Soatto, J. Kosecka, and S. S. Sastry. An Invitation to 3-D Vision.
72, 2008. Springer-Verlag, Berlin, 2003.
[176] M. R. Leek. Adaptive procedures in psychophysical research. Perception and [192] I. S. MacKenzie. Fitts’ law as a research and design tool in human-computer
Psychophysics, 63(8):1279–1292, 2001. interaction. Human-Computer Interaction, 7(1):91–139, 1992.
[177] R. J. Leigh and D. S. Zee. The Neurology of Eye Movements, 5th Ed. Oxford
University Press, 2015. [193] I. S. Mackenzie. Movement time prediction in human-computer interfaces. In
R. M. Baecker, J. Grudin, W. A. S. Buxton, and S. Greenberg, editors, Readings in
[178] J.-C. Lepecq, I. Giannopulu, and P.-M. Baudonniere. Cognitive effects on visually Human-Computer Interaction, pages 483–492. Morgan Kaufmann, San Francisco,
induced body motion in children. Perception, 24(4):435–449, 1995. 1995.
[194] I. S. MacKenzie and W. Buxton. Extending Fitts Law to 2D tasks. In Proceedings
[179] J.-C. Lepecq, I. Giannopulu, S. Mertz, and P.-M. Baudonniere. Vestibular sen- of the SIGCHI Conference on Human Factors in Computing Systems, pages 219–
sitivity and vection chronometry along the spinal axis in erect man. Perception, 226, 1992.
28(1):63–72, 1999.
[195] N. A. Macmillan and C. D. Creelman. Dection Theory: A User’s Guide, 2nd Ed.
[180] H. Li, L. Trutoiu, K. Olszewski, L. Wei, T. Trutna, P.-L. Hsieh, A. Nicholls, and Lawrence Erlbaum Associates, Mahwah, NJ, 2005.
C. Ma. Facial performance sensing head mounted display. In Proceedings ACM
SIGGRAPH, 2015. [196] R. Magill and 10th Ed. D. Anderson. Motor Learning and Control: Concepts and
Applications. McGraw-Hill, New York, NY, 2013.
[181] M. C. Lin and J. F. Canny. Efficient algorithms for incremental distance compu-
tation. In Proceedings IEEE International Conference on Robotics & Automation, [197] R. Mahoney, T. Hamel, and J.-M. Pfimlin. Nonlinear complementary filters on the
1991. special orthogonal group. IEEE Transactions on Automatic Control, 53(5):1203–
[182] M. C. Lin and D. Manocha. Collision and proximity queries. In J. E. Goodman 1218, 2008.
and J. O’Rourke, editors, Handbook of Discrete and Computational Geometry, 2nd
Ed., pages 787–807. Chapman and Hall/CRC Press, New York, 2004. [198] A. Maimone, D. Lanman, K. Rathinavel, K. Keller, D. Luebke, and H. Fuchs.
Pinlight displays: Wide field of view augmented-reality eyeglasses using defocused
[183] J. Linowes. Unity Virtual Reality Projects. Packt, Birmingham, UK, 2015. point light sources. ACM Transactions on Graphics, 33(4), 2014.
[184] S. Liversedge, I. Gilchrist, and S. Everling (eds). Oxford Handbook of Eye Move- [199] K. Mallon and P. F. Whelan. Precise radial un-distortion of images. In Proc.
ments. Oxford University Press, 2011. Computer Vision and Pattern Recognition, pages 18–21, 2004.
[185] F. Lotte, J. Faller, C. Guger, Y. Renard, G. Pfurtscheller, A. L’ecuyer, and [200] K. Mania, B. D. Adelstein, S. R. Ellis, and M. I. Hill. Perceptual sensitivity
R. Leeb. Combining BCI with virtual reality: Towards new applications and im- to head tracking latency in virtual environments with varying degrees of scene
proved BCI. In B. Z. Allison, S. Dunne, R. Leeb, J. Del R. Millán, and A. Nijholt, complexity. In Proceedings of Symposium on Applied Perception in Graphics and
editors, Towards Practical Brain-Computer Interfaces, pages 197–220. Springer- Visualization, pages 39–47, 2004.
Verlag, Berlin, 2012.
[201] W. R. Mark, L. McMillan, and G. Bishop. Post-rendering 3D warping. In Pro-
[186] F. Lotte, A. van Langhenhove, F. Lamarche, T. Ernest, Y. Renard, B. Arnaldi, ceedings of the Symposium on Interactive 3D Graphics, pages 7–16, 1997.
and A. Lécuyer. Exploring large virtual environments by thoughts using a brain-
computer interface based on motor imagery and high-level commands. Presence: [202] S. Marschner and P. Shirley. Fundamentals of Computer Graphics, 4th Ed. CRC
Teleoperators and Virtual Environments, 19(1):154–170, 2010. Press, Boca Raton, FL, 2015.
BIBLIOGRAPHY 397 398 BIBLIOGRAPHY
[203] M. T. Mason. Mechanics of Robotic Manipulation. MIT Press, Cambridge, MA, [222] E. I. Moser, E. Kropff, and M.-B. Moser. Place cells, grid cells, and the brain’s
2001. spatial representation system. Annual Reviews of Neuroscience, 31:69–89, 2008.
[204] G. Mather. Foundations of Sensation and Perception. Psychology Press, Hove, [223] J. D. Moss and E. R. Muth. Characteristics of head-mounted displays and their
UK, 2008. effects on simulator sickness. Human Factors, 53(3):308–319, 2011.
[205] G. Mather, F. Verstraten, and S. Anstis. The motion aftereffect: A modern per-
spective. MIT Press, Boston, MA, 1998. [224] D. E. Muller and F. P. Preparata. Finding the intersection of two convex polyhe-
dra. Theoretical Computer Science, 7:217–236, 1978.
[206] M. E. McCauley and T. J. Sharkey. Cybersickness: Perception of self-motion in
virtual environments. Presence, 1(3):311–318, 1992. [225] D. Mustafi, A. H. Engel, and Palczewski. Structure of cone photoreceptors.
Progress in Retinal and Eye Research, 28:289–302, 2009.
[207] H. McGurk and J. MacDonald. Hearing lips and seeing voices. Nature, 264:746–
748, 1976. [226] T. Narumi, S. Nishizaka, T. Kajinami, T. Tanikawa, and M. Hirose. Augmented
reality flavors: gustatory display based on edible marker and cross-modal interac-
[208] R. Mehra, N. Raghuvanshi, L. Antani, A. Chandak, S. Curtis, and D. Manocha. tion. In Proceedings of the SIGCHI Conference on Human Factors in Computing
Wave-based sound propagation in large open scenes using an equivalent source Systems, pages 93–102, 2011.
formulation. ACM Transactions on Graphics, 32(2), 2013.
[227] N. Naseer and K.-S. Hong. fNIRS-based brain-computer interfaces: a review.
[209] J. Merimaa and V. Pulkki. Spatial impulse response rendering I: Analysis and Frontiers in Human Neuroscience, 9(3), 2015.
synthesis. Journal of the Audio Engineering Society, 53(12):1115–1127, 2005.
[228] G. Nelson, J. Chandrashekar, M. A. Hoon, L. Feng, G. Zhao, N. J. P. Ryba, and
[210] P. R. Messinger, E. Stroulia, K. Lyons, M. Bone, R. H. Niu, K. Smirnov, and C. S. Zuker. An amino-acid taste receptor. Nature, 416:199–202, 2002.
S. Perelgut. Virtual worldspast, present, and future: New directions in social
computing. Decision Support Systems, 47(3):204–228, 2009. [229] A. Newell and P. S. Rosenbloom. Mechanisms of skill acquisition and the law of
practice. In J. R. Anderson, editor, Cognitive skills and their acquisition, pages
[211] A. Mikami, W. T. Newsome, and R. H. Wurtz. Motion selectivity in macaque 1–55. Erlbaum, Hillsdale, NJ, 1981.
visual cortex. II. Spatiotemporal range of directional interactions in MT and V1.
Journal of Neurophysiology, 55:1328–1339, 1986. [230] Y. M. H. Ng and C. P. Kwong. Correcting the chromatic aberration in barrel dis-
tortion of endoscopic images. Journal of Systemics, Cybernetics, and Informatics,
[212] M. Mine and G. Bishop. Just-in-time pixels. Technical Report TR93-005, Univer- 2003.
sity of North Carolina, Chapel Hill, NC, 1993.
[231] F. Nicodemus. Directional reflectance and emissivity of an opaque surface. Applied
[213] M. Minsky. Telepresence. Omni magazine, pages 44–52, June 1980. Optics, 4(7):767–775, 1965.
[214] B. Mirtich. V-Clip: Fast and robust polyhedral collision detection. Technical [232] L. F. Nicolas-Alonso and J. Gomez-Gil. Brain computer interfaces, a review.
Report TR97-05, Mitsubishi Electronics Research Laboratory, 1997. Sensors, 12(2):1211–1279, 2012.
[215] B. Mirtich. Efficient algorithms for two-phase collision detection. In K. Gupta [233] J. Ninio. The Science of Illusions. Cornell University Press, Ithaca, NY, 2001.
and A.P. del Pobil, editors, Practical Motion Planning in Robotics: Current Ap-
proaches and Future Directions, pages 203–223. Wiley, Hoboken, NJ, 1998.
[234] D. Nitz. A place for motion in mapping. Nature Neuroscience, 18:6–7, 2010.
[216] T. Möller. A fast triangle-triangle intersection test. Journal of Graphics Tools,
2(2):25–30, 1997. [235] G. Nützi, S. Weiss, D. Scaramuzza, and R. Siegwart. Fusion of IMU and vision for
absolute scale estimation in monocular SLAM. Journal of Intelligent and Robotic
[217] T. Möller and N. Trumbore. Fast, minimum storage ray/triangle intersection. Systems, 61(1):287–299, 2011.
Journal of Graphics Tools, 2(1):21–28, 1997.
[236] Office for Human Research Protections. International compilation of human re-
[218] B. Moore. An Introduction to the Psychology of Hearing, 6th Ed. Brill, Somerville, search standards. Technical report, U.S. Department of Health and Human Ser-
MA, 2012. vices, 2016. Available at http://www.hhs.gov/ohrp/international/compilation-
human-research-standards.
[219] G. Morrison. Input lag: How important is it? CNET, June 2013. Posted online
at https://www.cnet.com/news/input-lag-how-important-is-it/. [237] A. M. Okamura, J. T. Dennerlein, and R. D. Howe. Vibration feedback models for
virtual environments. In Proc. IEEE International Conference on Robotics and
[220] H. S. Mortensen, B. Pakkenberg, M. Dam, R. Dietz, C. Sonne, B. Mikkelsen, Automation, volume 1, pages 674–679, 1998.
and N. Eriksen. Quantitative relationships in delphinid neocortex. Frontiers in
Neuroanatomy, 8, 2014. [238] J. O’Keefe and J. Dosytovsky. The hippocampus as a spatial map. preliminary
evidence from unit activity in the freely-moving rat. Brain Research, 34(1):171–
[221] M. E. Mortenson. Geometric Modeling, 2nd Ed. Wiley, Hoboken, NJ, 1997. 175, 1971.
BIBLIOGRAPHY 399 400 BIBLIOGRAPHY
[239] J. L. Olson, D. M. Krum, E. A. Suma, and M. Bolas. A design for a smartphone- [256] V. Pulkki. Virtual sound source positioning using vector base amplitude panning.
based head mounted display. In Proceedings IEEE Virtual Reality Conference, Journal of the Audio Engineering Society, 45(6):456–466, 1997.
pages 233–234, 2011.
[257] V. Pulkki. Virtual sound source positioning using vector base amplitude panning.
[240] G. Osterberg. Topography of the layer of rods and cones in the human retina. Journal of the Audio Engineering Society, 55(6):503–516, 2007.
Acta Ophthalmologica, Supplement, 6:1–103, 1935.
[241] G. D. Park, R. W. Allen, D. Fiorentino, T. J. Rosenthal, and M. L. Cook. Sim- [258] V. Pulkki and J. Merimaa. Spatial impulse response rendering II: Reproduction
ulator sickness scores according to symptom susceptibility, age, and gender for of diffuse sound and listening tests. Journal of the Audio Engineering Society,
an older driver assessment study. In Proceedings of the Human Factors and Er- 54(1/2):3–20, 2006.
gonomics Society Annual Meeting, pages 2702–2706, 2006.
[259] S. Rajangam, P. H. Tseng, A. Yin, G. Lehew, D. Schwarz, M. A. Lebedev, and
[242] E. Paulos and J. Canny. Prop: Personal roving presence. In Proceedings of the M. A. Nicolelis. Wireless cortical brain-machine interface for whole-body naviga-
SIGCHI Conference on Human Factors in Computing Systems, pages 296–303, tion in primates. Scientific Reports, 2016.
1995.
[260] N. Ranasinghe, R. Nakatsu, N. Hieaki, and P. Gopalakrishnakone. Tongue
[243] E. Paulos and J. Canny. Social tele-embodiment: Understanding presence. Aut- mounted interface for digitally actuating the sense of taste. In Proceedings IEEE
nomous Robots, 11(1):87–95, 2000. International Symposium on Wearable Computers, pages 80–87, 2012.
[244] M. Pedley. High-precision calibration of a three-axis accelerome- [261] S. Razzaque, Z. Kohn, and M C. Whitton. Redirected walking. In Proceedings of
ter. Technical report, Freescale Semiconductor, 2015. Available at Eurographics, pages 289–294, 2001.
http://cache.freescale.com/files/sensors/doc/app note/AN4399.pdf.
[262] J. T. Reason and J. J. Brand. Motion Sickness. Academic, New York, 1975.
[245] E. Peli. The visual effects of head-mounted display (HMD) are not distinguishable
from those of desk-top computer display. Vision Research, 38(13):2053–2066, 1998. [263] M. F. Reschke, J. T. Somers, and G. Ford. Stroboscopic vision as a treatment for
[246] E. Peli. Optometric and perceptual issues with head-mounted displays. In motion sickness: strobe lighting vs. shutter glasses. Aviation, Space, and Environ-
mental Medicine, 77(1):2–7, 2006.
P. Mouroulis, editor, Visual instrumentation : optical design and engineering prin-
ciples. McGraw-Hill, New York, NY, 1999.
[264] S. W. Rienstra and A. Hirschberg. An Introduction to Acous-
[247] J. Pelz, M. Hayhoe, and R. Loeber. The coordination of eye, head, and hand tics. Endhoven University of Technology, 2016. Available at
movements in a natural task. Experimental Brain Research, 139(3):266–277, 2001. http://www.win.tue.nl/∼sjoerdr/papers/boek.pdf.
[248] Sönke Pelzer, Lukas Aspöck, Dirk Schröder, and Michael Vorländer. Integrating [265] K. J. Ritchey. Panoramic image based virtual reality/telepresence audio-visual
real-time room acoustics simulation into a cad modeling software to enhance the system and method. US Patent 5495576A, February 1996.
architectural design process. Buildings, 2:1103–1138, 2014.
[266] H. Robbins and S. Monro. Stochastic iteration: A Stochastic approximation
[249] R. J. Pethybridge. Sea sickness incidence in royal navy ships. Technical Report method. Annals of Mathematical Statistics, 22(3):400–407, 1951.
37/82, Institute of Naval Medicine, Gosport, Hants, UK, 1982.
[267] C. P. Robert. The Bayesian Choice, 2nd. Ed. Springer-Verlag, Berlin, 2001.
[250] S. Petitjean, D. Kriegman, and J. Ponce. Computing exact aspect graphs of curved
objects: algebraic surfaces. International Journal of Computer Vision, 9:231–255, [268] P. Robinson, A. Walther, C. Faller, and J. Braasch. Echo thresholds for reflections
December 1992. from acoustically diffusive architectural surfaces. Journal of the Acoustical Society
[251] V. I. Petkova and H. H. Ehrsson. If I Were You: Perceptual Illusion of Body of America, 134(4):2755–2764, 2013.
Swapping. PloS ONE, 3(12), 2008.
[269] M. Rolfs. Microsaccades: Small steps on a long way. Psychological Bulletin,
[252] M. Pocchiola and G. Vegter. The visibility complex. International Journal Com- 49(20):2415–2441, 2009.
putational Geometry & Applications, 6(3):279–308, 1996.
[270] R. Ron-Angevin and A. Diaz-Estrella. Braincomputer interface: Changes in per-
[253] T. Poggio, M. Fahle, and S. Edelman. Fast perceptual learning in visual hyper- formance using virtual reality techniques. Neuroscience Letters, 449(2):123–127,
acuity. Science, 256(5059):1018–1021, 1992. 2009.
[254] I. Poupyrev, M. Billinghust, S. Weghorst, and T. Ichikawa. The go-go interaction [271] D. Rosenbaum. Human Motor Control, 2nd Ed. Elsevier, Amsterdam, 2009.
technique: non-linear mapping for direct manipulation in VR. In Proceedings ACM
Symposium on User Interface Software and Technology, pages 79–80, 1996. [272] S. Ross. A First Course in Probability, 9th Ed. Pearson, New York, NY, 2012.
[255] M. Prsa, S. Gale, and O. Blanke. Self-motion leads to mandatory cue fusion across [273] G. Roth and U. Dicke. Evolution of the brain and intelligence. Trends in Cognitive
sensory modalities. Journal of Neurophysiology, 108(8):2282–2291, 2012. Sciences, 9:250–257, 2005.
BIBLIOGRAPHY 401 402 BIBLIOGRAPHY
[274] K. Ruhland, C. E. Peters, S. Andrist, J. B. Badler, N. I. Badler, M. Gleicher, [292] P. Signell. Predicting and specifying the perceived colors of reflective objects.
B. Mutlu, and R. McDonnell. A review of eye gaze in virtual agents, social robotics Technical Report MISN-0-270, Michigan State University, East Lansing, MI, 2000.
and hci: Behaviour generation, user interaction and perception. Computer Graph- Available at http://www.physnet.org/.
ics Forum, 34(6):299–326, 2015.
[293] M. Slater, B. Spanlang, M. V. Sanchez-Vives, and O. Blanke. Experience of body
[275] A. Ruina and R. Pratap. Introduction to Statics and Dynamics. Oxford University transfer in virtual reality. PloS ONE, 5(5), 2010.
Press, Oxford, UK, 2015. Available at http://ruina.tam.cornell.edu/Book/.
[294] L. J. Smart, T. A. Stoffregen, and B. G. Bardy. Visually induced motion sickness
[276] W. Rushton. Effect of humming on vision. Nature, 216:1173–1175, 2009. predicted by postural instability. Human Factors, 44(3):451–465, 2002.
[277] M. B. Sachs and N. Y. S. Kiang. Two-tone inhibition in auditory nerve fibres. [295] C. U. M. Smith. Biology of Sensory Systems, 2nd Ed. Wiley, Hoboken, NJ, 2008.
Journal of the Acoustical Society of America, 43:1120–1128, 1968.
[296] G. Smith and D. A. Atchison. The Eye and Visual Optical Instruments. Cambridge
[278] S. Sanei and J. A. Chambers. EEG Signal Processing. Wiley, Hoboken, NJ, 2007. University Press, Cambridge, U.K., 1997.
[297] R. Sawdon Smith and A. Fox. Langford’s Basic Photography, 10th Ed. Focal Press,
[279] X. M. Sauvan and C. Bonnet. Spatiotemporal boundaries of linear vection. Per- Oxford, U.K., 2016.
ception and Psychophysics, 57(6):898–904, 1995.
[298] W. J. Smith. Modern Optical Engineering, 4th Ed. SPIE Press, Bellingham, WA,
[280] D. Schmalsteig and T. Höllerer. Augmented Reality: Principles and Practice. 2008.
Mendeley Ltd., London, 2015.
[299] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: exploring photo collections
[281] G. Schweighofer and A. Pinz. Robust pose estimation from a planar target. IEEE in 3D. ACM Transactions on Graphics, 25(3):835–846, 2006.
Transactions on Pattern Analysis and Machine Intelligence, 28(12):2024–2030,
2006. [300] D. Song, K. Goldberg, and N. Y. Chong. Networked telerobots. In O. Khatib
and B. Siciliano, editors, Springer Handbook of Robotics, pages 759–771. Springer-
[282] A. R. Seitz, J. E. Nanez, S. R. Halloway, and T. Watanabe. Perceptual learning Verlag, Berlin, 2008.
of motion leads to faster-flicker perception. Journal of Vision, 6(6):158, 2015.
[301] B. R. Sorensen, M. Donath, G.-B. Yanf, and R. C. Starr. The minnesota scanner:
[283] A. R. Seitz and T. Watanabe. The phenomenon of task-irrelevant perceptual A prototype sensor for three-dimensional tracking of moving body segments. IEEE
learning. Vision Research, 49(21):2604–2610, 2009. Transactions on Robotics, 5(4):499–509, 1989.
[302] R. W. Soukoreff and I. S. MacKenzie. Towards a standard for pointing device
[284] M. Shelhamer, D. A. Robinson, and H. S. Tan. Context-specific adaptation of
evaluation, perspectives on 27 years of Fitts law research in HCI. International
the gain of the vestibulo-ocular reflex in humans. Journal of Vestibular Research: Journal of Human-Computer Studies, 61:751–759, 2004.
Equilibrium and Orientation, 2(1):89–96, 1992.
[303] M. W. Spong, S. Hutchinson, and M. Vidyasagar. Robot Modeling and Control.
[285] R. N. Shepard. Circularity in judgements of relative pitch. Journal of the Acoustical Wiley, Hoboken, NJ, 2005.
Society of America, 36(12):2346–2453, 1964.
[304] K. M. Stanney and R. S. Kennedy. Aftereffects from virtual environment expore:
[286] G. M. Shepherd. Discrimination of molecular signals by the olfactory receptor How long do they last? In Proceedings of the Human Factors and Ergonomics
neuron. Neuron, 13(4):771–790, 1994. Society Annual Meeting, pages 48(2): 1476–1480, 1998.
[287] T. B. Sheridan. Musings on telepresence and virtual presence. Presence: Teleop- [305] K. M. Stanney and R. S. Kennedy. Simulation sickness. In D. A. Vincenzi, J. A.
erators and Virtual Environments, 1(1):120–126, 1992. Wise, M. Mouloua, and P. A. Hancock, editors, Human Factors in Simulation and
Training, pages 117–127. CRC Press, Boca Raton, FL, 2009.
[288] W. R. Sherman and A. B. Craig. Understanding Virtual Reality: Interface, Ap-
[306] A. Steed and S. Julier. Design and implementation of an immersive virtual reality
plication, and Design. Morgan Kaufmann, San Francisco, CA, 2002.
system based on a smartphone platform. In Proceedings IEEE Symposium on 3D
[289] T. Shibata, J. Kim, D. M. Hoffman, and M. S. Banks. The zone of comfort: User Interfaces, 2013.
predicting visual discomfort with stereo displays. Journal of Vision, 11(8):1–29, [307] R. M. Steinman, Z. Pizlo, and F. J. Pizlo. Phi is not beta, and why Wertheimer’s
2011. discovery launched the Gestalt revolution. Vision Research, 40(17):2257–2264,
[290] B. G. Shinn-Cunningham, S. Santarelli, and N. Kopco. Tori of confusion: Binaural 2000.
localization cues for sources within reach of a listener. Journal of the Acoustical [308] N. Stephenson. Snow Crash. Bantam Books, 1996.
Society of America, 107(3):1627–1636, 2002.
[309] R. M. Stern, S. Hu, R. LeBlanc, and K. L. Koch. Chinese hyper-susceptibility to
[291] M. Siedlecka, A. Klumza, M. Lukowska, and M. Wierzchon. Rubber hand illusion vection-induced motion sickness. Aviation, Space, and Environmental Medicine,
reduces discomfort caused by cold stimulus. PloS ONE, 9(10), 2014. 64(9 Pt 1):827–830, 1993.
BIBLIOGRAPHY 403 404 BIBLIOGRAPHY
[310] J. Steuer. Defining virtual reality: Dimensions determining telepresence. Journal [328] A. Vasalou and A. Joinson. Me, myself and I: The role of interactional context on
of Communication, 42(4):73–93, 1992. self-presentation through avatars. Computers in Human Behavior, 25(2):510–520,
2009.
[311] S. S. Stevenson. On the psychophysical law. Psychological Review, 64(3):153–181,
1957. [329] J. F. Vasconcelos, G. Elkaim, C. Silvestre, P. Oliveira, and B. Cardeira. Geometric
approach to strapdown magnetometer calibration in sensor frame. Transactions
[312] R. Stoakley, M. J. Conway, and R. Pausch. Virtual reality on a WIM: interative on Aerospace and Electronic Systems, 47(2):1293–1306, 2011.
worlds in minature. In Proceedings of the SIGCHI Conference on Human Factors
in Computing Systems, pages 265–272, 1995. [330] G. Vass and T. Perlaki. Applying and removing lens distortion in post production.
Technical report, Colorfont, Ltd., Budapest, 2003.
[313] T. A. Stoffregen, E. Faugloire, K. Yoshida, M. B. Flanagan, and O. Merhi. Motion
sickness and postural sway in console video games. human factors. Human Factors, [331] J. Vidal. Toward direct brain - computer communication. Annual Review of
50(2):322–331, 2008. Biophysics and Bioengineering, 2:157–180, 1973.
[314] Student. The probable error of a mean. Biometrika, 6(1):1–25, 1908. [332] J. J. Vidal. Real-time detection of brain events in EEG. Proceedings of the IEEE,
65(5):633–664, 1977.
[315] I. E. Sutherland. The ultimate display. In Proceedings of the IFIP Congress, pages
506–508, 1965. [333] S. T. von Soemmerring. Über das Organ der Seele. Königsberg, 1796. With
afterword by Immanuel Kant.
[316] I. E. Sutherland. A head-mounted three dimensional display. In Proceedings of
AFIPS, pages 757–764, 1968. [334] M. Vorländer. Auralization. Springer-Verlag, Berlin, 2010.
[317] R. Szeliski. Image alignment and stitching: A tutorial. Technical Report MSR-TR- [335] M. Vorländer and B. Shinn-Cunningham. Virtual auditory displays. In K. S. Hale
2004-92, Microsoft Research, 2004. Available at http://research.microsoft.com/. and K. M. Stanney, editors, Handbook of Virtual Environments, 2nd Edition. CRC
Press, Boca Raton, FL, 2015.
[318] R. Szeliski. Computer Vision: Algorithms and Applications. Springer-Verlag,
Berlin, 2010. [336] C. Wächter and A. Keller. Instant ray tracing: The bounding interval hierar-
chy. In T. Akenine-Möller and W. Heidrich, editors, Eurographics Symposium on
[319] L. Takayama, E. Marder-Eppstein, H. Harris, and J. Beer. Assisted driving of a Rendering, pages 139–149. 2006.
mobile remote presence system: System design and controlled user evaluation. In
Proc. IEEE International Conference on Robotics and Automation, pages 1883– [337] B. A. Wandell. Foundations of Vision. Sinauer Associates, 1995. Available at
1889, 2011. https://foundationsofvision.stanford.edu/.
[320] Thomas and Finney. Calculus and Analytic Geomtry, 9th Ed. Addison-Wesley, [338] X. Wang and B. Winslow. Eye tracking in virtual environments. In K. S. Hale
Boston, MA, 1995. and K. M. Stanney, editors, Handbook of Virtual Environments, 2nd Edition. CRC
Press, Boca Raton, FL, 2015.
[321] L. L. Thompson and P. M. Pinsky. Acoustics. Encyclopedia of Computational
Mechanics, 2(22), 2004. [339] R. M. Warren, J. M. Wrightson, and J. Puretz. Illusory continuity of tonal and in-
fratonal periodic sounds. Journal of the Acoustical Society of America, 84(4):1338–
[322] S. Thrun, W. Burgard, and D. Fox. Probabilistic Robotics. MIT Press, Cambridge, 1142, 1964.
MA, 2005.
[340] W. H. Warren and K. J. Kurtz. The role of central and peripheral vision in
[323] A. Treisman. Focused attention in the perception and retrieval of multidimensional perceiving the direction of self-motion. Perception and Psychophysics, 51(5):443–
stimuli. Attention, Perception, and Psychophysics, 22(1):1–11, 1977. 454, 1992.
[324] B. Treutwein. Minireview: Adaptive psycholphysical procedures. Vision Research, [341] D. S. Watkins. Fundamentals of Matrix Computations. Wiley, Hoboken, NJ, 2002.
35(17):2503–2522, 1995.
[342] A. B. Watson and D. G. Pelli. QUEST: A Bayesian adaptive psychometric method.
[325] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzbiggon. Bundle adjustment Perception and Psychophysics, 33(2):113–120, 1983.
- a modern synthesis. In Proceedings IEEE International Workshop on Vision
Algorithms, pages 298–372, 1999. [343] B. L. Welch. The generalization of ”Student’s” problem when several different
population variances are involved. Biometrika, 34(1-2):28–35, 1947.
[326] R. Y. Tsai. A versatile camera calibration technique for high-accuracy 3D machine
vision metrology using off-the-shelf TV cameras and lenses. IEEE Journal of [344] G. Welch and E. Foxlin. Motion tracking: no silver bullet, but a respectable
Robotics and Automation, 3(4):323–344, 1987. arsenal. IEEE Computer Graphics and Applications, 22(6):24–28, 2002.
[327] B. Ullmer and H. Ishii. Emerging frameworks for tangible user interfaces. In J. M. [345] R. B. Welch and B. J. Mohler. Adapting to virtual environments. In K. S. Hale
Caroll, editor, Human-Computer Interaction for Tanglible User Interfaces, pages and K. M. Stanney, editors, Handbook of Virtual Environments, 2nd Edition. CRC
579–601. Addison-Wesley, Boston, MA, 2001. Press, Boca Raton, FL, 2015.
BIBLIOGRAPHY 405 406 BIBLIOGRAPHY
[346] A. T. Welford. Fundamentals of Skill. Methuen Publishing, London, 1968. [363] P. Zahorik. Assessing auditory distance perception using virtual acoustics. Journal
of the Acoustical Society of America, 111(4):1832–1846, 2002.
[347] M. Wertheimer. Experimentelle Studien über das Sehen von Bewegung (Experi-
mental Studies on the Perception of Motion). Zeitschrift für Psychologie, 61:161– [364] V. M. Zatsiorsky. Kinematics of Human Motion. Human Kinetics, Champaign,
265, 1912. IL, 1997.
[348] J. Westerhoff. Reality: A Very Short Introduction. Oxford University Press, Ox- [365] V. M. Zatsiorsky. Kinetics of Human Motion. Human Kinetics, Champaign, IL,
ford, UK, 2011. 2002.
[349] F. A. Wichman and N. J. Hill. The psychometric function: I. fitting, sampling, [366] V. M. Zatsiorsky and B. I. Prilutsky. Biomechanics of Skeletal Muscles. Human
and goodness of fit. Perception and Psychophysics, 63(8):1293–1313, 2001. Kinetics, Champaign, IL, 2012.
[350] J. M. Wolfe, K. R. Kluender, and D. M. Levi. Sensation and Perception, 4th Ed. [367] Y. Zheng, Y. Kuang, S. Sugimoto, and K. Aström. Revisiting the PnP problem: A
Sinauer, Sunderland, MA, 2015. fast, general and optimal solution. In Proceedings IEEE International Conference
on Computer Vision, pages 2344–2351, 2013.
[351] J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and T. M.
Vaughan. Braincomputer interfaces for communication and control. Clinical Neu- [368] H. Zhou and H. Hu. Human motion tracking for rehabilitation - A survey. Biomed-
rophysiology, 113(6):767–791, 2002. ical Signal Processing and Control, 3(1):1–18, 2007.
[352] A. F. Wright, C. F. Chakarova, M. M. Abd El-Aziz, and S. S. Bhattacharya.
Photoreceptor degeneration: genetic and mechanistic dissection of a complex trait.
Nature Reviews Genetics, 11:273–284, 2010.
[353] F. E. Wright. The Methods of Petrographic-microscopic Research. Carnegie Insti-
tution of Washington, 1911.
[354] Y. Wu and Z. Hu. PnP problem revisited. Journal of Mathematical Imaging and
Vision, 24(1):131–141, 2006.
[355] S. Xu, M. Perez, K. Yang, C. Perrenot, J. Felblinger, and J. Hubert. Determination
of the latency effects on surgical performance and the acceptable latency levels in
telesurgery using the dV-Trainer simulator. Surgical Endoscopy, 28(9):2569–2576,
2014.
[356] T. Yamada, S. Yokoyama, T. Tanikawa, K. Hirota, and M. Hirose. Wearable
olfactory display: Using odor in outdoor environment. In Proceedings IEEE Virtual
Reality Conference, pages 199–206, 2006.
[357] E. Yang and M. Dorneich. The effect of time delay on emotion, arousal, and
satisfaction in human-robot interaction. In Proceedings of the Human Factors and
Ergonomics Society Annual Meeting, pages 443–447, 2015.
[358] X. Yang and W. Grantham. Effects of center frequency and bandwidth on echo
threshold and buildup of echo suppression. Journal of the Acoustical Society of
America, 95(5):2917, 1994.
[359] R. Yao, T. Heath, A. Davies, T. Forsyth, N. Mitchell, and
P. Hoberman. Oculus VR Best Practices Guide. Retrieved from
http://brianschrank.com/vrgames/resources/OculusBestPractices.pdf, March
2014. Last retrieved on July 10, 2016.
[360] N. Yee and J. Bailenson. The Proteus effect: The effect of transformed self-
representation on behavior. Human Communication Research, 33:271–290, 2007.
[361] H. Yeh, R. Mehra, Z. Ren, L. Antani, M. C. Lin, and D. Manocha. Wave-ray
coupling for interactive sound propagation in large complex scenes. ACM Trans-
actions on Graphics, 32(6), 2013.
[362] W. A. Yost. Fundamentals of Hearing: An Introduction, 5th Ed. Emerald Group,
Somerville, MA, 2006.
408 INDEX
407
INDEX 409 410 INDEX
cornea, 111, 119 divergence, 134 eye, 110–115, 119–124 four-bar mechanism, 271
corneal reflection, 266 DLP, 192 lens, 111, 121 Fourier analysis, 311, 312, 322, 324–325
critical band masking, 318 Donkey Kong, 27, 28 movements, 130–136, 143–144 Fourier transform, 312, 324
cue, 147 Doom, 27, 28 eye chart, 139 fovea, 121, 123
cupula, 219 Doppler effect, 309, 321 eye tracking, 239 foveated rendering, 141
cybersickness, 344 double buffering, 200 eye-closing trick, 334 FPV, 374
cycles per degree, 138 doubly connected edge list, 65 eyestrain, 345 frame rate, 162, 204, 232
cyclopean viewpoint, 88, 252, 335 dress color illusion, 164 Franssen effect, 317
cylindrical joint, 270 drift correction, 243–244, 247–251 f-stop, 118
free nerve ending, 364, 371
drift error, 44, 241 Fantasmagorie, 26
frequency domain, 324
dB, 308 duplex theory, 366 far-field approximation, 329
Fresnel lens, 193
decibels, 308 dynamic world, 198 farsightedness, 113, 193
function space, 165
declination angle, 250 dynamical system, 223 Fast Fourier Transform, 324
functional magnetic resonance imaging,
degrees of freedom, 36, 73, 222 features, 256, 257
381
Denavit-Hartenberg parameters, 269 ear canal, 312 feedback, 279
functional near-infrared spectroscopy, 383
dendrites, 55 eardrum, 312 FFT, 324
fundamental frequency, 312
dependent variable, 354 ECG, 350 field of view, 43, 46, 114, 140, 141, 156,
depth buffer, 184 echo threshold, 316 191, 192, 206, 207, 232, 233, 237,
galvanic skin response, 350, 380
depth cameras, 45 echolocation, 320 287, 292, 341, 379
galvanic vestibular stimulation, 216
depth cues, 145–153 ecological validity, 353 filter, 323
monocular, 145–150, 155–156 education, 15–16, 302 filtering, 242 game engines, 221
stereo, 150–153 EEG, 381 filters, 322 gamma correction, 169
depth cycles, 184 efference copies, 56, 149, 279 finite impulse, 324 ganglion cell layer, 126
dermis, 364 EGG, 350 finite impulse response, 324, 325 ganglion cells, 126, 384
deurbanization, 12 Einstein equivalence principle, 219, 248 FIR, 324 Gaussian distribution, 356
diffraction, 310 electro-oculography, 265 FIR filter, 324 gender swap, 16
diffuse reflection, 96, 180 electrocardiogram, 350 first-person shooter, 6, 9, 27, 57, 232 geometric models, 63–68
digital cameras, 115 electroencephalogram, 381 first-person view, 374 data structures, 65–66
digital heritage, 16 electrogastrogram, 350 fish-eyed lens, 107, 207 gimbal lock, 78
digital light processing, 192 electrooculogram, 350 Fitts’s law, 282, 294 glabrous skin, 365
digital lollipop, 373 elevation, 318, 319 fixation, 133 global scanout, 142, 200
diplopia, 153 EM algorithm, 275 flavor, 373 global shutter, 116, 142, 261
Dirichlet boundary conditions, 327 emitter-detector pair, 254, 351 flicker, 161, 338 goggles and gloves, 30, 295
display, 39 empathy, 15 flicker fusion threshold, 162 Golgi tendon organ, 365
aural, 40, 321 end-bulbs of Krause, 365 flight simulation, 16 gorilla arms, 8, 292, 341
general concept, 37–39 end-effector, 375 fMRI, 381 GPS, 255
light-field, 43 EOG, 265, 350 fNIRS, 383 gravity, 250, 254
multi-focal-plane, 43 epidermis, 364 focal depth, 102 Gravity (movie), 25
pass-through, 21 estimation-maximization algorithm, 275 focal length, 102 grid cells, 3, 4
resolution, 136–232 etiology, 348 focal plane, 102 GSR, 350
retina, 138 evil demon, 4 focal point, 102 guillotine simulator, 22
see-through, 21 experimental design, 61, 352–361 focal ratio, 118 gustation, 370
distinguishability, 256 exponential smoothing, 323 forced fusion, 348 gustatory cortex, 373
distortion correction, 192–195 exponentially weighted moving average, forcing function, 327 gustatory interfaces, 373
distortion shading, 195 323 forward vection, 236 gyroscope, 44, 240
INDEX 411 412 INDEX
hair cells, 218 independent variable, 354 land sickness, 345 McGurk effect, 174, 316, 367
hair follicle receptor, 365 inertial measurement unit, 44, 240 Lanier, Jaron, 5, 30 mean, 355
half-edge data structure, 65 information spaces, 276 latency, 195–204, 232 Meissner’s corpuscle, 365
Hamilton, William Rowan, 80 infrared, 97 perception of, 337–338 Merkel’s disk, 365
hand-eye coordination, 279 inner nuclear layer, 126 zero effective, 201 merry-go-round, 214, 240
handheld receiver, 296 institutional review board, 354 lateral canal, 218 mesh (of triangles), 64
haptic exploration, 367 interaction, 5–6, 277–304 lateral geniculate nucleus, 128 mesh simplification, 197
haptic suit, 370 interaction mechanisms, 277 lateral inhibition, 127, 315, 366 metamerism, 165
hard iron, 250 interaural level difference, 320 lateral vection, 234 method of adjustment, 337
Hartman, Gilbert, 5 interaural time difference, 320 Latin square, 359 Mickey Mouse, 26
haunted swing illusion, 58, 232 interlacing, 161 lens flare, 110 microsaccades, 134, 143
Hausdorff distance, 228 interposition, 150, 151 lenses, 99–110 Milk, Chris, 15
head model, 252, 253, 330 interpupillary distance, 114, 340 aberrations, 105–110 Minecraft, 27, 28
head tracking, 239 intrinsic parameters, 261 concave, 104 minimum audible angle, 318, 319
head-related transfer function, 329 inverse Fourier transform, 324 convex, 101–104 Minnesota scanner, 262
health and safety, 7–8 inverse kinematics problem, 269 LGN, 128 mipmap, 189
Heilig, Morton, 29 inverse problem, 259, 346, 354 light field, 209 mixed reality, 5, 20–22, 370
Helmholtz wave equation, 327 inverted pendulum, 377 light-field cameras, 210 MMORPGs, 32
Henning’s tetrahedron, 373 IPD, 114, 330 light-field displays, 43 MOCAP, 258, 272
hierarchical processing, 55 IRB, 354 lighthouse tracking, 259, 262 monaural cue, 318
higher-order harmonics, 312 iris, 121 lightness constancy, 168, 169 Monet, Claude, 23
homogeneous transformation matrix, 75– irradiance, 182 linear filter, 323 monosodium glutamate, 372
77 ITD, 320, 321 linear least-squares, 242 motion capture, 258, 272, 299, 375
horizontal cells, 127 iterative optimization, 270 linear transformations, 69 motion parallax, 202
horopter, 150, 152 linear vection, 236 motion planning, 379
Horse in Motion, 24 jaggies, 138 localization, 272, 318 motion sickness, 342
HRTF, 329 jerk, 213 locomotion, 50, 221, 283–291 motion-to-photons, 195
HSV, 165 JND, 317 longitudinal wave, 307
motor cortex, 56
hue, 166 joint, 223, 266 look-at, 84–85
motor imagery, 383
human factors, 331 joint receptor, 365 low persistence, 163
motor programs, 277–282
human-computer interaction, 331, 352 judder, 163 lower pairs, 270
moviesphere, 206
human-robot interaction, 378 just noticeable difference, 61, 317, 359 Luckey, Palmer, 8
MSAA, 189
humanoid robots, 375 Lucky’s Tale, 9
Kalman filter, 264 MSG, 372
hyperopia, 113
Kant, Immanuel, 5 Möller-Trumbore intersection algorithm, MUDs, 32
ILD, 320 kinematic singularities, 78–79, 270 179 multi-focal-plane displays, 43
image blur, 150, 151 kinematics, 266–270 MAA, 318, 319 multibody dynamics, 265
image stabilization, 134 inverse, 269 magnetometer, 44 multibody kinematics, 223, 265
image stitching, 206 kinesthesis, 364 magnification, 113 multibody system, 265
image-order rendering, 178 kinocilium, 219 maladaptation syndrome, 342 multilateration, 255
impedance, 313 Kleinfinger-Khalil parameters, 269 manipulation, 291–295 multiresolution shading, 195, 198
impressionist, 23 knee-jerk reflex, 366 map projection, 207 multisample anti-aliasing, 189
inbetweening, 204 mapping, 271–276, 281 multistable perception, 172, 173
inclination angle, 249 La Ciotat, 25 matched zone, 49, 221, 231, 239, 283 muscle spindle, 365
incremental distance computation, 228 Lambertian shading, 179–180 Matrix, The, 5 Muybridge, Eadward, 24
INDEX 413 414 INDEX
myopia, 113 otolith system, 216 photodiode, 259 prior distribution, 171
oval window, 314 photopic vision, 123 prism, 101
nausea, 344 photoplethysmogram, 350 prismatic joint, 267, 270
nearsightedness, 113, 193 P300, 383 photoreceptors, 110, 121–126, 142, 147, probability of detection, 60
Necker cube, 174 Pacinian corpuscle, 365 163, 165 probability theory, 170, 354
negatively correlated, 358 Painter’s algorithm, 184 physics engine, 221 proprioception, 56, 158, 279, 363
nerve ending, 364 PAL standard, 161 pincushion distortion, 107, 192, 193 prosopagnosia, 53
Neumann boundary conditions, 327 pallor, 344, 380 pinhole, 115 Proteus effect, 298
neural impulse, 36, 124 panoramas, 204–210, 274 pseudoscopic vision, 335
pinna, 312
neuron, 55 parallax, 149, 234, 335 pseudounipolar, 364
pitch (sound), 309, 317
neuroplasticity, 280, 333, 380 auditory, 321 psychoacoustics, 317
pitch rotation, 74
nocireceptors, 365 parallel wavefronts, 94 psychometric function, 59
pitch vection, 234
nonlinear filter, 324 partial differential equations, 223 psychophysics, 59–61, 317, 359–361, 366
place cells, 3, 4
nonrigid models, 223 pass-through display, 21 pulse oximeter, 350
placebo, 355
normal distribution, 356 passive feature, 258 pupil, 117, 121
planar joint, 270
normal mapping, 187 PDEs, 223 pure tone, 311
Plato, 4
NTSC standard, 161 Pearson correlation coefficient, 357 Purkinje images, 266
PLATO system, 32
nuisance variable, 354, 358–359 Pearson’s r, 357
plenoptic cameras, 210
null hypothesis, 355, 358 penetration depth, 228 QR code, 258
perceived visual angle, 148 PnP problem, 259, 269
Nuremberg code, 354 quaternions, 80–82, 247, 248, 251, 269
perception point cloud, 274, 275
Nyquist-Shannon sampling theorem, 322
auditory, 316–321 Point Cloud Library, 276 r-value, 357
OBB, 229 gustatory, 373 poison hypothesis, 349 rabbit duck illusion, 173
object-fixed camera, 261 of color, 164–169 Pokemon Go, 21 radiance, 182
object-order rendering, 178 of depth, 145–156 polygon soup, 66 randomized design, 358
occlusion culling, 191 of motion, 156–164 Pong, 27, 28 rapid eye movements, 133
OFF bipolar, 127 of scale, 145, 153–154 Ponzo illusion, 53 rarefaction, 308
olfaction, 370 olfactory, 371 Portal 2, 9 raster scan, 142
olfactory receptor, 371 somatosensory, 366–367 pose, 253 rasterization, 183–192
omnidirectional treadmill, 289 perception engineering, 5 positively correlated, 358 ratio principle, 168
ON bipolar, 127 perception of stationarity, 30, 59, 141, post-rendering image warp, 202–204 ray casting, 178, 179, 290
open-loop, 5, 280 156, 162, 163, 195, 282, 329, 337, posterior canal, 218 ray tracing, 178–179
OpenSimulator, 52 345, 351 postural disequilibrium, 345 reading glasses, 113
optic disc, 121 perceptual learning, 175, 332, 380 power law of practice, 305 receptive field, 125
optic nerve, 121, 126 perceptual training, 57, 332–338 PPG, 350 receptor, 36, 54
optical axis, 102 perilymph, 314 PPI, 137 redirected walking, 283
optical flow, 149, 158 peripheral vision, 123 precedence effect, 316, 320 reflection, 310
optical illusions, 53–54 persistence of vision, 160 premotor cortex, 279 refraction, 310
optokinetic drum, 343 perspective n point problem, 259 presbyopia, 113 refractive index, 99
optokinetic reflex, 134 perspective projection, 85–87 presence, 3 registration, 243
organ of Corti, 314, 315 Persson, Markus, 28 primary auditory cortex, 316 Reichardt detector, 156
orientation tuning, 131 Petzval surface, 107 primary motor cortex, 279, 383 remapping, 282
oriented bounding box, 229 phase, 311 primary olfactory cortex, 371 retina, 110, 121
orthographic projection, 85 phase space, 226 primary somatosensory cortex, 366 image size, 147–148
otolith organs, 236 phi phenomenon, 160 primary visual cortex, 128 retina display, 138
INDEX 415 416 INDEX
retinal image size, 147 sensor, 36 spatial cue, 366 supplementary motor area, 279
retinal image slip, 143 sensor fusion, 242 spatial opponency, 128, 366 surrogate selves, 374
retinal implants, 384 sensor mapping, 241 spectral color, 97 surround-sound system, 39
retroreflective markers, 258 Sensorama, 29, 30, 371 spectral distribution function, 165 Sutherland, Ivan, 30
reverberation, 316 sensorimotor relationships, 279, 291 spectral power distribution, 97, 165, 179 sweet taste, 372
revolute joint, 266, 267, 270 sensory conflict theory, 348 spectral reflection function, 98, 179 Sword of Damocles, 30, 31
rigid body transform, 76 sensory cue, 145 specular reflection, 96 synapse, 55
robotic avatar, 378 sensory system selectivity, 54 speed, 213 synaptic pruning, 280
robotic mapping, 272 shading, 178–195 spherical aberration, 107 syndrome, 342
robotics, 374–380 shading model, 179 spherical joint, 270
rods, 121 tangible user interface, 370
shadows depth cue, 149 spinning dancer illusion, 174
roll rotation, 74 taste buds, 372
Shannon-Weaver model of communica- spiral aftereffect, 157
roll vection, 234 taste receptors, 372
tion, 295 sRGB, 169
rolling scanout, 142, 199 TDOA, 255, 320
Shepard tone illusion, 317 SSQ, 349
rolling shutter, 116, 142, 261 tearing, 199
shutter, 116 SSVEP, 383
rotations tectorial membrane, 314
shutter speed, 116 staircase procedures, 360
2D, 72–73 tele-embodiment, 378
signal processing, 322 Star Wars, 236
3D, 73–75 teleconferencing, 301
simulator sickness, 8, 344 state transition equation, 225 TeleGarden, 374
axis-angle representations, 77, 82 simulator sickness questionnaire, 349 state vector, 224
two-to-one problem, 79–80 teleoperation, 5, 377
simulators, 344 static world, 198 teleportation, 289
round window, 314 simultaneous localization and mapping, steady state visually evoked potential,
rubber hand illusion, 367, 368 telepresence, 5, 12, 377–378
49, 273 383 temporal cue, 366
Ruffini’s corpuscle, 364 single-unit recording, 53, 381 Steamboat Willie, 26
Runge-Kutta integration, 225 temporal drift error, 263
size constancy scaling, 147, 149 stencil buffer, 191, 198 texture gradient, 146
saccadic masking, 133 size perception, 148 step size, 224 texture mapping, 186, 205, 273, 299
saccule, 216 SLAM, 49, 273 stereo displays, 155 texture perception, 366
sagittal plane, 216 Smell-O-Vision, 371 stereo rig, 149 thalamus, 128, 366
salty taste, 373 smooth pursuit, 134, 136, 159, 336 stereo vision, 275 tilt axis, 248, 249
sampling frequency, 322 Snell’s law, 99–101, 104, 111 stereopsis, 27, 134 tilt error, 247
sampling rate, 224, 241, 322 Snellen eye chart, 138 stereoscope, 29, 30 time difference of arrival, 255, 320
saturation, 166 social robotics, 378 Steven’s power law, 60 time of arrival, 254
scala tympani, 314 social VR, 295–302 Stevens’ power law, 359, 361 time of flight, 254
scala vestibuli, 314 soft iron, 250 stochastic approximation method, 360 time-invariant dynamical systems, 225
scanout, 142 soft iron bias, 246 stochastic ray tracing, 328 TOA, 254
scientific method, 352 somatosensory perception, 366 strafing, 285 tonotopically, 316
sclera, 119 somatosensory system, 363–366 stroboscopic apparent motion, 24, 158– topographic mapping, 56, 316, 366
scotopic vision, 123 Sommerfield radiation condition, 327 162, 343 touch, 363
screen-door effect, 137, 138 Sopite syndrome, 345 Student’s t test, 356 transducer, 36
screw joint, 270 sound propagation, 326–329 Student’s t-distribution, 356 transformed social interaction, 301
sea sickness, 343, 345 sound ray, 308 subjective constancy, 147 transmission, 310
Second Life, 52, 297 sour taste, 373 sum of squares error, 242 transverse wave, 307
see-through display, 21 South Park, 26 superior olive, 316 tree sickness, 349
semicircular canals, 218 space sickness, 343 supersampling, 189, 190 trichromatic theory, 168, 179
sense organ, 36 spatial aliasing, 189, 190 supervised training, 333 trilateration, 255
INDEX 417 418 INDEX
Trip to the Moon, 25 virtual laser pointer, 290, 292 yaw correction, 249
triple buffering, 200 virtual prototyping, 16, 304 yaw rotation, 75
Tron, 8 virtual reality yaw vection, 234
tweening, 204 definition, 1–4 Young-Helmholtz theory, 168
two-point acuity test, 366 first use of term, 5
hardware, 35–46 z-buffer, 184
ultraviolet, 97 history, 22–33 z-transform, 325
umami, 372 modern experiences, 8–22 zero effective latency, 201, 337, 338
uncanny valley, 7, 223, 299, 341 zipper effect, 162, 336
software, 46–52
uncorrelated, 358 zoetrope, 159, 160
virtual societies, 12–15
universal simulation principle, 6, 277, 282, Virtual World Generator (VWG), 39 zoopraxiscope, 24
289, 291, 296, 340, 341 virtuality, 5
unsupervised clustering, 359 visibility, 44, 256
unsupervised learning, 359 visibility complex, 204
unsupervised training, 333
visibility computation, 178
utricle, 216
visibility constraints, 44
value (HSV), 166 visibility event, 202, 204
variance, 356 visual acuity, 121, 138
vblank, 199 visual cortex, 128, 316
vection, 57, 158, 174, 196, 231–238, 287– visually induced apparent motion, 343
289, 321, 343 visually induced motion sickness, 343
vector field, 158, 233 vitreous humor, 119
velocity, 212 VOG, 266
vergence, 134, 150 VOR, 134–136, 156, 159, 162, 163, 216,
vergence-accommodation mismatch, 43, 336
144, 338, 340, 341 gain adaptation, 141
vertical blanking interval, 199 VOR gain, 141
vertical vection, 234 VR sickness, 8, 342–351
vertigo, 220, 344 VR/AR/MR, 5
vestibular organs, 54, 216, 343 Vredeman de Vries, Hans, 23
vestibular sense, 54, 57, 216 vsync, 199
vestibular system, 216–220 VWG, 198
video game, 280, 285, 301
video games, 8–9, 161 wagon-wheel effect, 157
video memory, 199 walkie-talkie, 296
video oculography, 266 waterfall illusion, 157
view volume culling, 191 wayfinding, 290
View-Master, 29, 30 Weber’s law, 61
viewing frustum, 88 Welch’s t-test, 357
viewing ray, 178 Wheatstone, Charles, 29
viewing transformations, 82–91, 189 white light, 168
viewport transform, 90 white noise, 325
Virtual Boy, 31 world in miniature, 290
virtual environments, 5 world-fixed camera, 260, 261