Light Commands PDF
Light Commands PDF
Light Commands PDF
Abstract—We propose a new class of signal injection attacks their products with Alexa, Siri or Google Assistant. Thus, users
on microphones based on the photoacoustic effect: converting can receive information and control products by the mere act
light to sound using a microphone. We show how an attacker of speaking, without the need for physical interaction with
can inject arbitrary audio signals to the target microphone
by aiming an amplitude-modulated light at the microphone’s keyboards, mice, touchscreens, or even buttons.
aperture. We then proceed to show how this effect leads to However, while much attention is being given to improving
a remote voice-command injection attack on voice-controllable the capabilities of VC systems, much less is known about
systems. Examining various products that use Amazon’s Alexa, the resilience of these systems to software and hardware
Apple’s Siri, Facebook’s Portal, and Google Assistant, we show attacks. Indeed, previous works [1, 2] already highlight a major
how to use light to obtain full control over these devices at
distances up to 110 meters and from two separate buildings. limitation of voice-only user interaction: the lack of proper
Next, we show that user authentication on these devices is often user authentication. As such, a voice-controllable system can
lacking or non-existent, allowing the attacker to use light-injected execute an injected command without the need for additional
voice commands to unlock the target’s smartlock-protected front user confirmation. While early command-injection techniques
doors, open garage doors, shop on e-commerce websites at were easily noticeable by the device’s legitimate owner, a more
the target’s expense, or even locate, unlock and start various
vehicles (e.g., Tesla and Ford) that are connected to the target’s recent line of work [3, 4, 5, 6, 7, 8, 9] focuses on stealthy
Google account. Finally, we conclude with possible software and command injection, preventing the user from recognizing or
hardware defenses against our attacks. even hearing the injected commands.
Index Terms—Signal Injection Attack, Transduction Attack, The absence of voice authentication has resulted in a
Voice-Controllable System, Photoacoustic Effect, Laser, MEMS proximity-based threat model, where close-proximity users
I. I NTRODUCTION are considered legitimate, while attackers are kept at bay by
physical obstructions like walls, locked doors, and closed win-
The consistent growth in computational power is profoundly dows. For attackers aiming to surreptitiously gain control over
changing the way that humans and computers interact. Moving physically-inaccessible systems, existing injection techniques
away from traditional interfaces like keyboards and mice, are unfortunately limited, as the current state of the art [6] has
in recent years computers have become sufficiently powerful an injection range limited to 25 ft (7.62 m) in open space, with
to understand and process human speech. Recognizing the physical barriers (e.g., windows) further reducing the distance.
potential of quick and natural human-computer interaction, Thus, in this paper we tackle the following questions:
technology giants like Apple, Google, Facebook, and Amazon Can commands be remotely and stealthily injected into a
have each launched their own large-scale deployment of voice- voice-controllable system? If so, how can an attacker perform
controllable (VC) systems that continuously listen to and act such an attack under realistic conditions and with limited
on human voice commands. physical access? Finally, what are the implications of such
With tens of millions of devices sold with Alexa, Siri, Portal, command injections on third-party IoT hardware integrated
and Google Assistant, users can now interact with service with the voice-controllable system?
providers without the need to sit in front of a computer or
type on a mobile phone. Responding to this trend, the Internet A. Our Contribution
of Things (IoT) market has also undergone a small revolution. In this paper we present LightCommands, an attack that is
Rather than having each device be controlled via a dedicated capable of covertly injecting commands into voice-controllable
piece of software, IoT manufacturers now spend time making systems at long distances.
hardware, coupled with a lightweight interface to integrate Laser-Based Audio Injection. We have identified a semantic
* Paper version as of November 4, 2019. More information and demon- gap between the physics and specifications of MEMS (micro-
strations is available at lightcommands.com electro-mechanical systems) microphones, where such micro-
Laser
Laser beam
Target
110 m
Geared
tripod head
Laser spot
Fig. 1. Experimental setup for exploring attack range. (Top) Floor plan of the 110 m long corridor. (Left) Laser with telephoto lens mounted on geared tripod
head for aiming. (Center) Laser aiming at the target across the 110 m corridor. (Right) Laser spot on the target device mounted on tripod.
phones unintentionally respond to light as if it was sound. setup, using commercially available laser pointers and laser
Exploiting this effect, we can inject sound into microphones drivers. Moreover, by using infrared lasers and abusing vol-
by simply modulating the amplitude of a laser light. ume features (e.g., whisper mode for Alexa devices) on the
Attacking Voice-Controllable Systems. Next, we investigate target device, we show how an attacker can mount a light-
the vulnerability of popular VC systems (such as Alexa, Siri, based audio injection attack while minimizing the chance of
Portal, and Google Assistant) to light-based audio injection discovery by the target’s legitimate owner.
attacks. We find that 5 mW of laser power (the equivalent of Countermeasures. Finally, we discuss software and
a laser pointer) is sufficient to obtain full control over many hardware-based countermeasures against our attacks.
popular Alexa and Google smart home devices, while about 60 Summary of Contributions. In this paper we make the
mW is sufficient for gaining control over phones and tablets. following contributions.
Long Range. Using a telephoto lens to focus the laser,
1) Discover a hardware problem with MEMS microphones,
we demonstrate the first long-range command injection attack
making them susceptible to light-based signal injection
on VC systems, achieving distances of up to 110 meters (the
attacks (Section IV).
maximum available space in our testing area) as shown in
2) Investigate the vulnerability of popular Alexa, Siri, Portal,
Figure 1. We also demonstrate how light can be used to
and Google Assistant devices to light-based command
control VC systems across buildings and through closed glass
injection across large distances and varying laser power
windows at similar distances. Finally, we note that unlike
(Section V).
previous works that have limited range due to the use of sound
3) Investigate the security implications of malicious command
for signal injection, the range obtained by light-based injection
injection attacks on VC systems and demonstrate how such
is only limited by the attacker’s power budget, optics, and
attacks can be mounted using cheap and readily available
aiming capabilities.
equipment (Section VI).
Insufficient Authentication. Having established the feasi-
4) Discuss software and hardware countermeasures to light-
bility of malicious control over VC systems, we investigate the
based signal injection attacks (Section VII).
security implications of sound injection attacks. We find that
VC systems are often lacking user authentication mechanisms,
B. Safety and Responsible Disclosure
or if the mechanisms are present, they are incorrectly imple-
mented (e.g., allowing for PIN brute forcing). We show how Laser Safety. Laser radiation requires special controls for
an attacker can use light-injected voice commands to unlock safety, as high-powered lasers might cause hazards of fire, eye
the target’s smart-lock protected front door, open garage doors, damage, and skin damage. We urge that researchers receive
shop on e-commerce websites at the target’s expense, or even formal laser safety training and approval of experimental de-
locate, unlock and start various vehicles (e.g., Tesla and Ford) signs before attempting reproduction of our work. In particular,
if the vehicles are connected to the target’s Google account. all the experiments in this paper were conducted under a
Attack Stealthiness and Cheap Setup. We then show Standard Operating Procedure which was approved by our
how an attacker can build a cheap yet effective injection university’s Safety Committee.
Disclosure Process. Following the practice of responsible microphone, recovering the original voice command while
disclosure, we have shared our findings with Google, Amazon, remaining undetected by humans.
Apple, August, Ford, Tesla, and Analog Devices, a major However, both attacks are limited to short distances (from 2
supplier of MEMS microphones. We subsequently maintained cm to 175 cm) due to the transmitter operating at low power.
contact with the security teams of these vendors, as well Unfortunately, increasing the transmitting power generates an
as with ICS-CERT and the FDA. The findings presented in audible frequency component containing the (hidden) voice
this paper were made public on the mutually-agreed date of command, as the transmitter is also affected by the same
November 4th, 2019. nonlinearity observed in the receiving microphone. Tackling
the distance limitation, Roy et al. [6] mitigated this effect by
II. BACKGROUND
splitting the signal in multiple frequency bins and playing them
A. Voice-Controllable System through an array of 61 speakers. However, the re-appearance
The term “Voice-Controllable (VC) system” refers to a of audible leakage still limits the attack’s range to 25 ft (7.62
system that is controlled primarily by voice commands di- m) in open space, with physical barriers (e.g., windows) and
rectly spoken by users in a natural language, e.g., English. the absorption of ultrasonic waves in air further reducing range
While some important exceptions exist, VC systems often by attenuating the transmitted signal.
immediately operate on voice commands issued by the user, Skill Squatting Attacks. A final line of work focuses on
without requiring further interaction. For example, when the confusing speech recognition systems, causing them to misin-
user commands the VC system to “open the garage door”, the terpret correctly-issued voice commands. These so-called skill
garage door is immediately opened. squatting attacks [11, 12] work by exploiting systematic errors
Following the terminology of [4], a typical VC system in the recognition of similarly sounding voice commands to
is composed of three main components: (i) voice capture, route users to malicious applications without their knowledge.
(ii) speech recognition, and (iii) command execution. First,
the voice capture subsystem is responsible for converting the C. Acoustic Signal Injection Attacks
sound produced by the user into electrical signals. Next, the Several works used acoustic signal injection as a method of
speech recognition subsystem is responsible for detecting the inducing unintended behavior in various systems.
wake word in the acquired signal (e.g., “Alexa” for Amazon’s More specifically, Son et al. [13] showed that MEMS sen-
Alexa, “OK Google” for Google Assistant, ”Hey, Portal” for sors are sensitive to ultrasound signals, resulting in jamming
Facebook’s Portal and “Hey Siri” for Apple’s Siri) and subse- denial of service attacks against inertial measurement unit
quently interpreting the meaning of the voice command using (IMU) on drones. Subsequently, Yan et al. [14] demonstrated
signal and natural-language processing. Finally, the command- that acoustic waves can be used to saturate and spoof ultrasonic
execution subsystem launches the corresponding application or sensors, impairing the safety of cars. This was further im-
executes an operation based on the recognized voice command. proved by Walnut [15], which exploited aliasing and clipping
effects in the sensor’s components to achieve precise control
B. Attacks on Voice-Controllable Systems over MEMS accelerometers via sound injection.
Several previous works explored the security of VC sys- More recently, Nashimoto et al. [16] showed the possibility
tems, uncovering vulnerabilities that allow attackers to issue of using sound to attack sensor-fusion algorithms that rely on
unauthorized voice commands to these devices [3, 4, 5, 6, 7]. data from multiple sensors (e.g., accelerometers, gyroscopes,
Malicious Command Injection. More specifically, [1, 2] and magnetometers) while Blue Note [17] demonstrates the
developed malicious smartphone applications that play syn- feasibility of sound attacks on mechanical hard drives, result-
thetic audio commands into nearby VC systems without re- ing in operating system crashes.
quiring any special operating system permissions. While these
attacks transmit commands that are easily noticeable to a D. Laser Injection Attacks
human listener, other works [3, 8, 9] focused on camouflaging In addition to sound, light has also been utilized for signal
commands in audible signals, attempting to make them unin- injection. Indeed, [18, 19, 14] mounted denial of service at-
telligible or unnoticeable to human listeners, while still being tacks on cameras and LiDARs by illuminating victims’ photo-
recognizable to speech recognition models. receivers with strong lights. This was later extended by Shin
Inaudible Voice Commands. A more recent line of et al. [20] and Cao et al. [21] to a more sophisticated attack that
work focuses on completely hiding the voice commands injects precisely-controlled signals to LiDAR systems, causing
from human listeners. Roy et al. [5] demonstrate how high the target to see an illusory object. Next, Park et al. [22]
frequency sounds inaudible to humans can become recordable showed an attack on medical infusion pumps, using light to
by commodity microphones. Subsequently, Song and Mittal attack optical sensors that count the number of administered
[10] and DolphinAttack [4] extended the work of [5] by medication drops. Finally, [23] show how various sensors,
sending inaudible commands to VC systems via word mod- such as infrared and light sensors, can be used to activate
ulation on ultrasound carriers. By exploiting nonlinearities and transfer malware between infected devices.
in the microphones, a signal modulated onto an ultrasonic Another line of work focuses on using light for injecting
carrier is demodulated to the audible range by the targeted faults inside computing devices, resulting in security breaches.
More specifically, it is well-known that laser light causes soft Microphone Mounting. A backport MEMS microphone is
(temporary) errors in semiconductors, where similar errors mounted on the surface of a printed circuit board (PCB), with
are also caused by ionizing radiation [24]. Exploiting this the microphone’s aperture exposed through a cavity on the
effect, Skorobogatov and Anderson [25] showed the first PCB (see the third column of Figure 2). The cavity, in turn,
light-induced fault attacks on smartcards and microcontrollers, is part of an acoustic path that guides sound through holes
demonstrating the possibility of flipping individual bits in (acoustic ports) in the device’s chassis to the microphone’s
memory cells. This effect was subsequently exploited in nu- aperture. Finally, the device’s acoustic ports typically have a
merous follow ups, using laser-induced faults to compromise fine mesh as shown in Figure 3 to prevent dirt and foreign
the hardware’s data and logic flow, extract secret keys, and objects from entering the microphone.
dump the device’s memory. See [26, 27] for further details.
G. Laser Sources
E. Photoacoustic Effect
Choice of a Laser. A laser is a device that emits a beam of
Photoacoustics is a field of research that studies the inter- coherent light that can stay narrow over a long distance and
action between light and acoustic pressure waves (see [28] be focused to a tight spot. While many technologies exist for
for a survey). The first work in this area dates back to emitting coherent light, in this paper we focus on laser emitting
1880, where Alexander Graham Bell [29] invented an optical diodes, which are common in consumer laser products such as
communication device that uses a vibrating mirror as a me- laser pointers. Next, as the light intensity emitted from a laser
chanical sunlight modulator and a selenium cell to convert the diode is directly proportional to the diode’s driving current,
modulated light back to electricity. While the so-called pho- we can easily encode analog signals via the beam’s intensity
tophone was successful at transmitting voice across distances, by using a laser driver capable of amplitude modulation.
the inherent requirement of having a line of sight between
Laser Safety and Availability. As strong, tightly focused
the transmitter and receiver made the technology inferior to
lights can be potentially hazardous, there are standards in
the radio communication that was emerging at that time. The
place regulating lights emitted from laser systems [32, 33]
rise of digital communication technology has made the analog
that divide lasers into classes based on the potential for injury
modulation even less attractive, and voice transmission over
resulting from beam exposure. In this paper, we are interested
light had been forgotten for decades.
in two main types of devices, which we now describe.
Recently, researchers rediscovered light-voice transmission
as a sophisticated user interface that delivers an audible mes- Low-Power Class 3R Systems. This class contains devices
sage to a particular user by using air as the medium. Tucker whose output power is less than 5 mW at visible wavelength
[30] reported that the U.S. military is developing a device that (400—700 nm, see Figure 4). While prolonged intentional eye
ionizes molecules in the air using an extremely short-pulse exposure to the beam emitted from these devices might be
(femtosecond) laser to generate plasma that makes sound. harmful, these lasers are considered safe to human eyes for
Sullenberger et al. [31] proposed a different way of generating brief exposure durations. As such, class 3R systems form a
sound using an infrared laser with a particular wavelength that good compromise between safety and usability, making these
efficiently heats up ambient water vapor, causing an acoustic lasers common in consumer products such as laser pointers.
pressure wave in the air which results in successful sound High-Power Class 3B and Class 4 Systems. Next, lasers that
delivery to a user at 2.5 meters away. emit between 5 and 500 mW are classified as class 3B systems,
and might cause eye injury even from short beam exposure
F. MEMS Microphones durations. Finally, lasers that emit over 500 mW of power
MEMS is an integrated implementation of mechanical com- are categorized as class 4 systems, which can instantaneously
ponents on a chip, typically fabricated with an etching process. cause blindness, skin burns and fires. As such, uncontrolled
While there are a number of different MEMS sensors (e.g., exposure to class 4 laser beams should be strongly avoided.
accelerometers and gyroscopes), in this paper we focus on However, despite the regulation, there are reports of high-
MEMS-based microphones, which are particularly popular in power class 3B and 4 systems being openly sold as “laser
mobile and embedded applications (such as smartphones and pointers” [34]. Indeed, while purchasing laser pointers from
smart speakers) due to their small footprints and low prices. Amazon and eBay, we have discovered a troubling discrepancy
Microphone Overview. The first column of Figure 2 shows between the rated and actual power of laser products. While
the construction of a typical backport MEMS microphone, the labels and descriptions of most products stated an output
which is composed of a diaphragm and an ASIC circuit. The power of 5 mW, the actual measured power was sometimes as
diaphragm is a thin membrane that flexes in response to an high as 1 W (i.e., ×200 above the allowable limit).
acoustic wave. The diaphragm and a fixed back plate work
III. T HREAT M ODEL
as a parallel-plate capacitor, whose capacitance changes as a
consequence of the diaphragm’s mechanical deformations as The attacker’s goal is to inject malicious commands into the
it responds to alternating sound pressures. Finally, the ASIC targeted voice-controllable device, without being detected by
die converts the capacitive change to a voltage signal on the the device’s owner and without having physical device access.
output of the microphone. More specifically, we consider the following threat model.
Acoustic pressure wave Front
Through hole
Device chassis Mesh
Gasket
PCB
ASIC Backplate Diaphragm
Package ASIC Diaphragm
Diaphragm
Back
Fig. 2. MEMS microphone construction. (Left) Cross-sectional view of a MEMS microphone on a device. (Middle) A diaphragm and ASIC on a depackaged
microphone. (Right) Magnified view of an acoustic port on PCB.
Acoustic port of Acoustic port of IV. I NJECTING S OUND VIA L ASER L IGHT
Google Home Echo Dot 3rd gen.
A. Signal Injection Feasibility
In this section we explore the feasibility of injecting acoustic
signals into microphones using laser light. We begin by
Fig. 3. Acoustic port of (Left) Google Home and (Right) Echo Dot 3rd describing our experimental setup.
generation. The ports are located on the top of the devices, and there are Setup. We used a blue Osram PLT5 450B 450-nm laser
meshes inside the port. diode connected to a Thorlabs LDC205C current driver. We
Visible light increased the diode’s DC current with the driver until it emitted
a continuous 5.0 mW laser beam, while measuring light
Ultra violet Infrared
intensity using the Thorlabs S121C photo-diode power sensor.
400 500 600 700 The beam was subsequently directed to the acoustic port on
Wavelength [nm] the SparkFun MEMS microphone breakout board mounting an
Fig. 4. Wavelength and color of light Analog Devices ADMP401 MEMS microphone. Finally, we
recorded the diode current and the microphone’s output using a
No Physical Access or Owner Interaction. While the Tektronix MSO5204 oscilloscope. See Figure 5 for a picture
attacker is free to choose the target, we assume that the of our setup. The experiments were conducted in a regular
attacker does not have any physical access to the device office environment, with typical ambient noise from human
being attacked. Thus, the attacker cannot press any buttons, speech, computer equipment, and air conditioning systems.
alter voice-inaccessible settings, or compromise the device’s Signal Injection. We used the current driver to modulate
software. Finally, we assume that the attacker cannot make the a sine wave on top of the diode’s current It via amplitude
device’s owner perform any useful interaction (like pressing a modulation (AM), given by the following equation:
button or unlocking the screen). It = IDC + Ipp sin(2πf t) (1)
Line of Sight. We do assume however that the attacker
has (a remote) line of sight access to the target device and where IDC is a DC bias, Ipp is the peak-to-peak amplitude,
its microphone ports. We argue that such an assumption is and f is the frequency. In our case, we set IDC = 26.2 mA,
reasonable, as voice-activated devices (such as smart speakers, Ipp = 7 mA and f = 1 kHz, where the sine wave was
thermostats, security cameras, or even phones) are often left generated using an on-board DAC on a laptop computer, and
visible to the attacker, including through closed glass windows. was supplied to the modulation port on the current driver
Device Feedback. We note that the remote line of sight through an audio amplifier (Neoteck NTK059 Headphone
access to the target device also allows the attacker to observe Amplifier). As the light intensity emitted by the laser diode
the device’s LED lights. Next, as these lights come on after is directly proportional to the current provided by the laser
the device properly recognized the wakeup word and show an driver, this resulted in having the 1 kHz sine wave be directly
unique patterns once the command was properly recognized encoded in the intensity of the light emitted by the laser diode.
and accepted, the attacker to remotely determine if an attack Observing the Microphone Output. As can be seen in
attempt was successful. Figure 5, the microphone output clearly shows a 1 kHz sine
Device Characteristics. Finally, we also assume that the wave that matches the frequency of the injected signal.
attacker has access to a device of a similar model as the
target device. Thus, the attacker knows all the target’s physical B. Characterizing Laser Audio Injection
characteristics, such as location of the microphone ports and Having successfully demonstrated the possibility of inject-
physical structure of the device’s sound path. Such knowledge ing audio signals via laser beams, we now proceed to charac-
can easily be acquired by purchasing and analyzing a device terize the light intensity response of the diodes (as a function
of the same model before launching the attacks. of current) and the frequency response of the microphone to
32
Laser 28
26
24
22
20
0 2 4 6 8 10
Time (ms)
300
Victim Microphone Signal
200
Voltage (mV)
100
0
Amplifier -100
-200
Laser current driver -300
0 2 4 6 8 10
Time (ms)
Fig. 5. Testing signal injection feasibility. (Left) A setup for signal injection feasibility composed of a laser current driver, PC, audio amplifier, and oscilloscope.
(Middle) Laser diode with beam aimed at a MEMS microphone breakout board. (Right) Diode current and microphone output waveforms.
laser-based audio injection. To see the wavelength dependency, power depends mostly on IDC , as Ipp sin(2πf t) averages out
we also examine a 638-nm red laser (Ushio HL63603TG) in to zero.
addition to the blue one used in the previous experiment. Thus, to stay within the power budget of L mW while ob-
Laser Current to Light Characteristics. We begin by taining the strongest possible signal at the microphone output,
examining the relationship between the diode current and the the attacker must first determine the DC current offset IDC
optical power of the laser. For this purpose, we aimed a that results in the diode outputting light at L mW, and then
laser beam at our Thorlabs S121C power sensor while driving subsequently maximize the amplitude of the microphone’s
the diodes with DC currents, i.e., Ipp = 0 in Equation 1. output signal by setting Ipp /2 = IDC − Ith .*
Considering the different properties of the diodes, the blue and Characterizing the Frequency Response of Laser Audio
red laser are examined up to 300 and 200 mA, respectively. Injection. Next, we set out to characterize the response
The first column of Figure 6 shows the current vs. light of the microphone to different frequencies of sound signals
(I-L) curves for the blue and red lasers. The horizontal axis injected via laser beams. We use the same operating points
is the diode current IDC and the vertical axis is the optical as the previous experiment, and set the tone’s amplitude such
power. As can be seen, once the current provided to the that it fits with the linear region (IDC = 200 mA and Ipp =
laser is above the diode-specific threshold (denoted by Ith ), 150 mA for the blue laser, and IDC = 150 mA and Ipp =
the light power emitted by the laser increases linearly with 75 mA for the red laser). We then record the microphone’s
the provided current. Thus, as | sin(2πf t)| < 1, we have an output levels while changing the frequency f of the light-
(approximately) linear conversion of current to light provided modulated sine wave.
that IDC − Ipp /2 > Ith . The third column of Figure 6 shows the obtained frequency
Laser Current to Sound Characteristics. We now proceed response for both blue and red lasers. The horizontal axis
to characterize the effect of light injection on a MEMS mi- is the frequency while the vertical axis is the peak-to-peak
crophone. We achieve this by aiming an amplitude-modulated voltage of the microphone output. Both lasers have very
(AM) laser beam with variable current amplitudes (Ipp ) and a similar responses, covering the entire audible band 20 Hz–20
constant current offset (IDC ) into the aperture of an Analog kHz, implying the possibility of injecting any audio signal.
Devices ADMP401 microphone, mounted on a breakout board. Choice of Laser. Finally, we note the color insensitivity of
We subsequently monitor the peak-to-peak voltage of the injection. Although blue and red lights are on the other edges
microphone’s output, plotting the resulting signal. on the visible spectrum (see Figure 4), the levels of injected
The second column of Figure 6 shows the relationship audio signal are in the same range and the shapes of the
between the modulating signal Ipp and the resulting signal frequency-response curves are also similar. Therefore, color
Vpp for both the blue and red laser diodes. The results suggest has low priority in choosing a laser compared to other factors
that the driving alternating current Ipp (cf. the bias current) for making LightCommands. In this paper, we consistently use
is the key for strong injection: we can linearly increase the the 450-nm blue laser mainly because of (i) better availability
sound volume received by the microphone by increasing the of high-power diodes and (ii) the advantage in focusing
driving AC current Ipp . because of a shorter wavelength.
Choosing IDC and Ipp . Given a laser diode that can C. Mechanical or Electrical Transduction?
emit a maximum average power of L mW, we would like to
choose the values for IDC and Ipp which result in the strongest In this section we set out to investigate whether our light-
possible microphone output signals, while having the average based acoustic signal injection is due to physical movements
optical power emitted by the laser be less than or equal to L of the microphone’s diaphragm (i.e., light-induced mechanical
mW. From the leftmost column of Figure 6, we deduce that * We note here that the subtraction of Ith is designed to ensure that
the laser’s output power is linearly proportional to the laser’s IDC − Ipp /2 > Ith , meaning that the diode stays in its linear region thereby
driving current It = IDC + Ipp sin(2πf t), and the average avoiding signal distortion.
450nm I-L Curve (I pp=0) 450nm Microphone Response (f = 1 kHz, IDC = 200 mA) 450nm Frequency Response (I DC = 200 mA, I pp = 150 mA)
Microphone V pp (V)
Microphone V pp (V)
300 0.8 2
Light Power (mW)
0.6 1.5
200
0.4 1
100
0.2 0.5
0 0 0
0 50 100 150 200 250 300 0 50 100 150 10 1 10 2 10 3 10 4 10 5
Diode Current (mA) Diode Current I pp (mA) Frequency (Hz)
638nm I-L Curve (I pp=0) 638nm Microphone Response (f = 1 kHz, IDC = 150 mA) 638nm Frequency Response (I DC = 150 mA, I pp = 75 mA)
Microphone V pp (V)
Microphone V pp (V)
300 0.8
Light Power (mW)
1
0.6
200
0.4 0.5
100
0.2
0 0 0
0 50 100 150 200 250 300 0 50 100 150 10 1 10 2 10 3 10 4 10 5
Diode current (mA) Diode current I pp (mA) Frequency (Hz)
Fig. 6. Characteristics of the 450-nm blue laser (first row) and the 638-nm red laser (second row). (First column) Current-light DC characteristics. (Second
column) Microphone response for a 1 kHz tone with different amplitudes. (Third column) Frequency responses of the overall setup for fixed bias and amplitude.
-200
by the attacker in order to gain control over the VC system
-300
under ideal conditions as well as the maximal distance that
0 1 2 3 4 5 6 7 8 9 10 such control can be obtained under more realistic conditions.
Time (ms)
Fig. 7. The microphone’s response to laser injection with different mechanical Target Selection. We benchmark our attack against several
conditions: (Cyan) a baseline measurement without modification, (Magenta) consumer devices which have voice control capabilities (see
the microphone with its metal package removed, and (Black) a transparent Table I). We aim to test the most popular voice assistants
glue on microphone’s diaphragm.
– namely Alexa, Siri, Portal, and Google Assistant. While
we do not claim that our list is exhaustive, we do argue
vibration), or by another mechanism such as the photoelectric that it does provide some intuition about the vulnerability of
effect. We achieve this via a series of measurements that grad- popular VC systems to laser-based voice injection attacks.
ually modify the microphone’s mechanical condition while Next, to explore how different hardware variations (rather
leaving its optical condition intact. Using an ADMP401 micro- than algorithmic variations) affect our attack performance, we
phone, we first take a baseline measurement without any modi- benchmark our attack on multiple devices running the same
fication. Next, we remove the microphone’s pressure reference voice recognition backend: Alexa, Siri, Portal and Google
chamber by opening the package covering the diaphragm and Assistant, as summarized in Table I. For some devices, we
ASIC (as shown in the second column of Figure 2). Finally, we examine different generations to explore the differences on
dampen the diaphragm’s movement by putting glue directly attack performance for various hardware models. Finally,
on the opened and exposed diaphragm† . We note that the we also considered third-party devices with built-in speech
microphone’s optical properties are unchanged as the glue recognition, such as the EcoBee thermostat.
is transparent and applied from the diaphragm’s back side,
A. Exploring Laser Power Requirements
leaving the acoustic port intact.
Figure 7 presents the resulting voltage signals from the In this section we aim to characterize the minimal laser
microphone in the three conditions illuminated with the same power required by the attacker under ideal conditions to take
laser beam (the blue laser with f = 1 kHz, Ipp = 50 mA, and control over a voice-activated system. Before describing our
IDC = 80 mA). As can be seen, the modification decreases experimental setup, we show our selection of benchmarked
the amplitude of the signal detected by the microphone, and voice commands and experiment success criteria.
the signal after the glue application is less than 10% of Command Selection. We have selected four different voice
the original signal. We thus attribute our light-based signal commands that represent common operations performed by
injection results to mechanical movements of the microphone’s voice operated systems.
diaphragm, which are in turn translated to output voltage by • What Time Is It? This command was selected to serve
the microphone’s internal circuitry. as the baseline of our experiments, as it does not require
† We used transparent and non-conductive glue (Gorilla Super Glue), and
the device to perform nearly any operation besides correctly
conducted the measurement while the glue is wet since surface tension during identifying the command and accessing the Internet to
the curing process can damage the chip. recover the current time.
• Set the Volume to Zero. Here, we demonstrate the aiming) while better results in terms of distance and power can
attacker’s ability to control the output of the VC system. be achieved if less than perfect accuracy is considered.
We expect this to be the first voice command issued by the Voice Customization and Security Settings. For the
attacker, in an attempt to avoid attracting attention from the experiments conducted in this section, we left all the device’s
target’s legitimate owner. settings in their default configuration. Next, in embedded
• Purchase a Laser Pointer. With this command we Alexa and Google VC systems (e.g., smart speakers, cameras,
show how an attacker can potentially place order for various etc.) voice customization is off by default, meaning that
products on behalf (and at the expense) of users. The the device will operate on commands spoken by any voice.
attacker can subsequently wait for delivery near the target’s Meanwhile, for phone and tablet devices (which are typically
residents and collect the purchased item. operated by a single user), we left the voice identification in
• Open the Garage Door. Finally, and perhaps most its default activated setting. For such devices, to ascertain the
devastatingly, we show how an attacker can interact with minimal required power for a successful attack, we trained
additional systems which have been linked by the user to the VC system with a human voice and subsequently inject
the targeted VC system. While the garage door opener is one the audio recording of the commands spoken using the same
such example with clear security implications, we discuss voice. Finally, in Section V-C, we discuss bypassing various
other examples in Section VI. voice matching mechanisms.
Experimental Setup. We use the same blue laser and
Command Generation. We have generated audio recordings
Thorlabs laser driver as in Section IV-A, aiming the laser
of all four of the above commands using a common audio
beam at microphone ports of the devices listed in Table I
recording system (e.g., Audacity). Each command recording
from a distance of about 30 cm. To control the surrounding
was subsequently appended to a recording of the wake word
environment, the entire setup was placed in a metal enclosure,
corresponding to the device being tested (e.g., Alexa, Hey
with opaque bottom and sides and with a dark red semi-
Siri, Hey Portal, or OK, Google) and normalized to adjust
transparent acrylic top plate, designed to block blue light.
the overall volume of the recordings to a constant value. We
See Figure 8. As the goal of the experiments described in
obtained a resulting corpus of 16 complete commands. Finally,
this section is to ascertain the minimal required power for
for each device, we injected four of the complete commands
a successful attack on each device, we have used a pair of
(those beginning with the device-appropriate wake word) into
electrically controlled scanning mirrors (40 Kbps high-speed
the device’s microphone using the setup described below and
laser scanning system for laser shows) to precisely place the
observed the device’s response.
laser beam in the center of the device’s microphone port.
Verifying Successful Injection. We consider a command Before each experiment we manually focused the laser so that
injection attempt as successful in case the device somehow in- the laser spot size hitting the microphone is minimal.
dicates the correct interpretation of the command. For devices For aiming at devices whose microphone port is covered
with screens (such as phones and screen enabled speakers), with cloth (e.g., Google Home Mini shown in Figure 9), the
we considered an attempt successful when the device correctly position of the microphone ports can be determined using an
displayed a transcription of the light-injected voice command. easily-observable reference point such as the device’s wire
For screen-less devices (e.g., smart speakers), we manually connector or LED array. Finally, we note that the distance
examined the command log of the account associated with the between the microphone and the reference point is easily
device for the correct command transcription. obtainable by the attacker either by exploring his own device,
Attack Success Criteria. For a given power budget, distance, or by referring to online teardown videos [35].
and command, we consider the injection successful in case Experimental Results. The fifth column of Table I presents
the device correctly recognized the command during three a summary of our results. While the power required from
consecutive injection attempts. We take this as an indication the attacker varies from 0.5 mW (Google Home) to 60 mW
that the power budget and distance are sufficient for achieving (Galaxy S9), all the devices are susceptible to laser-based
a near-perfect success probability assuming suitable aiming. command injection. Finally, we note that the microphone port
Next, we consider an attack successful for a given power of some devices (e.g., Google Home Mini) is covered with
budget and distance in case all of our four commands where fabric and / or foam. While we conjecture that this attenuates
successfully injected to the device during three consecutive optical power, as Table I shows, the attack is still possible.
injection attempts. Like in the individual command case, we Finally, we note that the experiments done in this section
take this as an indication that the considered power budget are performed under ideal conditions, at close range and with
and distance is sufficient for a high probability successful the aid of electronic aiming mirrors. Thus, in Section V-B we
commands injection. We note that this criteria is conservative, report on attack results under more realistic conditions with
as some commands are easier to inject than others, presumably respect to distance and aiming.
due to their phonetical properties. As such, the results in
this section should be seen as a conservative estimate of B. Exploring Attack Range
what an attacker can achieve for each device assuming good In this section we set out to explore the effective range of
environmental conditions (e.g., quiet surroundings and suitable our attack under more realistic attack conditions.
TABLE I
T ESTED DEVICES WITH MINIMUM ACTIVATION POWER AND MAXIMUM DISTANCE ACHIEVABLE AT THE GIVEN POWER OF 5 M W AND 60 M W. A 110 M
LONG HALLWAY WAS USED FOR 5 M W TESTS WHILE A 50 M LONG HALLWAY WAS USED FOR TESTS AT 60 M W.
Mirror
driver
Laser beam
75 m Office
Laser
building source
43 m 21.8 o
Target room
15 m
70 m
Telescope
Laser spot for aiming Reflections
at the window
Laser
source Laser spot on
Target device
from the telescope the target device
Fig. 10. Setup for the low-power cross-building attack: (Top left) Laser and target arrangement. (Bottom left) Picture of the target device as visible through
the telescope, with the microphone ports and laser spot clearly visible. (Middle) Picture from the tower: laser on telephoto lens aiming down to the target.
(Right) Picture from the office building: laser spot on the target device.
requiring 36 hours to enumerate the entire 4-digit space (3.6 owner credentials, we were able to get several capabilities.
hours for 3 digits). In both the 3- and 4-digit case, the door These included getting information about the vehicle’s current
was successfully unlocked when the correct PIN was reached. location|| , locking and unlocking the doors and trunk, starting
PIN Bypassing. Finally, we have discovered that while and stopping the vehicle’s charging and the climate control
commands like “unlock front door” for August locks or “dis- system. Next, we note that we were able to perform all of
able alarm system” for Ring alarms require PIN authentication, these tasks using only voice commands, without the need of
other commands such as “open the garage door” using an a PIN number or key proximity. Finally, we were not able to
assistant-enabled garage door opener§ generally do not require start the car without key proximity.
any authentication. Thus, even if one command is unavailable, Ford Cars. For newer vehicles, Ford provides a phone
the attacker can often achieve a similar goal by using other app called “FordPass”, that connects to the car’s Ford SYNC
commands. system, and allows the owner to interact with the car over the
Internet. Taking the next step, Ford also provides a FordPass
C. Attacking Cars Google Assistant integration** with similar capabilities as the
Many modern cars have Internet-over-cellular connectivity, “EV Car” integration for Tesla. While Ford implemented PIN
allowing their owners to perform certain operations via a protection for critical voice commands like remote engine start
dedicated app on their mobile devices. In some cases, this and door unlocking, like in the case of August locks, there are
connectivity has further evolved (either by the vendor or by no mechanisms in place to prevent PIN brute forcing. Finally,
a third-party) in having the target’s car be connected to a VC while we were able to remotely open the doors and start the
system, allowing voice unlocking and/or pre-heating (which engine, shifting the vehicle out of “Park” immediately stopped
often requires engine start). Thus, a compromised VC system the engine, preventing the unlocked car from being driven.
might be used by an attacker to gain access to the target’s car. D. Exploring Stealthy Attacks
In this section we investigate the feasibility of such attacks,
The attacks described so far can be spotted by the user of
using two major car manufactures, namely Tesla and Ford.
the targeted VC system in three ways. First, the user might
Tesla. Tesla cars allow their owner to interact with the car
notice the light indicators on the target device following a
using a dedicated Tesla-provided phone app. After installing
successful command injection. Next, the user might hear the
the app on our phone and linking it to a Tesla Model S
device acknowledging the injected command. Finally, the user
vehicle, we have installed the “EV Car”¶ integration, linking
might notice the spot while the attacker tries to aim the laser
it to the vehicle. While “EV Car” is not officially provided
at the target microphone port.
by Tesla, after successful configuration using the vehicle’s
|| Admittedly, the audible location is of little use to a remote attacker who
§ https://www.garadget.com/ is unable to listen in on the speaker’s output.
¶ https://assistant.google.com/services/a/uid/000000196c7e079e?hl=en ** https://assistant.google.com/services/a/uid/000000ac1d2afd15
While the first issue is a limitation of our attack (and in fact Light spot covering
the entire target Target
of any command injection attack), in this section we explore
the attacker’s options for addressing the remaining two issues.
Diode terminals
Acoustic Stealthiness. To tackle the issue of the device of the flashlight