Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Light Commands PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Light Commands: Laser-Based Audio Injection

Attacks on Voice-Controllable Systems*


Takeshi Sugawara Benjamin Cyr Sara Rampazzi
The University of Electro-Communications University of Michigan University of Michigan
sugawara@uec.ac.jp bencyr@umich.edu srampazz@umich.edu

Daniel Genkin Kevin Fu


University of Michigan University of Michigan
genkin@umich.edu kevinfu@umich.edu

Abstract—We propose a new class of signal injection attacks their products with Alexa, Siri or Google Assistant. Thus, users
on microphones based on the photoacoustic effect: converting can receive information and control products by the mere act
light to sound using a microphone. We show how an attacker of speaking, without the need for physical interaction with
can inject arbitrary audio signals to the target microphone
by aiming an amplitude-modulated light at the microphone’s keyboards, mice, touchscreens, or even buttons.
aperture. We then proceed to show how this effect leads to However, while much attention is being given to improving
a remote voice-command injection attack on voice-controllable the capabilities of VC systems, much less is known about
systems. Examining various products that use Amazon’s Alexa, the resilience of these systems to software and hardware
Apple’s Siri, Facebook’s Portal, and Google Assistant, we show attacks. Indeed, previous works [1, 2] already highlight a major
how to use light to obtain full control over these devices at
distances up to 110 meters and from two separate buildings. limitation of voice-only user interaction: the lack of proper
Next, we show that user authentication on these devices is often user authentication. As such, a voice-controllable system can
lacking or non-existent, allowing the attacker to use light-injected execute an injected command without the need for additional
voice commands to unlock the target’s smartlock-protected front user confirmation. While early command-injection techniques
doors, open garage doors, shop on e-commerce websites at were easily noticeable by the device’s legitimate owner, a more
the target’s expense, or even locate, unlock and start various
vehicles (e.g., Tesla and Ford) that are connected to the target’s recent line of work [3, 4, 5, 6, 7, 8, 9] focuses on stealthy
Google account. Finally, we conclude with possible software and command injection, preventing the user from recognizing or
hardware defenses against our attacks. even hearing the injected commands.
Index Terms—Signal Injection Attack, Transduction Attack, The absence of voice authentication has resulted in a
Voice-Controllable System, Photoacoustic Effect, Laser, MEMS proximity-based threat model, where close-proximity users
I. I NTRODUCTION are considered legitimate, while attackers are kept at bay by
physical obstructions like walls, locked doors, and closed win-
The consistent growth in computational power is profoundly dows. For attackers aiming to surreptitiously gain control over
changing the way that humans and computers interact. Moving physically-inaccessible systems, existing injection techniques
away from traditional interfaces like keyboards and mice, are unfortunately limited, as the current state of the art [6] has
in recent years computers have become sufficiently powerful an injection range limited to 25 ft (7.62 m) in open space, with
to understand and process human speech. Recognizing the physical barriers (e.g., windows) further reducing the distance.
potential of quick and natural human-computer interaction, Thus, in this paper we tackle the following questions:
technology giants like Apple, Google, Facebook, and Amazon Can commands be remotely and stealthily injected into a
have each launched their own large-scale deployment of voice- voice-controllable system? If so, how can an attacker perform
controllable (VC) systems that continuously listen to and act such an attack under realistic conditions and with limited
on human voice commands. physical access? Finally, what are the implications of such
With tens of millions of devices sold with Alexa, Siri, Portal, command injections on third-party IoT hardware integrated
and Google Assistant, users can now interact with service with the voice-controllable system?
providers without the need to sit in front of a computer or
type on a mobile phone. Responding to this trend, the Internet A. Our Contribution
of Things (IoT) market has also undergone a small revolution. In this paper we present LightCommands, an attack that is
Rather than having each device be controlled via a dedicated capable of covertly injecting commands into voice-controllable
piece of software, IoT manufacturers now spend time making systems at long distances.
hardware, coupled with a lightweight interface to integrate Laser-Based Audio Injection. We have identified a semantic
* Paper version as of November 4, 2019. More information and demon- gap between the physics and specifications of MEMS (micro-
strations is available at lightcommands.com electro-mechanical systems) microphones, where such micro-
Laser
Laser beam

Target

110 m

Google Home attached to


geared tripod head
Target
Telephoto lens Laser mount Laser

Geared
tripod head
Laser spot

Fig. 1. Experimental setup for exploring attack range. (Top) Floor plan of the 110 m long corridor. (Left) Laser with telephoto lens mounted on geared tripod
head for aiming. (Center) Laser aiming at the target across the 110 m corridor. (Right) Laser spot on the target device mounted on tripod.

phones unintentionally respond to light as if it was sound. setup, using commercially available laser pointers and laser
Exploiting this effect, we can inject sound into microphones drivers. Moreover, by using infrared lasers and abusing vol-
by simply modulating the amplitude of a laser light. ume features (e.g., whisper mode for Alexa devices) on the
Attacking Voice-Controllable Systems. Next, we investigate target device, we show how an attacker can mount a light-
the vulnerability of popular VC systems (such as Alexa, Siri, based audio injection attack while minimizing the chance of
Portal, and Google Assistant) to light-based audio injection discovery by the target’s legitimate owner.
attacks. We find that 5 mW of laser power (the equivalent of Countermeasures. Finally, we discuss software and
a laser pointer) is sufficient to obtain full control over many hardware-based countermeasures against our attacks.
popular Alexa and Google smart home devices, while about 60 Summary of Contributions. In this paper we make the
mW is sufficient for gaining control over phones and tablets. following contributions.
Long Range. Using a telephoto lens to focus the laser,
1) Discover a hardware problem with MEMS microphones,
we demonstrate the first long-range command injection attack
making them susceptible to light-based signal injection
on VC systems, achieving distances of up to 110 meters (the
attacks (Section IV).
maximum available space in our testing area) as shown in
2) Investigate the vulnerability of popular Alexa, Siri, Portal,
Figure 1. We also demonstrate how light can be used to
and Google Assistant devices to light-based command
control VC systems across buildings and through closed glass
injection across large distances and varying laser power
windows at similar distances. Finally, we note that unlike
(Section V).
previous works that have limited range due to the use of sound
3) Investigate the security implications of malicious command
for signal injection, the range obtained by light-based injection
injection attacks on VC systems and demonstrate how such
is only limited by the attacker’s power budget, optics, and
attacks can be mounted using cheap and readily available
aiming capabilities.
equipment (Section VI).
Insufficient Authentication. Having established the feasi-
4) Discuss software and hardware countermeasures to light-
bility of malicious control over VC systems, we investigate the
based signal injection attacks (Section VII).
security implications of sound injection attacks. We find that
VC systems are often lacking user authentication mechanisms,
B. Safety and Responsible Disclosure
or if the mechanisms are present, they are incorrectly imple-
mented (e.g., allowing for PIN brute forcing). We show how Laser Safety. Laser radiation requires special controls for
an attacker can use light-injected voice commands to unlock safety, as high-powered lasers might cause hazards of fire, eye
the target’s smart-lock protected front door, open garage doors, damage, and skin damage. We urge that researchers receive
shop on e-commerce websites at the target’s expense, or even formal laser safety training and approval of experimental de-
locate, unlock and start various vehicles (e.g., Tesla and Ford) signs before attempting reproduction of our work. In particular,
if the vehicles are connected to the target’s Google account. all the experiments in this paper were conducted under a
Attack Stealthiness and Cheap Setup. We then show Standard Operating Procedure which was approved by our
how an attacker can build a cheap yet effective injection university’s Safety Committee.
Disclosure Process. Following the practice of responsible microphone, recovering the original voice command while
disclosure, we have shared our findings with Google, Amazon, remaining undetected by humans.
Apple, August, Ford, Tesla, and Analog Devices, a major However, both attacks are limited to short distances (from 2
supplier of MEMS microphones. We subsequently maintained cm to 175 cm) due to the transmitter operating at low power.
contact with the security teams of these vendors, as well Unfortunately, increasing the transmitting power generates an
as with ICS-CERT and the FDA. The findings presented in audible frequency component containing the (hidden) voice
this paper were made public on the mutually-agreed date of command, as the transmitter is also affected by the same
November 4th, 2019. nonlinearity observed in the receiving microphone. Tackling
the distance limitation, Roy et al. [6] mitigated this effect by
II. BACKGROUND
splitting the signal in multiple frequency bins and playing them
A. Voice-Controllable System through an array of 61 speakers. However, the re-appearance
The term “Voice-Controllable (VC) system” refers to a of audible leakage still limits the attack’s range to 25 ft (7.62
system that is controlled primarily by voice commands di- m) in open space, with physical barriers (e.g., windows) and
rectly spoken by users in a natural language, e.g., English. the absorption of ultrasonic waves in air further reducing range
While some important exceptions exist, VC systems often by attenuating the transmitted signal.
immediately operate on voice commands issued by the user, Skill Squatting Attacks. A final line of work focuses on
without requiring further interaction. For example, when the confusing speech recognition systems, causing them to misin-
user commands the VC system to “open the garage door”, the terpret correctly-issued voice commands. These so-called skill
garage door is immediately opened. squatting attacks [11, 12] work by exploiting systematic errors
Following the terminology of [4], a typical VC system in the recognition of similarly sounding voice commands to
is composed of three main components: (i) voice capture, route users to malicious applications without their knowledge.
(ii) speech recognition, and (iii) command execution. First,
the voice capture subsystem is responsible for converting the C. Acoustic Signal Injection Attacks
sound produced by the user into electrical signals. Next, the Several works used acoustic signal injection as a method of
speech recognition subsystem is responsible for detecting the inducing unintended behavior in various systems.
wake word in the acquired signal (e.g., “Alexa” for Amazon’s More specifically, Son et al. [13] showed that MEMS sen-
Alexa, “OK Google” for Google Assistant, ”Hey, Portal” for sors are sensitive to ultrasound signals, resulting in jamming
Facebook’s Portal and “Hey Siri” for Apple’s Siri) and subse- denial of service attacks against inertial measurement unit
quently interpreting the meaning of the voice command using (IMU) on drones. Subsequently, Yan et al. [14] demonstrated
signal and natural-language processing. Finally, the command- that acoustic waves can be used to saturate and spoof ultrasonic
execution subsystem launches the corresponding application or sensors, impairing the safety of cars. This was further im-
executes an operation based on the recognized voice command. proved by Walnut [15], which exploited aliasing and clipping
effects in the sensor’s components to achieve precise control
B. Attacks on Voice-Controllable Systems over MEMS accelerometers via sound injection.
Several previous works explored the security of VC sys- More recently, Nashimoto et al. [16] showed the possibility
tems, uncovering vulnerabilities that allow attackers to issue of using sound to attack sensor-fusion algorithms that rely on
unauthorized voice commands to these devices [3, 4, 5, 6, 7]. data from multiple sensors (e.g., accelerometers, gyroscopes,
Malicious Command Injection. More specifically, [1, 2] and magnetometers) while Blue Note [17] demonstrates the
developed malicious smartphone applications that play syn- feasibility of sound attacks on mechanical hard drives, result-
thetic audio commands into nearby VC systems without re- ing in operating system crashes.
quiring any special operating system permissions. While these
attacks transmit commands that are easily noticeable to a D. Laser Injection Attacks
human listener, other works [3, 8, 9] focused on camouflaging In addition to sound, light has also been utilized for signal
commands in audible signals, attempting to make them unin- injection. Indeed, [18, 19, 14] mounted denial of service at-
telligible or unnoticeable to human listeners, while still being tacks on cameras and LiDARs by illuminating victims’ photo-
recognizable to speech recognition models. receivers with strong lights. This was later extended by Shin
Inaudible Voice Commands. A more recent line of et al. [20] and Cao et al. [21] to a more sophisticated attack that
work focuses on completely hiding the voice commands injects precisely-controlled signals to LiDAR systems, causing
from human listeners. Roy et al. [5] demonstrate how high the target to see an illusory object. Next, Park et al. [22]
frequency sounds inaudible to humans can become recordable showed an attack on medical infusion pumps, using light to
by commodity microphones. Subsequently, Song and Mittal attack optical sensors that count the number of administered
[10] and DolphinAttack [4] extended the work of [5] by medication drops. Finally, [23] show how various sensors,
sending inaudible commands to VC systems via word mod- such as infrared and light sensors, can be used to activate
ulation on ultrasound carriers. By exploiting nonlinearities and transfer malware between infected devices.
in the microphones, a signal modulated onto an ultrasonic Another line of work focuses on using light for injecting
carrier is demodulated to the audible range by the targeted faults inside computing devices, resulting in security breaches.
More specifically, it is well-known that laser light causes soft Microphone Mounting. A backport MEMS microphone is
(temporary) errors in semiconductors, where similar errors mounted on the surface of a printed circuit board (PCB), with
are also caused by ionizing radiation [24]. Exploiting this the microphone’s aperture exposed through a cavity on the
effect, Skorobogatov and Anderson [25] showed the first PCB (see the third column of Figure 2). The cavity, in turn,
light-induced fault attacks on smartcards and microcontrollers, is part of an acoustic path that guides sound through holes
demonstrating the possibility of flipping individual bits in (acoustic ports) in the device’s chassis to the microphone’s
memory cells. This effect was subsequently exploited in nu- aperture. Finally, the device’s acoustic ports typically have a
merous follow ups, using laser-induced faults to compromise fine mesh as shown in Figure 3 to prevent dirt and foreign
the hardware’s data and logic flow, extract secret keys, and objects from entering the microphone.
dump the device’s memory. See [26, 27] for further details.
G. Laser Sources
E. Photoacoustic Effect
Choice of a Laser. A laser is a device that emits a beam of
Photoacoustics is a field of research that studies the inter- coherent light that can stay narrow over a long distance and
action between light and acoustic pressure waves (see [28] be focused to a tight spot. While many technologies exist for
for a survey). The first work in this area dates back to emitting coherent light, in this paper we focus on laser emitting
1880, where Alexander Graham Bell [29] invented an optical diodes, which are common in consumer laser products such as
communication device that uses a vibrating mirror as a me- laser pointers. Next, as the light intensity emitted from a laser
chanical sunlight modulator and a selenium cell to convert the diode is directly proportional to the diode’s driving current,
modulated light back to electricity. While the so-called pho- we can easily encode analog signals via the beam’s intensity
tophone was successful at transmitting voice across distances, by using a laser driver capable of amplitude modulation.
the inherent requirement of having a line of sight between
Laser Safety and Availability. As strong, tightly focused
the transmitter and receiver made the technology inferior to
lights can be potentially hazardous, there are standards in
the radio communication that was emerging at that time. The
place regulating lights emitted from laser systems [32, 33]
rise of digital communication technology has made the analog
that divide lasers into classes based on the potential for injury
modulation even less attractive, and voice transmission over
resulting from beam exposure. In this paper, we are interested
light had been forgotten for decades.
in two main types of devices, which we now describe.
Recently, researchers rediscovered light-voice transmission
as a sophisticated user interface that delivers an audible mes- Low-Power Class 3R Systems. This class contains devices
sage to a particular user by using air as the medium. Tucker whose output power is less than 5 mW at visible wavelength
[30] reported that the U.S. military is developing a device that (400—700 nm, see Figure 4). While prolonged intentional eye
ionizes molecules in the air using an extremely short-pulse exposure to the beam emitted from these devices might be
(femtosecond) laser to generate plasma that makes sound. harmful, these lasers are considered safe to human eyes for
Sullenberger et al. [31] proposed a different way of generating brief exposure durations. As such, class 3R systems form a
sound using an infrared laser with a particular wavelength that good compromise between safety and usability, making these
efficiently heats up ambient water vapor, causing an acoustic lasers common in consumer products such as laser pointers.
pressure wave in the air which results in successful sound High-Power Class 3B and Class 4 Systems. Next, lasers that
delivery to a user at 2.5 meters away. emit between 5 and 500 mW are classified as class 3B systems,
and might cause eye injury even from short beam exposure
F. MEMS Microphones durations. Finally, lasers that emit over 500 mW of power
MEMS is an integrated implementation of mechanical com- are categorized as class 4 systems, which can instantaneously
ponents on a chip, typically fabricated with an etching process. cause blindness, skin burns and fires. As such, uncontrolled
While there are a number of different MEMS sensors (e.g., exposure to class 4 laser beams should be strongly avoided.
accelerometers and gyroscopes), in this paper we focus on However, despite the regulation, there are reports of high-
MEMS-based microphones, which are particularly popular in power class 3B and 4 systems being openly sold as “laser
mobile and embedded applications (such as smartphones and pointers” [34]. Indeed, while purchasing laser pointers from
smart speakers) due to their small footprints and low prices. Amazon and eBay, we have discovered a troubling discrepancy
Microphone Overview. The first column of Figure 2 shows between the rated and actual power of laser products. While
the construction of a typical backport MEMS microphone, the labels and descriptions of most products stated an output
which is composed of a diaphragm and an ASIC circuit. The power of 5 mW, the actual measured power was sometimes as
diaphragm is a thin membrane that flexes in response to an high as 1 W (i.e., ×200 above the allowable limit).
acoustic wave. The diaphragm and a fixed back plate work
III. T HREAT M ODEL
as a parallel-plate capacitor, whose capacitance changes as a
consequence of the diaphragm’s mechanical deformations as The attacker’s goal is to inject malicious commands into the
it responds to alternating sound pressures. Finally, the ASIC targeted voice-controllable device, without being detected by
die converts the capacitive change to a voltage signal on the the device’s owner and without having physical device access.
output of the microphone. More specifically, we consider the following threat model.
Acoustic pressure wave Front

Through hole
Device chassis Mesh
Gasket

PCB
ASIC Backplate Diaphragm
Package ASIC Diaphragm
Diaphragm
Back
Fig. 2. MEMS microphone construction. (Left) Cross-sectional view of a MEMS microphone on a device. (Middle) A diaphragm and ASIC on a depackaged
microphone. (Right) Magnified view of an acoustic port on PCB.

Acoustic port of Acoustic port of IV. I NJECTING S OUND VIA L ASER L IGHT
Google Home Echo Dot 3rd gen.
A. Signal Injection Feasibility
In this section we explore the feasibility of injecting acoustic
signals into microphones using laser light. We begin by
Fig. 3. Acoustic port of (Left) Google Home and (Right) Echo Dot 3rd describing our experimental setup.
generation. The ports are located on the top of the devices, and there are Setup. We used a blue Osram PLT5 450B 450-nm laser
meshes inside the port. diode connected to a Thorlabs LDC205C current driver. We
Visible light increased the diode’s DC current with the driver until it emitted
a continuous 5.0 mW laser beam, while measuring light
Ultra violet Infrared
intensity using the Thorlabs S121C photo-diode power sensor.
400 500 600 700 The beam was subsequently directed to the acoustic port on
Wavelength [nm] the SparkFun MEMS microphone breakout board mounting an
Fig. 4. Wavelength and color of light Analog Devices ADMP401 MEMS microphone. Finally, we
recorded the diode current and the microphone’s output using a
No Physical Access or Owner Interaction. While the Tektronix MSO5204 oscilloscope. See Figure 5 for a picture
attacker is free to choose the target, we assume that the of our setup. The experiments were conducted in a regular
attacker does not have any physical access to the device office environment, with typical ambient noise from human
being attacked. Thus, the attacker cannot press any buttons, speech, computer equipment, and air conditioning systems.
alter voice-inaccessible settings, or compromise the device’s Signal Injection. We used the current driver to modulate
software. Finally, we assume that the attacker cannot make the a sine wave on top of the diode’s current It via amplitude
device’s owner perform any useful interaction (like pressing a modulation (AM), given by the following equation:
button or unlocking the screen). It = IDC + Ipp sin(2πf t) (1)
Line of Sight. We do assume however that the attacker
has (a remote) line of sight access to the target device and where IDC is a DC bias, Ipp is the peak-to-peak amplitude,
its microphone ports. We argue that such an assumption is and f is the frequency. In our case, we set IDC = 26.2 mA,
reasonable, as voice-activated devices (such as smart speakers, Ipp = 7 mA and f = 1 kHz, where the sine wave was
thermostats, security cameras, or even phones) are often left generated using an on-board DAC on a laptop computer, and
visible to the attacker, including through closed glass windows. was supplied to the modulation port on the current driver
Device Feedback. We note that the remote line of sight through an audio amplifier (Neoteck NTK059 Headphone
access to the target device also allows the attacker to observe Amplifier). As the light intensity emitted by the laser diode
the device’s LED lights. Next, as these lights come on after is directly proportional to the current provided by the laser
the device properly recognized the wakeup word and show an driver, this resulted in having the 1 kHz sine wave be directly
unique patterns once the command was properly recognized encoded in the intensity of the light emitted by the laser diode.
and accepted, the attacker to remotely determine if an attack Observing the Microphone Output. As can be seen in
attempt was successful. Figure 5, the microphone output clearly shows a 1 kHz sine
Device Characteristics. Finally, we also assume that the wave that matches the frequency of the injected signal.
attacker has access to a device of a similar model as the
target device. Thus, the attacker knows all the target’s physical B. Characterizing Laser Audio Injection
characteristics, such as location of the microphone ports and Having successfully demonstrated the possibility of inject-
physical structure of the device’s sound path. Such knowledge ing audio signals via laser beams, we now proceed to charac-
can easily be acquired by purchasing and analyzing a device terize the light intensity response of the diodes (as a function
of the same model before launching the attacks. of current) and the frequency response of the microphone to
32

Diode Current (mA)


PC Oscilloscope Microphone 30
Attacker Laser Signal

Laser 28
26
24
22
20
0 2 4 6 8 10
Time (ms)

300
Victim Microphone Signal
200

Voltage (mV)
100
0
Amplifier -100
-200
Laser current driver -300
0 2 4 6 8 10
Time (ms)

Fig. 5. Testing signal injection feasibility. (Left) A setup for signal injection feasibility composed of a laser current driver, PC, audio amplifier, and oscilloscope.
(Middle) Laser diode with beam aimed at a MEMS microphone breakout board. (Right) Diode current and microphone output waveforms.

laser-based audio injection. To see the wavelength dependency, power depends mostly on IDC , as Ipp sin(2πf t) averages out
we also examine a 638-nm red laser (Ushio HL63603TG) in to zero.
addition to the blue one used in the previous experiment. Thus, to stay within the power budget of L mW while ob-
Laser Current to Light Characteristics. We begin by taining the strongest possible signal at the microphone output,
examining the relationship between the diode current and the the attacker must first determine the DC current offset IDC
optical power of the laser. For this purpose, we aimed a that results in the diode outputting light at L mW, and then
laser beam at our Thorlabs S121C power sensor while driving subsequently maximize the amplitude of the microphone’s
the diodes with DC currents, i.e., Ipp = 0 in Equation 1. output signal by setting Ipp /2 = IDC − Ith .*
Considering the different properties of the diodes, the blue and Characterizing the Frequency Response of Laser Audio
red laser are examined up to 300 and 200 mA, respectively. Injection. Next, we set out to characterize the response
The first column of Figure 6 shows the current vs. light of the microphone to different frequencies of sound signals
(I-L) curves for the blue and red lasers. The horizontal axis injected via laser beams. We use the same operating points
is the diode current IDC and the vertical axis is the optical as the previous experiment, and set the tone’s amplitude such
power. As can be seen, once the current provided to the that it fits with the linear region (IDC = 200 mA and Ipp =
laser is above the diode-specific threshold (denoted by Ith ), 150 mA for the blue laser, and IDC = 150 mA and Ipp =
the light power emitted by the laser increases linearly with 75 mA for the red laser). We then record the microphone’s
the provided current. Thus, as | sin(2πf t)| < 1, we have an output levels while changing the frequency f of the light-
(approximately) linear conversion of current to light provided modulated sine wave.
that IDC − Ipp /2 > Ith . The third column of Figure 6 shows the obtained frequency
Laser Current to Sound Characteristics. We now proceed response for both blue and red lasers. The horizontal axis
to characterize the effect of light injection on a MEMS mi- is the frequency while the vertical axis is the peak-to-peak
crophone. We achieve this by aiming an amplitude-modulated voltage of the microphone output. Both lasers have very
(AM) laser beam with variable current amplitudes (Ipp ) and a similar responses, covering the entire audible band 20 Hz–20
constant current offset (IDC ) into the aperture of an Analog kHz, implying the possibility of injecting any audio signal.
Devices ADMP401 microphone, mounted on a breakout board. Choice of Laser. Finally, we note the color insensitivity of
We subsequently monitor the peak-to-peak voltage of the injection. Although blue and red lights are on the other edges
microphone’s output, plotting the resulting signal. on the visible spectrum (see Figure 4), the levels of injected
The second column of Figure 6 shows the relationship audio signal are in the same range and the shapes of the
between the modulating signal Ipp and the resulting signal frequency-response curves are also similar. Therefore, color
Vpp for both the blue and red laser diodes. The results suggest has low priority in choosing a laser compared to other factors
that the driving alternating current Ipp (cf. the bias current) for making LightCommands. In this paper, we consistently use
is the key for strong injection: we can linearly increase the the 450-nm blue laser mainly because of (i) better availability
sound volume received by the microphone by increasing the of high-power diodes and (ii) the advantage in focusing
driving AC current Ipp . because of a shorter wavelength.
Choosing IDC and Ipp . Given a laser diode that can C. Mechanical or Electrical Transduction?
emit a maximum average power of L mW, we would like to
choose the values for IDC and Ipp which result in the strongest In this section we set out to investigate whether our light-
possible microphone output signals, while having the average based acoustic signal injection is due to physical movements
optical power emitted by the laser be less than or equal to L of the microphone’s diaphragm (i.e., light-induced mechanical
mW. From the leftmost column of Figure 6, we deduce that * We note here that the subtraction of Ith is designed to ensure that
the laser’s output power is linearly proportional to the laser’s IDC − Ipp /2 > Ith , meaning that the diode stays in its linear region thereby
driving current It = IDC + Ipp sin(2πf t), and the average avoiding signal distortion.
450nm I-L Curve (I pp=0) 450nm Microphone Response (f = 1 kHz, IDC = 200 mA) 450nm Frequency Response (I DC = 200 mA, I pp = 150 mA)

Microphone V pp (V)

Microphone V pp (V)
300 0.8 2
Light Power (mW)

0.6 1.5
200
0.4 1
100
0.2 0.5

0 0 0
0 50 100 150 200 250 300 0 50 100 150 10 1 10 2 10 3 10 4 10 5
Diode Current (mA) Diode Current I pp (mA) Frequency (Hz)
638nm I-L Curve (I pp=0) 638nm Microphone Response (f = 1 kHz, IDC = 150 mA) 638nm Frequency Response (I DC = 150 mA, I pp = 75 mA)

Microphone V pp (V)

Microphone V pp (V)
300 0.8
Light Power (mW)

1
0.6
200
0.4 0.5
100
0.2

0 0 0
0 50 100 150 200 250 300 0 50 100 150 10 1 10 2 10 3 10 4 10 5
Diode current (mA) Diode current I pp (mA) Frequency (Hz)
Fig. 6. Characteristics of the 450-nm blue laser (first row) and the 638-nm red laser (second row). (First column) Current-light DC characteristics. (Second
column) Microphone response for a 1 kHz tone with different amplitudes. (Third column) Frequency responses of the overall setup for fixed bias and amplitude.

Effect of Microphone Packaging (450 nm, f = 1 kHz, I pp = 50 mA, I DC = 80 mA)


V. ATTACKING VARIOUS VOICE -C ONTROLLABLE
300
S YSTEMS
200
Voltage (mV)

100 In this section we evaluate our attack on sixteen popular VC


0
systems. We aim to find out the minimal laser power required
-100

-200
by the attacker in order to gain control over the VC system
-300
under ideal conditions as well as the maximal distance that
0 1 2 3 4 5 6 7 8 9 10 such control can be obtained under more realistic conditions.
Time (ms)

Fig. 7. The microphone’s response to laser injection with different mechanical Target Selection. We benchmark our attack against several
conditions: (Cyan) a baseline measurement without modification, (Magenta) consumer devices which have voice control capabilities (see
the microphone with its metal package removed, and (Black) a transparent Table I). We aim to test the most popular voice assistants
glue on microphone’s diaphragm.
– namely Alexa, Siri, Portal, and Google Assistant. While
we do not claim that our list is exhaustive, we do argue
vibration), or by another mechanism such as the photoelectric that it does provide some intuition about the vulnerability of
effect. We achieve this via a series of measurements that grad- popular VC systems to laser-based voice injection attacks.
ually modify the microphone’s mechanical condition while Next, to explore how different hardware variations (rather
leaving its optical condition intact. Using an ADMP401 micro- than algorithmic variations) affect our attack performance, we
phone, we first take a baseline measurement without any modi- benchmark our attack on multiple devices running the same
fication. Next, we remove the microphone’s pressure reference voice recognition backend: Alexa, Siri, Portal and Google
chamber by opening the package covering the diaphragm and Assistant, as summarized in Table I. For some devices, we
ASIC (as shown in the second column of Figure 2). Finally, we examine different generations to explore the differences on
dampen the diaphragm’s movement by putting glue directly attack performance for various hardware models. Finally,
on the opened and exposed diaphragm† . We note that the we also considered third-party devices with built-in speech
microphone’s optical properties are unchanged as the glue recognition, such as the EcoBee thermostat.
is transparent and applied from the diaphragm’s back side,
A. Exploring Laser Power Requirements
leaving the acoustic port intact.
Figure 7 presents the resulting voltage signals from the In this section we aim to characterize the minimal laser
microphone in the three conditions illuminated with the same power required by the attacker under ideal conditions to take
laser beam (the blue laser with f = 1 kHz, Ipp = 50 mA, and control over a voice-activated system. Before describing our
IDC = 80 mA). As can be seen, the modification decreases experimental setup, we show our selection of benchmarked
the amplitude of the signal detected by the microphone, and voice commands and experiment success criteria.
the signal after the glue application is less than 10% of Command Selection. We have selected four different voice
the original signal. We thus attribute our light-based signal commands that represent common operations performed by
injection results to mechanical movements of the microphone’s voice operated systems.
diaphragm, which are in turn translated to output voltage by • What Time Is It? This command was selected to serve
the microphone’s internal circuitry. as the baseline of our experiments, as it does not require
† We used transparent and non-conductive glue (Gorilla Super Glue), and
the device to perform nearly any operation besides correctly
conducted the measurement while the glue is wet since surface tension during identifying the command and accessing the Internet to
the curing process can damage the chip. recover the current time.
• Set the Volume to Zero. Here, we demonstrate the aiming) while better results in terms of distance and power can
attacker’s ability to control the output of the VC system. be achieved if less than perfect accuracy is considered.
We expect this to be the first voice command issued by the Voice Customization and Security Settings. For the
attacker, in an attempt to avoid attracting attention from the experiments conducted in this section, we left all the device’s
target’s legitimate owner. settings in their default configuration. Next, in embedded
• Purchase a Laser Pointer. With this command we Alexa and Google VC systems (e.g., smart speakers, cameras,
show how an attacker can potentially place order for various etc.) voice customization is off by default, meaning that
products on behalf (and at the expense) of users. The the device will operate on commands spoken by any voice.
attacker can subsequently wait for delivery near the target’s Meanwhile, for phone and tablet devices (which are typically
residents and collect the purchased item. operated by a single user), we left the voice identification in
• Open the Garage Door. Finally, and perhaps most its default activated setting. For such devices, to ascertain the
devastatingly, we show how an attacker can interact with minimal required power for a successful attack, we trained
additional systems which have been linked by the user to the VC system with a human voice and subsequently inject
the targeted VC system. While the garage door opener is one the audio recording of the commands spoken using the same
such example with clear security implications, we discuss voice. Finally, in Section V-C, we discuss bypassing various
other examples in Section VI. voice matching mechanisms.
Experimental Setup. We use the same blue laser and
Command Generation. We have generated audio recordings
Thorlabs laser driver as in Section IV-A, aiming the laser
of all four of the above commands using a common audio
beam at microphone ports of the devices listed in Table I
recording system (e.g., Audacity). Each command recording
from a distance of about 30 cm. To control the surrounding
was subsequently appended to a recording of the wake word
environment, the entire setup was placed in a metal enclosure,
corresponding to the device being tested (e.g., Alexa, Hey
with opaque bottom and sides and with a dark red semi-
Siri, Hey Portal, or OK, Google) and normalized to adjust
transparent acrylic top plate, designed to block blue light.
the overall volume of the recordings to a constant value. We
See Figure 8. As the goal of the experiments described in
obtained a resulting corpus of 16 complete commands. Finally,
this section is to ascertain the minimal required power for
for each device, we injected four of the complete commands
a successful attack on each device, we have used a pair of
(those beginning with the device-appropriate wake word) into
electrically controlled scanning mirrors (40 Kbps high-speed
the device’s microphone using the setup described below and
laser scanning system for laser shows) to precisely place the
observed the device’s response.
laser beam in the center of the device’s microphone port.
Verifying Successful Injection. We consider a command Before each experiment we manually focused the laser so that
injection attempt as successful in case the device somehow in- the laser spot size hitting the microphone is minimal.
dicates the correct interpretation of the command. For devices For aiming at devices whose microphone port is covered
with screens (such as phones and screen enabled speakers), with cloth (e.g., Google Home Mini shown in Figure 9), the
we considered an attempt successful when the device correctly position of the microphone ports can be determined using an
displayed a transcription of the light-injected voice command. easily-observable reference point such as the device’s wire
For screen-less devices (e.g., smart speakers), we manually connector or LED array. Finally, we note that the distance
examined the command log of the account associated with the between the microphone and the reference point is easily
device for the correct command transcription. obtainable by the attacker either by exploring his own device,
Attack Success Criteria. For a given power budget, distance, or by referring to online teardown videos [35].
and command, we consider the injection successful in case Experimental Results. The fifth column of Table I presents
the device correctly recognized the command during three a summary of our results. While the power required from
consecutive injection attempts. We take this as an indication the attacker varies from 0.5 mW (Google Home) to 60 mW
that the power budget and distance are sufficient for achieving (Galaxy S9), all the devices are susceptible to laser-based
a near-perfect success probability assuming suitable aiming. command injection. Finally, we note that the microphone port
Next, we consider an attack successful for a given power of some devices (e.g., Google Home Mini) is covered with
budget and distance in case all of our four commands where fabric and / or foam. While we conjecture that this attenuates
successfully injected to the device during three consecutive optical power, as Table I shows, the attack is still possible.
injection attempts. Like in the individual command case, we Finally, we note that the experiments done in this section
take this as an indication that the considered power budget are performed under ideal conditions, at close range and with
and distance is sufficient for a high probability successful the aid of electronic aiming mirrors. Thus, in Section V-B we
commands injection. We note that this criteria is conservative, report on attack results under more realistic conditions with
as some commands are easier to inject than others, presumably respect to distance and aiming.
due to their phonetical properties. As such, the results in
this section should be seen as a conservative estimate of B. Exploring Attack Range
what an attacker can achieve for each device assuming good In this section we set out to explore the effective range of
environmental conditions (e.g., quiet surroundings and suitable our attack under more realistic attack conditions.
TABLE I
T ESTED DEVICES WITH MINIMUM ACTIVATION POWER AND MAXIMUM DISTANCE ACHIEVABLE AT THE GIVEN POWER OF 5 M W AND 60 M W. A 110 M
LONG HALLWAY WAS USED FOR 5 M W TESTS WHILE A 50 M LONG HALLWAY WAS USED FOR TESTS AT 60 M W.

Authen- Minimum Max Distance Max Distance


Device Backend Category
tication Power [mW] at 60 mW [m] at 5 mW [m]
Google Home Google Assistant Speaker No 0.5 50+ 110+
Google Home Mini Google Assistant Speaker No 16 20 —
Google Nest Cam IQ Google Assistant Camera No 9 50+ —
Echo Plus 1st Generation Alexa Speaker No 2.4 50+ 110+
Echo Plus 2nd Generation Alexa Speaker No 2.9 50+ 50
Echo Alexa Speaker No 25 50+ —
Echo Dot 2nd Generation Alexa Speaker No 7 50+ —
Echo Dot 3rd Generation Alexa Speaker No 9 50+ —
Echo Show 5 Alexa Speaker No 17 50+ —
Echo Spot Alexa Speaker No 29 50+ —
Facebook Portal Mini Alexa + Portal Speaker No 18 5 —
Fire Cube TV Alexa Streamer No 13 20 —
EcoBee 4 Alexa Thermostat No 1.7 50+ 70
iPhone XR (Front Mic) Siri Phone Yes 21 10 —
iPad 6th Gen Siri Tablet Yes 27 20 —
Samsung Galaxy S9 (Bottom Mic) Google Assistant Phone Yes 60 5 —
Google Pixel 2 (Bottom Mic) Google Assistant Phone Yes 46 5 —

Microphones LED array


Target

Mirror
driver
Laser beam

Fig. 9. Google Home Mini. Notice the cloth-covered microphone ports.

Laser • 5 mW Low-Power Laser. Next, we also explore the


diode maximum range of a more restricted attacker, which is
Scanning mirrors
limited to the maximum amount of power allowed in the
on rotation stage
U.S. for consumer laser pointers, namely 5 mW.
Fig. 8. Setup for exploring laser power requirements: the laser and target are Laser Focusing and Aiming. For large attack distances (tens
arranged in the laser enclosure. The laser spot is aimed at the target acoustic of meters), laser focusing requires a large diameter lens and
port using electrically controllable scanning mirrors inside the enclosure. The
enclosure’s top red acrylic cover was removed for visual clarity. cannot be done via the small lenses that are typically used for
laser pointers. Thus, we mounted our laser to an Opteka 650-
1300 mm high-definition telephoto lens, with 86 mm diameter
Experimental Setup. From the experiments performed in (Figure 1(left)). Finally, to simulate realistic aiming conditions
Section V-A we note that about 60 mW of laser power is for the attacker, we avoided the use of electronic scanning
sufficient for successfully attacking all of our tested devices mirrors (used in Section V-A) and mounted the lens and
(at least under ideal conditions). Thus, in this section we laser on a geared camera head (Manfrotto 410 Junior Geared
benchmark the range of our attack using two power budgets. Tripod Head) and tripod. Laser aiming and focusing was done
• 60 mW High-Power Laser. As explained in Section II-G, manually, with the target also mounted on a (separate) tripod.
we frequently encountered laser pointers whose measured See Figure 1 for a picture of our setup.
power output was above 60 mW, which greatly exceeds Test Locations and Experimental Procedure. As eye
legal 5 mW restrictions. Thus, emulating an attacker which exposure to a 60 mW laser is potentially dangerous, we
does not follow laser safety protocols for consumer devices, blocked off a 50 meter long corridor in our office building
we benchmark our attack using 60 mW lasers, which is and performed the experiments at night. However, due to
sufficient for successfully attacking all of our tested devices safety reasons, we were unable to obtain a longer corridor for
in the previous experiment. our high-power tests. For lower-power attacks, we performed
the experiments in a 110 meter long corridor connecting two devices, speaker authentication is enabled by default due to
buildings (see Figure 1(top)). In both cases, we fixed the target the high processing power and single owner use.
at distance and subsequently adjusted the optics, obtaining Overview of Voice Authentication. After training using
the smallest possible laser spot. We then regulated the diode samples of owner’s voice speaking specific sentences, the
current so that the target is illuminated with 5 or 60 mW tablet of phone continuously listens to the microphone, ac-
respectively. Finally, the corridor is illuminated with regular quiring a set of voice samples. These are in turn fed into
fluorescent lamps at office-level brightness while the ambient deep learning models which recognize if the voice sample
acoustic noise in both experiments was about 46 dB (measured corresponds to assistant-specific wake up words (e.g., “Hey
using a General Tools DSM403SD sound level meter). Siri” or “Ok Google”) spoken by the owner. Finally, in case of
Experimental Results. Table I contains a summary of our a successful detection of features matching the owner’s voice,
distance-benchmarking results. With 60 mW laser power, we the phone or tablet proceeds to parse and subsequently execute
have successfully injected voice commands to all the tested the voice command.
devices from a distance of several meters. For devices that Bypassing Voice Authentication. Intuitively, an attacker
reached the maximum 50 meters in the high-power exper- can defeat the speaker authentication feature using authentic
iment, we also conducted the low-power experiment in the voice recordings of the device’s legitimate owner speaking the
110 m hallway. Untested devices are indicated by ’—’ in desired voice commands. Alternatively, if no such recordings
Table I because of their high minimal activation power. are available, DolphinAttack [4] suggests using speech synthe-
While most devices require a 60 mW laser for successful sis techniques, such as splicing relevant phonemes from other
command injection (e.g., an non standard complaint laser recordings of the owner’s voice, to construct the commands.
pointer), some popular smart speakers such as Google Home Wake-Only Security. However, during our experiments we
and Eco Plus 1st and 2nd Generation are particularly sensitive, found that speaker recognition is used by Google and Apple to
allowing for command injection even with 5 mW power over only verify the wake word, as opposed to the entire command.
tens of meters. Finally, as our attacks were conducted in 50 For example, Android and iOS phones trained to recognize a
and 110 meter hallways (for 60 and 5 mW lasers, respectively) female voice, correctly execute commands where the wake
for some devices, we had to stop the attack when the maximal word was spoken by the female voice, while the rest of the
hallway length was reached. We mark this case with a + sign command was spoken using a male voice. Thus, to bypass
near the device’s range in the appropriate column. voice authentication, an attacker only needs a recording of
the device’s wake word in the owner’s voice (which can be
C. Attacking Speaker Authentication obtained by recording any command spoken by the owner).
Reproducing Wake Words. Finally, we explore the possibil-
We begin by distinguishing between speaker recognition ity of using Text-To-Speech (TTS) techniques for reproducing
features, which are designed to recognize voice of specific the owner’s voice saying the wake words for a tablet or phone
users and personalize the device’s content, and speaker authen- based voice assistant. To that aim, we repeat the phone and
tication features which is designed to restrict access control to tablet experiments done in Sections V-A, V-B and Table I,
specific users. While not the main topic of this work, in this training all the phone and tablet devices with a human female
section we now discuss both features in the context of light- voice. We then used NaturalReader [36], an online TTS tool
based command injection. for generating the wake words specific for each device, hoping
No Speaker Authentication for Smart Speakers. We begin that the features of one of the offered voices will match
by observing that for smart speaker type devices (which are the human voice used for training. See Table II for device-
the main focus of this work), speaker recognition is off by specific voice configurations matching the female voice used
default at the time of writing. Next, even if the feature is for training. Next, we concatenate the synthetically-generated
enabled by careful users, smart speakers are designed to be wake word spoken in a female voice to a voice command
utilized by multiple people. Thus, their speaker recognition pronounced by a male native-English speaker. Using these
features are usually limited to content personalization rather recordings, we successfully replicated the minimal power and
than authentication, treating unknown voices as guests. Empir- maximum distance results as presented in Table I.
ically verifying this, we found that Google Home and Alexa We thus conclude that while voice recognition is able to
smart speakers block voice purchasing for unrecognized voices enforce some similarity between the attacker’s and owner’s
(presumably as they do not know which account should be voices, it does not offer sufficient entropy to form an adequate
billed for the purchase) while allowing previously-unheard countermeasure to command injection attacks. In particular,
voices to execute security critical voice commands such as out of the 18 English voices supported by NaturalReader, we
unlocking doors. Finally we note that at the time of writing were able to find an artificial voice matching the human female
voice authentication is not available for smart speaker devices, voice used for training for all 4 of the tablet and phone devices
which are common home smart assistant deployments. considered in this work. Finally, we did not test the ability to
Phone and Tablet Devices. Next, while not the main match voices for devices other than phones and tablets, as
focus of this work, we also investigated the feasibility of light voice authentication is not available for smart speaker devices
command injection into phone and tablet devices. For such at the time of writing.
TABLE II (which caused some beam wobbling due to laser movement),
B YPASSING VOICE AUTHENTICATION ON PHONE AND TABLET DEVICES the laser beam successfully injected the voice command while
Device Assistant TTS Service Voice Name penetrating a closed double-pane glass window. While causing
iPhone XR Siri NaturalReader US English Heather negligible reflections, the double-pane window did not cause
iPad 6th Gen Siri NaturalReader US English Laura any visible distortion in the injected signal, with the laser
Galaxy S9 Google Assistant NaturalReader US English Laura
Pixel 2 Google Assistant NaturalReader US English Laura beam hitting the target’s top microphones at an angle of
21.8 degrees. We conclude that cross-building laser command
injection is possible, at large distances and under realistic
VI. E XPLORING VARIOUS ATTACK S CENARIOS attack conditions.
The results of Section V clearly demonstrate the feasibil- B. Attacking Authentication
ity of laser-based injection of voice commands into voice-
Some of the current generation of VC systems attempt
controlled devices across large attack distances. In this section,
to protect unauthorized execution of sensitive commands by
we explore the security implications of such an injection, as
requiring additional user authentication step. For phone and
well as experiment with more realistic attack conditions.
tablet devices, the Siri and Alexa apps require the user to
A. A Low-Power Cross-Building Attack unlock the phone before executing certain commands (e.g.,
For the long-range attacks presented in Section V-B, we unlock front door, disable home alarm system). However, for
deliberately placed the target device so that the microphone devices that do not have other form of inputs beside the
ports are facing directly into the laser beam. While this is user’s voice (e.g., voice-enabled smart speakers, cameras, and
realistic for some devices (who have microphone ports on their thermostats) a digit-based PIN code is used to authenticate the
sides), such an arrangement is artificial for devices with top- user before critical commands are performed.
facing microphones (unless mounted sideways on the wall). PIN Eavesdropping. The PIN number spoken by the user
In this section we perform the attack under a more real- is inherently vulnerable to eavesdropping attacks, which can
istic conditions where an attacker aims from another higher be performed remotely using a laser microphone (measuring
building at a target device placed upright on a window sill. the acoustic vibration of a glass window using a laser reflec-
Experimental Conditions. We use the laser diode, telephoto tion [37]), or using common audio eavesdropping techniques.
lens and laser driver from Section V, operating the diode at Moreover, within an application the same PIN is used to
5 mW (equivalent to a laser pointer). Next, we placed a Google authenticate more than one critical command (e.g., “unlock
Home device (which only has top-facing microphones) upright the car” and “start the engine”) while users often re-use PIN
near a window, on a fourth-floor office (15 meters above the numbers across different applications. In both cases, increasing
ground). The attacker’s laser was placed on a platform inside the number of PIN-protected commands ironically increases
a nearby bell tower, located 43 meters above ground level. the opportunity for PIN eavesdropping attacks.
Overall, the distance between the attacker’s and laser was 75 PIN Brute forcing. We also observed incorrect implemen-
meters, see Figure 10 for the configuration. tation of PIN verification mechanisms. While Alexa naturally
Laser Focusing and Aiming. As in Section V-B, it is supports PIN authentication (limiting the user to three wrong
impossible to focus the laser using the small lens typically attempts before requiring interaction with a phone application),
used for laser pointers. We thus mounted the laser to an Opteka Google Assistant delegates PIN authentication to third-party
650-1300 mm telephoto lens. Next, to aim the laser across device vendors that often lack security experience.
large distances, we have mounted the telephoto lens on a Evaluating this design choice, we have investigated the
Manfrotto 410 geared tripod head. This allows us to precisely feasibility of PIN brute forcing attacks on an August Smart
aim the laser beam on the target device across large distances, Lock Pro, which is the most reviewed smart lock on Amazon
achieving an accuracy far exceeding the one possible with at the time of writing. First, we have discovered that August
regular (non-geared) tripod heads where the attacker’s arm does not enforce a reasonable PIN code length, allowing the
directly moves the laser module. Finally, in order to see the user to set a PIN containing anywhere from 1 to 6 digits for
laser spot and the device’s microphone ports from far away, we door unlocking. Next, we observed that the August lock does
have used a consumer-grade Meade Infinity 102 telescope. As not limit the number of wrong attempts permitted by the user,
can be seen in Figure 10 (left), the Google Home microphone’s nor does the lock implement a time delay mechanism between
ports are clearly visible through the telescope.‡ incorrect attempts. Thus, all the attacker has to do to unlock
Attack Results. We have successfully injected commands the target’s door is to simply enumerate all possible PIN codes.
into the Google Home target in the above described conditions. Empirically verifying this, we have written a Python im-
We note that despite its low 5 mW power and windy conditions plementation that enumerates all 4-digit PIN numbers using a
synthetic voice. After each unsuccessful attempt, the Google
‡ Figure 10 (left) was taken via a cell phone camera attached to the tele- home device responded with “Sorry, the security code is
scope’s eyepiece. Unfortunately, due to imperfect phone-eyepiece alignment, incorrect, can I have your security code to unlock the front
the outcome is slightly out of focus and the laser spot is over saturated.
However, the Google Home was in sharp focus with a small laser spot when door?” only to have our program speak the next PIN candidate.
viewed directly by a human observer. Overall, a single unlock attempt lasted about 13 seconds,
Tower

75 m Office
Laser
building source
43 m 21.8 o
Target room
15 m
70 m

Microphone port Laser


beam

Telescope
Laser spot for aiming Reflections
at the window

Laser
source Laser spot on
Target device
from the telescope the target device

Fig. 10. Setup for the low-power cross-building attack: (Top left) Laser and target arrangement. (Bottom left) Picture of the target device as visible through
the telescope, with the microphone ports and laser spot clearly visible. (Middle) Picture from the tower: laser on telephoto lens aiming down to the target.
(Right) Picture from the office building: laser spot on the target device.

requiring 36 hours to enumerate the entire 4-digit space (3.6 owner credentials, we were able to get several capabilities.
hours for 3 digits). In both the 3- and 4-digit case, the door These included getting information about the vehicle’s current
was successfully unlocked when the correct PIN was reached. location|| , locking and unlocking the doors and trunk, starting
PIN Bypassing. Finally, we have discovered that while and stopping the vehicle’s charging and the climate control
commands like “unlock front door” for August locks or “dis- system. Next, we note that we were able to perform all of
able alarm system” for Ring alarms require PIN authentication, these tasks using only voice commands, without the need of
other commands such as “open the garage door” using an a PIN number or key proximity. Finally, we were not able to
assistant-enabled garage door opener§ generally do not require start the car without key proximity.
any authentication. Thus, even if one command is unavailable, Ford Cars. For newer vehicles, Ford provides a phone
the attacker can often achieve a similar goal by using other app called “FordPass”, that connects to the car’s Ford SYNC
commands. system, and allows the owner to interact with the car over the
Internet. Taking the next step, Ford also provides a FordPass
C. Attacking Cars Google Assistant integration** with similar capabilities as the
Many modern cars have Internet-over-cellular connectivity, “EV Car” integration for Tesla. While Ford implemented PIN
allowing their owners to perform certain operations via a protection for critical voice commands like remote engine start
dedicated app on their mobile devices. In some cases, this and door unlocking, like in the case of August locks, there are
connectivity has further evolved (either by the vendor or by no mechanisms in place to prevent PIN brute forcing. Finally,
a third-party) in having the target’s car be connected to a VC while we were able to remotely open the doors and start the
system, allowing voice unlocking and/or pre-heating (which engine, shifting the vehicle out of “Park” immediately stopped
often requires engine start). Thus, a compromised VC system the engine, preventing the unlocked car from being driven.
might be used by an attacker to gain access to the target’s car. D. Exploring Stealthy Attacks
In this section we investigate the feasibility of such attacks,
The attacks described so far can be spotted by the user of
using two major car manufactures, namely Tesla and Ford.
the targeted VC system in three ways. First, the user might
Tesla. Tesla cars allow their owner to interact with the car
notice the light indicators on the target device following a
using a dedicated Tesla-provided phone app. After installing
successful command injection. Next, the user might hear the
the app on our phone and linking it to a Tesla Model S
device acknowledging the injected command. Finally, the user
vehicle, we have installed the “EV Car”¶ integration, linking
might notice the spot while the attacker tries to aim the laser
it to the vehicle. While “EV Car” is not officially provided
at the target microphone port.
by Tesla, after successful configuration using the vehicle’s
|| Admittedly, the audible location is of little use to a remote attacker who
§ https://www.garadget.com/ is unable to listen in on the speaker’s output.
¶ https://assistant.google.com/services/a/uid/000000196c7e079e?hl=en ** https://assistant.google.com/services/a/uid/000000ac1d2afd15
While the first issue is a limitation of our attack (and in fact Light spot covering
the entire target Target
of any command injection attack), in this section we explore
the attacker’s options for addressing the remaining two issues.
Diode terminals
Acoustic Stealthiness. To tackle the issue of the device of the flashlight

owner hearing the targeted device acknowledging the execu-


tion of voice command (or asking for a PIN number during
the brute forcing process), the attacker can start the attack
10 m
by asking the device to lower its speaker volume. For some
devices (EcoBee, Google Nest Camera IQ, and Fire TV), the Microphone
holes
volume can be reduced to completely zero, while for other
devices it can be set to barely-audible levels. Moreover, the
attacker can also abuse device features to achieve the same
goal. For Google Assistant, enabling the “do not disturb mode”
mutes reminders, broadcast messages and other spoken notifi-
cations. For Amazon Echo devices, enabling “whisper mode”
significantly reduces the volume of the device responses during Fig. 11. Setup with laser flashlight to avoid precise aiming. (Left) Target
device illuminated by the flashlight. (Right) Modified laser flashlight mounted
the attack to almost inaudible levels. on a geared tripod head aiming at the target 10 meters away.
Optical Stealthiness. Next, to avoid having the owner spot
the laser light aimed at the target device, the attacker can
use an invisible laser wavelength. Experimentally verifying Figure 11). Then, the experimental setup of Section V-B is
this, we replicated the attack on Google Home device from replicated except that the laser diode and telephoto lens is
Section V-A using a 980-nm infrared laser (Lilly Electronics replaced with the flashlight. Using this setup, we successfully
30 mW laser module). We then connected the laser to a injected commands to a Google Home device at a range of
Thorlabs LDC205C driver, limiting its power to 5 mW. Finally, about 10 meters, while running the flashlight at an output
as the spot created by infrared lasers is invisible to human power of 1 W. Next, as can be seen in Figure 11, the beam
eyes, we aimed the laser using a smartphone camera (as these spot created by the flashlight is large enough to cover the
typically do not contain infrared filters). entire target (and its microphone ports), without the need to use
Using this setup, we have successfully injected voice com- additional focusing optics and aiming equipment. However, we
mands to a Google Home at a distance of about 30 centimeters note that while the large spot size helps for imprecise aiming,
in the same enclosure as Section V-A. The spot created by the the flashlight’s quickly diverging beam also limits the attack’s
infrared laser was barely visible using the phone camera, and maximal distance.
completely invisible to the human eye. Finally, not wanting Finally, the large spot size created by the flashlight (covering
to risk prolonged exposure to invisible (but eye damaging) the entire device surface) can also be used to inject the
laser beams, we did not perform range experiments with sound into to multiple microphones simultaneously, thereby
this setup. However, given the color insensitivity described potentially defeating software-based anomaly detection coun-
in Section IV-A, we conjecture that results similar to those termeasures described in Section VII.
obtained in Section V-B could be obtained here as well.
E. Avoiding the Need for Precise Aiming F. Reducing the Attack Costs
Another limitation of the attacks described so far is the need While the setups used for all the attacks described in
to aim the laser spot precisely on the target’s microphone ports. this paper are built using readily available components, some
While we achieved such aiming in Section VI-A by using equipment (such as the laser driver and diodes) are intended for
geared camera tripod heads, in this section we show how the lab use, making assembly and testing somewhat difficult for
need for precise aiming can be avoided altogether. a non-experienced user. In this section we present a low-cost
An attacker can use a higher-power laser and trade its setup that can be easily constructed using improvised means
power with a larger laser spot size, which makes aiming and off-the-shelf components.
considerably easier. Indeed, laser modules higher than 4,000 Laser Diode and Optics. Modifying off-the-shelf laser
mW are commonly available on common e-commerce sites for pointers can be an easy way to get a laser with collimation
laser engraving. Since we could not test such a high-power optics. In particular, cheap laser pointers often have no current
laser in an open-air environment for a safety concerns, we regulators, having their anodes and cathodes directly con-
decided to use a laser-excited phosphor flashlight (Acebeam nected to the batteries. Thus, we can easily connect a current
W30 with 500 lumens), which is technically a laser but sold driver to the pointer’s battery connectors via alligator clips.
as a flashlight with beam-expanding optics. Figure 12 shows a cheap laser pointer based setup, available
To allow for voice modulation, we modified the flashlight at $18 for 3 pieces at Amazon.††
by removing its original current driver and connecting its
diode terminals to the Thorlabs LDC240C laser driver (see †† https://www.amazon.com/gp/product/B075K69DTQ
multiple microphones, which should receive similar signals
due to the omnidirectional nature of acoustic waves propaga-
A cheap laser pointer
tion. Meanwhile, when the attacker uses a single laser, only
with alligator clips
on its battery electrodes a single microphone receives a signal while the others re-
ceive nothing. Thus, manufacturers can attempt to detect such
Laser spot (green) anomalies, ignoring the injected commands. However, we note
that attackers can defeat such comparison countermeasures by
simultaneously injecting lights to all the device’s microphones
using wide beams, see Section VI-E.
Finally, LightCommands are very different compared to
normal audible commands. For sensor-rich devices like phones
and tablets, sensor-based intrusion detection techniques [39]
can potentially be used to identity and subsequently block such
irregular command injection. We leave further exploration of
Audio cable
this direction to future work.
from PC
B. Hardware-Based Approach
5V It is possible to reduce the amount of light reaching the
microphone’s diaphragm using a barrier that physically blocks
Audio amplifier Laser current driver straight light beams, while allowing acoustic pressure waves
Fig. 12. Setup for low-cost attack: a laser current driver connected to a laser to detour around it. Performing a literature review on proposed
pointer attacking a Google Home device. microphone designs, we have found several such suggestions,
mainly aimed to protect microphones from sudden pressure
spikes. For example, the designs in Figure 13 have a silicon
Laser Driver. The laser current driver with analog modu- plate or movable shutter, both of which eliminate the line of
lation port is the most specialized instrument in the attacker’s sight to the diaphragm [40]. It is important to note however,
setup. We used the scientific-grade laser drivers that cost about that such barriers should be opaque to all light wavelengths
$1,500, however, there are cheaper alternatives such as the (including infrared and ultraviolet), preventing the attacker
Wavelength Electronics LD5CHA current driver available at a from going through the barrier using a different colored light.
cost of about $300. Finally, a light-blocking barrier can be also implemented at
Sound Source and Experimental Results. Finally, the the device level, by placing a non-transparent cover on top
attacker needs a method for playing recorded audio commands. of the microphone hole, which attenuates the amount of light
We used an ordinary on-board laptop sound card (Dell XPS hitting the microphone.
15 9570), amplified using a Neoteck NTK059 Headphone However, we note that such physical barriers are only
Amplifier ($30 on Amazon). See Figure 12 for a picture of effective to a certain point, as an attacker can always increase
a complete low-cost setup. We have experimentally verified the laser power in an attempt to compensate for the cover-
successful command injection using this setup into a Google induced attenuation. Finally, in case such compensation is not
Home target, located at a distance of 15 meters (with the possible, the attacker can always use the laser to burn through
main range limitation being the laser focusing optics and an barriers, creating his own light path.
artificially-limited power budget of 5 mW for safety reasons).
C. Limitations
VII. C OUNTERMEASURES AND L IMITATIONS
Hardware Limitations. Being a light based attack,
A. Software-Based Approach LightCommands inherits all the limitations of light-related
As discussed in Section VI-B, an additional layer of authen- physics. In particular, light does not properly penetrate opaque
tication can be effective at somewhat mitigating the attack. obstacles which might be penetrable to sound. In addition,
Alternatively, in case the attacker cannot eavesdrop on the unlike sound, LightCommands requires careful aiming and
device’s response (for example since the device is located line of sight access. Finally, while line of sight access is
far away behind a closed window), having the VC system often available for smart speakers visible through windows,
ask the user a simple randomized question before command the situation is different for mobile devices such as smart
execution can be an effective way at preventing the attacker watches, phones and tablets. This is since unlike static smart
from obtaining successful command execution. However, we speakers, these devices are often mobile, requiring an attacker
note that adding an additional layer of interaction often comes to quick aim and inject commands. When combined with the
at a cost of usability, limiting user adoption. precise aiming and higher laser power required to attack such
Finally, manufacturers can attempt to use sensor fusion devices, successful LightCommands attacks might be partic-
techniques [38] in the hopes of detecting light-based command ularly challenging. We thus leave the task of systematically
injection. More specifically, common VC systems often have exploring such devices to future work.
Diaphragm
Silicon plate Movable shutter
ACM Workshop on Security and Privacy in Smartphones
& Mobile Devices. ACM, 2014, pp. 63–74.
[2] Y. Jang, C. Song, S. P. Chung, T. Wang, and W. Lee,
PCB PCB “A11y attacks: Exploiting accessibility in operating sys-
Acoustic port
tems,” in Proceedings of the 2014 ACM SIGSAC Confer-
Fig. 13. Designs of MEMS microphone with light-blocking barriers [40] ence on Computer and Communications Security. ACM,
2014, pp. 103–115.
[3] N. Carlini, P. Mishra, T. Vaidya, Y. Zhang, M. Sherr,
Liveness Test. As opposed to other attacks, LightCommands’ C. Shields, D. Wagner, and W. Zhou, “Hidden voice
remote threat model and lack of proper feedback channel commands.” in USENIX Security Symposium, 2016, pp.
makes it difficult for the attacker to pass any sorts of liveness 513–530.
checks. Such checks can be as primitive as asking a user a [4] G. Zhang, C. Yan, X. Ji, T. Zhang, T. Zhang, and W. Xu,
simple questions before performing a sensitive command, or “DolphinAttack: Inaudible voice commands,” in ACM
as sophisticated as using data from different microphones [41, Conference on Computer and Communications Security.
42, 43] or sound reflections [44] in order to verify that the ACM, 2017, pp. 103–117.
incoming commands where indeed spoken by a live human [5] N. Roy, H. Hassanieh, and R. Roy Choudhury, “Back-
(as opposed to played back via a speaker). We note that, how- door: Making microphones hear inaudible sounds,” in
ever, using interactive liveness tests (e.g., questions) typically Proceedings of the 15th Annual International Conference
hurts usability while the works of [41, 42, 43, 44] can only on Mobile Systems, Applications, and Services. ACM,
authenticate users at close distances (e.g., tens of centimeters), 2017, pp. 2–14.
making it inapplicable to smart speaker devices. [6] N. Roy, S. Shen, H. Hassanieh, and R. R. Choudhury,
VIII. C ONCLUSIONS AND F UTURE W ORK “Inaudible voice commands: The long-range attack and
defense,” in 15th USENIX Symposium on Networked
In this paper we presented LightCommands, which is
Systems Design and Implementation (NSDI 18), 2018,
an attack that uses light to inject commands into voice-
pp. 547–560.
controllable systems from large distances. To mount the attack,
[7] X. Yuan, Y. Chen, Y. Zhao, Y. Long, X. Liu, K. Chen,
an attacker transmits light modulated with an audio signal,
S. Zhang, H. Huang, X. Wang, and C. A. Gunter,
which is converted back to the original audio signal within
“CommanderSong: A systematic approach for practical
a microphone. We demonstrated LightCommands on many
adversarial voice recognition,” in 27th USENIX Security
commercially available voice-controllable systems that use
Symposium (USENIX Security 18), 2018, pp. 49–64.
Siri, Portal, Google Assistant, and Alexa, obtaining successful
[8] T. Vaidya, Y. Zhang, M. Sherr, and C. Shields, “Cocaine
command injections at a maximum distance of more than
noodles: exploiting the gap between human and machine
100 meters while penetrating clear glass windows. Next, we
speech recognition,” Presented at WOOT, vol. 15, pp.
highlight deficiencies in the security of voice-controllable
10–11, 2015.
systems, which leads to additional compromises of third-party
[9] M. M. Cisse, Y. Adi, N. Neverova, and J. Keshet,
hardware such as locks and cars.
“Houdini: Fooling deep structured visual and speech
Better understanding of the physics behind the attack will
recognition models with adversarial examples,” in Ad-
benefit both new attacks and countermeasures. In particular,
vances in neural information processing systems, 2017,
we can possibly use the same principle to mount other acoustic
pp. 6977–6987.
injection attacks (e.g., on motion sensors) using light. In
[10] L. Song and P. Mittal, “Inaudible voice commands,”
addition, heating by laser can also be an effective way of
arXiv preprint arXiv:1708.07238, 2017.
injecting false signals to sensors.
[11] D. Kumar, R. Paccagnella, P. Murley, E. Hennenfent,
IX. ACKNOWLEDGMENTS J. Mason, A. Bates, and M. Bailey, “Skill squatting
The authors would like to thank John Nees for providing attacks on Amazon Alexa,” in 27th USENIX Security
helpful advice regarding laser operation and laser optics. Symposium (USENIX Security 18), 2018, pp. 33–47.
This research was funded by JSPS KAKENHI Grant Num- [12] N. Zhang, X. Mi, X. Feng, X. Wang, Y. Tian, and
ber JP18K18047 and JP18KK0312, by the Defense Advanced F. Qian, “Understanding and mitigating the security risks
Research Projects Agency (DARPA) under contract FA8750- of voice-controlled third-party skills on amazon alexa and
19-C-0531, gifts from Intel, AMD, and Analog Devices, an google home,” arXiv preprint arXiv:1805.01525, 2018.
award from MCity at University of Michigan, and by the [13] Y. Son, H. Shin, D. Kim, Y.-S. Park, J. Noh, K. Choi,
National Science Foundation under grant CNS-1330142. J. Choi, Y. Kim et al., “Rocking drones with intentional
sound noise on gyroscopic sensors.” in USENIX Security
R EFERENCES Symposium, 2015, pp. 881–896.
[1] W. Diao, X. Liu, Z. Zhou, and K. Zhang, “Your voice [14] C. Yan, W. Xu, and J. Liu, “Can you trust autonomous
assistant is mine: How to abuse speakers to steal informa- vehicles: Contactless attacks against sensors of self-
tion and control your phone,” in Proceedings of the 4th driving vehicle,” DEFCON, vol. 24, 2016.
[15] T. Trippel, O. Weisse, W. Xu, P. Honeyman, and K. Fu, ence on Design & Technology of Integrated Systems in
“WALNUT: waging doubt on the integrity of MEMS Nanoscale Era (DTIS). IEEE, 2011, pp. 1–6.
accelerometers with acoustic injection attacks,” in Eu- [28] S. Manohar and D. Razansky, “Photoacoustics: a histor-
roS&P. IEEE, 2017, pp. 3–18. ical review,” Advances in Optics and Photonics, vol. 8,
[16] S. Nashimoto, D. Suzuki, T. Sugawara, and K. Sakiyama, no. 4, pp. 586–617, December 2016.
“Sensor CON-Fusion: Defeating kalman filter in signal [29] A. G. Bell, “Upon the production and reproduction of
injection attack,” in Proceedings of the 2018 on Asia sound by light,” Journal of the Society of Telegraph
Conference on Computer and Communications Security, Engineers, vol. 9, no. 34, pp. 404–426, 1880.
AsiaCCS 2018, 2018, pp. 511–524. [30] P. Tucker, “The US military is making lasers that create
[17] C. Bolton, S. Rampazzi, C. Li, A. Kwong, W. Xu, and voices out of thin air,” accessed: 2019-08-20.
K. Fu, “Blue note: How intentional acoustic interference [31] R. M. Sullenberger, S. Kaushik, and C. M. Wynn, “Pho-
damages availability and integrity in hard disk drives and toacoustic communications: delivering audible signals
operating systems,” in IEEE Symposium on Security and via absorption of light by atmospheric H2O,” Opt. Lett.,
Privacy. IEEE Computer Society, 2018, pp. 1048–1062. vol. 44, no. 3, pp. 622–625, 2019.
[18] J. Petit, B. Stottelaar, M. Feiri, and F. Kargl, “Remote [32] I. S. of Conformity Assessment Schemes for
attacks on automated vehicles sensors: Experiments on Electrotechnical Equipment and Components, “IEC
camera and LiDAR,” Black Hat Europe, vol. 11, p. 2015, 60825-1:2014 safety of laser products - part 1:
2015. Equipment classification and requirements.” [Online].
[19] J. Petit and S. E. Shladover, “Potential cyberattacks on Available: https://www.iecee.org/index.htm
automated vehicles,” IEEE Transactions on Intelligent [33] U. D. of Health, F. Human Services, C. f. D.
Transportation Systems, vol. 16, no. 2, pp. 546–556, Drug Administration, and R. Health, “Laser products
2015. conformance with IEC 60825-1 ed. 3 and IEC 60601-2-
[20] H. Shin, D. Kim, Y. Kwon, and Y. Kim, “Illusion and 22 ed. 3.1 (laser notice no. 56) guidance for industry and
dazzle: Adversarial optical channel exploits against lidars food and drug administration staff.” [Online]. Available:
for automotive applications,” in Cryptographic Hardware https://www.fda.gov/media/110120/download
and Embedded Systems - CHES 2017 - 19th International [34] S. M. Goldwasser and B. Edwards, “Hidden menace: Re-
Conference, Taipei, Taiwan, September 25-28, 2017, Pro- congnizing and controlling the hazards posed by smaller
ceedings, 2017, pp. 445–467. and lower power lasers,” http://www.repairfaq.org/sam/
[21] Y. Cao, C. Xiao, B. Cyr, Y. Zhou, W. Park, S. Ram- laser/ILSC 2011-1303.pdf, 2011, accessed: 2019-08-20.
pazzi, Q. A. Chen, K. Fu, and Z. M. Mao, “Adversarial [35] IFIXIT, “Google home mini teardown,”
sensor attack on LiDAR-based perception in autonomous https://www.ifixit.com/Teardown/Google+Home+Mini+
driving,” 2019. Teardown/102264, accessed: 2019-08-25.
[22] Y.-S. Park, Y. Son, H. Shin, D. Kim, and Y. Kim, [36] N. Ltd., “Naturalreader,” https://www.naturalreaders.
“This ain’t your dose: Sensor spoofing attack on medical com/online/, accessed: 2019-08-25.
infusion pump.” in WOOT, 2016. [37] N. Melena, N. Neuenfeldt, A. Slagel, M. Hamel,
[23] A. S. Uluagac, V. Subramanian, and R. Beyah, “Sensory C. Mackin, and C. Smith, “Covert IR-laser remote
channel threats to cyber physical systems: A wake-up listening device,” The University of Arizona Hon-
call,” in 2014 IEEE Conference on Communications and ors Thesis https://repository.arizona.edu/handle/10150/
Network Security. IEEE, 2014, pp. 301–309. 244475, accessed: 2019-08-20.
[24] D. H. Habing, “The use of lasers to simulate radiation- [38] D. Davidson, H. Wu, R. Jellinek, T. Ristenpart,
induced transients in semiconductor devices and circuits,” and V. Singh, “Controlling UAVs with sensor input
IEEE Transactions on Nuclear Science, vol. 12, no. 5, pp. spoofing attacks,” in Proceedings of the 10th USENIX
91–100, 1965. Conference on Offensive Technologies, ser. WOOT’16.
[25] S. P. Skorobogatov and R. J. Anderson, “Optical fault Berkeley, CA, USA: USENIX Association, 2016, pp.
induction attacks,” in Cryptographic Hardware and Em- 221–231. [Online]. Available: http://dl.acm.org/citation.
bedded Systems - CHES 2002, 4th International Work- cfm?id=3027019.3027039
shop, Redwood Shores, CA, USA, August 13-15, 2002, [39] A. K. Sikder, H. Aksu, and A. S. Uluagac, “6thsense:
Revised Papers, 2002, pp. 2–12. A context-aware sensor-based attack detector for smart
[26] D. Karaklaji, J. Schmidt, and I. Verbauwhede, “Hardware devices,” in 26th USENIX Security Symposium (USENIX
designer’s guide to fault attacks,” IEEE Transactions on Security 17), 2017, pp. 397–414.
Very Large Scale Integration (VLSI) Systems, vol. 21, [40] Z. Wang, Q. Zou, Q. Song, and J. Tao, “The era of silicon
no. 12, pp. 2295–2306, 2013. MEMS microphone and look beyond,” in 2015 Transduc-
[27] J.-M. Dutertre, J. J. Fournier, A.-P. Mirbaha, D. Nac- ers - 2015 18th International Conference on Solid-State
cache, J.-B. Rigaud, B. Robisson, and A. Tria, “Review Sensors, Actuators and Microsystems (TRANSDUCERS),
of fault injection mechanisms and consequences on coun- June 2015, pp. 375–378.
termeasures design,” in 2011 6th International Confer- [41] L. Zhang, S. Tan, J. Yang, and Y. Chen, “Voicelive:
A phoneme localization based liveness detection for
voice authentication on smartphones,” in Proceedings of
the 2016 ACM SIGSAC Conference on Computer and
Communications Security. ACM, 2016, pp. 1080–1091.
[42] L. Zhang, S. Tan, and J. Yang, “Hearing your voice
is not enough: An articulatory gesture based liveness
detection for voice authentication,” in Proceedings of
the 2017 ACM SIGSAC Conference on Computer and
Communications Security, CCS 2017, Dallas, TX, USA,
October 30 - November 03, 2017. ACM, 2017, pp.
57–71.
[43] L. Lu, J. Yu, Y. Chen, H. Liu, Y. Zhu, Y. Liu, and
M. Li, “Lippass: Lip reading-based user authentication
on smartphones leveraging acoustic signals,” in IEEE
INFOCOM 2018-IEEE Conference on Computer Com-
munications. IEEE, 2018, pp. 1466–1474.
[44] L. Lu, J. Yu, Y. Chen, H. Liu, Y. Zhu, L. Kong, and M. Li,
“Lip reading-based user authentication through acoustic
sensing on smartphones,” IEEE/ACM Transactions on
Networking, vol. 27, no. 1, pp. 447–460, 2019.

You might also like