Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

1804 03209

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Speech Commands: A Dataset for Limited-Vocabulary Speech

Recognition
arXiv:1804.03209v1 [cs.CL] 9 Apr 2018

Pete Warden
Google Brain
Mountain View, California
petewarden@google.com

April 2018

1 Abstract datasets encourages collaborations across groups and


enables apples-for-apples comparisons between differ-
Describes an audio dataset[1] of spoken words de- ent approaches, helping the whole field move forward.
signed to help train and evaluate keyword spotting The Speech Commands dataset is an attempt to
systems. Discusses why this task is an interesting build a standard training and evaluation dataset for
challenge, and why it requires a specialized dataset a class of simple speech recognition tasks. Its primary
that’s different from conventional datasets used for goal is to provide a way to build and test small mod-
automatic speech recognition of full sentences. Sug- els that detect when a single word is spoken, from
gests a methodology for reproducible and compara- a set of ten or fewer target words, with as few false
ble accuracy metrics for this task. Describes how positives as possible from background noise or unre-
the data was collected and verified, what it contains, lated speech. This task is often known as keyword
previous versions[2] and properties. Concludes by spotting.
reporting baseline results of models trained on this To reach a wider audience of researchers and devel-
dataset. opers, this dataset has been released under the Cre-
ative Commons BY 4.0 license[5]. This enables it to
easily be incorporated in tutorials and other scripts
2 Introduction where it can be downloaded and used without any
user intervention required (for example to register on
Speech recognition research has traditionally required
a website or email an administrator for permission).
the resources of large organizations such as universi-
This license is also well known in commercial set-
ties or corporations to pursue. People working in
tings, and so can usually be dealt with quickly by
those organizations usually have free access to either
legal teams where approval is required.
academic datasets through agreements with groups
like the Linguistic Data Consortium[3], or to propri-
etary commercial data. 3 Related Work
As speech technology has matured, the number of
people who want to train and evaluate recognition Mozilla’s Common Voice dataset[6] has over 500
models has grown beyond these traditional groups, hours from 20,000 different people, and is available
but the availability of datasets hasn’t widened. As under the Creative Commons Zero license (similar to
the example of ImageNet[4] and similar collections public domain). This licensing makes it very easy
in computer vision has shown, broadening access to to build on top of. It is aligned by sentence, and

1
was created by volunteers reading requested phrases transfer of the audio to a web service begins. Because
through a web application. the local model is running on hardware that’s not un-
LibriSpeech[7] is a collection of 1,000 hours of readder the web service provider’s control, there are hard
English speech, released under a Creative Commons resource constraints that the on-device model has to
BY 4.0 license, and stored using the open source respect. The most obvious of these is that the mobile
FLAC encoder, which is widely supported. Its la- processors typically present have total compute capa-
bels are aligned at the sentence level only, thus lack-bilities that are much lower than most servers, so to
ing word-level alignment information. This makes it run in near real-time for an interactive response, on-
more suitable for full automatic speech recognition device models must require fewer calculations than
than keyword spotting. their cloud equivalents. More subtly, mobile devices
TIDIGITS[8] contains 25,000 digit sequences spo- have limited battery lives and anything that is run-
ken by 300 different speakers, recorded in a quiet ning continuously needs to be very energy efficient
room by paid contributors. The dataset is only or users will find their device is drained too quickly.
available under a commercial license from the Lan- This consideration doesn’t apply to plugged-in home
guage Data Consortium, and is stored in the NIST devices, but those do have thermal constraints on
SPHERE file format, which proved hard to decode how much heat they can dissipate that restrict the
using modern software. Our initial experiments on amount of energy available to local models, and are
keyword spotting were performed using this dataset. encouraged by programs like EnergyStar to reduce
CHiME-5[9] has 50 hours of speech recorded in peo- their overall power usage as much as possible. A fi-
ple’s homes, stored as 16 KHz WAV files, and avail- nal consideration is that users expect a fast response
able under a restricted license. It’s aligned at the from their devices, and network latency can be highly
sentence level. variable depending on the environment, so some ini-
tial acknowledgement that a command was received
is important for a good experience, even if the full
4 Motivations server response is delayed.
These constraints mean that the task of keyword
Many voice interfaces rely on keyword spotting to spotting is quite different to the kind of speech recog-
start interactions. For example you might say "Hey nition that’s performed on a server once an interac-
Google" or "Hey Siri"[10] to begin a query or com- tion has been spotted:
mand for your phone. Once the device knows that
you want to interact, it’s possible to send the audio • Keyword spotting models must be smaller and
to a web service to run a model that’s only limited involved less compute.
by commercial considerations, since it can run on a • They need to run in a very energy-efficient way.
server whose resources are controlled by the cloud
provider. The initial detection of the start of an in- • Most of their input will be silence or background
teraction is impractical to run as a cloud-based ser- noise, not speech, so false positives on those must
vice though, since it would require sending audio data be minimized.
over the web from all devices all the time. This would
be very costly to maintain, and would increase the • Most of the input that is speech will be unrelated
privacy risks of the technology. to the voice interface, so the model should be
Instead, most voice interfaces run a recognition unlikely to trigger on arbitrary speech.
module locally on the phone or other device. This lis- • The important unit of recognition is a single
tens continuously to audio input from microphones, word or short phrase, not an entire sentence.
and rather than sending the data over the internet to
a server, they run models that listen for the desired These differences mean that the training and eval-
trigger phrases. Once a likely trigger is heard, the uation process between on-device keyword spotting

2
and general speech recognition models is quite dif- one exception was that I asked them to avoid record-
ferent. There are some promising datasets to sup- ing themselves whenever there were background con-
port general speech tasks, such as Mozilla’s Common versations happening for privacy reasons, so I asked
Voice, but they aren’t easily adaptable to keyword them to be in a room alone with the door closed.
spotting.
This Speech Commands dataset aims to meet the I also decided to focus on English. This was for
special needs around building and testing on-device pragmatic reasons, to limit the scope of the gather-
models, to enable model authors to demonstrate the ing process and make it easier for native speakers to
accuracy of their architectures using metrics that are perform quality control on the gathered data. I hope
comparable to other models, and give a simple way that transfer learning and other techniques will still
for teams to reproduce baseline models by training make this dataset useful for other languages though,
on identical data. The hope is that this will speed up and I open-sourced the collection application to al-
progress and collaboration, and improve the overall low others to easily gather similar data in other lan-
quality of models that are available. guages. I did want to gather as wide a variety of ac-
A second important audience is hardware manu- cents as possible however, since we’re familiar from
facturers. By using a publicly-available task that experience with the bias towards American English
closely reflects product requirements, chip vendors in many voice interfaces.
can demonstrate the accuracy and energy usage of
their offerings in a way that’s easily comparable for Another goal was to record as many different peo-
potential purchasers. This increased transparency ple as I could. Keyword-spotting models are much
should result in hardware that better meets product more useful if they’re speaker-independent, since the
requirements over time. The models should also pro- process of personalizing a model to an individual re-
vide clear specifications that hardware engineers can quires an intrusive user interface experience. With
use to optimize their chips, and potentially suggest this in mind, the recording process had to be quick
model changes that make it easier to provide efficient and easy to use, to reduce the number of people who
implementations. This kind of co-design between ma- would fail to complete it.
chine learning and hardware can be a virtuous circle,
increasing the flow of useful information between the I also wanted to avoid recording any personally-
domains in a way that helps both sides. identifiable information from contributors, since any
such data requires handling with extreme care for pri-
vacy reasons. This meant that I wouldn’t ask for any
5 Collection attributes like gender or ethnicity, wouldn’t require a
sign-in through a user ID that could link to personal
5.1 Requirements data, and would need users to agree to a data-usage
agreement before contributing.
I made the decision to focus on capturing audio that
reflected the on-device trigger phrase task described To simplify the training and evaluation process, I
above. This meant that the use of studio-captured decided to restrict all utterances to a standard dura-
samples seemed unrealistic, since that audio would tion of one second. This excludes longer words, but
lack background noise, would be captured with high- the usual targets for keyword recognition are short so
quality microphones, and in a formal setting. Suc- this didn’t seem to be too restrictive. I also decided
cessful models would need to cope with noisy environ- to record only single words spoken in isolation, rather
ments, poor quality recording equipment, and people than as part of a sentence, since this more closely re-
talking in a natural, chatty way. To reflect this, all sembles the trigger word task we’re targeting. It also
utterances were captured through phone or laptop makes labeling much easier, since alignment is not as
microphones, wherever users happened to be. The crucial.

3
5.2 Word Choice malicious actors to create was a bad idea. To address
this, the final home of the application was moved to:
I wanted to have a limited vocabulary to make sure
the capture process was lightweight, but still have https://aiyprojects.withgoogle.com/
enough variety for models trained on the data to ֒→ open_speech_recording
potentially be useful for some applications. I also
wanted the dataset to be usable in comparable ways This is a known domain that’s controlled by Google,
to common proprietary collections like TIDIGITS. and so it should be much harder to create confusing
This led me to pick twenty common words as the spoofs of.
core of our vocabulary. These included the digits zero The initial page that a new user sees when navi-
to nine, and in version one, ten words that would gating to the application explains what the project is
be useful as commands in IoT or robotics applica- doing, and asks them to explicitly and formally agree
tions; "Yes", "No", "Up", "Down", "Left", "Right", to participating in the study. This process was de-
"On", "Off", "Stop", and "Go". In version 2 of the signed to ensure that the resulting utterances could
dataset, I added four more command words; “Back- be freely redistributed as part of an open dataset, and
ward”, “Forward”, “Follow”, and “Learn”. One of the that users had a clear understanding of what the ap-
most challenging problems for keyword recognition plication was doing. When a user clicks on “I Agree”,
is ignoring speech that doesn’t contain triggers, so a session cookie is added to record their agreement.
I also needed a set of words that could act as tests The recording portion of the application will only be
of that ability in the dataset. Some of these, such shown if this session cookie is found, and all upload
as “Tree”, were picked because they sound similar to accesses are guarded by cross-site request forgery to-
target words and would be good tests of a model’s kens, to ensure that only audio recorded from the
discernment. Others were chosen arbitrarily as short application can be uploaded, and that utterances are
words that covered a lot of different phonemes. The from users who have agreed to the terms.
final list was "Bed", "Bird", "Cat", "Dog", "Happy", The recording page asks users to press a “Record”
"House", "Marvin", "Sheila", "Tree", and "Wow". button when they’re ready, and then displays a ran-
dom word from the list described above. The word
is displayed for 1.5 seconds while audio is recorded,
5.3 Implementation
and then another randomly-chosen word is shown af-
To meet all these requirements, I created an open- ter a one-second pause. Each audio clip is added to a
source web-based application that recorded utter- list that’s stored locally on the client’s machine, and
ances using the WebAudioAPI[11]. This API is sup- they remain there until the user has finished record-
ported on desktop browsers like Firefox and Chrome, ing all words and has a chance to review them. The
and on Android mobile devices. It’s not available random ordering of words was chosen to avoid pro-
on iOS, which was considered to be unfortunate but nunciation changes that might be caused by repeti-
there were no alternatives that were more attractive. tion of the same word multiple times. Core words
I also looked into building native mobile applications are shown five times each in total, whereas auxiliary
for iOS and Android, but I found that users were words only appear once. There are 135 utterances
reluctant to install them, for privacy and security collected overall, which takes around six minutes in
reasons. The web experience requires users to grant total to run through completely. The user can pause
permission to the website to access the microphone, and restart at any point.
but that seemed a lot more acceptable, based on the Once the recording process is complete, the user is
increased response rate. The initial test of the ap- asked to review all of the clips, and if they’re happy
plication was hosted at an appspot.com subdomain, with them, upload them. This then invokes a web
but it was pointed out that teaching users to give mi- API which uploads the audio to the server applica-
crophone permissions to domains that were easy for tion, which saves them into a cloud storage bucket.

4
The WebAudioAPI returns the audio data in OGG- ֒→ -I {} ffmpeg -i ${BASEDIR}/oggs/{}.
compressed format, and this is what gets stored in the ֒→ ogg -ar 16000 ${BASEDIR}/wavs/{}.wav
resulting files. The session ID is used as the prefix
of each file name, and then the requested word is fol- Samples from other sources came as varying
lowed by a unique instance ID for the recording. This sample-rate WAV files, so they were also resampled to
session ID has been randomly generated, and is not 16 KHz WAV files using a similar ffmpeg command.
tied to an account or any other demographic informa-
tion, since none has been generated. It does serve as
a speaker identifier for utterances however. To ensure 5.5 Extract Loudest Section
there’s a good distribution of different speakers, once
a user has gone through this process once a cookie From manual inspection of the results, there were
is added to the application that ensures they can’t still large numbers of utterances that were too quiet
access the recording page again. or completely silent. The alignment of the spoken
To gather volunteers for this process, I used ap- words within the 1.5 second file was quite arbitrary
peals on social media to share the link and the aims too, depending on the speed of the user’s response to
of the project. I also experimented with using paid the word displayed. To solve both these problems,
crowdsourcing for some of the utterances, though the I created a simple audio processing tool called Ex-
majority of the dataset comes from the open site. tract Loudest Section to examine the overall volume
of the clips. As a first stage, I summed the absolute
differences of all the samples from zero (using a scale
5.4 Quality Control where -32768 in the 16-bit sample data was -1.0 as
a floating-point number, and +32767 was 1.0), and
The gathered audio utterances were of variable qual- looked at the mean average of that value to estimate
ity, and so I needed criteria to accept or reject sub- the overall volume of the utterance. From experi-
missions. The informal guideline I used was that if mentation, anything below 0.004 on this metric was
a human listener couldn’t tell what word was being likely to be to quiet to be intelligible, and so all of
spoken, or it sounded like an incorrect word, then those clips were removed.
the clip should be rejected. To accomplish this, I
To approximate the correct alignment, the tool
used several layers of review.
then extracted the one-second clip that contained the
To remove clips that were extremely short or quiet, highest overall volume. This tended to center the
I took advantage of the nature of the OGG compres- spoken word in the middle of the trimmed clip, as-
sion format. Compressed clips that contained very suming that the utterance was the loudest part of
little audio would be very small in size, so a good the recording. To run these processes, the following
heuristic was that any files that were smaller than 5 commands were called:
KB were unlikely to be correct. To implement this
rule, I used the following Linux shell command: git clone https://github.com/petewarden/
֒→ extract_loudest_section tmp/
find ${BASEDIR}/oggs -iname "*.ogg" -size
֒→ extract_loudest_section
֒→ -5k -delete
cd tmp/extract_loudest_section
make
With that complete, I then converted the OGG cd ../..
files into uncompressed WAV files containing PCM mkdir -p ${BASEDIR}/trimmed_wavs
sample data at 16KHz, since this is any easier format /tmp/extract_loudest_section/gen/bin/
for further processing: ֒→ extract_loudest_section ${BASEDIR}’/
find ${BASEDIR}/oggs -iname "*.ogg" -print0 ֒→ wavs/*.wav’ ${BASEDIR}/trimmed_wavs/
֒→ | xargs -0 basename -s .ogg | xargs

5
5.6 Manual Review ֒→ generator.noise(16000*60, color=’
֒→ pink’))/3) * 32767).astype(np.int16)
These automatic processes caught technical problems
֒→ )
with quiet or silent recordings, but there were still
some utterances that were of incorrect words or were To distinguish these files from word utter-
unintelligible for other reasons. To filter these out I ances, they were placed in a specially-named
turned to commercial crowdsourcing. The task asked "_background_noise_" folder, in the root of the
workers to type in the word they heard from each clip, archive.
and gave a list of the expected words as examples.
Each clip was only evaluated by a single worker, and
any clips that had responses that didn’t match their 6 Properties
expected labels were removed from the dataset.
The final dataset consisted of 105,829 utterances of
35 words, broken into the categories and frequencies
5.7 Release Process shown in Table 1.
The recorded utterances were moved into folders, Each utterance is stored as a one-second (or less)
with one for each word. The original 16-digit hex- WAVE format file, with the sample data encoded as
adecimal speaker ID numbers from the web applica- linear 16-bit single-channel PCM values, at a 16 KHz
tion’s file names were hashed into 8-digit hexadecimal rate. There are 2,618 speakers recorded, each with
IDs. Speaker IDs from other sources (like the paid a unique eight-digit hexadecimal identifier assigned
crowdsourcing sites) were also hashed into the same as described above. The uncompressed files take up
format. This was to ensure that any connection to approximately 3.8 GB on disk, and can be stored as
worker IDs or other personally-identifiable informa- a 2.7GB gzip-compressed tar archive.
tion was removed. The hash function used is stable
though, so in future releases the IDs for existing files
should remain the same, even as more speakers are 7 Evaluation
added.
One of this dataset’s primary goals is to enable mean-
ingful comparisons between different models’ results,
5.8 Background Noise so it’s important to suggest some precise testing pro-
tocols. As a starting point, it’s useful to specify ex-
A key requirement for keyword spotting in real prod-
ucts is distinguishing between audio that contains actly which utterances can be used for training, and
which must be reserved for testing, to avoid over-
speech, and clips that contain none. To help train
fitting. The dataset download includes a text file
and test this capability, I added several minute-long
16 KHz WAV files of various kinds of background called validation_list.txt, which contains a list
of files that are expected to be used for validating re-
noise. Several of these were recorded directly from
noisy environments, for example near running water sults during training, and so can be used frequently to
or machinery. Others were generated mathematically help adjust hyperparameters and make other model
changes. The testing_list.txt file contains the
using these commands in Python:
names of audio clips that should only be used for mea-
scipy.io.wavfile.write(’/tmp/white_noise. suring the results of trained models, not for training
֒→ wav’, 16000, np.array(((acoustics. or validation. The set that a file belongs to is chosen
֒→ generator.noise(16000*60, color=’ using a hash function on its name. This is to en-
֒→ white’))/3) * 32767).astype(np.int16 sure that files remain in the same set across releases,
֒→ )) even as the total number changes, so avoid set cross-
scipy.io.wavfile.write(’/tmp/pink_noise.wav contamination when trying old models on the more
֒→ ’, 16000, np.array(((acoustics. recent test data. The Python implementation of the

6
Word Number of Utterances 7.1 Top-One Error
Backward 1,664
Bed 2,014 The simplest metric to judge a trained model against
Bird 2,064 is how many utterances it can correctly identify. In
Cat 2,031 principle this can be calculated by running the model
Dog 2,128 against all the files in the testing set, and comparing
Down 3,917 the reported against the expected label for each. Un-
Eight 3,787 like image classification tasks like ImageNet, it’s not
Five 4,052 obvious how to weight all of the different categories.
Follow 1,579 For example, I want a model to indicate when no
Forward 1,557 speech is present, and separately to indicate when it
Four 3,728 thinks a word has been spoken that’s not one it rec-
Go 3,880 ognizes. These “open world” categories need to be
Happy 2,054 weighted according to their expected occurrence in
House 2,113 a real application to produce a realistic metric that
Learn 1,575 reflects the perceived quality of the results in a prod-
Left 3,801 uct.
Marvin 2,100 The standard chosen for the TensorFlow speech
Nine 3,934 commands example code is to look for the ten words
No 3,941 "Yes", "No", "Up", "Down", "Left", "Right", "On",
Off 3,745 "Off", "Stop", and "Go", and have one additional
On 3,845 special label for “Unknown Word”, and another for
One 3,890 “Silence” (no speech detected). The testing is then
Right 3,778 done by providing equal numbers of examples for
Seven 3,998 each of the twelve categories, which means each class
Sheila 2,022 accounts for approximately 8.3% of the total. The
Six 3,860 "Unknown Word" category contains words randomly
Stop 3,872 sampled from classes that are part of the target
Three 3,727 set. The "Silence" category has one-second clips ex-
Tree 1,759 tracted randomly from the background noise audio
Two 3,880 files.
Up 3,723 I’ve uploaded a standard set of test files[13] to make
Visual 1,592 it easier to reproduce this metric. If you want to cal-
Wow 2,123 culate the canonical Top-One error for a model, run
Yes 4,044 inference on each audio clip, and compare the top pre-
Zero 4,052 dicted class against the ground truth label encoded
in its containing subfolder name. The proportion of
Figure 1: How many recordings of each word are correct predictions will give you the Top-One error.
present in the dataset There’s also a similar collection of test files[14] avail-
able for version one of the dataset.
The example training code that accompanies the
dataset[15] provides results of 88.2% on this met-
ric for the highest-quality model when fully trained.
This translates into a model that qualitatively gives
set assignment algorithm is given in the TensorFlow a reasonable, but far from perfect response, so it’s
tutorial code[12] that is a companion to the dataset. expected that this will serve as a baseline to be ex-

7
ceeded by more sophisticated architectures. • Wrong-percentage shows how many words were
correctly distinguished as speech rather than
7.2 Streaming Error Metrics background noise, but were given the wrong class
label.
Top-One captures a single dimension of the perceived
quality of the results, but doesn’t reveal much about • False-positive percentage is the number of words
other aspects of its performance in a real application. detected that were in parts of the audio where
For example, models in products receive a continuous no speech was actually present.
stream of audio data and don’t know when words An algorithm for calculating these values given
start and end, whereas the inputs to Top One eval- an audio file and a text file listing ground
uations are aligned to the beginning of utterances. truth labels is implemented in TensorFlow as
The equal weighting of each category in the overall test_streaming_accuracy.cc[16].
score also doesn’t reflect the distribution of trigger Performing successfully on these metrics requires
words and silence in typical environments. more than basic template recognition of audio clips.
To measure some of these more complex properties There has to be at least a very crude set of rules to
of models, I test them against continuous streams of suppress repeated recognitions of the same word in
audio and score them on multiple metrics. Here’s short time frames, so default logic for this is imple-
what the baseline model trained with V2 data pro- mented in recognize_commands.cc[17].
duces: This allows a simple template-style recognition
49.0% matched, 46.0% correctly, 3.0% model to be used directly to generate these statis-
֒→ wrongly, 0.0% false positives tics. One of the other configurable features of the
accuracy test is the time tolerance for how close to
To produce this result, I ran the following bash the ground truth’s time a recognition result must be
script against the 10 minute streaming test audio clip to count as a match. The default for this is set to
and ground truth labels: 750ms, since that seems to match with requirements
bazel run tensorflow/examples/ for some of the applications that are supported.
To make reproducing and comparing results easier,
֒→ speech_commands:freeze -- --
֒→ start_checkpoint=/tmp/ I’ve made available a one-hour audio file[18] contain-
֒→ speech_commands_train/conv.ckpt ing a mix of utterances at random times and noise, to-
gether with a text file marking the times and ground
֒→ -18000 --output_file=/tmp/
֒→ v2_frozen_graph.pb truth labels of each utterance. This was generated
using the script included in the TensorFlow tutorial,
bazel run tensorflow/examples/
and can be used to compare different models perfor-
֒→ speech_commands:
֒→ test_streaming_accuracy -- --graph=/ mance on streaming applications.
֒→ tmp/v2_frozen_graph.pb --wav=/tmp/
֒→ speech_commands_train/streaming_test 7.3 Historical Evaluations
֒→ .wav --labels=/tmp/ Version 1 of the dataset[2] was released August 3rd
֒→ speech_commands_train/conv_labels. 2017, and contained 64,727 utterances from 1,881
֒→ txt --ground_truth=/tmp/ speakers. Training the default convolution model
֒→ speech_commands_train/ from the TensorFlow tutorial (based on Convolu-
֒→ streaming_test_labels.txt tional Neural Networks for Small-footprint Keyword
Spotting[19]) using the V1 training data gave a Top-
• Matched-percentage represents how many words One score of 85.4%, when evaluated against the test
were correctly identified, within a given time tol- set from V1. Training the same model against ver-
erance. sion 2 of the dataset[1], documented in this paper,

8
produces a model that scores 88.2% Top-One on the ֒→ start_checkpoint=${HOME}/
training set extracted from the V2 data. A model ֒→ speech_commands_checkpoints/conv-v
trained on V2 data, but evaluated against the V1 ֒→ {1,2}.ckpt-18000
test set gives 89.7% Top-One, which indicates that
the V2 training data is responsible for a substantial
improvement in accuracy over V1. The full set of 7.4 Applications
results are shown in Table 2.
The TensorFlow tutorial gives a variety of baseline
Data V1 Training V2 Training models, but one of the goals of the dataset is to en-
V1 Test 85.4% 89.7% able the creation and comparison of a wide range of
V2 Test 82.7% 88.2% models on a lot of different platforms, and version one
of has enabled some interesting applications. CMSIS-
Figure 2: Top-One accuracy evaluations using differ- NN[21] covers a new optimized implementation of
ent training data neural network operations for ARM microcontrollers,
and uses Speech Commands to train and evaluate the
results. Listening to the World[22] demonstrates how
These figures were produced using the checkpoints combining the dataset and UrbanSounds[23] can im-
produced by the following training commands: prove the noise tolerance of recognition models. Did
you Hear That[24] uses the dataset to test adversar-
python tensorflow/examples/speech_commands/ ial attacks on voice interfaces. Deep Residual Learn-
֒→ train.py --data_url=\protect\vrule ing for Small Footprint Keyword Spotting[25] shows
֒→ width0pt\protect\href{http:// how approaches learned from ResNet can produce
֒→ download.tensorflow.org/data/ more efficient and accurate models. Raw Waveform-
֒→ speech_commands_v0.01.tar.gz}{http based Audio Classification[26] investigates alterna-
֒→ ://download.tensorflow.org/data/ tives to traditional feature extraction for speech and
֒→ speech_commands_v0.01.tar.gz} music models. Keyword Spotting Through Image
python tensorflow/examples/speech_commands/ Recognition[27] looks at the effect virtual adversarial
֒→ train.py --data_url=\protect\vrule training on the keyword task.
֒→ width0pt\protect\href{http://
֒→ download.tensorflow.org/data/
֒→ speech_commands_v0.02.tar.gz}{http 8 Conclusion
֒→ ://download.tensorflow.org/data/
֒→ speech_commands_v0.02.tar.gz} The Speech Commands dataset has shown to be use-
ful for training and evaluating a variety of models,
The results of these commands are available as pre- and the second version shows improved results on
trained checkpoints[20]. The evaluations were per- equivalent test data, compared to the original.
formed by running variations on the following com-
mand line (with the v1/v2’s substituted as appropri-
ate): 9 Acknowledgements
python tensorflow/examples/speech_commands/ Massive thanks are due to everyone who donated
֒→ train.py --data_url=\protect\vrule recordings to this data set, I’m very grateful. I also
֒→ width0pt\protect\href{http:// couldn’t have put this together without the help and
֒→ download.tensorflow.org/data/ support of Billy Rutledge, Rajat Monga, Raziel Al-
֒→ speech_commands_v0.01}{http:// varez, Brad Krueger, Barbara Petit, Gursheesh Kour,
֒→ download.tensorflow.org/data/ Robert Munro, Kirsten Gokay, David Klein, Lukas
֒→ speech_commands_v0.0{1},1}.tar.gz -- Biewald, and all the AIY and TensorFlow teams.

9
References
[1] (2018) Speech commands dataset version 2. [Online]. Available:
http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz

[2] (2017) Speech commands dataset version 1. [Online]. Available:


http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz

[3] (2018) Linguistic data consortium. [Online]. Available: https://www.ldc.upenn.edu/

[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical
Image Database,” in CVPR09, 2009.

[5] (2018) Creative commons international attribution international 4.0 license. [Online]. Available:
https://creativecommons.org/licenses/by/4.0/

[6] (2017) Mozilla common voice. [Online]. Available: https://voice.mozilla.org/en

[7] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public
domain audio books,” in Proceedings of the International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 2015.

[8] R. G. Leonard and G. R. Doddington. (1992) A speaker-independent connected-digit database.


[Online]. Available: https://catalog.ldc.upenn.edu/docs/LDC93S10/tidigits.readme.html

[9] (2018) The 5th chime speech separation and recognition challenge. [Online]. Available:
http://spandh.dcs.shef.ac.uk/chime_challenge/data.html

[10] (2017) Hey siri: An on-device dnn-powered voice trigger for apple’s personal assistant. [Online].
Available: https://machinelearning.apple.com/2017/10/01/hey-siri.html

[11] (2015) Web audio api. [Online]. Available:


https://developer.mozilla.org/en-US/docs/Web/API/Web_Audio_API

[12] (2018) Implementation of set assignment algorithm. [Online]. Available:


https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/speech_commands/input_data.py#L61

[13] (2018) Speech commands dataset test set version 2. [Online]. Available:
http://download.tensorflow.org/data/speech_commands_test_set_v0.02.tar.gz

[14] (2017) Speech commands dataset test set version 1. [Online]. Available:
http://download.tensorflow.org/data/speech_commands_test_set_v0.01.tar.gz

[15] (2017) Tensorflow audio recognition tutorial. [Online]. Available:


https://www.tensorflow.org/tutorials/audio_recognition

[16] (2018) test_streaming_accuracy.cc source file. [Online]. Available:


https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/speech_commands/test_streaming_accu

[17] (2018) recognize_commands.cc source file. [Online]. Available:


https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/speech_commands/recognize_commands

10
[18] (2018) Speech commands dataset streaming test version 2. [Online]. Available:
http://download.tensorflow.org/data/speech_commands_streaming_test_v0.02.tar.gz
[19] T. N. Sainath and C. Parada, “Convolutional Neural Networks for Small-Footprint Keyword Spotting,”
in Sixteenth Annual Conference of the International Speech Communication Association, 2015. [Online].
Available: https://www.isca-speech.org/archive/interspeech_2015/papers/i15_1478.pdf
[20] (2018) Speech commands tutorial checkpoints. [Online]. Available:
https://storage.googleapis.com/download.tensorflow.org/models/speech_commands_checkpoints.tar.gz
[21] L. Lai, N. Suda, and V. Chandra, “CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M
CPUs,” ArXiv e-prints, Jan. 2018.
[22] B. McMahan and D. Rao, “Listening to the World Improves Speech Command Recognition,” ArXiv
e-prints, Oct. 2017.
[23] J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in
Proceedings of the 22Nd ACM International Conference on Multimedia, ser. MM ’14. New York, NY,
USA: ACM, 2014, pp. 1041–1044. [Online]. Available: http://doi.acm.org/10.1145/2647868.2655045
[24] M. Alzantot, B. Balaji, and M. Srivastava, “Did you hear that? Adversarial Examples Against Auto-
matic Speech Recognition,” ArXiv e-prints, Jan. 2018.
[25] R. Tang and J. Lin, “Deep Residual Learning for Small-Footprint Keyword Spotting,” ArXiv e-prints,
Oct. 2017.
[26] J. Lee, T. Kim, J. Park, and J. Nam, “Raw Waveform-based Audio Classification Using Sample-level
CNN Architectures,” ArXiv e-prints, Dec. 2017.
[27] S. Krishna Gouda, S. Kanetkar, D. Harrison, and M. K. Warmuth, “Speech Recognition: Keyword
Spotting Through Image Recognition,” ArXiv e-prints, Mar. 2018.

11

You might also like