http://ieeexplore.ieee.org/Xplore
Portable mTBI Assessment Using Temporal and
Frequency Analysis of Speech
Louis Daudet
Nikhil Yadav
Matthew Perez
Department of Computer
Science and Engineering
University of Notre Dame
Notre Dame, IN, USA
Division of Computer Science,
Mathematics, and Science
St. John’s University
Queens, NY, USA
Department of Computer
Science and Engineering
University of Notre Dame
Notre Dame, IN, USA
Department of Computer
Science and Engineering
University of Notre Dame
Notre Dame, IN, USA
Department of Communicative
Sciences and Disorders
Saint Mary’s College
Notre Dame, IN, USA
ldaudet@nd.edu
Christian Poellabauer
cpoellab@nd.edu
yadavn@stjohns.edu
Sandra Schneider
Department of Applied and
Computational Mathematics
and Statistics
University of Notre Dame
sschneider@saintmarys.edu Notre Dame, IN, USA
ABSTRACT
This paper shows that extraction and analysis of various
acoustic features from speech using mobile devices can allow
the detection of patterns that could be indicative of neurological trauma. This may pave the way for new types of
biomarkers and diagnostic tools. Toward this end, we created a mobile application designed to diagnose mild traumatic brain injuries (mTBI) such as concussions. Using
this application, data was collected from youth athletes from
47 high schools and colleges in the the Midwestern United
States. In this paper, we focus on the design of a methodology to collect speech data, the extraction of various temporal and frequency metrics from that data, and the statistical
analysis of these metrics to find patterns that are indicative
of a concussion. Our results suggest a strong correlation
between certain temporal and frequency features and the
likelihood of a concussion.
Keywords: speech analysis, voice pathology, portable
diagnostics, concussions
1.
mperez14@nd.edu
Alan Huebner
INTRODUCTION
Speech recognition is a standard feature in current mobile
devices, e.g., applications often make use of speech recognition to perform certain user-driven actions. While it has
previously been shown that changes in motor speech production often accompany progressive neurological diseases
and neurotrauma (i.e., traumatic brain injury or TBI), only
recent advances in speech analysis and mobile technologies
have made it possible to develop speech-based diagnostic
and assessment tools for various health conditions, specifi-
alan.huebner.10@nd.edu
cally focusing on neurodevelopmental conditions, neurodegenerative diseases, and traumatic injuries. Speech processing for health diagnostics and assessment in clinical settings
can be performed in the cloud, i.e., speech is captured on
a portable device, transmitted to a speech processing service that extracts and analyzes various acoustic features,
and sends the results back to the device for the practitioner
to interpret. However, in many settings, this approach is
not feasible, e.g., whenever assessment must occur in realtime in non-clinical environments. Examples include sideline speech assessment of athletes (e.g., due to a suspected
concussion) in areas with no or poor wireless connectivity,
as well as the assessment of military personnel deployed in
remote areas. With the arrival of mobile devices (such as
smartphones and tablets) with relatively large processing,
energy, and storage resources, it is becoming increasingly
realistic to develop health tools on such devices that not
only capture the human voice, but also perform processing,
filtering, analysis, and diagnostics in near real-time [1, 2, 3].
The technical and computational challenges for the deployment of lightweight speech recognition applications on mobile devices are daunting, e.g., while mobile devices are increasingly powerful, they still have much stricter resource
limitations than other computing systems [4]. It can also
be challenging to build such systems due to the proprietary
nature of (desktop or server-based) speech recognition software toolkits, which are often provided without access to
source code. However, with recent advances on both the
software side (e.g., more efficient algorithms) and the hardware side (e.g., resource-richer devices), mobile speech recognition applications have become available [5, 6], thereby
making speech-based diagnostics on mobile devices possible.
In this paper, we first present a speech analysis application
using open source tools and software focusing on the extraction and analysis of temporal features of speech. We then expand on these results by adding an analysis of features from
the frequency domain. In consultation with speech language
pathologists a set of reading tests have been designed to mea-
Digital Object Identifier 10.1109/JBHI.2016.2633509
2168-2194 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
sure a variety of speech features such as speaking rate, word
duration, or pitch and intensity fluctuation. Stimuli are selected to test maximum movement, strength, accuracy, and
range of the different oral motor structures (i.e., tongue, soft
palate, lips). We present our approach to data collection and
analysis with an emphasis on data collected from over 2500
athletes from 47 high schools and colleges in the Midwestern
United States using this application.
2.
sessment when the disease is already advanced enough for
its strongest effects to become noticeable.
RELATED WORK
Several studies [7, 8, 9] have explored acoustic measures
to help distinguish various types of dysarthria (i.e. motor
speech disorders) due to neurotrauma (e.g., traumatic brain
injury) or neurodegenerative diseases (e.g., Parkinson’s Disease, Amyotrophic Lateral Sclerosis, Huntington’s Disease,
etc.). Exploring rhythm abnormalities [7, 8] and vowel metrics [9], the authors found that for individuals with Parkinson’s Disease, significant changes in the articulation rate (a
measure of rate of speaking excluding pauses) and pause
times (periods of absence of speech) can be detected. And
while acoustic differences in vowel changes due to different
types of dysarthria can be detected, listeners’ perceptions
of these changes were variable. The later suggests acoustic
measures may be more accurate in detecting early signs of
neurodegenerative disease processes.
While there are many studies [10, 11] examining speech
changes in individuals with moderate to severe TBI, there
is limited literature on speech changes due to mild TBI or
concussions. Currently, the diagnosis of concussion and its
ramifications to neurological diseases and disorders remains
illusive, which is why it is so important to have an instrument that can detect changes in speech that will serve as a
biomarker for neurological wellness. In other studies [11, 12],
the authors investigated the possible impact of concussions
on speech. They focused on data collected from collegiate
boxers, where the acoustic features of pitch, formant frequencies, jitter (cycle-to-cycle variations of the fundamental
frequency), and shimmer (variability of peak-to-peak amplitude) were explored. While preliminary in focus, there was
substantial evidence that even mild traumatic brain injuries
leave their fingerprints in human voice.
Pitch and formant changes have been used to detect early
signs of autism in infants’ cry melodies and other oral productions of young children [13, 14]. In [13], the authors
found that children with a high risk of being on the autism
spectrum had pain cries with a higher fundamental formant
frequency than children with a lower risk. In [14], they found
that children later diagnosed to be on the autism spectrum
had more monotonous voices than children without autism,
with their pitch contours being less complex.
Recent research using speech analysis for early Alzheimer’s
disease detection [15] emphasized the non-invasive nature of
using speech analysis for health diagnostics and monitoring.
Using emotional temperature, a metric of their own design
to measure emotional response, and the analysis of automatic spontaneous speech, they were able to differentiate
with some success between patients with Alzheimer’s disease
and a control group. This method could assist in making an
earlier diagnosis than what is offered by current diagnostic
methods that rely on extensive cognitive and medical as-
Figure 1: Speech recognition system block diagram
3. METHODS
3.1 Speech Recognition
An overview of the architecture of the third-party speech
recognition and analysis system used in this project (described in more detail in Section 4.2) is shown in Figure 1.
Users are asked to go through a series of tests, where words,
sentences, and sounds are produced and captured by a lowimpedance, unidirectional dynamic microphone attached to
the mobile device. The external microphone minimizes the
effects of background noise and cross-talk, making the captured signal as clear as possible. This cannot be achieved by
the built-in microphone on the device as it is not sensitive
enough to filter these discrepancies, which are a product of
the recording environment. The integrity of the recording is
maintained by detecting its noise content using the methods
highlighted in [16]. If the voiced or unvoiced SNR (signalto-noise ratio) is below a threshold, the user is prompted to
retake the test. If not, the speech recording is saved into
a speech repository, either locally on the mobile device, or
remotely in cloud storage. Each recording is sent to a feature extraction tool, where feature frames are extracted and
then passed on to a speech recognition software. This software includes a decoder, which in turn consists of a linguistic
component that receives information from a knowledge base
comprised of the following:
• Lexicon/Dictionary: This maps words to their pronunciations. Single words can have multiple pronunciations, which are represented in phonetic units. Dictionaries can vary in size from just a few words to
hundreds of thousands.
• Language Model: Describes what is likely to be spoken. A stochastic approach is employed where word
transitions are defined in terms of transition probabilities, e.g., in the sentence “Shut the door”, a noun
is expected after “the”, hence the language model can
constrain the search space to only nouns in the dictionary, reducing computational overhead.
• Acoustic model: This is a collection of trained statistical models, each representing a phonetic unit of
speech. They are trained by analyzing large corpora
of speech. Acoustic models are not necessarily constrained by the speaker, in that they can be generated
for a group of speakers or an individual. In our case,
we use a group-based acoustic model.
The decoder forms the backbone of the speech recognizer.
It selects the next set of likely states, assigns a score for
incoming features based on these states, and eliminates the
low scoring states to generate results. State selection by the
decoder is driven by the knowledge base, where the grammar
selects the next set of possible words. The dictionary is used
to collect pronunciations for the words. An acoustic model
is used to collect Hidden Markov Models (HMMs) for each
pronunciation, from which transition probabilities are used
to select the next set of states.
The linguist component of the decoder obtains word pronunciations, probabilities, transitions, and state information
from the knowledge base as shown in Figure 1. These are
used to generate nodes in HMMs that represent the speech
samples. HMMs are commonly used in speech recognition
software to calculate the likelihood of each state [17, 18].
The HMMs used in the third party speech recognition software used in this project are specialized versions that emit
observations from an observation sequence O from left-toright with probabilities defined in a Probability Distribution Function (PDF). Backward transitions are not allowed.
Each of the states in the HMM is represented by a Gaussian
mixture of density functions. In other words, the linguist
translates the rules provided by the user into a grammar
that the search manager can use. Based on this grammar,
the function of the search manager is then to build a tree of
possibilities for what the signal might be, and then search
the tree to find the best hypothesis. To do the latter, the
search manager uses the acoustic scorer. Its role is to compute, for a given input vector, the state output probability.
It provides these to the search module on demand. Using
them, the search module can then prune the tree of possibilities until a best hypothesis is found [25]. This best hypothesis is then output by the decoder together with its timing
boundaries. The computation complexity for these tasks can
be significant and in the past, mobile devices lacked a fast
virtual memory subsystem and a complete processing library
(most notably in C/C++) to handle the computational complexity of automatic speech recognition (ASR) tasks.
3.2 Temporal and Frequency Speech Features
In this work, we consider both temporal and frequency features in speech to detect signs of a potential mTBI. The
temporal features that are extracted together with their descriptions are listed in Table 1. Average duration has been
chosen as a feature since it has been extensively studied
and has been found to provide useful information regarding
speaking rate, dialect, phonetic context, stress, and specific
characteristics (i.e., gender, age, neurological status) of the
speaker [19, 20, 21]. The same applies to the metric relating
to stress. When the participants have to put stress on a particular word, they have to adjust their respiratory system to
reflect that increase in stress, the coordination of which is
anticipated to be more difficult for an individual suffering
from a diffuse head injury such as a concussion. In the same
way, a participant reading a continuous passage of speech,
will tax the motor speech system to make sure all words are
coordinated and stress is timed appropriately.
Further, we also measure the diadochokinetic (DDK) rate.
The DDK rate measures the speed, strength, steadiness, and
accuracy of rapid, repetitive motor speech movements. In
our test, these sounds are Pa, Ta, and Pa-Ta-Ka [22]. This
part of the data collection is designed to emphasize various
oral motor skills such as articulation, respiration, tone control, and phonation. The DDK rate test determines if there
are problems in the speech mechanisms that control motor
skills or speech planning functions. In the case of a participant having suffered an mTBI, we anticipate, based on
the work in [23, 24], different temporal metrics to change,
e.g., the speed/rate, expected to be either fast or slow, the
frequency of iterations, or steadiness, expected to fluctuate,
and the strength of the repetition to become either hyperor hypokinetic.
In the frequency domain, several metrics are being considered as shown in Table 2. At this point, all metrics are
extracted off device (i.e., on a remote server), but their extraction does not necessarily require any special software
and thus could easily be performed on-device (although the
performance degradation due to hardware constraints is to
be determined). The first metrics that we compute are the
average pitch for the entire sound file and the standard deviation of this pitch. This is an easy metric to compute, as
there are many tools available to compute the pitch for every window of a given time interval over a given sound file.
In this paper, we used the popular Praat1 software tool to
perform these measurements. We also compute the average
power and its standard deviation as they are also easy to extract and popular in speech assessment. However, analysis
of these features has to be performed carefully since small
variations in the data capture (e.g., changes in the placement of the microphone) can lead to drastic changes in the
recorded intensity of the voice samples.
When listening to the recorded speech of concussed participants, several frequency domain features appear to show
consistent differences (compared to non-concussed participants). Common patterns include increased monotonicity of
the voice, increased tone fluctuations, and increased stress.
To translate these perceptive assessments into quantifiable
metrics, we first measured the pitch (or power) of the sound
file for every window of 10ms of voiced time in the sound
files. From these measurements, we computed the average
pitch for the sound file and then used that average to analyze
the speech signal’s pitch values to identify when they crossed
that average. We first considered measuring the amount of
pitch variation by counting the number of times the pitch
values went past the value of the average pitch. However
that made very steady recordings with numerous fluctuations just around the average seem more fluctuating than
recordings with fewer fluctuation but of great intensity. Instead we computed a weight for each of these crossings (seen
in Figure 2). That way, very steady recordings with numerous fluctuations just around the average have a lot of weights
that will be small and add up to a small value, while recordings with fewer fluctuation, but of greater intensity will have
a small number of these weights, which will be large and add
up to a large value.
1
www.praat.org, http://www.fon.hum.uva.nl/praat
Table 1 Temporal Acoustic Metrics and Features Extracted
TEMPORAL ACOUSTIC METRIC
Average Duration
Standard Deviation in Duration
Stressed Word Duration
Stress Pause
Average Syllable Duration
Average Pause Duration
Average Diadochokinetic Rate (DDK)
Average Diadochokinetic (DDK) Period
Standard Deviation in DDK Period
Coefficient of Variation in DDK Period
DESCRIPTION
Average duration taken to say a word in the test
The standard deviation in durations of words being spoken
The time taken to say a word while stressing it
Pause time before saying the stressed word
The average syllable duration in a continuous passage of speech
The average pause duration in a continuous passage of speech. Notable pauses
indicates a possible concussion
The number of consonant vowel (C-V) vocalizations per second
The average period is the average time between consonant and vowel vocalizations (inverse of Rate)
This is the standard deviation of the DDK period (in ms)
This parameter measures the degree of rate variation in the period (%). If the
C-V vocalization is repeated with little variation in rate, then this number is
very small. However, as a speaker varies the rate of DDK during the sevensecond-analysis window, this number increases. This parameter is assessing
the participants ability to maintain a constant rate of C-V combinations
Figure 2: For every time the pitch values go across
the average pitch, a new data point is computed for
the variation of the pitch’s amplitude
Figure 3: For every time the pitch values go across
the average pitch, a new data point is computed for
the time component of the pitch’s variance
After having computed all of the crossing points’ weights,
we added them together to obtain a value for the entire
file, providing a measure for the variance of the pitch. To
see if that variance was constant throughout the file, we
then computed the average and standard deviation of that
variance. In addition to the variance of the amplitude of the
pitch, we also wanted to get a sense of the frequency of the
variance of the pitch. To do so, we computed a new value for
each of the points at which the pitch crosses the threshold of
the average pitch; averaging the time before and after each
crossing. This process can be seen in Figure 3. Again, we
wanted to be able to see if the frequency of the variance
was steady or not, and thus we computed the average and
the standard deviation of this frequency. This methodology
that we followed to extract metrics related to the pitch was
similarly applied to extract metrics related to the intensity
of the sound signal. The details of the frequency domain
features extracted in our work are presented in Table 2.
4. IMPLEMENTATION
4.1 Data Collection
Concussion testing typically involves recording a pre-season
baseline for a subject representing their “healthy” state. Subsequent post-baseline recordings are classified and tagged as
either being normal or from a subject who has a suspected
concussive injury. Post-baseline recordings are compared to
the initial baseline. Intuitively, subjects with a concussive
injury should display a pattern that is atypical of a healthy
control group. Speech was recorded on an iOS mobile device (iPad mini) fitted with a low-impedance Shure SM10A
microphone2 designed for close-talk headworn applications
such as remote-site sport broadcasting sampled at 44.1kHz,
16 bit, mono. A custom mobile application was designed
for a multi-syllabic word reading test, which asks users to
read out a sequence of words as they appear on a mobile device screen at given time intervals. The words were carefully
2
http://www.shure.com/americas/products
/microphones/sm/sm10a-headworn-microphone
Table 2 Frequency Acoustic Metrics and Features Extracted
FREQUENCY ACOUSTIC METRIC
Average Pitch
Pitch Standard Deviation
Pitch Variation
Average Pitch Variation
Pitch Variation Standard Deviation
Frequency of Pitch Variation
Average Frequency of Pitch Variation
Standard Deviation of the Frequency of
Pitch Variation
Average power
Power Standard Deviation
Power Variation
Average Power Variation
Power Variation Standard Deviation
Frequency of Power Variation
Average Frequency of Power Variation
Standard Deviation of the Frequency of
Power Variation
DESCRIPTION
Average pitch from a speech sample
The standard deviation of the pitch in a speech sample
How many times the pitch goes above or below the pitch average in a speech
sample, weighted by how much it deviates from that average
Average of the pitch variation
Standard deviation of the pitch variation
This metric is computed by adding together all the weights for the time component of the pitch variance as seen in Figure 3
Computed by averaging the frequency of pitch variation metric by the number
of crossing points
Standard deviation of the time between fluctuations in pitch
Average power from a speech sample
The standard deviation of the power in a speech sample
How many times the power deviates from the power average in a speech sample,
weighted by how much it deviates from that average
Average of the power variation
Standard deviation of the power variation
This metric is computed by adding together all the weights for the time component of the power variance similarly as what is seen in Figure 3 for pitch
Computed by averaging the frequency of power variation metric by the number
of crossing points
Standard deviation of the time between fluctuations in power
handpicked after consultation with speech language pathologists. The details of the tests used are shown in Table 3;
the words and sounds are selected in a way that will require
the users to use different parts of the speech production system (front and back of the mouth, soft palate in the back of
the throat). Each of the tests was designed and selected for
specific reasons, as described in Table 4.
The setup for the test is shown in Figure 4. The noise management techniques and signal-to-noise ratio threshold proposed in [16] were used. This approach rejects a test when
the voiced or unvoiced SNRs drop below a threshold. Feedback is provided based on these values to convey whether
the environment is noisy, the microphone is not placed optimally, or the user is not speaking loud enough (speech intensity is also measured). A test retake is recommended in
each of these cases. A sample waveform and corresponding
spectrogram of a recording is shown in Figure 5 generated
using the Praat software kit. The waveform signal shows
the amplitude of seven of the words. The broadband spectrogram shows the spectral energy of the sound over time.
The red dots represent the formants; blue lines represent the
speaker’s pitch, and the yellow line, faint at the bottom of
the frequency graph, represents intensity.
The recorded speech to be analyzed was then transferred
to a local application repository (disk storage) after being
down-sampled to 16KHz, both on the server as well as the
mobile devices. The downsampling was necessary due to the
selection of the speech processing library, Sphinx3 , which
can only function with 16KHz signals. The original 44KHz
signals were retained for quality purposes for use in future
research. The downsampled files are approximately 12s long
3
http://cmusphinx.sourceforge.net/
and only occupy 415 KB in storage each.
Figure 4: Recording setup
4.2
Speech Recognition
4.2.1 Decoder
The Sphinx speech recognition software toolkit [26, 27] was
selected as the speech decoder due to its open-source nature. In particular, the lightweight Pocketsphinx implementation was selected since it can easily be ported to mobile
devices. The software is written in C/C++, making it ideal
for the iOS mobile platform, which has a C/C++ virtual
memory subsystem. Pocketsphinx has been optimized for
speed on mobile device processors, e.g., the feature extraction has been changed to fixed point, which has resulted in
a significant improvement in speed [4].
Table 3 Speech Test Details
Test Time
1
1.5s
Words or Sounds
application
participate
education
difficulty
congratulations
possibility
mathematical
opportunity
put the book here
Description
Multisyllabic words are shown on the device’s
screen for exactly 1.5 seconds each. It is anticipated that participants with mTBI would find it
more difficult to accurately produce these words
and there may be induced delays in this operation
Temporal Metrics Extracted
Average Duration / Standard Deviation Duration
Stress is placed on different words in the sentence
(PUT, BOOK, HERE) that were displayed in bold
at each of the iterations. This is a standard test
used to measure stress, but timing parameters can
be extracted from this as well
The standard syllabic rate may be affected and
cause perceptual differences in articulation
Participants repeat the pa sound as quickly as possible
Stressed Word Duration / Stress
Pause
2
10s
3
5s
4
5s
we saw several wild
animals
pa
5
5s
ka
Participants repeat the ka sound as quickly as possible. Similar to previous test
6
5s
pa-ta-ka
Similar to category 4 and 5 tests. Alternating
sounds used to measure sequential motion rate
Average Syllable Duration / Average Pause Duration
Average Diadochokinetic Rate Period / Average DDK rate / Standard Deviation in DDK Period /
Coefficient of variation in DDK Period
Average Diadochokinetic Rate Period / Average DDK rate / Standard Deviation in DDK Period /
Coefficient of variation in DDK Period
Average Diadochokinetic Rate Period / Average DDK rate / Standard Deviation in DDK Period /
Coefficient of variation in DDK Period
Table 4 Speech Test Reasonings
Test Number
1
2
3
4
5
6
Reasons for inclusion
The multi-syllabic words (4 syllables) chosen for this test contain front, middle, and back vowels and bilabial,
alvelor, velor, and glide consonants allowing for maximum oral structure movement.
By requesting individuals to emphasize the highlighted word in the sentence, stress, rhythm, amplitude,
and frequency can be measured.
The sentence used was controlled for syllable length (including 1-, and 3-syllable words containing front,
middle, and back vowels and consonants measuring accuracy of articulatory production and movement, and
syllable duration.
Tests 4 and 5 are used to measure the diadochokinetic rate (or the ability to rapidly, accurately, and steadily
produce the neutral vowel with a front and back consonant sound); a measure of the accuracy of alternating
motor movements.
See explanation for test 4.
This test assesses sequential diadochokinetic motion rate by measuring the accuracy, rate, and duration of
each syllable produced.
4.2.2 Knowledge Base
• Language Model: A JSGF (Java Speech Grammar
Format)5 grammar file is defined for the multi-syllabic
word test. A fixed text corpus is, since our grammar
is restricted and subjects speak a known set of words.
• Lexicon/Dictionary: The Carnegie Mellon University pronouncing dictionary for general American English4 that comes packaged with PocketSphinx was
used (39 phones). The phonetic decomposition of the
words used in test 1 (multisyllabic words) from this
corpus is shown in Table 5 with all possible pronunciations.
4
http://www.speech.cs.cmu.edu/cgi-bin/cmudict
• Acoustic model: The generic hub4 Wall Street Journal (WSJ) [28, 29] acoustic model that comes prepackaged with Sphinx was used. It is a comprehensive
model that has been trained on over 140 hours of speech.
5
http://www.w3.org/TR/jsgf
Amplitude
Frequency in Hz
Time in Seconds
Figure 5: Sample waveform and spectrogram for
multi-syllabic reading test
Table 5 Phonetic breakdown of words used in multi-syllabic
reading test
WORD
participate
application
education
difficulty
difficulty (2)
congratulations
possibility
mathematical
opportunity
5.
PHONETIC BREAKDOWN
P AA R T IH S AH P EY T
AE P L AH K EY SH AH N
EH JH AH K EY SH AH N
D IH F AH K AH L T IY
D IH F IH K AH L T IY
K AH N G R AE CH AH L EY SH
AH N Z
P AA S AH B IH L AH T IY
M AE TH AH M AE T IH K AH L
AA P ER T UW N AH T IY
Figure 6: Power histogram before SNR algorithm
The cosine function, which was used to estimate noise power
distribution, is subtracted from the RMS power histogram
in order to estimate the speech power distribution. The
speech level is defined to be the bin midpoint where the
95th percentile occurs in the speech power histogram, as
shown in Figure 7. The noise level is then subtracted from
the speech level to obtain the SNR.
RESULTS
5.1 Noise Management
The method described in [16] was used to determine a threshold value for the voiced and unvoiced SNR below which a
speech signal is considered noisy. The code to compute the
SNR values from the sound files is a modified version of
a code written by Antoine Fillinger and Vincent Stanford
(which was adapted with their help). The description of the
code presented in this paper is extracted from a description
of an earlier version of the code by Jon Fiscus6 . The program estimates the SNR values of a file using a logarithmic
function and the speech power of the sound file. In order
to estimate speech noise levels a signal energy histogram
is created. With the assumption that the recording is not
too noisy, we expect to see two different peaks in this distribution: one for the noise level on the left and one for
the speech level on the right. The noise distribution is estimated by fitting a raised cosine function to the left peak of
the RMS histogram. The raised cosine function first issues
an estimate for the location, amplitude, and width of the
left most peak. Then a space search algorithm called “direct
search” [30] is applied to maximize the fit. With the best fit
found, the midpoint of the raised cosine function is labeled
as the mean noise power level, as seen in Figure 6.
6
http://labrosa.ee.columbia.edu/projects/snreval/
Figure 7: Power histogram after SNR algorithm
This method relies on the fact that the analyzed sound file
is a mix of two distinct power distributions, one emanating
from the signal, and one emanating from the noise. It is
important to note that if the noise and speech distributions
are close to one another (i.e., a very noisy recording), then
this technique will produce unreliable results.
We then had to compute the threshold value for the SNR
below which to reject the recordings. 110 speech recordings
were taken from our speech repository and a noise signal
was added to them at varying intensities. The accuracy
of the ASR using Sphinx in identifying the original timing
with the added noise was evaluated for the given files. At
first, the noise signal was added to the speech signal at full
power, then the noise signal’s power was reduced by varying
amounts before being added to the speech signal decreasing
its effect on the SNR and accuracy. This reduction in signal intensity is what is described in Figure 8 as “power drop
in dB”. This figure shows the averaged voiced and unvoiced
SNR values impacted by induced noise, and the correspond-
ing accuracy for temporal breakdown of words using Sphinx.
For accurate ASR operation, the average voiced SNR value
of the speech recording should be above 28dB, and the average unvoiced SNR reading should be above 16dB. These
were the criteria used for selecting the signal and identifying
it for processing.
met these criteria were used in the analysis. The output from
running Pocketsphinx on the mobile devices was identical
to the output generated on a server. This was validated
for multiple files, and implies the correctness of the Sphinx
decoder irrespective of the operating platform. As for the
frequency domain features, their extraction has not been
tested on a mobile device, but since this extraction was done
using only basic mathematical functions, there should be no
drastic changes required to port them to a mobile device.
Table 6 Sphinx output for a single test file
WORD
<sil>
participate
<sil>
application
<sil>
education
<sil>
difficulty
<sil>
congratulations
<sil>
possibility
<sil>
mathematical
<sil>
opportunity
<sil>
START TIME (s)
0
1.79
2.53
3.12
3.87
4.51
5.2
6.05
6.75
7.43
8.45
8.98
9.69
10.47
11.19
12.01
12.71
END TIME (s)
1.78
2.52
3.11
3.86
4.5
5.19
6.04
6.74
7.42
8.44
8.97
9.68
10.46
11.18
12
12.7
12.94
Temporal Decomposition Using Sphinx for Subject 1
Figure 8: Impact of noise on ASR accuracy (for 110
subjects)
We finally modified the SNR measuring code to make it
faster. Specifically, we changed the number of passes made
by the “direct search” algorithm to maximize the fit of the
raised cosine function. By default, the code was making a
large number of passes to try find a most accurate fit between the raised cosine function and the noise peak. Using
the same speech recordings used to determine good SNR
values, we try lowering that number by iteration, each time
measuring the difference between the raised cosine function
produced with the lower number of passes, and the one produced with the number of passes by default. We wanted to
keep that difference low, and noted that we could actually
reduced number passes from more than a hundred down to
ten and accomplish that goal. At ten passes, all the sound
files were getting SNR values that varied from the original
results by less than 10%.
5.2 Correctness
The accuracy in decoding the speech was analyzed by superimposing the extracted syllable timing boundaries obtained
using the Sphinx decoder, as shown in Table 6, with the
actual waveform. The <sil> in the sphinx output denotes
silence regions in the speech. Figure 9 shows how Sphinx
decomposes the waveforms of a recording.
As can be seen, it performed well provided the recording
was “clean” and taken in a relatively controlled noise-free
environment. Only recordings collected from subjects who
Figure 9: Temporal decomposition of speech using
Sphinx for subject 1 (the red lines indicate where
Sphinx marked the starting and ending times of the
uttered syllables)
5.3
Performance
Our concern was to highlight the performance of the temporal decomposition run completely on a mobile device making
it ideal in cases where network connectivity was unavailable
and where a server-side cloud solution would have been impractical. We measured the average CPU processing time
per file obtained when running temporal analysis over ten
files. The iPad 2 and iPad mini performed temporal decomposition and gave feedback in the 1.5s range. The iPhone
4 gave the highest response time in the 2.5s region. As expected, the MacBook Air laptop performed exceptionally
well (<.1s). This indicated a much faster virtual memory
subsystem on the laptop, and also faster I/O handling. The
mobile application achieving the same had a basic single
view graphical user interface (GUI), however graphical tasks
contributed a maximum of only 3.2% of CPU activity across
the mobile devices.
As for the frequency domain features, their extraction having only been done off device, it is difficult to assess just how
long it would take to extract them from a mobile device.
That being said, it took about 30 seconds to extract these
features on a MacBook Air for 580 sound files. According to
geekbench, a Cross-Platform Processor Benchmark used to
measure a computer’s processor and memory performance,
our iPad devices are about ten times slower than a MacBook Air. Thus the processing would take 300 seconds on
the iPad for these 580 sound files, or about 0.5 seconds per
speech sample, making the extraction theoretically compatible with the iPad.
Table 7 Results from Statistical Analysis of Temporal Metrics
TEST
ACOUSTIC METRIC
Pr(>|z|)
1
Average Duration
0.0076
2
BOOK Stressed Word Duration 0.0237
4
Average DDK Period
0
4
Average DDK Rate
0.0237
4
sDev DDK Period
0.0141
4
Variance DDK Period
0.0192
6
Average DDK Rate
0.0412
5.4 Statistical Significance Tests Of Temporal
Features
Temporal speech test data from 486 controls (i.e., they had
no prior concussions and had a post-baseline recording) and
95 concussed subjects were collected and analyzed temporally using a logistical regression approach. An established
rule-of-thumb for logistic regression modeling is that, in order to obtain stable results, the data must contain at least
10 events (i.e., concussions) for every predictor variable included in the model [31, 32]. Since we had 95 concussions,
this meant we could include at most nine predictors in our
model. We chose the timing acoustic metric predictors for all
six tests for the modeling. For a large number of recordings,
no useful speech features could be extracted, because of interferences such as background noise or other sounds (e.g.,
laughter, mis-pronounced words, etc.). Simply discarding
these observations from our relatively small data set was
not possible, so the technique of multiple imputation was
used to statistically “fill in” the missing data [33, 34]. The
results below are adjusted to take into account the fact that
data was imputed. The statistically significant timing predictors are showed in Table 7.
To assess the predictive power of the model we used the
method of receiver operating characteristic (ROC) curves.
A model that classified subjects no better than a coin flip
would yield a straight line from (0,0), to (1,1), i.e., the black
line in Figure 10; hence the area under this curve (AUC)
would be 0.50. Any increase in AUC from 0.50 indicates
better predictive power, with 0.80 considered to be excellent [35]. The red curve shows the ROC curve yielded by
the model fitted with the features included in Table 7. At
its highest point, the curve would appear to be noticeably
higher than the black diagonal. The AUC was computed to
be 0.70; thus, the discrimination between concussions and
non-concussions yielded by the model was directly on the
lower bound for acceptable. This shown that our system
could classify the temporal acoustic features that could be
useful biomarkers in concussion detection.
5.5 Statistical Significance Tests Of Spectral
Features
The spectral features were extracted in a two step process.
First, a Praat script was used to extract the pitch and the
Figure 10: ROC Curve Using Statistically Significant Temporal Features
power for every 10ms increment from each sound file. When
the power was too low, the pitch and/or the power could
be marked by Praat as unknown, every other interval would
get a value. This Praat script goes through each of the
sound files, test by test, and creates a text file of the same
name than the sound file with the data on power and pitch,
with one line for every 10ms. Once all the text files have
been created, a second script, this time written in Python
produces the values described in Table 2. The results for
these features are then added to the results of the temporal
features in an Excel file. This Excel file is used as an input
for an R script that goes through them all to determine the
statistical significance (see Tables 7 and 8 for results), and
then compute the AUC and draw the ROC curve for the
chosen metrics. One ROC curve shows only the spectral
metrics with a p-value < .05, listed in Table 8. This curve,
shown in Figure 11, has an AUC of 0.80.
For best results though, a second ROC curve was drawn using the statistically relevant features from both the temporal
and spectral domain. This ROC curve can be seen in Figure 12. When using both types of features, the AUC from
the ROC curve then increases to 0.86. It is important to note
though that 23 features are used for this ROC curve. Since
we tested this features against 98 concussed participants,
only 4.26 concussed participants were available to support
Table 8 Results from Statistical Analysis of Frequency Metrics
TEST
ACOUSTIC METRIC
Pr(>|z|)
3
Std freq variance pitch
0
3
Freq variance pitch
0.0015
3
Freq variance amplitude
0.0024
3
Std variance amplitude
0.0177
3
Variance pitch
0.0296
5
Freq variance pitch
0.0185
5
Std variance amplitude
0.0194
5
Average variance pitch
0.0257
5
Std pitch
0.0258
6
Std variance amplitude
0.0001
6
Average variance amplitude
0.0002
6
Std amplitude
0.0011
6
Average variance pitch
0.0043
6
Std pitch
0.0115
6
Std variance pitch
0.0176
6
Freq variance amplitude
0.026
Figure 11: ROC Curve Using Statistically Significant Spectral Features
each feature, whereas 5 to 10 are typically recommended to
avoid over-fitting of the model. With future data collection
(and therefore more concussed participants recorded), we intend to revisit and revise these results as part of our future
work.
to increase our speech corpus and study additional acoustic
features beyond the ones described in this work.
Acknowledgment
This research was supported in part by GE Health and the
National Football League through the GE/NFL Head Health
Challenge. The research was further supported in part by
the National Science Foundation under Grant Number IIS1450349. The authors would like to thank Vince Stanford of
the National Institute of Standards and Technology (NIST)
for his help in providing us with the SNR estimation algorithm for noise detection.
7.
Figure 12: ROC Curve Using Both Statistically Significant Spectral And Temporal Features
6.
CONCLUSIONS
This paper described a reading test to capture speech recording from potentially concussed subjects, noise management
techniques for such data collections, and feature extraction
techniques in both the time and frequency domains. Various combinations of these features show great potential as
speech biomarker for mTBI. In our future work, we intend
REFERENCES
[1] A. Waibel, A. Badran, A. W Black, R. Frederking,
D. Gates, A. Lavie, L. Levin, K. Lenzo, L. Mayfield
Tomokiyo, J. Reichert, T. Schultz, D. Wallace,
M. Woszczyna, and J. Zhang, “Speechalator: Two-way
speech-to-speech translation in your hand,” in
Proceedings of NAACL-HLT, 2003.
[2] H. Franco, J. Zheng, J. Butzberger, F. Cesari,
M. Frandsen, J. Arnold, V. R. R. Gadde, A. Stolcke,
and V. Abrash, “Dynaspeak: SRI’s scalable speech
recognizer for embedded and mobile systems,” in
Proceedings of HLT, 2002.
[3] T. W. Köhler, C. Fügen, S. Stüker, and A. Waibel,
“Rapid porting of ASR-systems to mobile devices,” in
Proceedings of Interspeech, 2005.
[4] D. Huggins-Daines et al., “Pocketspinx: A free real-time
continuous speech recognition system for hand held
devices,” International Conference on Acoustics, Speech
and Signal Processing (ICASSP), 2006.
[5] X. Lei, A. Senior, A. Gruenstein, and J. Sorensen, J,
“Accurate and compact large vocabulary speech
recognition on mobile devices.” in Interspeech, 2013.
[6] I. McGraw, R. Prabhavalkar, R. Alvarez, M.G. Arenas
“Personalized speech recognition on mobile devices” in
2016 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), 2016.
[7] S. Skodda and U. Schleggel, “Speech rate and rhythm
in Parkinson’s disease,” Mov Disord.,23(7), pp.
985–992, 2008.
[8] K.L. Lansford, and J.M. Liss. “Vowel acoustics in
dysarthria: Mapping to perception,” Journal of Speech,
Language, and Hearing Research, vol. 57.1, pp. 68-80,
2014.
[9] J.M. Liss, L. White, S.L. Mattys, K. Lansford,
A.J. Lotto, S.M. Spitzer, and J.N. Caviness,
“Quantifying Speech Rhythm Abnormalities in the
Dysarthrias,” JSLHR, vol. 52, pp. 1334-1352, 2009.
[10] A. D. Hinton-Bayre et al., “Mild head injury and
speed of information processing: a prospective study of
professional rugby league players,” Journal of Clinical
and Experimental Neuropsychology, vol. 19, pp.
275-289, 1997.
[11] M. Falcone, N. Yadav, C. Poellabauer, and P. Flynn,
“Using isolated vowel sounds for classification of mild
traumatic brain injury,” International Conference on
Acoustics, Speech and Signal Processing (ICASSP),
2013.
[12] C. Poellabauer,N. Yadav, L. Daudet, S. Schneider,
C. Busso, and P. Flynn, “Challenges in Concussion
Detection Using Vocal Acoustic Biomarkers,” IEEE
Access, vol. 3, pp. 1143-1160, 2015.
[13] S. J. Sheinkopf, J. M. Iverson, M. L. Rinaldi, and
B. M. Lester, “Atypical Cry Acoustics in 6-Month-Old
Infants at Risk for Autism Spectrum Disorder,” Autism
Research, 5(5), pp. 331-339, October 2012.
[14] J. Brisson, K. Martel, J. Serres, S. Sirois, and J. L.
Adrien, “Acoustic Analysis of Oral Productions of
Infants Later Diagnosed with Autism and Their
Mother,” Infant Mental Health Journal, 35(3), pp.
285-295, 2014.
[15] K. Lopez-de-Ipina, J. B. Alonso, N. Barroso,
M. Faundez-Zanuy, M. Ecay, J. Sole-Casals, C. M.
Travieso, A. Estanga, and A. Ezeiza, “New Approaches
for Alzheimer’s Disease Diagnosis Based on Automatic
Spontaneous Speech Analysis and Emotional
Temperature,” Ambient Assisted Living and Home
Care, Lecture Notes in Computer Science, vol. 7657,
pp. 407-414, 2012.
[16] N. Yadav, L. Daudet, C. Poellabauer, and P. Flynn,
“Noise Management in Mobile Speech Based Health
Tools,” IEEE Healthcare Innovation and Point-of-Care
Technologies (HIC-POCT), 2014.
[17] D. B. Paul, “Speech Recognition Using Hidden
Markov Models,” The Lincoln Laboratory Journal, vol.
3, no. 1, 1990.
[18] M. Gales and S. Young, “The Application of Hidden
Markov Models in Speech Recognition,” Foundations
and Trends in Signal Processing, vol. 1, no. 3,
pp.195-304, 2008.
[19] T. Crystal and A. House, “Segmental durations in
connected speech signals: Current results,” The journal
of the acoustical society of America, vol. 83, no. 4,
pp.1553-1573, 1988.
[20] T. Crystal and A. House, “Segmental durations in
connected speech signals: Syllabic stress,” The journal
of the acoustical society of America, vol. 83, no. 4,
pp.1574-1585, 1988.
[21] F. Darley, A. Aronson, and J. Brown, “Clusters of
deviant speech dimensions in the dysarthrias,” Journal
of Speech, Language, and Hearing Research, 12.3,
pp.462-496, 1969.
[22] F. Tao, L. Daudet, C. Poellabauer, S. Schneider, and
C. Busso, “A Portable Automatic PA-TA-KA Syllable
Detection System to Derive Biomarkers for Neurological
Disorders”. Interspeech 2016, pp. 362-366, 2016.
[23] J.R. Duffy, “Motor Speech Disorders: Substrates,
Differential Diagnosis, and Management,” St. Louis,
MO, USA, Mosby, 3rd ed., 2005.
[24] F.L. Darley, A.E. Aronson, and J.R. Brown, “Motor
Speech Disorders,” Philadelphia, PA, USA: Saunders,
1975.
[25] P. Lamere, P. Kwok, E. Gouvea, B. Raj, R. Singh,
W. Walker, M. Warmuth, and P. Wolf, “The CMU
SPHINX-4 speech recognition system,” JIEEE Intl.
Conf. on Acoustics, Speech and Signal Processing, Hong
Kong. Vol. 1. 2003.
[26] K. F. Lee, H. W. Hon, and R. Reddy, “An overview of
the SPHINX speech recognition system,” IEEE
Transactions on Acoustics, Speech, and Signal
Processing, Vol. 38, No. 1 , 1990.
[27] A. Varela, H. Cuayáhuitl, and J. A. Nolazco-Flores,
“Creating a Mexican Spanish Version of the CMU
Sphinx-III Speech Recognition System,” Iberoamerican
Congress on Pattern Recognition (CIARP), Springer
Lecture Notes in Computer Science(LNCS), 2905, pp.
251-258, 2003.
[28] P. Placeway, S. Chen, M. Eskenazi, U. Jain,
V. Parikh, B. Raj, M. Ravishankar, R. Rosenfeld, K.
Seymore, M. Siegler, R. Stern, and E. Thayer, “The
1996 hub-4 sphinx-3 system,” in Proc. DARPA Speech
Recognition Workshop, 1997.
[29] D. Paul and J. Baker, “The design of the wall street
journal-based csr corpus,” in Proceedings of ARPA
Speech and Natural Language Processing Workshop, pp
357-362, 1992.
[30] R. Hooke and T.A. Jeeves. ““Direct Search” Solution
of Numerical and Statistical Problems,” Journal of the
ACM (JACM), vol. 8.2, pp. 212-229, 1961.
[31] P. Peduzzi, J. Concato, E. Kemper, T.R. Holford, and
A.R. Feinstein, “A simulation study of the number of
events per variable in logistic regression analysis,”
Journal of clinical epidemiology, Vol.49(12), pp.
1373-1379, 1996.
[32] E. Vittinghoff, and C. E. McCulloch. “Relaxing the
rule of ten events per variable in logistic and Cox
regression.” American journal of epidemiology, Vol. 165,
no. 6, pp. 710-718, 2007.
[33] C.K. Enders, “Applied Missing Data Analysis” New
York: Guilford Press, 2010. Print
[34] J.L. Schafer, L. Joseph, and J.W. Graham, “Missing
data: our view of the state of the art,” in Psychological
methods 7.2, pp 147, 2002.
[35] D.W. Hosmer Jr, W. David, S. Lemeshow, and
R.X. Sturdivant, “Applied logistic regression,” John
Wiley and Sons, Vol. 398, 2013.