Portable mTBI Assessment Using Temporal and Frequency Analysis of Speech

Nikhil Yadav

http://ieeexplore.ieee.org/Xplore Portable mTBI Assessment Using Temporal and Frequency Analysis of Speech Louis Daudet Nikhil Yadav Matthew Perez Department of Computer Science and Engineering University of Notre Dame Notre Dame, IN, USA Division of Computer Science, Mathematics, and Science St. John’s University Queens, NY, USA Department of Computer Science and Engineering University of Notre Dame Notre Dame, IN, USA Department of Computer Science and Engineering University of Notre Dame Notre Dame, IN, USA Department of Communicative Sciences and Disorders Saint Mary’s College Notre Dame, IN, USA ldaudet@nd.edu Christian Poellabauer cpoellab@nd.edu yadavn@stjohns.edu Sandra Schneider Department of Applied and Computational Mathematics and Statistics University of Notre Dame sschneider@saintmarys.edu Notre Dame, IN, USA ABSTRACT This paper shows that extraction and analysis of various acoustic features from speech using mobile devices can allow the detection of patterns that could be indicative of neurological trauma. This may pave the way for new types of biomarkers and diagnostic tools. Toward this end, we created a mobile application designed to diagnose mild traumatic brain injuries (mTBI) such as concussions. Using this application, data was collected from youth athletes from 47 high schools and colleges in the the Midwestern United States. In this paper, we focus on the design of a methodology to collect speech data, the extraction of various temporal and frequency metrics from that data, and the statistical analysis of these metrics to ﬁnd patterns that are indicative of a concussion. Our results suggest a strong correlation between certain temporal and frequency features and the likelihood of a concussion. Keywords: speech analysis, voice pathology, portable diagnostics, concussions 1. mperez14@nd.edu Alan Huebner INTRODUCTION Speech recognition is a standard feature in current mobile devices, e.g., applications often make use of speech recognition to perform certain user-driven actions. While it has previously been shown that changes in motor speech production often accompany progressive neurological diseases and neurotrauma (i.e., traumatic brain injury or TBI), only recent advances in speech analysis and mobile technologies have made it possible to develop speech-based diagnostic and assessment tools for various health conditions, speciﬁ- alan.huebner.10@nd.edu cally focusing on neurodevelopmental conditions, neurodegenerative diseases, and traumatic injuries. Speech processing for health diagnostics and assessment in clinical settings can be performed in the cloud, i.e., speech is captured on a portable device, transmitted to a speech processing service that extracts and analyzes various acoustic features, and sends the results back to the device for the practitioner to interpret. However, in many settings, this approach is not feasible, e.g., whenever assessment must occur in realtime in non-clinical environments. Examples include sideline speech assessment of athletes (e.g., due to a suspected concussion) in areas with no or poor wireless connectivity, as well as the assessment of military personnel deployed in remote areas. With the arrival of mobile devices (such as smartphones and tablets) with relatively large processing, energy, and storage resources, it is becoming increasingly realistic to develop health tools on such devices that not only capture the human voice, but also perform processing, ﬁltering, analysis, and diagnostics in near real-time [1, 2, 3]. The technical and computational challenges for the deployment of lightweight speech recognition applications on mobile devices are daunting, e.g., while mobile devices are increasingly powerful, they still have much stricter resource limitations than other computing systems [4]. It can also be challenging to build such systems due to the proprietary nature of (desktop or server-based) speech recognition software toolkits, which are often provided without access to source code. However, with recent advances on both the software side (e.g., more efficient algorithms) and the hardware side (e.g., resource-richer devices), mobile speech recognition applications have become available [5, 6], thereby making speech-based diagnostics on mobile devices possible. In this paper, we ﬁrst present a speech analysis application using open source tools and software focusing on the extraction and analysis of temporal features of speech. We then expand on these results by adding an analysis of features from the frequency domain. In consultation with speech language pathologists a set of reading tests have been designed to mea- Digital Object Identiﬁer 10.1109/JBHI.2016.2633509 2168-2194 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. sure a variety of speech features such as speaking rate, word duration, or pitch and intensity ﬂuctuation. Stimuli are selected to test maximum movement, strength, accuracy, and range of the diﬀerent oral motor structures (i.e., tongue, soft palate, lips). We present our approach to data collection and analysis with an emphasis on data collected from over 2500 athletes from 47 high schools and colleges in the Midwestern United States using this application. 2. sessment when the disease is already advanced enough for its strongest eﬀects to become noticeable. RELATED WORK Several studies [7, 8, 9] have explored acoustic measures to help distinguish various types of dysarthria (i.e. motor speech disorders) due to neurotrauma (e.g., traumatic brain injury) or neurodegenerative diseases (e.g., Parkinson’s Disease, Amyotrophic Lateral Sclerosis, Huntington’s Disease, etc.). Exploring rhythm abnormalities [7, 8] and vowel metrics [9], the authors found that for individuals with Parkinson’s Disease, signiﬁcant changes in the articulation rate (a measure of rate of speaking excluding pauses) and pause times (periods of absence of speech) can be detected. And while acoustic diﬀerences in vowel changes due to diﬀerent types of dysarthria can be detected, listeners’ perceptions of these changes were variable. The later suggests acoustic measures may be more accurate in detecting early signs of neurodegenerative disease processes. While there are many studies [10, 11] examining speech changes in individuals with moderate to severe TBI, there is limited literature on speech changes due to mild TBI or concussions. Currently, the diagnosis of concussion and its ramiﬁcations to neurological diseases and disorders remains illusive, which is why it is so important to have an instrument that can detect changes in speech that will serve as a biomarker for neurological wellness. In other studies [11, 12], the authors investigated the possible impact of concussions on speech. They focused on data collected from collegiate boxers, where the acoustic features of pitch, formant frequencies, jitter (cycle-to-cycle variations of the fundamental frequency), and shimmer (variability of peak-to-peak amplitude) were explored. While preliminary in focus, there was substantial evidence that even mild traumatic brain injuries leave their ﬁngerprints in human voice. Pitch and formant changes have been used to detect early signs of autism in infants’ cry melodies and other oral productions of young children [13, 14]. In [13], the authors found that children with a high risk of being on the autism spectrum had pain cries with a higher fundamental formant frequency than children with a lower risk. In [14], they found that children later diagnosed to be on the autism spectrum had more monotonous voices than children without autism, with their pitch contours being less complex. Recent research using speech analysis for early Alzheimer’s disease detection [15] emphasized the non-invasive nature of using speech analysis for health diagnostics and monitoring. Using emotional temperature, a metric of their own design to measure emotional response, and the analysis of automatic spontaneous speech, they were able to diﬀerentiate with some success between patients with Alzheimer’s disease and a control group. This method could assist in making an earlier diagnosis than what is oﬀered by current diagnostic methods that rely on extensive cognitive and medical as- Figure 1: Speech recognition system block diagram 3. METHODS 3.1 Speech Recognition An overview of the architecture of the third-party speech recognition and analysis system used in this project (described in more detail in Section 4.2) is shown in Figure 1. Users are asked to go through a series of tests, where words, sentences, and sounds are produced and captured by a lowimpedance, unidirectional dynamic microphone attached to the mobile device. The external microphone minimizes the eﬀects of background noise and cross-talk, making the captured signal as clear as possible. This cannot be achieved by the built-in microphone on the device as it is not sensitive enough to ﬁlter these discrepancies, which are a product of the recording environment. The integrity of the recording is maintained by detecting its noise content using the methods highlighted in [16]. If the voiced or unvoiced SNR (signalto-noise ratio) is below a threshold, the user is prompted to retake the test. If not, the speech recording is saved into a speech repository, either locally on the mobile device, or remotely in cloud storage. Each recording is sent to a feature extraction tool, where feature frames are extracted and then passed on to a speech recognition software. This software includes a decoder, which in turn consists of a linguistic component that receives information from a knowledge base comprised of the following: • Lexicon/Dictionary: This maps words to their pronunciations. Single words can have multiple pronunciations, which are represented in phonetic units. Dictionaries can vary in size from just a few words to hundreds of thousands. • Language Model: Describes what is likely to be spoken. A stochastic approach is employed where word transitions are deﬁned in terms of transition probabilities, e.g., in the sentence “Shut the door”, a noun is expected after “the”, hence the language model can constrain the search space to only nouns in the dictionary, reducing computational overhead. • Acoustic model: This is a collection of trained statistical models, each representing a phonetic unit of speech. They are trained by analyzing large corpora of speech. Acoustic models are not necessarily constrained by the speaker, in that they can be generated for a group of speakers or an individual. In our case, we use a group-based acoustic model. The decoder forms the backbone of the speech recognizer. It selects the next set of likely states, assigns a score for incoming features based on these states, and eliminates the low scoring states to generate results. State selection by the decoder is driven by the knowledge base, where the grammar selects the next set of possible words. The dictionary is used to collect pronunciations for the words. An acoustic model is used to collect Hidden Markov Models (HMMs) for each pronunciation, from which transition probabilities are used to select the next set of states. The linguist component of the decoder obtains word pronunciations, probabilities, transitions, and state information from the knowledge base as shown in Figure 1. These are used to generate nodes in HMMs that represent the speech samples. HMMs are commonly used in speech recognition software to calculate the likelihood of each state [17, 18]. The HMMs used in the third party speech recognition software used in this project are specialized versions that emit observations from an observation sequence O from left-toright with probabilities deﬁned in a Probability Distribution Function (PDF). Backward transitions are not allowed. Each of the states in the HMM is represented by a Gaussian mixture of density functions. In other words, the linguist translates the rules provided by the user into a grammar that the search manager can use. Based on this grammar, the function of the search manager is then to build a tree of possibilities for what the signal might be, and then search the tree to ﬁnd the best hypothesis. To do the latter, the search manager uses the acoustic scorer. Its role is to compute, for a given input vector, the state output probability. It provides these to the search module on demand. Using them, the search module can then prune the tree of possibilities until a best hypothesis is found [25]. This best hypothesis is then output by the decoder together with its timing boundaries. The computation complexity for these tasks can be signiﬁcant and in the past, mobile devices lacked a fast virtual memory subsystem and a complete processing library (most notably in C/C++) to handle the computational complexity of automatic speech recognition (ASR) tasks. 3.2 Temporal and Frequency Speech Features In this work, we consider both temporal and frequency features in speech to detect signs of a potential mTBI. The temporal features that are extracted together with their descriptions are listed in Table 1. Average duration has been chosen as a feature since it has been extensively studied and has been found to provide useful information regarding speaking rate, dialect, phonetic context, stress, and speciﬁc characteristics (i.e., gender, age, neurological status) of the speaker [19, 20, 21]. The same applies to the metric relating to stress. When the participants have to put stress on a particular word, they have to adjust their respiratory system to reﬂect that increase in stress, the coordination of which is anticipated to be more difficult for an individual suﬀering from a diﬀuse head injury such as a concussion. In the same way, a participant reading a continuous passage of speech, will tax the motor speech system to make sure all words are coordinated and stress is timed appropriately. Further, we also measure the diadochokinetic (DDK) rate. The DDK rate measures the speed, strength, steadiness, and accuracy of rapid, repetitive motor speech movements. In our test, these sounds are Pa, Ta, and Pa-Ta-Ka [22]. This part of the data collection is designed to emphasize various oral motor skills such as articulation, respiration, tone control, and phonation. The DDK rate test determines if there are problems in the speech mechanisms that control motor skills or speech planning functions. In the case of a participant having suﬀered an mTBI, we anticipate, based on the work in [23, 24], diﬀerent temporal metrics to change, e.g., the speed/rate, expected to be either fast or slow, the frequency of iterations, or steadiness, expected to ﬂuctuate, and the strength of the repetition to become either hyperor hypokinetic. In the frequency domain, several metrics are being considered as shown in Table 2. At this point, all metrics are extracted off device (i.e., on a remote server), but their extraction does not necessarily require any special software and thus could easily be performed on-device (although the performance degradation due to hardware constraints is to be determined). The ﬁrst metrics that we compute are the average pitch for the entire sound ﬁle and the standard deviation of this pitch. This is an easy metric to compute, as there are many tools available to compute the pitch for every window of a given time interval over a given sound ﬁle. In this paper, we used the popular Praat1 software tool to perform these measurements. We also compute the average power and its standard deviation as they are also easy to extract and popular in speech assessment. However, analysis of these features has to be performed carefully since small variations in the data capture (e.g., changes in the placement of the microphone) can lead to drastic changes in the recorded intensity of the voice samples. When listening to the recorded speech of concussed participants, several frequency domain features appear to show consistent diﬀerences (compared to non-concussed participants). Common patterns include increased monotonicity of the voice, increased tone ﬂuctuations, and increased stress. To translate these perceptive assessments into quantiﬁable metrics, we ﬁrst measured the pitch (or power) of the sound ﬁle for every window of 10ms of voiced time in the sound ﬁles. From these measurements, we computed the average pitch for the sound ﬁle and then used that average to analyze the speech signal’s pitch values to identify when they crossed that average. We ﬁrst considered measuring the amount of pitch variation by counting the number of times the pitch values went past the value of the average pitch. However that made very steady recordings with numerous ﬂuctuations just around the average seem more ﬂuctuating than recordings with fewer ﬂuctuation but of great intensity. Instead we computed a weight for each of these crossings (seen in Figure 2). That way, very steady recordings with numerous ﬂuctuations just around the average have a lot of weights that will be small and add up to a small value, while recordings with fewer ﬂuctuation, but of greater intensity will have a small number of these weights, which will be large and add up to a large value. 1 www.praat.org, http://www.fon.hum.uva.nl/praat Table 1 Temporal Acoustic Metrics and Features Extracted TEMPORAL ACOUSTIC METRIC Average Duration Standard Deviation in Duration Stressed Word Duration Stress Pause Average Syllable Duration Average Pause Duration Average Diadochokinetic Rate (DDK) Average Diadochokinetic (DDK) Period Standard Deviation in DDK Period Coefficient of Variation in DDK Period DESCRIPTION Average duration taken to say a word in the test The standard deviation in durations of words being spoken The time taken to say a word while stressing it Pause time before saying the stressed word The average syllable duration in a continuous passage of speech The average pause duration in a continuous passage of speech. Notable pauses indicates a possible concussion The number of consonant vowel (C-V) vocalizations per second The average period is the average time between consonant and vowel vocalizations (inverse of Rate) This is the standard deviation of the DDK period (in ms) This parameter measures the degree of rate variation in the period (%). If the C-V vocalization is repeated with little variation in rate, then this number is very small. However, as a speaker varies the rate of DDK during the sevensecond-analysis window, this number increases. This parameter is assessing the participants ability to maintain a constant rate of C-V combinations Figure 2: For every time the pitch values go across the average pitch, a new data point is computed for the variation of the pitch’s amplitude Figure 3: For every time the pitch values go across the average pitch, a new data point is computed for the time component of the pitch’s variance After having computed all of the crossing points’ weights, we added them together to obtain a value for the entire ﬁle, providing a measure for the variance of the pitch. To see if that variance was constant throughout the ﬁle, we then computed the average and standard deviation of that variance. In addition to the variance of the amplitude of the pitch, we also wanted to get a sense of the frequency of the variance of the pitch. To do so, we computed a new value for each of the points at which the pitch crosses the threshold of the average pitch; averaging the time before and after each crossing. This process can be seen in Figure 3. Again, we wanted to be able to see if the frequency of the variance was steady or not, and thus we computed the average and the standard deviation of this frequency. This methodology that we followed to extract metrics related to the pitch was similarly applied to extract metrics related to the intensity of the sound signal. The details of the frequency domain features extracted in our work are presented in Table 2. 4. IMPLEMENTATION 4.1 Data Collection Concussion testing typically involves recording a pre-season baseline for a subject representing their “healthy” state. Subsequent post-baseline recordings are classiﬁed and tagged as either being normal or from a subject who has a suspected concussive injury. Post-baseline recordings are compared to the initial baseline. Intuitively, subjects with a concussive injury should display a pattern that is atypical of a healthy control group. Speech was recorded on an iOS mobile device (iPad mini) ﬁtted with a low-impedance Shure SM10A microphone2 designed for close-talk headworn applications such as remote-site sport broadcasting sampled at 44.1kHz, 16 bit, mono. A custom mobile application was designed for a multi-syllabic word reading test, which asks users to read out a sequence of words as they appear on a mobile device screen at given time intervals. The words were carefully 2 http://www.shure.com/americas/products /microphones/sm/sm10a-headworn-microphone Table 2 Frequency Acoustic Metrics and Features Extracted FREQUENCY ACOUSTIC METRIC Average Pitch Pitch Standard Deviation Pitch Variation Average Pitch Variation Pitch Variation Standard Deviation Frequency of Pitch Variation Average Frequency of Pitch Variation Standard Deviation of the Frequency of Pitch Variation Average power Power Standard Deviation Power Variation Average Power Variation Power Variation Standard Deviation Frequency of Power Variation Average Frequency of Power Variation Standard Deviation of the Frequency of Power Variation DESCRIPTION Average pitch from a speech sample The standard deviation of the pitch in a speech sample How many times the pitch goes above or below the pitch average in a speech sample, weighted by how much it deviates from that average Average of the pitch variation Standard deviation of the pitch variation This metric is computed by adding together all the weights for the time component of the pitch variance as seen in Figure 3 Computed by averaging the frequency of pitch variation metric by the number of crossing points Standard deviation of the time between ﬂuctuations in pitch Average power from a speech sample The standard deviation of the power in a speech sample How many times the power deviates from the power average in a speech sample, weighted by how much it deviates from that average Average of the power variation Standard deviation of the power variation This metric is computed by adding together all the weights for the time component of the power variance similarly as what is seen in Figure 3 for pitch Computed by averaging the frequency of power variation metric by the number of crossing points Standard deviation of the time between ﬂuctuations in power handpicked after consultation with speech language pathologists. The details of the tests used are shown in Table 3; the words and sounds are selected in a way that will require the users to use diﬀerent parts of the speech production system (front and back of the mouth, soft palate in the back of the throat). Each of the tests was designed and selected for speciﬁc reasons, as described in Table 4. The setup for the test is shown in Figure 4. The noise management techniques and signal-to-noise ratio threshold proposed in [16] were used. This approach rejects a test when the voiced or unvoiced SNRs drop below a threshold. Feedback is provided based on these values to convey whether the environment is noisy, the microphone is not placed optimally, or the user is not speaking loud enough (speech intensity is also measured). A test retake is recommended in each of these cases. A sample waveform and corresponding spectrogram of a recording is shown in Figure 5 generated using the Praat software kit. The waveform signal shows the amplitude of seven of the words. The broadband spectrogram shows the spectral energy of the sound over time. The red dots represent the formants; blue lines represent the speaker’s pitch, and the yellow line, faint at the bottom of the frequency graph, represents intensity. The recorded speech to be analyzed was then transferred to a local application repository (disk storage) after being down-sampled to 16KHz, both on the server as well as the mobile devices. The downsampling was necessary due to the selection of the speech processing library, Sphinx3 , which can only function with 16KHz signals. The original 44KHz signals were retained for quality purposes for use in future research. The downsampled ﬁles are approximately 12s long 3 http://cmusphinx.sourceforge.net/ and only occupy 415 KB in storage each. Figure 4: Recording setup 4.2 Speech Recognition 4.2.1 Decoder The Sphinx speech recognition software toolkit [26, 27] was selected as the speech decoder due to its open-source nature. In particular, the lightweight Pocketsphinx implementation was selected since it can easily be ported to mobile devices. The software is written in C/C++, making it ideal for the iOS mobile platform, which has a C/C++ virtual memory subsystem. Pocketsphinx has been optimized for speed on mobile device processors, e.g., the feature extraction has been changed to ﬁxed point, which has resulted in a signiﬁcant improvement in speed [4]. Table 3 Speech Test Details Test Time 1 1.5s Words or Sounds application participate education difficulty congratulations possibility mathematical opportunity put the book here Description Multisyllabic words are shown on the device’s screen for exactly 1.5 seconds each. It is anticipated that participants with mTBI would ﬁnd it more difficult to accurately produce these words and there may be induced delays in this operation Temporal Metrics Extracted Average Duration / Standard Deviation Duration Stress is placed on diﬀerent words in the sentence (PUT, BOOK, HERE) that were displayed in bold at each of the iterations. This is a standard test used to measure stress, but timing parameters can be extracted from this as well The standard syllabic rate may be aﬀected and cause perceptual diﬀerences in articulation Participants repeat the pa sound as quickly as possible Stressed Word Duration / Stress Pause 2 10s 3 5s 4 5s we saw several wild animals pa 5 5s ka Participants repeat the ka sound as quickly as possible. Similar to previous test 6 5s pa-ta-ka Similar to category 4 and 5 tests. Alternating sounds used to measure sequential motion rate Average Syllable Duration / Average Pause Duration Average Diadochokinetic Rate Period / Average DDK rate / Standard Deviation in DDK Period / Coefficient of variation in DDK Period Average Diadochokinetic Rate Period / Average DDK rate / Standard Deviation in DDK Period / Coefficient of variation in DDK Period Average Diadochokinetic Rate Period / Average DDK rate / Standard Deviation in DDK Period / Coefficient of variation in DDK Period Table 4 Speech Test Reasonings Test Number 1 2 3 4 5 6 Reasons for inclusion The multi-syllabic words (4 syllables) chosen for this test contain front, middle, and back vowels and bilabial, alvelor, velor, and glide consonants allowing for maximum oral structure movement. By requesting individuals to emphasize the highlighted word in the sentence, stress, rhythm, amplitude, and frequency can be measured. The sentence used was controlled for syllable length (including 1-, and 3-syllable words containing front, middle, and back vowels and consonants measuring accuracy of articulatory production and movement, and syllable duration. Tests 4 and 5 are used to measure the diadochokinetic rate (or the ability to rapidly, accurately, and steadily produce the neutral vowel with a front and back consonant sound); a measure of the accuracy of alternating motor movements. See explanation for test 4. This test assesses sequential diadochokinetic motion rate by measuring the accuracy, rate, and duration of each syllable produced. 4.2.2 Knowledge Base • Language Model: A JSGF (Java Speech Grammar Format)5 grammar ﬁle is deﬁned for the multi-syllabic word test. A ﬁxed text corpus is, since our grammar is restricted and subjects speak a known set of words. • Lexicon/Dictionary: The Carnegie Mellon University pronouncing dictionary for general American English4 that comes packaged with PocketSphinx was used (39 phones). The phonetic decomposition of the words used in test 1 (multisyllabic words) from this corpus is shown in Table 5 with all possible pronunciations. 4 http://www.speech.cs.cmu.edu/cgi-bin/cmudict • Acoustic model: The generic hub4 Wall Street Journal (WSJ) [28, 29] acoustic model that comes prepackaged with Sphinx was used. It is a comprehensive model that has been trained on over 140 hours of speech. 5 http://www.w3.org/TR/jsgf Amplitude Frequency in Hz Time in Seconds Figure 5: Sample waveform and spectrogram for multi-syllabic reading test Table 5 Phonetic breakdown of words used in multi-syllabic reading test WORD participate application education difficulty difficulty (2) congratulations possibility mathematical opportunity 5. PHONETIC BREAKDOWN P AA R T IH S AH P EY T AE P L AH K EY SH AH N EH JH AH K EY SH AH N D IH F AH K AH L T IY D IH F IH K AH L T IY K AH N G R AE CH AH L EY SH AH N Z P AA S AH B IH L AH T IY M AE TH AH M AE T IH K AH L AA P ER T UW N AH T IY Figure 6: Power histogram before SNR algorithm The cosine function, which was used to estimate noise power distribution, is subtracted from the RMS power histogram in order to estimate the speech power distribution. The speech level is deﬁned to be the bin midpoint where the 95th percentile occurs in the speech power histogram, as shown in Figure 7. The noise level is then subtracted from the speech level to obtain the SNR. RESULTS 5.1 Noise Management The method described in [16] was used to determine a threshold value for the voiced and unvoiced SNR below which a speech signal is considered noisy. The code to compute the SNR values from the sound ﬁles is a modiﬁed version of a code written by Antoine Fillinger and Vincent Stanford (which was adapted with their help). The description of the code presented in this paper is extracted from a description of an earlier version of the code by Jon Fiscus6 . The program estimates the SNR values of a ﬁle using a logarithmic function and the speech power of the sound ﬁle. In order to estimate speech noise levels a signal energy histogram is created. With the assumption that the recording is not too noisy, we expect to see two diﬀerent peaks in this distribution: one for the noise level on the left and one for the speech level on the right. The noise distribution is estimated by ﬁtting a raised cosine function to the left peak of the RMS histogram. The raised cosine function ﬁrst issues an estimate for the location, amplitude, and width of the left most peak. Then a space search algorithm called “direct search” [30] is applied to maximize the ﬁt. With the best ﬁt found, the midpoint of the raised cosine function is labeled as the mean noise power level, as seen in Figure 6. 6 http://labrosa.ee.columbia.edu/projects/snreval/ Figure 7: Power histogram after SNR algorithm This method relies on the fact that the analyzed sound ﬁle is a mix of two distinct power distributions, one emanating from the signal, and one emanating from the noise. It is important to note that if the noise and speech distributions are close to one another (i.e., a very noisy recording), then this technique will produce unreliable results. We then had to compute the threshold value for the SNR below which to reject the recordings. 110 speech recordings were taken from our speech repository and a noise signal was added to them at varying intensities. The accuracy of the ASR using Sphinx in identifying the original timing with the added noise was evaluated for the given ﬁles. At ﬁrst, the noise signal was added to the speech signal at full power, then the noise signal’s power was reduced by varying amounts before being added to the speech signal decreasing its eﬀect on the SNR and accuracy. This reduction in signal intensity is what is described in Figure 8 as “power drop in dB”. This ﬁgure shows the averaged voiced and unvoiced SNR values impacted by induced noise, and the correspond- ing accuracy for temporal breakdown of words using Sphinx. For accurate ASR operation, the average voiced SNR value of the speech recording should be above 28dB, and the average unvoiced SNR reading should be above 16dB. These were the criteria used for selecting the signal and identifying it for processing. met these criteria were used in the analysis. The output from running Pocketsphinx on the mobile devices was identical to the output generated on a server. This was validated for multiple ﬁles, and implies the correctness of the Sphinx decoder irrespective of the operating platform. As for the frequency domain features, their extraction has not been tested on a mobile device, but since this extraction was done using only basic mathematical functions, there should be no drastic changes required to port them to a mobile device. Table 6 Sphinx output for a single test ﬁle WORD <sil> participate <sil> application <sil> education <sil> difficulty <sil> congratulations <sil> possibility <sil> mathematical <sil> opportunity <sil> START TIME (s) 0 1.79 2.53 3.12 3.87 4.51 5.2 6.05 6.75 7.43 8.45 8.98 9.69 10.47 11.19 12.01 12.71 END TIME (s) 1.78 2.52 3.11 3.86 4.5 5.19 6.04 6.74 7.42 8.44 8.97 9.68 10.46 11.18 12 12.7 12.94 Temporal Decomposition Using Sphinx for Subject 1 Figure 8: Impact of noise on ASR accuracy (for 110 subjects) We ﬁnally modiﬁed the SNR measuring code to make it faster. Speciﬁcally, we changed the number of passes made by the “direct search” algorithm to maximize the ﬁt of the raised cosine function. By default, the code was making a large number of passes to try ﬁnd a most accurate ﬁt between the raised cosine function and the noise peak. Using the same speech recordings used to determine good SNR values, we try lowering that number by iteration, each time measuring the diﬀerence between the raised cosine function produced with the lower number of passes, and the one produced with the number of passes by default. We wanted to keep that diﬀerence low, and noted that we could actually reduced number passes from more than a hundred down to ten and accomplish that goal. At ten passes, all the sound ﬁles were getting SNR values that varied from the original results by less than 10%. 5.2 Correctness The accuracy in decoding the speech was analyzed by superimposing the extracted syllable timing boundaries obtained using the Sphinx decoder, as shown in Table 6, with the actual waveform. The <sil> in the sphinx output denotes silence regions in the speech. Figure 9 shows how Sphinx decomposes the waveforms of a recording. As can be seen, it performed well provided the recording was “clean” and taken in a relatively controlled noise-free environment. Only recordings collected from subjects who Figure 9: Temporal decomposition of speech using Sphinx for subject 1 (the red lines indicate where Sphinx marked the starting and ending times of the uttered syllables) 5.3 Performance Our concern was to highlight the performance of the temporal decomposition run completely on a mobile device making it ideal in cases where network connectivity was unavailable and where a server-side cloud solution would have been impractical. We measured the average CPU processing time per ﬁle obtained when running temporal analysis over ten ﬁles. The iPad 2 and iPad mini performed temporal decomposition and gave feedback in the 1.5s range. The iPhone 4 gave the highest response time in the 2.5s region. As expected, the MacBook Air laptop performed exceptionally well (<.1s). This indicated a much faster virtual memory subsystem on the laptop, and also faster I/O handling. The mobile application achieving the same had a basic single view graphical user interface (GUI), however graphical tasks contributed a maximum of only 3.2% of CPU activity across the mobile devices. As for the frequency domain features, their extraction having only been done oﬀ device, it is difficult to assess just how long it would take to extract them from a mobile device. That being said, it took about 30 seconds to extract these features on a MacBook Air for 580 sound ﬁles. According to geekbench, a Cross-Platform Processor Benchmark used to measure a computer’s processor and memory performance, our iPad devices are about ten times slower than a MacBook Air. Thus the processing would take 300 seconds on the iPad for these 580 sound ﬁles, or about 0.5 seconds per speech sample, making the extraction theoretically compatible with the iPad. Table 7 Results from Statistical Analysis of Temporal Metrics TEST ACOUSTIC METRIC Pr(>|z|) 1 Average Duration 0.0076 2 BOOK Stressed Word Duration 0.0237 4 Average DDK Period 0 4 Average DDK Rate 0.0237 4 sDev DDK Period 0.0141 4 Variance DDK Period 0.0192 6 Average DDK Rate 0.0412 5.4 Statistical Significance Tests Of Temporal Features Temporal speech test data from 486 controls (i.e., they had no prior concussions and had a post-baseline recording) and 95 concussed subjects were collected and analyzed temporally using a logistical regression approach. An established rule-of-thumb for logistic regression modeling is that, in order to obtain stable results, the data must contain at least 10 events (i.e., concussions) for every predictor variable included in the model [31, 32]. Since we had 95 concussions, this meant we could include at most nine predictors in our model. We chose the timing acoustic metric predictors for all six tests for the modeling. For a large number of recordings, no useful speech features could be extracted, because of interferences such as background noise or other sounds (e.g., laughter, mis-pronounced words, etc.). Simply discarding these observations from our relatively small data set was not possible, so the technique of multiple imputation was used to statistically “ﬁll in” the missing data [33, 34]. The results below are adjusted to take into account the fact that data was imputed. The statistically signiﬁcant timing predictors are showed in Table 7. To assess the predictive power of the model we used the method of receiver operating characteristic (ROC) curves. A model that classiﬁed subjects no better than a coin ﬂip would yield a straight line from (0,0), to (1,1), i.e., the black line in Figure 10; hence the area under this curve (AUC) would be 0.50. Any increase in AUC from 0.50 indicates better predictive power, with 0.80 considered to be excellent [35]. The red curve shows the ROC curve yielded by the model ﬁtted with the features included in Table 7. At its highest point, the curve would appear to be noticeably higher than the black diagonal. The AUC was computed to be 0.70; thus, the discrimination between concussions and non-concussions yielded by the model was directly on the lower bound for acceptable. This shown that our system could classify the temporal acoustic features that could be useful biomarkers in concussion detection. 5.5 Statistical Significance Tests Of Spectral Features The spectral features were extracted in a two step process. First, a Praat script was used to extract the pitch and the Figure 10: ROC Curve Using Statistically Significant Temporal Features power for every 10ms increment from each sound ﬁle. When the power was too low, the pitch and/or the power could be marked by Praat as unknown, every other interval would get a value. This Praat script goes through each of the sound ﬁles, test by test, and creates a text ﬁle of the same name than the sound ﬁle with the data on power and pitch, with one line for every 10ms. Once all the text ﬁles have been created, a second script, this time written in Python produces the values described in Table 2. The results for these features are then added to the results of the temporal features in an Excel ﬁle. This Excel ﬁle is used as an input for an R script that goes through them all to determine the statistical signiﬁcance (see Tables 7 and 8 for results), and then compute the AUC and draw the ROC curve for the chosen metrics. One ROC curve shows only the spectral metrics with a p-value < .05, listed in Table 8. This curve, shown in Figure 11, has an AUC of 0.80. For best results though, a second ROC curve was drawn using the statistically relevant features from both the temporal and spectral domain. This ROC curve can be seen in Figure 12. When using both types of features, the AUC from the ROC curve then increases to 0.86. It is important to note though that 23 features are used for this ROC curve. Since we tested this features against 98 concussed participants, only 4.26 concussed participants were available to support Table 8 Results from Statistical Analysis of Frequency Metrics TEST ACOUSTIC METRIC Pr(>|z|) 3 Std freq variance pitch 0 3 Freq variance pitch 0.0015 3 Freq variance amplitude 0.0024 3 Std variance amplitude 0.0177 3 Variance pitch 0.0296 5 Freq variance pitch 0.0185 5 Std variance amplitude 0.0194 5 Average variance pitch 0.0257 5 Std pitch 0.0258 6 Std variance amplitude 0.0001 6 Average variance amplitude 0.0002 6 Std amplitude 0.0011 6 Average variance pitch 0.0043 6 Std pitch 0.0115 6 Std variance pitch 0.0176 6 Freq variance amplitude 0.026 Figure 11: ROC Curve Using Statistically Significant Spectral Features each feature, whereas 5 to 10 are typically recommended to avoid over-ﬁtting of the model. With future data collection (and therefore more concussed participants recorded), we intend to revisit and revise these results as part of our future work. to increase our speech corpus and study additional acoustic features beyond the ones described in this work. Acknowledgment This research was supported in part by GE Health and the National Football League through the GE/NFL Head Health Challenge. The research was further supported in part by the National Science Foundation under Grant Number IIS1450349. The authors would like to thank Vince Stanford of the National Institute of Standards and Technology (NIST) for his help in providing us with the SNR estimation algorithm for noise detection. 7. Figure 12: ROC Curve Using Both Statistically Significant Spectral And Temporal Features 6. CONCLUSIONS This paper described a reading test to capture speech recording from potentially concussed subjects, noise management techniques for such data collections, and feature extraction techniques in both the time and frequency domains. Various combinations of these features show great potential as speech biomarker for mTBI. In our future work, we intend REFERENCES [1] A. Waibel, A. Badran, A. W Black, R. Frederking, D. Gates, A. Lavie, L. Levin, K. Lenzo, L. Mayﬁeld Tomokiyo, J. Reichert, T. Schultz, D. Wallace, M. Woszczyna, and J. Zhang, “Speechalator: Two-way speech-to-speech translation in your hand,” in Proceedings of NAACL-HLT, 2003. [2] H. Franco, J. Zheng, J. Butzberger, F. Cesari, M. Frandsen, J. Arnold, V. R. R. Gadde, A. Stolcke, and V. Abrash, “Dynaspeak: SRI’s scalable speech recognizer for embedded and mobile systems,” in Proceedings of HLT, 2002. [3] T. W. Köhler, C. Fügen, S. Stüker, and A. Waibel, “Rapid porting of ASR-systems to mobile devices,” in Proceedings of Interspeech, 2005. [4] D. Huggins-Daines et al., “Pocketspinx: A free real-time continuous speech recognition system for hand held devices,” International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2006. [5] X. Lei, A. Senior, A. Gruenstein, and J. Sorensen, J, “Accurate and compact large vocabulary speech recognition on mobile devices.” in Interspeech, 2013. [6] I. McGraw, R. Prabhavalkar, R. Alvarez, M.G. Arenas “Personalized speech recognition on mobile devices” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016. [7] S. Skodda and U. Schleggel, “Speech rate and rhythm in Parkinson’s disease,” Mov Disord.,23(7), pp. 985–992, 2008. [8] K.L. Lansford, and J.M. Liss. “Vowel acoustics in dysarthria: Mapping to perception,” Journal of Speech, Language, and Hearing Research, vol. 57.1, pp. 68-80, 2014. [9] J.M. Liss, L. White, S.L. Mattys, K. Lansford, A.J. Lotto, S.M. Spitzer, and J.N. Caviness, “Quantifying Speech Rhythm Abnormalities in the Dysarthrias,” JSLHR, vol. 52, pp. 1334-1352, 2009. [10] A. D. Hinton-Bayre et al., “Mild head injury and speed of information processing: a prospective study of professional rugby league players,” Journal of Clinical and Experimental Neuropsychology, vol. 19, pp. 275-289, 1997. [11] M. Falcone, N. Yadav, C. Poellabauer, and P. Flynn, “Using isolated vowel sounds for classiﬁcation of mild traumatic brain injury,” International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013. [12] C. Poellabauer,N. Yadav, L. Daudet, S. Schneider, C. Busso, and P. Flynn, “Challenges in Concussion Detection Using Vocal Acoustic Biomarkers,” IEEE Access, vol. 3, pp. 1143-1160, 2015. [13] S. J. Sheinkopf, J. M. Iverson, M. L. Rinaldi, and B. M. Lester, “Atypical Cry Acoustics in 6-Month-Old Infants at Risk for Autism Spectrum Disorder,” Autism Research, 5(5), pp. 331-339, October 2012. [14] J. Brisson, K. Martel, J. Serres, S. Sirois, and J. L. Adrien, “Acoustic Analysis of Oral Productions of Infants Later Diagnosed with Autism and Their Mother,” Infant Mental Health Journal, 35(3), pp. 285-295, 2014. [15] K. Lopez-de-Ipina, J. B. Alonso, N. Barroso, M. Faundez-Zanuy, M. Ecay, J. Sole-Casals, C. M. Travieso, A. Estanga, and A. Ezeiza, “New Approaches for Alzheimer’s Disease Diagnosis Based on Automatic Spontaneous Speech Analysis and Emotional Temperature,” Ambient Assisted Living and Home Care, Lecture Notes in Computer Science, vol. 7657, pp. 407-414, 2012. [16] N. Yadav, L. Daudet, C. Poellabauer, and P. Flynn, “Noise Management in Mobile Speech Based Health Tools,” IEEE Healthcare Innovation and Point-of-Care Technologies (HIC-POCT), 2014. [17] D. B. Paul, “Speech Recognition Using Hidden Markov Models,” The Lincoln Laboratory Journal, vol. 3, no. 1, 1990. [18] M. Gales and S. Young, “The Application of Hidden Markov Models in Speech Recognition,” Foundations and Trends in Signal Processing, vol. 1, no. 3, pp.195-304, 2008. [19] T. Crystal and A. House, “Segmental durations in connected speech signals: Current results,” The journal of the acoustical society of America, vol. 83, no. 4, pp.1553-1573, 1988. [20] T. Crystal and A. House, “Segmental durations in connected speech signals: Syllabic stress,” The journal of the acoustical society of America, vol. 83, no. 4, pp.1574-1585, 1988. [21] F. Darley, A. Aronson, and J. Brown, “Clusters of deviant speech dimensions in the dysarthrias,” Journal of Speech, Language, and Hearing Research, 12.3, pp.462-496, 1969. [22] F. Tao, L. Daudet, C. Poellabauer, S. Schneider, and C. Busso, “A Portable Automatic PA-TA-KA Syllable Detection System to Derive Biomarkers for Neurological Disorders”. Interspeech 2016, pp. 362-366, 2016. [23] J.R. Duﬀy, “Motor Speech Disorders: Substrates, Diﬀerential Diagnosis, and Management,” St. Louis, MO, USA, Mosby, 3rd ed., 2005. [24] F.L. Darley, A.E. Aronson, and J.R. Brown, “Motor Speech Disorders,” Philadelphia, PA, USA: Saunders, 1975. [25] P. Lamere, P. Kwok, E. Gouvea, B. Raj, R. Singh, W. Walker, M. Warmuth, and P. Wolf, “The CMU SPHINX-4 speech recognition system,” JIEEE Intl. Conf. on Acoustics, Speech and Signal Processing, Hong Kong. Vol. 1. 2003. [26] K. F. Lee, H. W. Hon, and R. Reddy, “An overview of the SPHINX speech recognition system,” IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 38, No. 1 , 1990. [27] A. Varela, H. Cuayáhuitl, and J. A. Nolazco-Flores, “Creating a Mexican Spanish Version of the CMU Sphinx-III Speech Recognition System,” Iberoamerican Congress on Pattern Recognition (CIARP), Springer Lecture Notes in Computer Science(LNCS), 2905, pp. 251-258, 2003. [28] P. Placeway, S. Chen, M. Eskenazi, U. Jain, V. Parikh, B. Raj, M. Ravishankar, R. Rosenfeld, K. Seymore, M. Siegler, R. Stern, and E. Thayer, “The 1996 hub-4 sphinx-3 system,” in Proc. DARPA Speech Recognition Workshop, 1997. [29] D. Paul and J. Baker, “The design of the wall street journal-based csr corpus,” in Proceedings of ARPA Speech and Natural Language Processing Workshop, pp 357-362, 1992. [30] R. Hooke and T.A. Jeeves. ““Direct Search” Solution of Numerical and Statistical Problems,” Journal of the ACM (JACM), vol. 8.2, pp. 212-229, 1961. [31] P. Peduzzi, J. Concato, E. Kemper, T.R. Holford, and A.R. Feinstein, “A simulation study of the number of events per variable in logistic regression analysis,” Journal of clinical epidemiology, Vol.49(12), pp. 1373-1379, 1996. [32] E. Vittinghoﬀ, and C. E. McCulloch. “Relaxing the rule of ten events per variable in logistic and Cox regression.” American journal of epidemiology, Vol. 165, no. 6, pp. 710-718, 2007. [33] C.K. Enders, “Applied Missing Data Analysis” New York: Guilford Press, 2010. Print [34] J.L. Schafer, L. Joseph, and J.W. Graham, “Missing data: our view of the state of the art,” in Psychological methods 7.2, pp 147, 2002. [35] D.W. Hosmer Jr, W. David, S. Lemeshow, and R.X. Sturdivant, “Applied logistic regression,” John Wiley and Sons, Vol. 398, 2013.

RELATED PAPERS

RELATED TOPICS

Log In

Portable mTBI Assessment Using Temporal and Frequency Analysis of Speech

Portable mTBI Assessment Using Temporal and Frequency Analysis of Speech

Related Papers

RELATED PAPERS

RELATED TOPICS