An Utterance Verification System for Word Naming Therapy in Aphasia
David S. Barbera1, Mark Huckvale2, Victoria Fleming1, Emily Upton1, Henry Coley-Fisher1, Ian
Shaw3, William Latham4, Alexander P. Leff1, Jenny Crinion1
1
2
Institute of Cognitive Neuroscience, University College London, U.K.
Speech, Hearing & Phonetic Sciences, University College London, U.K.
3
Technical Consultant at SoftV, U.K.
4
Goldsmiths College, University of London, U.K.
david.barbera.16@ucl.ac.uk
Abstract
Anomia (word finding difficulties) is the hallmark of aphasia
an acquired language disorder, most commonly caused by
stroke. Assessment of speech performance using picture
naming tasks is therefore a key method for identification of the
disorder and monitoring patient’s response to treatment
interventions. Currently, this assessment is conducted manually
by speech and language therapists (SLT). Surprisingly, despite
advancements in ASR and artificial intelligence with
technologies like deep learning, research on developing
automated systems for this task has been scarce. Here we
present an utterance verification system incorporating a deep
learning element that classifies ‘correct’/’incorrect’ naming
attempts from aphasic stroke patients. When tested on 8 native
British-English speaking aphasics the system’s performance
accuracy ranged between 83.6% to 93.6%, with a 10 fold cross
validation mean of 89.5%. This performance was not only
significantly better than one of the leading commercially
available ASRs (Google speech-to-text service) but also
comparable in some instances with two independent SLT
ratings for the same dataset.
Index Terms: speech disorders, word naming, aphasia
1. Introduction
Word retrieval difficulties, or anomia, is the most pervasive
symptom of post-stroke aphasia [1]. Recent data suggests there
are around 350,000 people in the UK alone who have chronic
aphasia post-stroke [2]. Despite the prevalence of aphasia, few
patients receive a sufficient dose of speech and language
therapy to recover maximally. For example, in the UK through
the National Health Service patients receive on average 8-12
hours when the recommended dose to see a significant change
is in the order of 100 hours [3]. Assessment of patients’ spoken
picture naming abilities and then practising repetitively over
time a range of vocabulary using spoken picture naming tasks
is an integral part of anomia treatment [4]. The intervention is
primarily administered by a speech and language therapist
(SLT), and the patient is confronted with a picture or drawing
of an object to name. An Automated Speech Recognition
system (ASR) that could reliably assess patient’s speech
performance on these picture naming tests would not only offer
increased consistency and sensitivity to changes in patient’s
speech abilities but also enable patients to perform these tests
independent of SLTs, potentially remotely away from the clinic
in the comfort of their own home. This would not only ‘freeup’ clinicians to deliver more complex interventions in their
‘face-to-face’ time but also support more patients who are
unable to travel into the clinic, a need which has become more
pressing in light of recent COVID-19 travel restrictions.
1.1. ASR for aphasic’s single word naming performance
Different to single and isolated spoken word recognition,
assessing spoken picture naming performance has the
advantage that the target word is known. Therefore, the
challenge for ASR in this context is actually to verify that a
particular target word is uttered in a given segment of speech
[5]. Furthermore, an ASR-based system, or utterance verifier
system, within a therapy app must immediately provide a binary
response ‘correct’/’incorrect’ feedback to the patient for each
spoken naming attempt, often 1000s of trials repeatedly over
time.
To the best of our knowledge, only two groups have used
and assessed an ASR-based system of such type in aphasic’s
single word picture naming performance. In the project Vithea
[6], researchers developed an aphasia treatment app for
Portuguese speakers. Their in-house ASR-engine called
AUDIMUS [7] using a keyword spotting technique to score
spoken naming attempts as ‘correct’/’incorrect’ reported an
average accuracy of 82%, with ranges between 69% and 93%
across patients [5]. The second group [8] evaluated a digitally
delivered picture naming intervention in native Australian
English speaking people with apraxia plus aphasia. They used
the open-source ASR engine CMU PocketSphinx [9] to provide
patients with ‘correct’/’incorrect’ feedback. For 124 words,
which were phonetically different, they reported an overall
ASR accuracy of 80% and a range of scores between 65.1% and
82.8% across patients, depending on impairment severity. Both
these systems provide useful ‘proof-of-concept’ data that ASR
systems for anomia assessment are feasible. Still, the high error
rate and variable performance across aphasic patients meant its
clinical utility remained low.
This project aims to present and assess the feasibility of a
tailor-made system incorporating a deep learning element to
assess word naming attempts in people with aphasia. We will
provide an open-access implementation of our system and
trained models online for researchers, therapists and clinicians
interested in adopting this approach.
2.
An utterance verifier for word naming
Given the scarcity of speech corpora in aphasia, we used a
template-based system for picture naming verification. We
built on the framework developed by Ann Lee and James Glass
in “A comparison-based approach to mispronunciation
detection” [10]. Their ASR system was developed to detect
word-level mispronunciations in non-native speech. It was
initially designed to be language agnostic. It works by
comparing a word uttered by a native speaker, or teacher, with
the same word uttered by a non-native speaker or student. It
relies on posteriorgram based pattern matching via a dynamic
time warping (DTW) algorithm to compare the utterances. Our
system replaced their Gaussian Mixture Model trained on
unlabeled corpora with an acoustic model based on a deep
neural architecture trained on English corpora from healthy
speakers to generate phone-based posteriors. Then, similar to
Lee’s teacher-versus-student framework, we compare healthyversus-aphasic utterances. We defined a posteriorgram as a
vector of posterior probabilities over phoneme classes in the
English language for which we employed the ARPAbet system
as used in the BEEP dictionary [11] consisting of 45 symbols:
44 ARPAbet symbols plus silence. To enable future clinical
utility of our system, we developed it to run embedded on
mobile devices without sophisticated model compression
techniques.
2.1. Signal pre-processing and acoustic modelling
Speech recordings were pre-processed in overlapping frames of
30 milliseconds every 10 milliseconds, and a fast Fourier
transform size of 64 milliseconds after a pre-emphasis filter of
0.95 to obtain a vector of 26 acoustic features per frame: 12
Mel-frequency cepstral coefficients (with a final liftering step
with a coefficient of 23 applied to them), energy and 13 deltas.
See step 1 and 2 in Figure 1.
To train our acoustic model, we used a corpus of healthy
British speakers WSJCAM0 [12] to match the native spoken
language of our patients. WSJCAM offers phone-level
transcriptions using the ARPAbet phone set for British English.
We then used Keras deep learning framework [13] with
TensorFlow [14] as the back-end. All our models used batch
normalisation, dropout rate of 0.5 and a categorical crossentropy over 45 classes as the loss function. Training lasted
until there was no improvement in accuracy for 50 epochs. We
explored several types and configurations of recurrent neural
networks and choose our final model as the one with the lowest
Phone Error Rate (PER) on the WSJCAM0 test set. Our
winning model was a Bidirectional GRU [15] of 128 units and
7 layers of depth trained with the Adam optimiser [15] resulting
in around 2 million parameters and achieving a segment-based
phone error rate (PER) of 15.85%. See step 3 in Figure 1.
Figure 1. From signal to posterior probabilities. Left to
right: speech signal is fragmented into frames every
10 milliseconds of a window size of 30 milliseconds
(1), from each frame a vector of acoustic features is
extracted (2) then each vector is fed to a Deep Neural
Network (3) which outputs a vector of posterior
probabilities or posteriorgram (4).
2.2. Comparison of utterances
Our system uses two recordings from native healthy speakers
for each target word, which are transformed into posteriorgrams
offline via our DNN, as shown in Figure 1 (steps 1-4). Each
naming attempt by an aphasic speaker is transformed into
posteriorgrams using our DNN and then compared to each of
the posteriorgrams from the two healthy speakers via the DTW
algorithm as in [10], see Figure 2. Adapting Lee’s notation,
given a sequence of posteriorgrams for the healthy speaker
𝐻 = (𝑝ℎ1, 𝑝ℎ2, , … , 𝑝ℎ𝑛 , ) and the aphasic speaker 𝐴 =
(𝑝𝑎1, 𝑝𝑎2, , … , 𝑝𝑎𝑚, ) , a 𝑛 × 𝑚 distance matrix can be defined
using the following inner product:
𝜑ℎ𝑎 (𝑖, 𝑗) = −log(𝑝ℎ𝑖 ∗ 𝑝𝑎𝑗 )
(1)
For such a distance matrix, DTW will search for the path from
(1,1) to (𝑛, 𝑚) that minimises the accumulated distance.
Different from Lee’s work, we used the minimum of the DTW
accumulated distances for all comparisons with the two healthy
speakers to make a final decision.
Figure 2. An utterance verification system for word
naming. Given a naming attempt, e.g. target word tree,
the voice of an aphasic patient is recorded and
processed through our DNN to generate
posteriorgrams (1). The system keeps posteriorgrams
of previously recorded healthy speakers’ utterances
for each target word, (2a and 2b). Posteriorgrams are
compared using the DTW algorithm yielding a
distance number between 0 and +∞ (3a and 3b). The
minimum of both distances is selected (4) and
compared to a set threshold (5) calibrated per speaker,
in this example 0.675. If the distance is less than the
threshold then the decision is that the aphasic speaker
has uttered the target word correctly, otherwise it is
classified as incorrect.
3. Experiment and data
3.1. Participants
Eight native English speakers, 6 male, with chronic anomia post
aphasic stroke were recruited. Demographics are shown in
Table 1 below. Inclusion criteria were chronic aphasia in the
absence of speech apraxia (severe motor speech impairment) as
evidenced by: (i) impaired naming ability on the object naming
subtest of the Comprehensive Aphasia Test [16]; scores below
< 38 are classified as impaired ; (ii) good single word repetition
from the same test; normative cut-off>12. All patients gave
written consent, and data were processed in accordance with
current GDPR guidelines. Ethical approval was granted
by NRES Committee East of England– Cambridge, 18/EE/228.
patient response was scored as ‘Filler’, and the corresponding
recording comprised of multiple ‘um’, ‘ah’, ‘eh’, only one of
those attempts was selected to create a single-utterance naming
attempt per item. These single-utterance recordings were the
data used to evaluate our spoken word verification system and
the baseline. Each naming attempt was then re-labelled as
‘correct’ or ‘incorrect’, and this last classification was used as
the ground truth to evaluate our system’s performance and
baseline. Figure 3 describes the dataset and each of the patient’s
naming performance.
Table 1. Demographic and clinical data of the patients
Patient
ID
Sex
Age
M
M
M
F
M
M
M
F
65
58
70
62
64
59
57
82
65
(8)
P1
P2
P3
P4
P5
P6
P7
P8
Mean
(SD)
Months
poststroke
108
90
91
21
14
98
109
38
71
(40)
CAT
Object
naming
32
19
10
28
6
30
27
29
23
(10)
CAT
Repetition
19
22
28
24
25
31
24
23
25
(4)
3.2. Stimuli
Picture naming stimuli consisted of 220 coloured drawings.
They were selected from the top 2000 most frequent words
using the Zipf index of the SUBTLEX-UK corpus [17] keeping
the same distribution of parts of speech for nouns, verbs and
adjectives.
3.3. Dataset Collection
We used a tailor-made gamified picture naming treatment app
developed in Unity on an Android tablet Samsung SM-T820 to
deliver the picture stimuli and record the patients’ speech
responses. Patients wore a Sennheiser headset SC 665 USB to
obtain the speech recordings at 16 kHz which were then stored
in a compliant WAVE-formatted file using a 16 bit PCM
encoding.
Patients were instructed to name each item presented
on screen as quickly and accurately as possible using a single
word response. They were given up to 6 seconds to complete
each picture naming attempt. The item presentation order was
randomised across patients. A SLT was present throughout the
assessment and scored the naming responses online in a
separate file without giving the patient any performance
feedback. A total of 1760 speech recordings (220 words x 8
patients) were acquired.
3.4. Procedure
The SLT classified all naming attempts into one of the
following categories: “Correct”, “No Response”, “Filler”,
“Phonological Error”, “Circumlocution” and “Other”. When
patients produced multiple speech responses, only the most
representative response was selected. For example, when a
Figure 3. Each patient’s picture naming performance
on the 220 item test, as classified by a speech and
language therapist (SLT).
3.4.1. Inter-SLT-rater Agreement
A second SLT independently rated all patients’ naming
attempts to obtain a SLT ‘gold-standard’ performance metric.
Inter-SLT-rater reliability was high overall, with an overall
Cohen’s kappa of 0.92 ranging between 0.84 and 0.99 across
patients. To compare our system to the gold-standard, the
performance between SLT raters was calculated across all
reported metrics (accuracy, F1-score, Pearson’s r).
3.4.2. ASR Baseline
We used to a commercially available ASR-engine, Google
standard speech-to-text service configured with British English
(date used: 24/3/20) to create a baseline with which to compare
the performance of our utterance verification system. For each
aphasic patient’s naming attempt, the same recording to test our
system was send to Google’s server and a transcription
obtained, if the target word was found in the transcript, then the
attempt was classified as ‘correct’, otherwise ‘incorrect’.
4. Results
4.1. System Performance
As indicated in section 2, our system utilised a set threshold to
make a final decision on marking a patient’s naming attempt
either ‘correct’ or ‘incorrect’. Two ways of calculating the best
threshold were evaluated offline: one that was fixed after
optimising it across all patients, and one that was adapted per
patient after optimising for each patient separately.
Performance results are shown in Table 2. Where significant a
pairwise McNemar post-hoc test with Bonferroni correction
was calculated. Fixed and adapted versions of our system were
significantly better than the baseline with p<0.05 and p<0.005,
respectively.
Table 2. Overall performance of our system (fixed and
adapted versions) and the commercial baseline. A
second SLT scoring (SLT2) is also shown.
System
baseline
fixed
adapted
SLT2
Accuracy
0.882
0.905
0.913
0.965
Pearson’s r
0.727
0.784
0.807
0.921
F1-Score
0.795
0.855
0.871
0.947
Performance per patient is illustrated in Figure 4, and the
significance of these results is shown in Table 3.
Table 3. Post hoc significance testing per patient;
pairwise McNemar test with Bonferroni correction.
*** p<0.0005, ** p<0.005, * p<0.05 and NS, nonsignificant.
P1
P2
P3
P4
P5
P6
P7
P8
NS
NS
NS
NS
NS
**
*
**
NS
NS
NS
NS
NS
***
*
***
NS
NS
NS
NS
NS
NS
NS
NS
***
**
*
NS
NS
***
***
***
*
**
**
*
***
NS
NS
NS
*
**
**
NS
**
NS
NS
NS
The fixed and adapted versions performed significantly better
than the baseline and comparable to the second SLT rater for
patients 6, 7 and 8. For the rest of the patients, there are no
significant differences in performance. The fixed and adapted
versions were not significantly different from each other.
4.2. System Cross-validation
Generalisation of the adapted version of our system to unseen
data using offline data was assessed using cross-validation. The
assumption, in this case, was that previously collected speech
1
https://github.com/DavidBarbera/WNUVforPWA
Table 4. Results for a 10-fold cross-validation for each
patient of the adapted system. For each patient the
average across all folds is reported as Mean (±SD).
Patient
P1
P2
P3
P4
P5
P6
P7
P8
Mean(SD)
Min
Max
Range
Accuracy
0.93(±0.068)
0.84(±0.082)
0.88(±0.055)
0.94(±0.055)
0.87(±0.060)
0.93(±0.071)
0.87(±0.081)
0.90(±0.038)
0.895(0.03)
0.836
0.936
0.1
F1-Score
0.89(±0.106)
0.78(±0.116)
0.51(±0.247)
0.89(±0.088)
0.61(±0.247)
0.91(±0.104)
0.90(±0.065)
0.85(±0.067)
0.790(0.14)
0.506
0.905
0.399
Pearson's r
0.85(±0.149)
0.67(±0.162)
0.46(±0.278)
0.85(±0.123)
0.56(±0.261)
0.85(±0.150)
0.72(±0.183)
0.79(±0.087)
0.718(0.14)
0.462
0.852
0.389
5. Conclusion
Figure 4. Comparison of performance between (i) a
commercial baseline,(ii) the ‘fixed’ version of our
system, (iii) the ‘adapted’ version, and (iv) a second
independent SLT. The higher the score, the better the
performance.
Pair
fixedbaseline
adaptedbaseline
fixedadapted
baselineSLT2
fixedSLT2
adaptedSLT2
samples from patients could be used to optimise the system’s
deciding threshold. For each patient, a 10-fold cross-validation
procedure was applied, and the average performance across
folds is reported, see Table 4. Accuracies for all patients was
high, above 84% with a range of 10% and a group average of
89.5%
We present here a tailor-made system based on a deep
learning architecture to automatically assess word naming
attempts for people with aphasia. In a sample of eight patients’
1760 naming attempts, our system performed significantly
better than the commercial baseline (Google STT service) and,
in some instances comparable to the gold-standard SLT scoring.
Given the scarcity of aphasic speech corpora, this represents a
significant step towards creating a reliable and automatic
spoken word assessment system for aphasic speakers and offers
clinical practice a deployable preliminary solution for further
research and optimisation of similar systems.
Future work will focus on analysing the effects of live
feedback on digitally delivered naming interventions. We will
adapt our current system to parse large volumes of aphasic
speech recordings of word naming attempts offline. Also, given
the language-agnostic framework our system is based upon, it
will be interesting to see if our system can be used in other
languages despite being initially trained in English. This would
offer an invaluable tool for aphasic speakers of underresearched languages.
Our system is available open-source to encourage
reproducibility and further development in this field; we
welcome further insights and collaborations1.
6. Acknowledgements
DB and the ASR technical development was funded by a
Medical Research Council iCASE PhD studentship, award
number 1803748. JC is funded a Wellcome Senior Clinical
Fellowship.
7. References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
Matti. Laine, Anomia: theoretical and clinical aspects. Hove:
Psychology, 2006.
‘Stroke
Association’,
Stroke
Association,
2018.
https://www.stroke.org.uk/ (accessed Nov. 28, 2018).
S. K. Bhogal, R. W. Teasell, N. C. Foley, and M. R. Speechley,
‘Rehabilitation of Aphasia: More Is Better’, Topics in Stroke
Rehabilitation, vol. 10, no. 2, pp. 66–76, Jul. 2003, doi:
10.1310/RCM8-5TUL-NC5D-BX58.
A. Whitworth, J. Webster, and D. Howard, A cognitive
neuropsychological approach to assessment and intervention in
aphasia: a clinician’s guide, Second edition. London ; New
York: Psychology Press, 2014.
A. Abad et al., ‘Automatic word naming recognition for an
online aphasia treatment system’, Computer Speech &
Language, vol. 27, no. 6, pp. 1235–1248, Sep. 2013, doi:
10.1016/j.csl.2012.10.003.
A. Pompili et al., ‘An online system for remote treatment of
aphasia’, in Proceedings of the Second Workshop on Speech
and Language Processing for Assistive Technologies, 2011, pp.
1–10, Accessed: Sep. 26, 2017. [Online]. Available:
http://dl.acm.org/citation.cfm?id=2140501.
H. Meinedo, D. Caseiro, J. Neto, and I. Trancoso,
‘AUDIMUS.MEDIA: A Broadcast News Speech Recognition
System for the European Portuguese Language’, in
Computational Processing of the Portuguese Language, vol.
2721, N. J. Mamede, I. Trancoso, J. Baptista, and M. das Graças
Volpe Nunes, Eds. Berlin, Heidelberg: Springer Berlin
Heidelberg, 2003, pp. 9–17.
K. J. Ballard, N. M. Etter, S. Shen, P. Monroe, and C. T. Tan,
‘Feasibility of Automatic Speech Recognition for Providing
Feedback During Tablet-Based Treatment for Apraxia of
Speech Plus Aphasia’, American Journal of Speech - Language
Pathology (Online); Rockville, vol. 28, no. 2S, pp. 818–834,
Jul.
2019,
doi:
http://dx.doi.org.libproxy.ucl.ac.uk/10.m44/2018_AJSLPMSC18-18-0m9.
cmusphinx/pocketsphinx. cmusphinx, 2020.
A. Lee and J. Glass, ‘A comparison-based approach to
mispronunciation detection’, in 2012 IEEE Spoken Language
Technology Workshop (SLT), Miami, FL, USA, Dec. 2012, pp.
382–387, doi: 10.1109/SLT.2012.6424254.
T. Robinson, ‘BEEP dictionary’, BEEP dictionary, 1996.
http://svrwww.eng.cam.ac.uk/comp.speech/Section1/Lexical/beep.html
(accessed Nov. 09, 2018).
T. Robinson, J. Fransen, D. Pye, J. Foote, and S. Renals,
‘Wsjcam0: A British English Speech Corpus For Large
Vocabulary Continuous Speech Recognition’, in In Proc.
ICASSP 95, 1995, pp. 81–84.
F. Chollet and others, Keras. 2015.
Martín Abadi et al., TensorFlow: Large-Scale Machine
Learning on Heterogeneous Systems. 2015.
J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, ‘Empirical
Evaluation of Gated Recurrent Neural Networks on Sequence
Modeling’, arXiv:1412.3555 [cs], Dec. 2014, Accessed: Jan.
21, 2019. [Online]. Available: http://arxiv.org/abs/1412.3555.
D. P. Kingma and J. Ba, ‘Adam: A Method for Stochastic
Optimisation’, arXiv:1412.6980 [cs], Dec. 2014, Accessed:
Sep.
16,
2017.
[Online].
Available:
http://arxiv.org/abs/1412.6980.
Kate, Swinburn, Comprehensive aphasia test: CAT / Kate
Swinburn, Gillian Porter and David Howard. Hove:
Psychology Press, 2004.
W. J. B. van Heuven, P. Mandera, E. Keuleers, and M.
Brysbaert, ‘Subtlex-UK: A New and Improved Word
Frequency Database for British English’, Quarterly Journal of
Experimental Psychology, vol. 67, no. 6, pp. 1176–1190, Jun.
2014, doi: 10.1080/17470218.2013.850521.