Handling Numeric Expressions in Automatic Speech Recognition

Abstract

This paper addresses the problem of correctly formatting numeric expressions in automatic speech recognition (ASR) transcripts. This is challenging since the expected transcript format depends on the context, e.g., 1945 (year) vs. 19:45 (timestamp). We compare cascaded and end-to-end approaches to recognize and format numeric expression, such as years, timestamps, currency amounts, and quantities. For the end-to-end approach we employed a data generation strategy using a large language model (LLM) together with a text to speech (TTS) model to generate adaptation data. The results on our test dataset show that while approaches based on LLMs perform well on recognizing formatted numeric expressions, adapted end-to-end models offer competitive performance with the advantage of lower latency and inference cost.

Index Terms: numeric expression formatting, automatic speech recognition.

1 Introduction

In the last decade, ASR systems improved tremendously in terms of word error rate (WER) due to more data, more computation power and better architectures [1, 2, 3]. These systems are normally trained with labeled ASR data, i.e., human transcribed speech or human correct automatically transcribed speech.

The way how numeric expressions are transcribed - using numeric literals, e.g., 1945, or number words, e.g., nineteen forty-five - can vary between different datasets of sometimes even within a dataset. Furthermore, dependent on the usage of the ASR system, different transcript formats might be preferred. For example when a video conference call is automatically subtitled using an ASR system, the readers might prefer numeric literals since they are shorter and easier to read.

On the other hand, transcripts of an ASR system containing numeric expressions should be formatted dependent on the context the numeric expressions occur in. For example the number word nineteen forty-five should be formatted as 1945 if it represents a year, as 19:45 if it represents a timestamp, as $19.45 if it represents a currency amount and as 1,945 if quantity is meant.

Often numeric expression formatting is not reflected in the WER because numeric expression formats get normalized before calculating the WER. However, proper formatting of numeric expressions is important because it heavily improves readability of the transcript.

Therefore, in this work, we tackle the problem of properly formatting numeric expressions in ASR transcripts. For this, we 1) created a test set containing the numeric expression types year, timestamp, currency amount and quantity, 2) propose a strategy using a LLM together with a TTS model to get synthetic data with which an end-to-end ASR system can be adapted (see figure 1 and section 3.1), and 3) compare cascaded and end-to-end approaches to recognize the numeric expression types (see sections 3.2 and 4). We show that while approaches based on LLMs perform well on recognizing formatted numeric expressions, adapted end-to-end models offer competitive performance with the advantage of lower latency and inference cost.

2 Related Work

Numeric expression type	Number words	Wanted formatting
Year	in nineteen forty-five	in 1945
Timestamp	at quarter to eight (in the evening)	at 19:45
Currency amount (English)	one thousand dollars and fifty cents	$1,000.50
Currency amount (German)	eintausend Euro und fünfzig Cent	1.000,50€
Quantity (English)	two thousand pieces	2,000 pieces
Quantity (German)	zweitausend Teile	2.000 Teile

Table 1: Examples of wanted formatting of numeric expressions for different numeric expression types.

Until a few years ago, most ASR systems were trained to output lowercase transcripts without punctuation [4]. For this the transcripts of the training data got normalized. To get a transcript which contains casing and punctuation, inverse text normalization [5] was applied. This was done by applying a text segmentation model [6] after the transcription. For the text segmentation step auto regressive models similar to models used in machine translation can be used. To minimize the training-test mismatch in the input distribution such a text segmentation model should to be trained on hypotheses specific to some ASR model. Therefore, when changing the ASR model also the text segmentation model should be changed.

The lately introduced LLMs [7, 8, 9] can also be used as a text segmentation model, i.e. to reformat the ASR hypotheses. LLMs are pre-trained on a lot of text to predict the next token and then adapted to e.g. answer questions. In-context learning [10] can be used to increase the performance without changing the weights of the LLM by providing examples how questions should be answered.

On the other hand, ASR systems lately moved more and more towards end-to-end approaches where the transcript already contains casing and punctuation [3]. This has the advantage that only one model has to be executed decreasing latency and reducing maintenance effort. Furthermore, end-to-end approaches search for a global optimum and with enough training data this works well [3]. The drawback is that the formatting of numeric expressions in the transcript can not be easily changed with text-only data and the question is how to get ASR data with suitable numeric expression formatting.

We use a TTS model for synthetic data generation (see next section). Other works [11, 12] have shown that it is possible to use synthetic TTS data to improve ASR performance.

3 Experiments

3.1 Data

Refer to caption — Figure 1: Data generation: 1) Generate a sentence containing a numeric expression written down as number words, 2) Generate audio, 3) Convert the sentence such that the numeric expression is written using numeric literals.

Set	Utterances	Hours
Training	2409	2.85
Development	288	0.36
Test	909	1.53
Training-larger	3637	6.86

Table 2: Numeric expression dataset: Number of utterances and hours for the different parts of the generated numeric expressions dataset.

To adapt and test the numeric expression formatting of our models (see 3.2) we created a numeric expression dataset (see figure 1).

For this, we first used gpt3.5-turbo from OpenAI to generate sentences containing the different numeric expression types we consider written down as number words. This is done by a prompt like (the actually used prompt is a little more complex e.g. to make the LLM not output enumeration)

Generate {n} diverse [German (optional)] sentences containing a {numeric expression type} written down using number words.

With half of the executed prompts we generated English sentences and with the other half German sentences.

Then, we used the TTS model tts-1-hd from OpenAI (with voice randomly chosen) to generate audio. For this it was crucial to have the numeric expressions transcribed as number words since the TTS model did not produce correct output using numeric literals e.g. $19.45.

Third, we prompted the LLM to convert the number words to numeric literals in the wanted format (see table 1). This is done by a prompt like

Convert the {numeric expression type} in the sentences to numeric literals.

The output of this step is used a labels for the utterances.

To get a high quality dataset we applied some filtering using simple rules, e.g., output sentences of the third step not containing numeric literals were ignored.

The created data was then split in training, development and test sets. We noticed that the numeric expressions created by the LLM sometimes repeated. Therefore, we split the data such that the numeric expressions contained in the three sets were chosen to be pairwise disjoint (see table 2). Then, the test set was read by human annotators to collect real audio samples.

We noticed that for the end-to-end approach (see section 4) the performance on the timestamps was worse than the cascaded approaches. Therefore, we evaluated if more data generation could help and created more training data (Training-larger) containing timestamps with gpt-4o. This is done by a prompt like

Generate a [German (optional)] sentence containing the timestamp {timestamp} written down using number words.

For {timestamp} we iterate over many possibilities, e.g. for English one o’clock, quarter past one, half past one, quarter to one, two minutes past one, two minutes to one. For German we use equivalent translations. The seconds and third step of the data generation are the same as before.

To evaluate the general performance of our model, we report the WER on the Common voice [13] test sets in English and German. We filtered the test sets by excluding utterances containing numeric expressions because the numeric expressions contained in the labels are written down using number words and we tuned our models to output numeric literals. The English and German test sets each contain 2,000 utterances and 3.3 hours of audio.

Model	WER (%) Common voice EN	WER (%) Common voice DE	WER (%) Numeric expressions EN	WER (%) Numeric expressions DE
ASR only	13.5	10.5	8.7	14.6
ASR + mbart baseline	13.4	10.1	15.2	19.5
ASR + mbart numeric expressions	13.5	10.3	5.1	9.4
+ more data	13.4	10.3	3.8	9.3
ASR + gpt3.5-turbo	13.5	10.6	4.3	8.0
ASR + gpt4-turbo	13.5	10.4	3.5	7.4
ASR + gpt-4o	13.6	10.5	3.4	7.4
fine-tuned ASR	13.7	10.7	3.8	5.7
+ more data	13.7	10.9	3.3	5.8

Table 3: Results: WER (

\downarrow

) for English, German on a filtered version of Common voice not containing numeric expressions and the numeric expressions test set for the different approaches.

Model	Acc. (%) years	Acc. (%) timestamps	Acc. (%) currency amounts	Acc. (%) quantities	Acc. (%) average
ASR only	97.4	3.4	36.3	93.3	57.6
ASR + mbart baseline	74.3	1.0	4.4	8.6	22.1
ASR + mbart numeric expressions	96.6	20.3	71.1	91.4	69.9
+ more data	95.8	39.2	73.3	93.3	75.4
ASR + gpt3.5-turbo	96.6	64.9	79.3	96.3	84.3
ASR + gpt4-turbo	95.5	78.4	94.8	99.4	92.0
ASR + gpt-4o	95.5	83.8	95.6	99.4	93.6
fine-tuned ASR	97.7	63.9	97.0	99.4	89.5
+ more data	97.7	75.9	98.5	99.4	92.9

Table 4: Results: Accuracy (

\uparrow

) for years, timestamps, currency amounts and quantities on our numeric expressions test data for the different approaches.

3.2 Models and Approaches

We compare cascaded and end-to-end approaches for numeric expression formatting (see figure 2). For the cascaded approach we use a trained ASR model and reformat the output using a text segmentation model. For the end-to-end approach we adapt the trained ASR model by fine-tuning on the training set of our numeric expressions dataset (see section 3.1).

We use Whisper [3] (whisper-large-v2) as our baseline ASR model and for the text segmentation model we compare using a mbart-based model [14] (mbart-large-50) and LLMs.

We adapted the pretrained mbart model in two steps since we only have limited data for the second step. First, we fine-tuned it to predict the transcript labels of the Common voice training sets (excluding utterances containing numeric expressions similar to the test sets) given the ASR hypothesis generated by our baseline ASR model. This model we denote by mbart baseline. Second, we fine-tuned the model on the numeric expressions dataset to format numeric expressions correctly. This is done by using the sentences where numeric expressions are written down as number words as input and the corresponding sentences where numeric expressions are written down as numeric literals as labels. This model we denote by mbart numeric expressions. For both steps we froze the embedding of the model since this yielded better performance than not freezing it.

For the LLM we use GPT3.5 (gpt-3.5-turbo) or GPT 4 [15] (gpt4-turbo and gpt-4o) with in-context learning using one example for each numeric expression type (9 examples in total).

4 Results

The results can be seen in table 3 (WER) and table 4 (accuracy of the different numeric expression types).

We see that ASR + mbart baseline slightly improves the WER on the Commonvoice test sets due to the learned correction of the ASR hypothesis. However, the performance (WER and accuracy) on the numeric expressions test sets heavily decreased since the model was not trained to predict numeric literals.

The model ASR + mbart numeric expressions performs better and outperforms the ASR only model on the numeric expressions test sets, while there is not much difference on the Common voice test sets. However, the model struggles to format the timestamps (and currency amounts) correctly, e.g., the ASR hypothesis ”The library opens at 10 o’clock, but it’s best to arrive early.” is converted to ”The library opens at 17:00, but it’s best to arrive early.” This is probably due to the limited amount of numeric expressions data. Using more data did help a bit but the accuracy on e.g. timestamps is still less than 40%.

Using a LLM as text segmentation model sometimes (gpt3.5-turbo: 9.7%, gpt4-turbo: 1.4%, gpt-4o: 2.7%) does not follow the prompt, e.g., when the input sentence is a question, it is answered. This leads to a completely different transcript and increases in the WER. To circumvent this problem we compute the WER between input and output of the LLM and if the WER is larger than a threshold (0.5) we ignore to LLM output and return the input instead. With this, the LLMs clearly outperform the mbart-based model both in terms of WER and accuracy. The advantage of the LLMs is that they got trained on a lot more data. Most errors are caused by the LLM not following the prompt, e.g., in the sentence ”The bus leaves at five past seven.” the timestamps is not changed to 7:05. Furthermore, the most recently published LLMs perform better. Using an LLM which follows the prompt better would probably yield a bit better scores.

It is quite expensive to reformat each hypothesis using an LLM ( $\approx$ $15 for evaluating the ASR + gpt-4o approach on the 4.000 Common voice test sets sentences), especially if the goal is to provide the transcription to lots of customers. Therefore, we experimented adapting the ASR model end-to-end. The results show similar WER performance as the LLM-based approaches, however the WER on the German numeric expressions test set is a bit better. The improvement is due to the fact that for most of the timestamp data the baseline ASR model outputs e.g. for an audio containing ”Ich habe bis fünfzehn Uhr fünfundvierzig Zeit.” a transcript ”Ich habe bis 15.45 Uhr Zeit”. The conversion by the LLM does not remove the ”Uhr” which is counted as an error. The fine-tuned ASR model does not output this ”Uhr”. The accuracy of the fine-tuned ASR model with more data on the numeric expressions is better than the approaches ASR + gpt3.5-turbo / gpt4-turbo and only a bit worse than ASR + gpt-4o. The largest difference is on the timestamp numeric expressions. For example the audio containing ”The deadline is at thirty minutes past three.” was transcribed ”The deadline is 03:30.” While this could be correct, a deadline is more likely not to be within the night and the better world knowledge helps the LLMs here to output ”15:30”.

With our data generation strategy it is easy to add more formatting rules, e.g., new currency symbols, by performing more augmenting data and adapting the model.

4.1 Limitations

The main limitation for the cascaded approach using an LLM is the ability of the LLM to follow the prompt correctly. This is expected to be handled even better for newer LLMs getting trained. For the fine-tuned ASR model the limitation is getting diverse data containing suitable numeric expression formatting.

We also tried adapting the ASR model using batch weighting [16] and/or a factorization-based approach [2] together with the common voice training dataset. While the performance on the Common voice test sets improved, which is expected since the training and test datasets are more similar, the performance on the numeric expression formatting was slightly worse. Furthermore, we tried freezing the encoder or only adapting the final projection layer during the adaptation with the numeric expressions data. For both, the performance on the numeric expressions test data was slightly worse compared to not freezing any weights.

5 Conclusion

In this paper, we tackled the problem of correctly formatting numeric expressions in ASR transcripts. Our experiments revealed that LLMs, particularly the latest models, deliver strong performance in recognizing and formatting numeric expressions. On the other hand, end-to-end models adapted with synthetic ASR data provide competitive performance.

6 Acknowledgements

This research was supported in part by a grant from Zoom Video Communications, Inc. The authors gratefully acknowledge the support.

References

[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[2] N.-Q. Pham, T.-S. Nguyen, J. Niehues, M. Müller, S. Stüker, and A. Waibel, “Very deep self-attention networks for end-to-end speech recognition,” arXiv preprint arXiv:1904.13377, 2019.
[3] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023, pp. 28 492–28 518.
[4] T.-S. Nguyen, S. Stueker, J. Niehues, and A. Waibel, “Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7689–7693.
[5] M. Shugrina, “Formatting time-aligned asr transcripts for readability,” in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2010, pp. 198–206.
[6] E. Cho, J. Niehues, and A. Waibel, “Nmt-based segmentation and punctuation insertion for real-time spoken language translation.” in Interspeech, 2017, pp. 2645–2649.
[7] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
[8] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
[9] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
[10] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, and Z. Sui, “A survey on in-context learning,” arXiv preprint arXiv:2301.00234, 2022.
[11] N. Rossenbach, A. Zeyer, R. Schlüter, and H. Ney, “Generating synthetic audio data for attention-based speech recognition systems,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7069–7073.
[12] A. Fazel, W. Yang, Y. Liu, R. Barra-Chicote, Y. Meng, R. Maas, and J. Droppo, “Synthasr: Unlocking synthetic data for speech recognition,” arXiv preprint arXiv:2106.07803, 2021.
[13] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” arXiv preprint arXiv:1912.06670, 2019.
[14] Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer, “Multilingual denoising pre-training for neural machine translation,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 726–742, 2020.
[15] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
[16] C. Huber, J. Hussain, T.-N. Nguyen, K. Song, S. Stüker, and A. Waibel, “Supervised adaptation of sequence-to-sequence speech recognition systems using batch-weighting,” in Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems, 2020, pp. 9–17.