Benchmarking Natural Language Understanding Services for Building Conversational Agents

Liu, Xingkun; Eshghi, Arash; Swietojanski, Pawel; Rieser, Verena

doi:10.1007/978-981-15-9323-9_15

Xingkun Liu³⁹,
Arash Eshghi³⁹,
Pawel Swietojanski⁴⁰ &
…
Verena Rieser³⁹

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 714))

962 Accesses
27 Citations

Abstract

We have recently seen the emergence of several publicly available Natural Language Understanding (NLU) toolkits, which map user utterances to structured, but more abstract, Dialogue Act (DA) or Intent specifications, while making this process accessible to the lay developer. In this paper, we present the first wide coverage evaluation and comparison of some of the most popular NLU services, on a large, multi-domain (21 domains) dataset of 25 K user utterances that we have collected and annotated with Intent and Entity Type specifications and which will be released as part of this submission (https://github.com/xliuhw/NLU-Evaluation-Data). The results show that on Intent classification Watson significantly outperforms the other platforms, namely, Dialogflow, LUIS and Rasa; though these also perform well. Interestingly, on Entity Type recognition, Watson performs significantly worse due to its low Precision (At the time of producing the camera-ready version of this paper, we noticed the seemingly recent addition of a ‘Contextual Entity’ annotation tool to Watson, much like e.g. in Rasa. We’d threfore like to stress that this paper does not include an evaluation of this feature in Watson NLU.). Again, Dialogflow, LUIS and Rasa perform well on this task.

(Work done when Pawel was with Emotech North LTD).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Overview of the NLPCC 2018 Shared Task: Spoken Language Understanding in Task-Oriented Dialog Systems

Leveraging intent–entity relationships to enhance semantic accuracy in NLU models

Article Open access 26 May 2024

Trilingual conversational intent decoding for response retrieval

Article 05 September 2023

Notes

1.
According to anecdotal evidence from academic and start-up communities.
2.
https://rasa.com/.
3.
https://www.ibm.com/watson/ai-assistant/.
4.
https://www.luis.ai/home.
5.
https://dialogflow.com/.
6.
Note that, one could develop one’s own system using existing libraries, e.g. sk_learn libraries http://scikit-learn.org/stable/, spaCy https://spacy.io/, but a quicker and more accessible way is to use an existing service platform.
7.
Was not yet open source when we were doing the benchmarking, and was later on also introduced in https://arxiv.org/abs/1805.10190.
8.
We also note here that our dataset was inevitably unbalanced across the different Intents & Entities: e.g. some Intents had much fewer instances: iot_wemo had only 77 instances. But this would affect the performance of the four platforms equally, and thus does not confound the results presented below.
9.
At the time of producing the camera-ready version of this paper, we noticed the seemingly recent addition of a ‘Contextual Entity’ annotation tool to Watson, much like e.g. in Rasa. Wed like to stress that this paper does not include an evaluation of this feature in Watson NLU.
10.
Micro-average sums up the individual TP, FP, and FN of all Intent/Entity classes to compute the average metric.
11.
Interestingly, Watson only requires a list of possible entities rather than entity annotation in utterances as other platforms do (See Table 1).
12.
Tables for other folds are omitted for space reason.

References

Braun D, Mendez AH, Matthes F, Langen M (2017) Evaluating natural language understanding services for conversational question answering systems. In: Proceedings of SIGDIAL 2017, pp 174–185
Google Scholar
Canonico M, Russis LD (2018) A comparison and critique of natural language understanding tools. In: Proceedings of CLOUD COMPUTING 2018
Google Scholar
Coucke A, Ball A, Delpuech C, Doumouro C, Raybaud S, Gisselbrecht T, Dureau J (2017) benchmarking natural language understanding systems: Google, Facebook, Microsoft, Amazon, and Snips. https://medium.com/snips-ai/benchmarking-natural-language-understanding-systems-google-facebook-microsoft-and-snips-2b8ddcf9fb19
Canh NT (2018) Benchmarking intent classification services. June 2018. https://medium.com/botfuel/benchmarking-intent-classification-services-june-2018-eb8684a1e55f
Wisniewski C, Delpuech C, Leroy D, Pivan F, Dureau J (2017) Benchmarking natural language understanding systems. https://snips.ai/content/sdk-benchmark-visualisation/

Download references

Author information

Authors and Affiliations

Heriot-Watt University, Edinburgh, EH14 4AS, UK
Xingkun Liu, Arash Eshghi & Verena Rieser
The University of New South Wales, Sydney, Australia
Pawel Swietojanski

Authors

Xingkun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Arash Eshghi
View author publications
You can also search for this author in PubMed Google Scholar
Pawel Swietojanski
View author publications
You can also search for this author in PubMed Google Scholar
Verena Rieser
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xingkun Liu .

Editor information

Editors and Affiliations

Apple, Cupertino, CA, USA
Erik Marchi
Kore University of Enna, Enna, Italy
Sabato Marco Siniscalchi
Polytechnic University of Turin, Torino, Italy
Sandro Cumani
Kore University of Enna, Enna, Italy
Valerio Mario Salerno
National University of Singapore, Singapore, Singapore
Haizhou Li

Appendix

We provide some examples of the data annotation and the training inputs to each of the 4 platforms in Table 4, Listings 1, 2, 3 and 4.

We also provide more details on the train and test data distribution, as well as the Confusion Matrix for the first fold (Fold_1) of the 10-Fold Cross Validation. Table 5 shows the number of the sentences for each Intent in each dataset. Table 6 lists the number of entity samples for each Entity Type in each dataset. Tables 7 and 8 show the confusion matrices used to calculate the scores of Precision, Recall and F1 for Intents and Entities. The TP, FP, FN and TN in the tables are short for True Positive, False Positive, False Negative and True Negative respectively.

Table 5 Data distribution for intents in Fold_1

Full size table

Table 6 Data distribution for entities in Fold_1

Full size table

Table 7 Confusion matrix summary for intents in Fold_1

Full size table

Table 8 Confusion matrix summary for entities in Fold_1

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Liu, X., Eshghi, A., Swietojanski, P., Rieser, V. (2021). Benchmarking Natural Language Understanding Services for Building Conversational Agents. In: Marchi, E., Siniscalchi, S.M., Cumani, S., Salerno, V.M., Li, H. (eds) Increasing Naturalness and Flexibility in Spoken Dialogue Interaction. Lecture Notes in Electrical Engineering, vol 714. Springer, Singapore. https://doi.org/10.1007/978-981-15-9323-9_15

Download citation

DOI: https://doi.org/10.1007/978-981-15-9323-9_15
Published: 11 March 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-9322-2
Online ISBN: 978-981-15-9323-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Benchmarking Natural Language Understanding Services for Building Conversational Agents

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Overview of the NLPCC 2018 Shared Task: Spoken Language Understanding in Task-Oriented Dialog Systems

Leveraging intent–entity relationships to enhance semantic accuracy in NLU models

Trilingual conversational intent decoding for response retrieval

Notes

References