Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
When to Support and when to Bot Nemi Pelgrom1 and Daniel Ihrmark2 ∗ 1- Linnauesuniversity - Dept of Computer Science Växjö - Sweden 2- Linnauesuniversity - Dept of Cultural Sciences Växjö - Sweden April 29, 2024 Abstract In this study, we analyse a dataset comprising of 8,700 customer support chat lines to discern patterns that differentiate simple from complex problems, and discuss the implications for automated customer support systems. We identify a distinct threshold: messages of 20 words or fewer typically present issues straightforward enough for bots to understand. Conversely, messages exceeding 21 words often detail multi-step problems, or personalised problems, challenging for bots to address effectively. This pattern was found on manual annotation of 1026 messages spanning 18-24 words. We also report on a trial of using an LLM to automate the process to annotate 8,700 messages ranging from 18 to 30 words. Our findings show a critical role of message length in determining the suitability of automated support and offer insights into optimising bot deployment in customer service environments. 1 Introduction The distinction between tasks that are appropriate for bots versus those that require human intervention is a critical aspect in streamlining customer support services[1]. The aim is to find solutions that are both computationally efficient, and also able to give the customer a satisfying support experience. The accurate identification of a customer’s primary issue is central. Issues that demand an elaborate explanation are typically of a complexity level beyond the bots’ limited ability to accurately understand the context of the issue, and in many cases the issues are of a kind that requires human interaction (such as personalised features or system bugs). The common problem of the customer being unable ∗ We thank Fortnox AB for sponsoring this research. We also clarify that only the dataset is from Fortnox AB, none of the discussion has any connection to the company. 1 to identify by themselves what support they need to solve their issues, is central to customer satisfaction [2], which is a hindrance for deploying bots [3]. This problem necessitates an effective and efficient system within customer support operations to assess and direct issues to the suitable resolution channel, whether it be a bot or a human agent. We contribute to this by presenting a pattern found in initial messages to several support chat channels of a Swedish company. We will discuss this pattern in the context of evaluating efficiency and customer satisfaction. From a linguistic standpoint, the idea of a topic is often seen as a defined cluster of words that frequently co-occur across the sample being explored. One popular method for topic modeling tasks is based on Latent Dirichlet Allocation [4], which has previously been used to explore topic segmentation in the context of costumer support. There are also several LLM based trials on categorising topics, these are based on much larger corpuses of training and of prompting than what is relevant to our aims[5], as our the focus is not on the definition of the topics themselves, but rather on the number of topics included in a short text segment.. Making detailed definitions and lists of topics is useful when there is a known and limited amount of possible questions. With larger companies that offer a wide variety of services, the amount of possible problems is large and complex, which makes such solutions time consuming to administrate. The issue from the user, needs to be communicated to the support system or support worker, the solutions needs to be identified, and the solutions needs to be either fixed by them, or communicated back to the user. This makes customer support be heavily dependent on the quality of communication[6]. Some problems are very straight forward; ”Where do I find the price list for this service?”, and they can be solved by a bot that is able to extract the particular request and connect it with the right link on the company web page. And some problems are more complex; ”I think I accidentally sent an invoice to the wrong receiver, and didn’t notice until after I lost the option to take it back. Could you help me remove it or contact the recipient?”, and these need a support worker to be solved. If we could identify which messages goes into which category, we both avoid making the first category of users wait for an available support worker, and the second category of users have to talk to a bot before they reach a person. 2 Methodology While doing an unrelated processing of support chats, we noticed a pattern of a cut-off point for complexity of requests from the users in the messages. We decided to explore it, to see if we could show that there was a real pattern. We collected a dataset of real messages from users to various support channels at Fortnox AB. We wanted to use a zero-shot LLM to help annotate the dataset, so that we could have more data to validate the hypothesis, however, the model was not able to learn the pattern based on 240 annotated examples. Since the pattern was clear in this first round of manual annotation, we decided to complete the trial manually, and so we annotated more messages in our selection 2 of words counts, until we reached over 1000 annotated messages, with an even distribution of amount of messages from each category of word count. 2.1 Preprocessing the dataset The original dataset was 89.000 lines. It contained all messages from all conversations, and we are only interested in initial messages, so we removed all lines that were not clearly introducing a problem. Approximately half of the messages we used has the Swedish word for ”hello” as the first word, but this was more seldom used in the longer messages so we decided to not limit ourselves to only using them only. Then we removed all messages containing identifying information (such as email addresses, links, full names, organisation IDs). To avoid language barrier biases, we removed the few chats that were in other languages than Swedish. We removed messages that are citing texts since this made the word count wrong. 2.2 Criteria for chat message categorisation We want to see if there is a correlation between how many words there are in a message and how many separable topics are discussed in them. To do this, we need a line that separates topics so that we can apply this line to our messages and see where they fall on it. Since our chat data is directed towards topics limited to the services the company provides and there are furthermore many topics mentioned by names particular to the company, we found that the available topic segmentation algorithms were not effective on our data. While we acknowledge that topic segmentation is a wide field of its own, it is not of primary interest for this study, so we have used a manual approximation to estimate how many topics there are in any particular message. We are not experts in the topics mentioned in the messages, which makes it more likely that we have false positives (categorised as 2, but should be 1), than false negatives (categorised as 1, but should be 2) in the dataset. We separated the topics on the following basis: If it is possible to reformulate the content of the message to be one single step question, then we categorised it as 1. If there are several steps, or several not clearly connected questions, we categorised it as 2. If the question was about something particular to that user, such as a particular invoice, or person, we also categorised it as 2, on the basis that one would not be able to answer the question without first locating the particular invoice or person, which makes it a two-step process to answer. Messages that were of the form ”I need help with understanding this product/service” were categorised as 2, since we understand that as a request of more than one simple question. Most of the messages that were difficult to categorise were of the form of a question, followed by a specification of where they were accounting a problem or why they were asking the question. The difficulty was of judging if the specification was necessary for answering the question or not. If it was necessary, we categorised it as 2, otherwise as 1. To avoid bias on only using messages 3 that were easy to categorise, we annotated each message in the final dataset. While this means that it is possible some difficult-to-categorise messages could be categorised differently, the difference should only be marginal, and this was decided to be preferable over changing the dataset it self to have more bias. We only did the binary categorisation of sorting the data into category 1 or 2. While some messages could be considered to have more then 2 topics in them, we were looking for a cutoff point between 1 and 2, so while further detail in the categorisation process could result in slightly more accurate results, it would not affect the applicability of the results and was deemed unnecessary. 2.3 Examples of categorisation Two topics: • Hello! I have a customer who has offset a credit against the wrong invoice. Can you help me match the overpayment against the underpayment? This involves two steps: first understanding how to handle credit against the wrong invoice, and then deciding if and how overpayment can be matched against underpayment. • I went in a while ago and entered on the invoice that payment has been made, has it not reached you yet? This takes two steps: identifying the invoice, and checking if it has been paid. One topic: • Hello, I need to get a certification list for how the certification is set up in Fortnox for an auditor. How do I get it out? This is one step because it only involves a specified data extraction. • Hello, I wonder if I can see if the employer declaration has already been sent to the tax authority. I am unsure if this has been done. This is one step because it only requests how to locate a specific part of the webpage 2.4 Trial with zero-shot BERT model We did a trial on automating the categorization of the messages into our two categories. The trial was conducted using the zero-shot BERT model created by KB Labs [7]. The model has previously struggled with text components such as punctuation, but was considered an interesting possible venue due to being able to parse based on context, as compared to the content-agnostic approach represented by word count measurement. We did a trial of automating the annotation of the messages. Our trial was performed on 8700 messages of length 18-30, but we got poor results. There was a significant over representation of messages being categorised as several topics, even when it was clear from manual analysis that it was one topic that was discussed. We drew the conclusion that the model could not identify the pattern of criteria for chat message categorisation that we are using. 4 3 Results Figure 1: Plot of the percentage of one or more topics against the amount of words Topics 1 2 18 88 0 19 223 0 20 118 57 21 72 131 22 24 111 23 9 100 24 8 85 Table 1: Amount of messages containing one or more topics, according to amount of words in the message, manually annotated 4 Discussion These results give us a new insight into a correlation between the amount of words in a support request and the complexity of the request. This could be used as a basis for decisions on what parts, or how much, of customer support to automate. Since many chat bots have difficulties understanding when there are several tasks mentioned, or when there are complex contexts to consider, it is useful to be able to separate out such issues from the simpler ones. Implementing this separation of incoming messages to support chats, could increase the experience of both the user and the people who are working with the support. The users with complex problems do not have to spend extra time answering questions from a bot before they are connected to the support they need, and the people working with the support should get fewer of the simple problems assigned to them. While the results have not been checked to be replicable in support chats from other companies, or other languages, we believe that the insights from our results still have value, and it is possible to easily recreate our 5 trial on any similar dataset for more trials before possible implementation of this discussed division of messages. 5 Conclusion We found that there is a very clear patter on the complexity of the issues users bring up in their initial messages to support chats to Fortnox AB. The pattern is that messages 20 words or shorter are clear in what they are requesting, and messages 21 words or longer usually have complex or several-step requests. While it is not strange that shorter messages are easier to handle, it is surprising that there is such a clear cut-off point with simple word count. This insight could be used as part of any system that aims to make AI-aided customer support more efficient. References [1] Debayan Banerjee, Mathis Poser, Christina Wiethof, Varun Shankar Subramanian, Richard Paucar, Eva A. C. Bittner, and Chris Biemann. A system for human-ai collaboration for online customer support, 2023. [2] Alexander Rossmann, Alfred Zimmermann, and Dieter Hertweck. The impact of chatbots on customer service performance. In Advances in the human side of service engineering: Proceedings of the AHFE 2020 Virtual Conference on The Human Side of Service Engineering, July 16-20, 2020, USA, pages 237–243. Springer, 2020. [3] Xinyi Yang. Understanding chatbot service encounters: Consumersâ satisfactory and dissatisfactory experiences. Master’s thesis, X. Yang, 2020. [4] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003. [5] Derek Greene, Derek O’Callaghan, and Pádraig Cunningham. How many topics? stability analysis for topic models. In Toon Calders, Floriana Esposito, Eyke Hüllermeier, and Rosa Meo, editors, Machine Learning and Knowledge Discovery in Databases, volume 8724 of Lecture Notes in Computer Science, pages –, Berlin, Heidelberg, 2014. Springer. [6] Mark Anthony Camilleri and Ciro Troise. Live support by chatbots with artificial intelligence: A future research agenda. Service Business, 17, 11 2022. [7] Love Börjeson, Chris Haffenden, Martin Malmsten, Fredrik Klingwall, Emma Rende, Robin Kurtz, Faton Rekathati, Hillevi Hägglöf, and Justyna Sikora. Transfiguring the library as digital research infrastructure: Making kblab at the national library of sweden. 2023. 6