Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3613904.3642939acmconferencesArticle/Chapter ViewFull TextPublication PageschiConference Proceedingsconference-collections
research-article
Open access

Unblind Text Inputs: Predicting Hint-text of Text Input in Mobile Apps via LLM

Published: 11 May 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Mobile apps have become indispensable for accessing and participating in various environments, especially for low-vision users. Users with visual impairments can use screen readers to read the content of each screen and understand the content that needs to be operated. Screen readers need to read the hint-text attribute in the text input component to remind visually impaired users what to fill in. Unfortunately, based on our analysis of 4,501 Android apps with text inputs, over 76% of them are missing hint-text. These issues are mostly caused by developers’ lack of awareness when considering visually impaired individuals. To overcome these challenges, we developed an LLM-based hint-text generation model called HintDroid, which analyzes the GUI information of input components and uses in-context learning to generate the hint-text. To ensure the quality of hint-text generation, we further designed a feedback-based inspection mechanism to further adjust hint-text. The automated experiments demonstrate the high BLEU and a user study further confirms its usefulness. HintDroid can not only help visually impaired individuals, but also help ordinary people understand the requirements of input components. HintDroid demo video: https://youtu.be/FWgfcctRbfI.

    1 Introduction

    In the era of burgeoning mobile device development, mobile applications have evolved into indispensable tools for accessing and engaging within diverse environments, offering unparalleled convenience, particularly for individuals with low vision. According to statistics from Google Play [7] and App Store [3], over 4 million apps are widely used for various tasks such as entertainment, financial services, reading, shopping, banking, and chatting. However, many app developers often don’t develop apps according to accessibility standards, which directly results in disabled people being unable to access the functionalities of many apps. This is especially true for visually impaired users, who need rich Graphical User Interface (GUI/UI) hints to understand the functionality. According to statistics from the World Health Organization (WHO) [5], at least 2.2 billion people have near or distant vision impairment. In at least 1 billion of these, vision impairment could have been prevented or is yet to be addressed. [5]. This requires smart devices and mobile apps to provide convenience for them. Therefore, how to make mobile apps better serve their eyes is a social justice issue that the whole society needs to pay attention to [5, 71].
    Figure 1:
    Figure 1: Examples of differences between hint-text, label and context description. (a) Label is used to briefly describe image components. (b) Content description provides an overview of related input components. (c) Hint-text further explains the input requirements.
    Figure 2:
    Figure 2: Workflow of our HintDroid: It extracts GUI entity information from the view hierarchy file of the GUI page and constructs a GUI prompt that helps LLM understand the context. To facilitate LLM’s better understanding of the task, HintDroid uses a retrieval-based example selection method to construct the in-context learning prompts. It also uses input content as a bridge to evaluate the generated hint-text and extracts feedback information by checking whether the input content can trigger the next GUI page.
    To facilitate interaction between visually impaired users and apps, an increasing number of mobile devices support screen reader apps [8, 10]. There are also many companies that have developed relevant screen readers for different disabled users to help them understand and use the functionalities of the app. The screen reader will recognize the “text” or “content description” of the text component, “label” of the image component and the “hint-text” of the text input component, and read it aloud for visually impaired users. This is very effective for visually impaired users, who can understand the app’s UI through screen readers and use their functionalities. As shown in Figure 1, hint-text is different from label and content description. Label is usually used to briefly describe image components as in Figure 1 (a), content description provides the overview of related input components as in Figure 1 (b), while hint-text goes a step further to explain the input requirements and help users understand what should input into the component (the red rectangles in Figure 1 indicate the example hint-text). Google developer accessibility guideline [2] requires developers to provide hint-text for input components, especially input components that lack content description. “It’s helpful to show text in the element itself, in addition to making this hint-text available to screen readers” [9].
    Despite the guideline, the real-world practice doesn’t work like that. According to our observation on 4,950 popular apps from Google Play (in Section 4), about 91% of them have text input components, yet as high as 76% of the input components have issues of missing hint-text as shown in Figure 3 (a). Without hint-text, the screen reader cannot obtain information about the text input component, which can cause it to skip the component. In addition, even if some input components have the hint-text, it can be simple and lacks practical meaning as in Figure 3 (b), which can’t help people understand the text input requirements. This also leads to disabled people being unable to use the functionalities provided by the app, which still affects their access to the functionality of this page.
    Figure 3:
    Figure 3: Example of the text input component without/with hint-text.(a) These inputs have issues of missing hint-text. (b) Hint-text lacks practical meaning. (c) Hint-text can help visually impaired users successfully fill in the correct input.
    Therefore, it calls for automatic support for the generation of hint-text, yet to the best of our knowledge there is no existing study tackling this problem. The most relevant work is the generation of missing labels for image components with deep learning techniques to improve the accessibility of apps [30, 31, 142]. However, the approaches for generating labels mainly involve the understanding of icons, while the generation of hint-text not only requires more thorough viewpoints of the whole GUI page but also calls for deeper comprehension of the related information, as shown in Figure 1. Taken in this sense, this work will develop advanced techniques for generating the meaningful hint-text as shown in 3 (c), which can help visually impaired users successfully fill in the correct input.
    In order to overcome these challenges, inspired by the excellent performance of the Large Language Model (LLM) in natural language processing tasks [27, 32, 37, 82, 116, 140]. We propose HintDroid for automated hint-text generation based on LLM and GUI page information. Given the view hierarchy file of the GUI page, we first extract the text input component and the GUI entity information of the page, and then design GUI prompts to enable LLM to understand the text input context. In order to facilitate the LLM to better understand our task, we use in-context learning and construct an example database with hint-text and its corresponding input component information. Then we design a retrieval-based example selection method to construct an in-context learning prompt. Combining the above two prompts, LLM will output hint-text and the input content generated based on it. To ensure the quality of hint-text, we use input content as a bridge to evaluate the generated hint-text, and extract feedback information by checking whether the input content can trigger the next GUI page, then construct a feedback prompt to further let LLM adjust the hint-text.
    We evaluate the effectiveness of HintDroid on 2,659 text input components with hint-text involving 753 popular apps in Google Play (the largest repository of Android Apps). Results show that HintDroid can achieve 83% BLEU@1, 77% BLEU@2, 73% BLEU@3, 66% BLEU@4, 67% METEOR, 63% ROUGE and 62% CIDEr with the generated hint-text. Compared with 12 common-used and state-of-the-art baselines, HintDroid can achieve more than 82% boost in the exact match compared with the best baseline. In order to further understand the role of each module and sub-module of the approach, we conduct ablation experiments to further demonstrate its effectiveness. We further carry out a user study to evaluate its usefulness in assisting visually impaired users, with 33 apps from Google Play. Results show that the participants with HintDroid fill in 152% more correct input, cover 66% more states and 77% more activities within 139% less time, compared with those without using HintDroid. The experimental results and feedback from these participants confirm the usefulness of our HintDroid.
    The contributions of this paper are as follows:
    To our best knowledge, this is the first work to automatically predict the hint-text of text input components for enhancing app accessibility1. We hope this work can invoke the community’s attention to maintaining the accessibility of mobile apps from the viewpoint of hint-text.
    An empirical study for investigating how well the text input components of current apps support the accessibility for users with vision impairment, which motivates this study as well as follow-up researches.
    Large-scale evaluation on real-world mobile apps with promising results, and a user study demonstrating HintDroid’s usefulness in assisting visually impaired users successfully fill in the correct input.

    2 Related Work

    With the development of mobile applications, more and more companies (Google, Apple) recognize the need to improve the accessibility of apps and have introduced guidelines for app developers and designers [2, 4, 6, 9], including basic principles for accessibility design. The accessibility of apps is crucial for users to use the app normally, especially for visually impaired users and other disabled users [50, 57, 96, 101, 109, 114, 118, 126, 144]. Missing hint-text in the input text component is a common accessibility issue that can affect users’ understanding of input requirements, especially for visually impaired users, as screen readers need to provide input requirements to visually impaired users by reading hint-text fields [2, 8, 9, 10]. Therefore, this paper focuses on the accessibility issue of text input components. Our HintDroid can automatically complete the hint-text of the input components, helping visually impaired people understand the text input components and successfully fill in the correct input.

    2.1 App Accessibility for Visually Impaired Users

    Within the domain of Human-Computer Interaction, researchers have extensively delved into accessibility challenges prevalent in various categories of small-scale mobile applications [56, 63, 83, 84, 87, 107, 134], spanning domains like health [39, 61, 93, 110, 129], smart cities [95, 127, 132], and government engagement [13, 66]. While these works scrutinize distinct facets of accessibility, a recurring theme has been the conspicuous absence of descriptions for image-based components [68, 98, 112, 122]. The gravity of this issue has been explicitly recognized across studies. Notably, Park et al. [107] conducted an evaluation that assigned both severity and frequency ratings to different accessibility errors, with missing labels emerging as the most severe among various issues. Kane et al. [62] conducted an investigation into mobile device adoption and accessibility for users with visual and motor disabilities. Ross et al. [112] conducted a comprehensive analysis of image-based button labeling within a relatively large corpus of Android apps, pinpointing prevalent labeling deficiencies. Based on the above analysis of label missing issues, Chen et al. [30] proposed LabelDroid, and employed deep-learning techniques to train a model on a dataset of existing icons with labels to automatically generate labels for visually similar, unlabeled icons. To further improve the performance of label generation, Mehralian et al. [91] considered more GUI information and proposed a context-aware label generation approach, COALA, that incorporated several sources of information from the icon in generating accurate labels. The above studies have conducted in-depth research on the accessibility issues of image components. In addition, statistical data shows that the accessibility issues associated with text input are also serious, but this issue has received relatively little attention in existing research. So our study not only presents the most expansive scrutiny of text input components but also introduces an LLM-based solution to generate hint-text.
    Several works also focused on detecting and rectifying accessibility gaps, particularly for users with visual impairments [11, 12, 15, 20, 35, 74, 85, 86, 113, 121]. Eler et al. [43] proposed an automated test generation model to dynamically evaluate mobile apps. Salehnamadi et al.  [113] designed a high-fidelity form of accessibility testing for Android apps, Latte, that automatically reused tests written to evaluate an app’s functional correctness to assess its accessibility as well. Zhang et al. [143] leveraged crowd-sourcing to annotate GUI elements devoid of original content descriptions. Although these researches help enhance mobile accessibility, they still can’t fix the issue of missing hint-text. This also further motivates us to design an automated method to generate hint-text that satisfies contextual semantics.

    2.2 GUI Understanding and Intelligent Interactions for Visually Impaired Users

    In order to help visually impaired individuals understand the meaning of GUI pages and components, researchers have attempted to use computational vision technology for GUI modeling and semantic understanding of GUI pages [16, 19, 28, 41, 44, 46, 51, 60, 70, 131, 136, 142]. Schoop et al. [115] designed a novel system that models the perceived tappability of mobile UI elements with a vision-based deep neural network and helps provide design insights with dataset-level and instance-level explanations of model predictions. He et al. [54] designed a new pre-trained UI representation model, ActionBert, aimed at utilizing visual, linguistic, and domain-specific features from user interaction traces to pre-train the universal feature representation of UI and its components. Although these studies can help users understand the GUI information of pages, they have not identified and understood the relevant information of text input components. Considering that visually impaired users cannot obtain GUI information on the page through their eyes, it is difficult to determine input requirements. To address the above challenges, this paper repairs the hint-text missing issue in input components and helps visually impaired users understand the input components.
    The portability of mobile devices has led to an increasing number of visually impaired people using smartphones for daily life [57, 64, 111, 114]. Researchers have explored innovations in alternative sensory modalities like speech systems [17, 104], auditory [25, 49], and multimodal interaction [26], etc. It has opened a new era for accessible usage of the smartphone for visually impaired people [40, 99]. To assist visually impaired people in filling in the input content, researchers proposed some input solutions rely on indicating Braille mappings to enter characters [14, 18, 24, 47, 48, 67, 88, 90, 102, 123]. While input methods to assist mobile text entry have been extensively studied in the recent literature [79, 81], text entry research has focused much less on the needs of persons with vision problems. For visually impaired people, how to use the text input functionality of an app is a challenging task. They not only need to understand the intent of the input component, but also fill in the correct input.

    2.3 LLM Usage in Human-computer Interaction

    Recently, the great success of pre-trained Large Language Models [27, 32, 37, 116, 140] in a variety of NLP tasks. Considering the powerful performance of LLM, researchers have successfully leveraged it to solve various tasks in the field of human-machine interaction and software engineering [100, 108, 135, 139, 141]. Supported by code naturalness [55], researchers applied the LLMs to code writing in different programming languages [45, 133]. A similar work QTypist [81] leveraged the LLM to generate the text inputs for triggering the next GUI page in order to improve the testing coverage of mobile testing. Different from its sole focus on text input generation to boost existing GUI testing tools, HintDroid is designed for generating the hint-text of text input component, which can help visually impaired users fill in the correct input.
    LLMs were also successfully applied in research related to the HCI community  [38, 58, 59, 65, 73, 80, 128]. Stylette [65] allowed users to modify web designs with language commands and used LLMs to infer the corresponding CSS properties. Lee et al. [72] presented CoAuthor, a dataset designed to reveal GPT-3’s capabilities in assisting creative and argumentative writing. Othman et al. [103] proposed an automated accessibility issue repair method based on LLM, which utilizes LLM to repair website accessibility issues for the first time. Wang et al. [128] investigated the feasibility of enabling versatile conversational interactions with mobile UIs using a single LLM and designed prompting techniques to adapt an LLM to mobile UIs. These researches on LLM also inspires us to use LLM knowledge to understand UI and generate hint text. Unlike them, this paper addresses the challenge of missing hint-text that cannot be fixed by current accessibility repair tools. To our knowledge, it is the first time to propose using LLM’s automation to repair hint-text missing issues, helping users further understand input requirements.

    3 Android Accessibility Background

    In this section, we first introduce Android-related terminology and Android screen reader. This background helps to support the concise and clear terminology used throughout our analysis.
    Android Text Input Components. When developers develop UI, the Android platform provides them with many different types of UI components [2], such as TextView, ImageView, EditText, Button, and so on. EditText is a user interface element for entering and modifying text. When you define an EditText component, you must specify the R.styleable.TextView_InputType attribute. The attribute “hint” is mainly used to display prompts for input requirements. For text input components, Google developer accessibility guideline [2] requires developers to provide the Android: hint attribute to provide input requirements for users.
    Android Screen Readers. Based on the WebAIM [1], a significant 95.1% of visually impaired respondents relied on smartphone screen readers. Google and Apple, as major industry players, shape mobile technology and apps. For vision-impaired users, Google’s TalkBack [8] and Apple’s VoiceOver [10] provide crucial accessibility, enabling mobile app engagement. Focusing on Android apps, TalkBack is a cornerstone accessibility service. Pre-installed on many Android devices, it empowers blind and visually impaired users. TalkBack narrates blocks of text, alerts users to interactive elements, like buttons, and extends beyond text, engaging with apps through intricate gestures. It also offers local and global context menus, allowing tailored interaction experiences and global setting adjustments. A silent ally for those navigating digital realms with different senses, it bridges the gap between user intent and device response, enhancing mobile app experiences universally. TalkBack typically obtains label and hint-text from apps to provide services to users, so providing label and hint-text in mobile applications is crucial for helping users understand the app.

    4 Motivational Study

    Figure 4:
    Figure 4: Statistical results of hint-text missing rate. More than 80% of the hint-text of 18 categories of apps are missing.
    This paper focuses on proposing an automated approach to automatically generate the hint-text for text input components. Compared to the purpose of input content generation, hint-text focuses more on helping users or visually impaired individuals understand the needs of input components from a human cognitive perspective. To understand the distribution of text input components and hint-text missing issues in popular apps, we conduct a motivational study to explore the potential impact of hint-text missing issues on people with disabilities.

    4.1 Data Collection

    In order to assess the support of popular mobile apps for vision impairment users, we randomly crawl 4,950 apps from 33 categories (150 apps per category) from Google Play [7], all of which were updated between December 2022 and June 2023, with installations ranging from 1K to 100M. We use the Application Explorer [29] (ensuring that each app can run normally on the browser) to automatically explore different screens in the app through various operations such as clicking, editing, and scrolling. During the exploration process, we get screenshots of the app GUI and their view hierarchy file (runtime front-end code file), which identifies the type of each element (such as EditText, TextView), coordinates in the screenshot, content description, and other metadata. After removing all duplicate screenshots, we ultimately collected 73,286 GUI screenshots and their corresponding View hierarchy files from 4,950 apps. In the collected data, 45,803 (63%) screenshots from 4,501 apps (91%) included text input components, which formed the dataset we analyzed in this study.

    4.2 Current Status of Text Input Component Hint in Mobile Applications

    Statistical results show that out of 4,501 apps with text input components, 3,398 (76%) of them are without hint-text. Among all 45,803 screens, 30,226 (66%) had at least one text input without explicit hint-text content. This means that more than half of the text input components do not provide hint-text to users. These statistical data confirm the severity of the issue of missing hint-text, which may seriously hinder visually impaired users from using the mobile app. Then, we further analyze the issue of missing hint-text in the text input component of different categories of mobile apps. The statistical results show that the issue of hint-text missing widely exists in different types of mobile apps, and some categories of hint-text missing problems are severe. As shown in Figure 4 (a), more than 80% of the hint-text of 18 categories of apps are missing. The missing rate of hint-text in Tools, Shopping, Book and Reference and Travel and local categories, which commonly used in the daily lives of visually impaired individuals exceeds 90%, greatly affecting their daily use.
    According to Google Play’s download statistics standards [7], We further analyze the hint-text missing rate for apps with different download numbers (popularity). We find that regardless of the number of downloads, the apps have a serious hint-text missing rate, of 71%-89%. Meanwhile, we also find that popular apps with high downloads also have serious issues of missing hint-text, i.e., apps with 100K-100M downloads have similar missing rates, with an average missing rate of 72%. Such hint-text missing issues with popular apps may have a greater negative impact, as these apps have a larger audience. These findings confirm the severity of the issue of missing hint-text, therefore an automated method is needed to generate missing hint-text for text input components.
    Figure 5:
    Figure 5: Overview of HintDroid. HintDroid consists of three main modules: (1) Module 1 is used to extract the contextual GUI information of the text input and generate the GUI prompt. (2) Module 2 is used to construct the in-context learning prompt to improve the performance of LLM. (3) Module 3 further optimizes the generation results of hint-text through a feedback mechanism.

    5 Approach

    This paper proposes HintDroid, which uses LLM to automatically generate the hint-text of text input components to facilitate the screen reader of disabled users in better understanding the inherent meaning of the input components. Figure 5 presents the overview of HintDroid, which consists of three main modules. First, given the view hierarchy file of the text input page of the Android application, the GUI entity extraction module extracts GUI information related to the text input component and the contextual information of its nearby components. Then, based on these extracted GUI information, HintDroid generates LLM understandable GUI prompts for hint-text generation (Section 5.1). Second, based on the GUI prompt, the information retrieval-based prompt enhancement module automatically selects examples that are most similar to the current input scenario, and constructs an in-context learning prompt. This prompt uses in-context learning to help LLM better understand hint-text generation tasks, thereby enhancing LLM’s hint-text generation performance (Section 5.2). Third, with the above prompt, we instruct the LLM to generate not only the hint-text, but also the suggested input content that aligns with the hint-text. We then input the content into the component and check whether it triggers the next page. If it doesn’t trigger the next page, we assume the generated hint-text may be inappropriate, and will provide the failed case as the feedback prompt to facilitate the LLM in adjusting the answer. Additionally, when receiving inappropriate input content, certain text input components can also report the error message for indicating the expected input, and we also utilize this error message in the feedback for re-querying the LLM (Section 5.3).

    5.1 GUI Entity Extraction and Prompting

    The first module of our approach is to understand, analyze, organize, and extract the entities from the GUI page with the text input component. Although LLM is equipped with knowledge learned from large-scale training corpora and excels in various tasks, its performance may be significantly affected by the quality of its input, that is, whether the input can accurately describe what to ask. Therefore, we design an approach to extract and organize the GUI information.
    Table 1:
    TypeEntityDescriptionInstantiation
    [AppName]Name of the app under testing[AppName]:“Flight”
     [Activities]List of names for all activities of the app, obtained from AndroidManifest.xml file[Activities]: “Main, OneWay, RoundTrip,...”
    [ActivityName]Activity name of the current GUI page[ActivityName]: “RoundTrip”
     [Component]List of all widgets in current page, represented with text/id[Component]: “Depart, Arrive, Departure time,...”
     [Position]Relative position of widgets, obtained through their coordinates[Position]: Upper: “Flight Search,...”, Lower: “Departure time,... ”
    [TextInput]The text input denoted with the textual related fields[TextInput]:“Departure time”
     [NearbyInput]Nearby widgets denoted with their textual related fields[NearbyInput] : “Flight Search,... ”
    Table 1: The example of the GUI entity extraction. HintDroid extracts the GUI entity information of the app, the GUI page currently used, and the components on the page.

    5.1.1 GUI Entity Extraction.

    Inspired by the screen reader’s ability, we convert GUI information into natural language descriptions [8, 10, 75, 142]. We first extract the GUI entity information of the app, the GUI page currently used, and the components on the page, which helps LLM understand the GUI of the current page and the input requirements of the text input components. The app entity information is extracted from the AndroidMaincast.xml file, while the other two types of entity information are extracted from the view hierarchy file, which can be obtained by UIAutomator [124]. Table 1 presents the summarized view of them.
    App Entity Information provides the macro-level semantics of the app under testing, which facilitates the LLM to gain a general perspective about the functionalities of the app. The extracted information includes the app name and the name of all its activities. The app name can help LLM understand the type of app it belongs to, and the activity name can help LLM infer the current functional GUI page and the functional GUI page after entering the input content, thereby providing a basis for generating hint-text that conforms to contextual semantics.
    Page GUI Entity Information provides the semantics of the current GUI page during the interactive process, which facilitates the LLM to capture the current snapshot. We extract the activity name of the current GUI page, all the components represented by the “text” field or “resource-id” field (the first non-empty one in order), and the component position of the page. For the position, inspired by the screen reader [8, 10], we first obtain the coordinates of each component in order from top to bottom and from left to right, and the components whose ordinate is below the middle of the page are marked as lower, and the rest is marked as upper.
    Input Component Entity information denotes the micro-level semantics of the GUI page, i.e., the inherent meaning of all its text input components, which facilitates the LLM in understanding the input requirements and the semantic information of its contextual components. The extracted information includes “text” and “resource-id” fields (the first non-empty one in order). To avoid the empty textual fields of a component, we also extract the information from nearby components to provide a more thorough perspective, which includes the “text” of parent node components and sibling node components.
    Table 2:
    IdPrompt TypeInstantiationExamples
        
    In-context learning prompt
    <Hint-text examples>We will provide you with 6 examples:We will provide you with 6 examples:
      1. [TextInput], [NearbyInput], [Hint-text]1st text input is “From”, its nearby components are “...”, its hint-text is...
      2. [TextInput], [NearbyInput], [Hint-text]2nd text input is “To”, its nearby components are..., its hint-text is...
      ......
      6. [TextInput], [NearbyInput], [Hint-text]6th text input is “Flight”, its nearby..., its hint-text is “Enter the city”.
        
    GUI prompt
    <App info>[AppName], [Activities]The app name is “Flight”, it has following activities: “Main,...”
    3<Page GUI info>[ActivityName], [Component], [Position]The current GUI page is “SearchFlight”, it has following components:“Search,...”, the upper part of the page is “...”, the lower part....
    4<Input component info>[TextInput], [NearbyInput]The text input of this page is “Depart”, its nearby components are....
        
    Feedback prompt
    <Feedback>[Feedback], [ErrorMessage]The input content “train” doesn’t pass the page, the error message of the input component is: “Please enter the correct city name”.
        
    Query & feedback query
    <Query>Please generate a hint-text for the input component based on the above information, and generate corresponding
      input content based on the generated hint-text.
    7<Feedback Query>Please regenerate the hint text and its corresponding input content based on the feedback information above.
        
    Example output
    <Example output>[Hint-text], [InputContent]Please output according to the following example: the hint-text is “xxx”, the input content is “xxx”.
    Table 2: The example of the GUI prompt construction. It provides the app, GUI page and the input component information, then queries the LLM for the hint-text and its corresponding input content.

    5.1.2 GUI Prompt Construction.

    With the extracted GUI information, we combine them into the GUI prompt that LLM can understand, as shown in Table 2, i.e., <App information> + <GUI page information> + <input component formation> + <Query> + <Example output>. Generally speaking, it first provides the app information, GUI page information and the input component information, then queries the LLM for the hint-text of the text input and its corresponding input content and gives it the example output as shown in Figure 6. Due to the robustness of LLM, the generated prompt sentence does not need to fully follow the grammar. After inputting the GUI prompt, the LLM will return its recommended hint-text and its inferred input content.

    5.2 Enriching Prompt with Examples

    It is usually difficult for LLM to perform well on domain-specific tasks as our hint-text generation, and a common practice would be employing the in-context learning [42, 94, 119] schema to boost the performance. It provides the LLM with examples to demonstrate what the instruction is to enable the LLM to better understand the task. Following the in-context learning schema, along with the GUI prompt for the text input as described in Section 5.1.2, we additionally provide the LLM with examples of the hint-text. To achieve this, we first build a basic example dataset of hint-text from the popular mobile apps in our motivational study. Research shows that the quality and relevance of examples can significantly affect the performance of LLM [42, 94, 119]. Therefore, based on the dataset we built, a retrieval-based example selection method (in Section 5.2.2) is designed to select the most appropriate example according to the text input and its hint-text, which further enables the LLM to learn with pertinence.

    5.2.1 Example Dataset Construction.

    We collect the hint-text of text input components from Android apps in our motivational study and continuously build an example dataset that serves as the basis for in-context learning. For each data instance, as demonstrated in Table 2, it records the GUI information of text input components and its hint-text, which enables us to select the most suitable examples and facilitate the LLM understanding of what the hint-text and the text input context like.
    Mining Hint-text from Android App. First, we automatically crawl the view hierarchy file from the Android mobile apps in our motivational study. Then we use keyword matching to filter these related to the text input components (e.g., EditText) which have hint-text. In this way, we obtain 15,577 (45,803 - 30,226) view hierarchy file in Section 4 with text input components (all of them has the hint-text) and store them in the example dataset (there is no overlap with the evaluation datasets). We then extract the GUI entity information of the text input component with the method in Section 5.1.1, and store it together with the hint-text.
    Enlarging the Dataset with Hint-text During Using. We also enrich the hint-text example dataset with the new hint-text which generates input content that truly triggers the next page (transition) during HintDroid runs on various apps. Specifically, for each generated hint-text, after running it in the mobile apps, we put the ones that generate input content that truly triggers the next page (transition) and its GUI information into the example dataset.

    5.2.2 Retrieval-based Example Selection and In-context Learning.

    The hint-text examples can provide intuitive guidance to the LLM in accomplishing a task, yet excessive examples might mislead the LLM and cause the performance to decline. Therefore, we design a retrieval-based example selection method to choose the most suitable examples (i.e., most similar to the input components) for LLM.
    In detail, the similarity comparison is based on the GUI entity information of the text input components. We use Word2Vec (Lightweight word embedding method) [92] to encode the context information of each input component into a 300-dimensional sentence embedding, and calculate the cosine similarity between the input component and each data instance in the example dataset. We choose the top-K data instance with the highest similarity score, and set K as 6 empirically. The selected data instances (i.e., examples) will be provided to the LLM in the format of GUI entity information and hint-text, as demonstrated in Table 2.
    Figure 6:
    Figure 6: Example of the prompt generation. The prompts include: in-context learning prompt, GUI prompt, feedback prompt, query and example output.

    5.3 Feedback Extraction and Prompting

    Since the LLM cannot always generate the correct hint-text as we want, we ask it to generate not only the hint-text, but also the input content that aligns with the hint-text, and use the input content to help determine whether the hint-text is appropriate or not. The design idea of HintDroid is to role-play LLM as a low vision user, providing it with GUI context and hint-text to see if it can generate input content correctly. So, we optimize the hint-text by determining whether the generated input content can trigger the next page, rather than simply triggering the next page which is what QTypist [81] did for software testing. In detail, we enter the generated content into the text input component and check whether it triggers the next page (transition). If it doesn’t trigger the next page (transition), we use the failed information to serve as the feedback and add it to the prompt to re-query the LLM.

    5.3.1 Automated Input Content Checking.

    We design an automated script to automatically input the content into components and perform operations. Specifically, LLM outputs the hint-text and input content of each text input component in a fixed format. We will recode it into an operation script in Android ADB format, such as adb shell input text “xxx”. Then we iterate through the components of the current page, identify the components that require further action after input is completed, and use the ADB command to perform the operation. After the operation is completed, we will check whether the situation detector has triggered the next page. Determine whether the page name and components have changed. For cases that failed trigger, we will proceed with the following feedback extraction.

    5.3.2 Error Message Extraction.

    When one inputs an incorrect text into the component, they might report the error message, e.g., the app may alter the users that the password should contain letters and digits. The error message can further help LLM understand what the valid input should look like.
    We extract the error messages via differential analysis which compares the differences between the GUI page before and after inputting the text, and extracts the text field of the newly emerged components (e.g., a popup window) in the later GUI page, with examples shown in Figure 2. We also record the text input which makes the error message happen, which can help the LLM to understand the reason behind it.

    5.3.3 Feedback Prompt Construction.

    For cases where there is no successful page transfer, we combine the aforementioned information in Table 2 as the feedback prompt, i.e. <Feedback> + GUI prompt + <Feedback query> + <Example output>. We design two types of <Feedback>. One is to tell the LLM that the current hint-text and input content are incorrect. The other is to provide the information dynamically generated by the text input component, which may make people more clear at once. Note that not all input components may provide an error message, and for components without an error message, we display null in the prompt. Finally, we request LLM to re-optimize the hint-text based on the feedback prompt mentioned above.

    5.4 Implementation

    We implement HintDroid based on the Turbo-3.5 which is released on the OpenAI website2. It obtains the view hierarchy file of the current GUI page through UIAutomator [124] to extract GUI information of the text input components. HintDroid can be integrated by the automated GUI exploration tool, which automatically extracts the GUI information and generates the hint-text. When we obtain the generated hint-text from HintDroid, we further design an automated script to add its missing hint-text to the app. Specifically, as shown in Figure 7, given any app, the automated script consists of the following five steps. (1) We use Application Explorer [29] to automatically run the app and get the view hierarchy files for each page. (2) We detect GUI pages (view hierarchy files) with missing hint-text based on text input components. (3) HintDroid automatically generates the corresponding hint-text. (4) Automatically decompile the APK file of the app and retrieve the code of text input. We design a script to automatically execute to decode the APK. (5) Automatically repackage APK files. We design a script to automatically execute to encode and package the modified code into APK, and complete the repair. Please note that if it is an open-source app, we can directly modify its source code and repackage it. So HintDroid is not a dynamic interactive process, it is a one-time offline job. We also calculate that the average time for generating hint-text on each GUI page is 1.86 seconds.
    Figure 7:
    Figure 7: Workflow of implementation. ① Extracting GUI pages. ② Detecting GUI pages with missing hint-text. ③ Predicting hint-text based on GUI information. ④ Decompiling APK to obtain code. ⑤ Repackaging APK after code modification.

    6 Effectiveness Evaluation

    We evaluate the effectiveness of HintDroid from the point of view of the hint-text generation accuracy. For the accuracy of hint-text generation, we compare the exact match, BlEU, METEOR, ROUGE-L and CIDEr with 12 baseline methods to demonstrate its advantage (details are in Section 6.1.2). For the model structure, we conduct ablation experiments to evaluate the impact of each (sub-) module on the performance.

    6.1 Experiment Setup

    6.1.1 Dataset and Experiment Procedures.

    We collect the GUI pages that contain text input components and their corresponding hint-text from popular apps on Google Play as the experimental dataset. Specifically, we follow the data collection method in the motivational study in Section 4 and randomly select 50 apps from each of 33 categories that are different from the motivational study dataset, totaling 1,650 apps. We further ensure that the experimental data doesn’t overlap with previous data through app name matching. We further use the Application Explorer [29] to obtain GUI page file for each app, and filter out GUI pages with text input components and their corresponding hint-text through the “hint” attribute field. According to the above standards, we have obtained a total of 2,797 text input components from 753 apps with hint-text. We further recruited 2 developers with over 5 years of development experience to evaluate the quality of these hint-texts. Developers evaluate hint-text based on the accessibility specifications in the Google Developer Guidelines and annotate hint-text based on the principle of open coding protocol [117]. The third developer reexamines the results until a consensus is reached. In the end, 2,659 text input components with hint-text are used for our effectiveness evaluation.
    Table 3:
    MethodExact matchBLEU@1BLEU@2BLEU@3BLEU@4METEORROUGE-LCIDEr
             
    Learning-based method
    0.290.370.350.310.290.260.240.21
    LSTM0.280.330.310.250.220.190.170.13
    Seq2Seq0.300.370.320.290.270.250.210.18
    Transformer0.390.530.470.430.400.370.360.35
    RNNInput0.280.350.330.320.270.250.230.19
    0.340.470.450.390.360.350.320.31
    CNN+LSTM0.260.290.240.190.170.150.090.08
             
    Matching-based method
    0.210.270.240.220.180.200.180.15
    Random based0.110.160.130.100.070.080.090.07
    Rule based0.320.450.390.320.270.290.310.27
             
    LLM-based method
    0.350.490.430.390.360.380.330.31
    QTypist0.310.470.410.350.330.370.340.33
    HintDroid0.710.830.770.730.660.670.630.62
    Table 3: Result of the accuracy of hint-text generated by HintDroid and baselines.

    6.1.2 Baselines.

    Since there are hardly any existing approaches for the hint-text generation of mobile apps, we employ 12 baselines from various aspects to provide a thorough comparison.
    First, we directly utilize ChatGPT [116] as the baseline. We provide the GUI information of the text input component (as described in Table 2), and ask it to generate hint-text.
    Secondly, we use the hint-text example dataset constructed in Section 5.2.1 to retrain the text generation model. The example dataset contains 15,577 pairs of GUI information for input components and their corresponding hint-text. We use GUI information as input to the model and hint-text as output to train the text generation model. For text-based hint-text generation model, we select Recurrent Neural Network(RNN) [137], LSTM [138], Seq2Seq [77], Transformer [52] as the hint-text generation models. For the image-based hint-text generation model, we choose Labeldroid [30] and CCN+LSTM [137]. Labeldroid [30] is a deep learning-based model to automatically predict the content description from the image button. Since our hint-text example dataset records GUI screenshots corresponding to hint-text, we also use this dataset as the training set to train these models.
    Thirdly, considering that the input scenarios of some apps are similar to those in the example dataset, we design a retrieval-based matching method and a random-based matching method to select similar hint-text. We use the methods in Section 5.2.2 for information retrieval. Meanwhile, we also design a rule-based hint-text generation method, which has 36 general rules summarized by example data.
    Fourthly, we select existing input text generation tools (QTypist and RNNInput) and use example data in Section 5.2.2 for fine-tuning. QTypist [81] is a text input generation approach based on GPT-3, RNNInput [79] utilizes the RNN model and Word2Vec to predict the text input value for a given text input component.

    6.1.3 Metric.

    To evaluate the performance of HintDroid, we select 5 widely-used evaluation metrics including exact match [30], BLEU [106], METEOR [21], ROUGE [78], CIDEr [125] inspired by related works about machine translation and image captioning. The exact match rate is the percentage of testing pairs whose predicted content description exactly matches the ground truth. Exact match is a binary metric, i.e., 0 if any difference, otherwise 1. It can’t tell the extent to which a generated content description differs from the ground truth. Therefore, we also adopt other metrics.
    BLEU [106] is an automatic evaluation metric widely used in machine translation. It calculates the similarity between machine-generated translations and human-created reference translations. BLEU is defined as the product of n-gram precision and brevity penalty. As most content descriptions for image-based buttons are short, we measure the BLEU value by setting n as 1, 2, 3, 4, represented as BLEU@1, BLEU@2, BLEU@3 and BLEU@4.
    METEOR [21] (Metric for Evaluation of Translation with Explicit ORdering) is another metric used for machine translation evaluation. It is proposed to fix some disadvantages of BLEU which ignores the existence of synonyms and recall ratio. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [78] is a set of metrics based on recall rate, and we use ROUGE-L, which calculates the similarity between predicted sentence and reference based on the longest common subsequence. CIDEr (Consensus-Based Image Description Evaluation) [125] uses the term frequency-inverse document frequency to calculate the weights in reference sentences for different n-grams.
    All of these metrics give a real value with range [0,1] and are usually expressed as a percentage. The higher the metric score, the more similar the machine-generated content description is to the ground truth. If the predicted results exactly match the ground truth, the score of these metrics is 1 (100%). We compute these metrics using coco-caption code [33].

    6.2 Results and Analysis

    6.2.1 Accuracy of Hint-text Generation.

    Table 3 shows the overall hint-text generation performance of HintDroid and the baselines. HintDroid achieves the average exact match, BLEU@1, BLEU@2, BLEU@3, BLEU@4, METEOR, ROUGE-L and CIDEr of 0.71, 0.83, 0.77, 0.73, 0.66, 0.67, 0.63, 0.62 across 2,659 hint-texts. This indicates the effectiveness of our approach in generating hint-text for text input components. We further examine the bad cases for the hint-text generated by HintDroid and summarize the following two reasons. (1) The GUI pages don’t have contextual GUI components or the components without semantic information. In this case, observing the GUI context information of its previous page can enhance the understanding of the input component. (2) Hint-text expresses similar meanings, but it doesn’t fully match the benchmark. We manually analyze these hint-texts and find that 93% of them expressed similar meanings.
    Figure 8:
    Figure 8: Examples of bad case in hint-text generation. (a) The GUI pages don’t have contextual GUI components or the components without semantic information. (b) Hint-text expresses similar meanings, but it doesn’t fully match the benchmark.
    To show the generalization of HintDroid, we also calculate the performance of HintDroid in different app categories as seen in Figure 9. We find that our proposed approach can generate diversified hint-text for different categories of apps which can help users understand the input requirements. Furthermore, it is good at capturing the contextual semantic information of the input components and generating the hint-text. HintDroid is also not sensitive to the app category, i.e., steady performance across different app categories.
    Performance comparison with baselines. Table 3 also shows the performance comparison with the baselines. We can see that our proposed HintDroid is much better than the baselines, i.e., 82%, 0.57%, 0.64%, 0.70%, 0.65%, 0.43%, 0.75%, 0.77% higher in exact match, BLEU@1, BLEU@2, BLEU@3, BLEU@4, METEOR, ROUGE-L and CIDEr compared with the best baseline Transformer. We analyze the reasons for the failure of these baselines, mainly because they focus on generating correct input content for mobile GUI testing, rather than generating hint-text from the perspective of helping users and human cognition. This further indicates the advantages of our approach. We conduct the Mann-Whitney U test [89] between these models among all testing metrics. Since we have inferential statistical tests, we apply the Benjamini & Hochberg (BH) [22] method to correct p-values. Results show the improvement of our model is significant in all comparisons (p-value<0.01). Without our elaborate design, the raw ChatGPT demonstrates poor performance, which further indicates the necessity of our approach.
    Figure 9:
    Figure 9: Result of different app category. HintDroid can generate diversified hint-text for different categories of apps which can help users understand the input requirements.

    6.3 Ablation Study

    6.3.1 Contribution of Modules.

    Figure 10 (a) shows the performance of HintDroid and its 2 variants respectively removing the second and third modules. In detail, for HintDroid w/o Module 2 (in-context learning prompt), we don’t provide the example date for LLM. For HintDroid w/o Module 3 (Feedback prompt), we don’t use feedback and just use the results generated once. we provide the information related to the input components (as Table 2) to the LLM.
    We can see that HintDroid’s hint-text generation performance is much higher than all other variants, indicating the necessity of the designed modules and the advantage of our approach. Compared with HintDroid, HintDroid w/o Module 3 results in the largest performance decline, i.e., 106% drop (0.30 vs. 0.62) in CIDEr rate. This further indicates that the feedback module can help LLM to deeply understand the input requirements and optimize the generated hint-text based on the error message.
    HintDroid w/o in-context learning prompt also undergoes a big performance decrease, i.e., 93% (0.32 vs. 0.62) in CIDEr rate. This might be because without the examples, the LLM would not understand the input intention and criteria for what kinds of hint-text are needed.
    Figure 10:
    Figure 10: Result of ablation study. The results demonstrate that removing any of the modules/sub-modules would result in a noticeable performance decline, indicating the necessity and effectiveness of the designed modules/sub-modules.
    Contribution of Sub-modules. Figure 10 further demonstrates the performance of HintDroid and its 4 variants. We remove the part of prompt when querying LLM, i.e., App information, page GUI information, input component information and error message. The experimental results demonstrate that removing any of the sub-modules would result in a noticeable performance decline, indicating the necessity and effectiveness of the designed sub-modules.

    6.3.2 Influence of Different Number of Examples.

    Figure 10 demonstrates the performance under the different number of examples provided in the prompt. We can see that the hint-text generation performance increases with more examples, reaching the highest exact match, BLEU, METEOR, ROUGE-L and CIDEr with 6 examples. And after that, the performance would gradually decrease even increasing the examples. This indicates that too few or too many examples would both damage the performance, because of the tiny information or the noise in the provided examples.

    7 Usefulness Evaluation

    To evaluate our HintDroid, we also conduct a user study to demonstrate its usefulness in real-world practice. Our goal is to examine: (1) whether HintDroid can help visually impaired users successfully fill in the correct input. (2) whether HintDroid can effectively help visually impaired users explore the functionality of the application. (3) whether HintDroid can save time by filling in the correct input.

    7.1 Dataset of User Study

    To ensure the representativeness of test data, We begin with the 3,398 apps from Google Play described in Section 4, which have text input components without hint-texts (Please note that the data in this Section is not used for model training or in-context learning). To further confirm the universality and usefulness of our model, we first filter them according to the following rules. (1) At least 3 pages requiring text input components, (2) generated hint-texts can be integrated into the app, and (3) the number of activities in the app is more than 10. According to the above rules, we obtained 371 apps, and further selected the app with the highest download number of each app category as our experimental data. As shown in Figure 11, we describe the data selection process according to the PRISMA Flow Diagram[105]. We end up with 33 apps (1 app per category) with 237 text input components, and use them for the final evaluation, with details in Table 4.
    Figure 11:
    Figure 11: Flowchart of data selection. We filter them according to 3 rules. (1) At least 3 pages requiring text input components, (2) generated hint-texts can be integrated into the app, and (3) the number of activities in the app is more than 10.

    7.2 Participants Recruitment

    We recruit 36 visually impaired users to participate in the experiment, of whom 20 are male and 16 are female. Their ages ranged from 20 to 55 years (median = 36 years). 22 participants have no residual vision, 8 have only light/dark perception and 6 have very little central vision. The participants have visual impairment for a period between 7 and 41 years. All participants use screen readers (Talkback [8]) as their primary assistive technology to use mobile apps. All of the participants have been using mobile devices for 5 years or more. Every participant receives a $100 as a reward after the experiment. At the beginning of the experiment, we ask participants to use app functions as much as possible. We also conduct a follow-up survey among the participants regarding their experiment experience.
    The study involves two groups of 36 participants: the experimental group from P1 to P18 who use the mobile apps with the hint-text generated by our HintDroid, and the control group from P19 to P36 who use the app without hint-text. Each pair of participants ⟨ Px, P(x+18) ⟩ has a comparable app using experience to ensure that the experimental group has similar expertise and capability to the control group in total [36, 97, 109, 120]. Specifically, given that all participants are affiliated with the same rehabilitation and education institution, we seek the collaboration of the institution’s director to assist in the matching process. This process ensures a one-to-one ratio between the experimental and control groups, with pairings based on comparable personal competencies. The director’s extensive familiarity with the participants, stemming from over two years of close association, provided invaluable insight into their capabilities, ensuring an equitable and balanced distribution between the two study groups.

    7.3 Experimental Design

    To avoid potential inconsistency, we pre-install the 36 apps in the Samsung Galaxy Note 10 with Android 9.0 OS. For each app in the experiment group, we first run Application Explorer [29] to observe the GUI page file. Then run our HintDroid to complete the missing hint-text for each app. Finally, we repackage the APK file according to the automated script in Section 5.4. Please note that to ensure the correctness of the experiment, we check that each app can still run correctly after repackaging.
    We start the screen readers on the devices. and ask them to explore each app separately. The participants in the two groups need to use the 33 given mobile apps. They are required to fully explore the app and cover as many functionalities as possible. Each participant has up to 15 minutes to use a mobile app which is far more than the typical app session (71.56 seconds) [23]. Each participant can choose whether to end the exploration early based on their perceived exploration situation. Each of them conducts the experiment individually without any discussion with each other. During their exploring, all their screen interactions are recorded, based on which we derive their exploring performance.

    7.4 Evaluation Metrics

    Following previous studies [34, 76, 81], we use the following metrics to evaluate the effectiveness of HintDroid.
    Input accuracy: (number of correct inputs filled in by user in an app) / (number of all input components in an app)
    Activity coverage: (number of discovered activities) / (number of all activities)
    State coverage: (number of discovered states) / (number of all possible states)
    Filling time: average time spent from arriving at the page with text input to filling in the correct input.
    Table 4:
    Basic informationInput accuracyActivity coverageState coverageFilling time (min)
    idAppCategorycontrolexperimentcontrolexperimentcontrolexperimentcontrolexperiment
    HealthHBHealth0.500.810.540.720.510.642.381.18
    2WeatherL&WWeather0.400.880.240.660.620.781.931.02
    3MessagerWACommuni0.310.950.290.730.240.682.301.10
    4MoneyTKFinance0.530.720.500.690.360.742.740.70
    5FoodFactsFood0.440.900.520.650.470.631.491.37
    6MPAS+Maps0.500.800.280.670.350.741.791.46
    7PSStoreProduct0.340.800.280.630.520.761.501.43
    8NewAudioMusic0.420.840.200.640.510.822.890.64
    9WallETHPersonal0.470.900.210.740.260.811.720.93
    10PicGallPhoto0.450.710.270.590.370.812.920.53
    11SmartNewNews0.360.750.520.680.250.641.601.49
    12MyHMHouse0.180.870.470.750.430.712.560.83
    13INSTEADLife0.150.770.530.690.300.751.800.54
    14GameSpeGame0.550.730.250.750.500.761.710.51
    15BusinessEXBusiness0.120.720.210.650.350.791.500.55
    16PocketMapsTravel0.140.800.410.710.460.782.840.58
    17EventOREvents0.120.950.520.730.570.751.430.23
    18FitTAPComics0.160.930.460.690.310.822.130.46
    19SkyTubeVideo0.170.960.450.720.360.822.190.95
    20LibReaderBooks0.260.710.330.690.250.692.500.32
    21NoxSecuTool0.530.960.470.680.630.642.031.29
    22EarnMonSocial0.130.910.390.730.630.662.290.32
    23WalkTraSports0.350.700.420.720.560.652.710.68
    24ParentLAParenting0.360.720.240.690.590.801.701.55
    25ISAYMedical0.530.960.480.600.290.722.180.43
    26IpsosCommun0.090.720.540.790.550.792.711.25
    27FIRRLibraries0.530.790.520.650.340.671.331.23
    28DRBUsShopping0.280.720.530.740.630.712.191.44
    29LearningEducation0.490.890.330.690.400.632.170.95
    30MMDRDating0.060.740.210.750.310.771.370.73
    31PrettyBeauty0.190.940.580.700.600.731.990.36
    32FairAuto0.330.820.460.740.510.651.771.25
    33ArtPIXArt0.490.850.240.570.410.672.880.86
    Average0.330.830.390.690.440.732.100.88
    Table 4: The comparison of the experiment and control group. We present the HintDroid’s input accuracy, average activity, state coverage and the average filling time across the two groups.

    7.5 Results and Analysis

    We present the HintDroid’s input accuracy, average activity, state coverage and the average filling time across the two groups, as shown in Table 4.

    7.5.1 Higher Input Accuracy.

    As shown in Table 4, the average input accuracy of the experimental group is 0.83 which is about 152% ((0.83-0.33)/0.33) higher than that of the control group. The results of Mann-Whitney U Test [89] show there is a significant difference (p-value <0.01) between these two groups in the input accuracy metrics. This indicates that HintDroid can generate hint-text by analyzing the GUI information from text input components, helping visually impaired individuals better understand input requirements and successfully fill in the correct input. We also find that for some input components with limited information, the hint-text generated by the HintDroid is of great help to visually impaired individuals.
    Figure 12:
    Figure 12: Example of different good cases generated by HintDroid. (a) HintDroid generates the full name of the abbreviation and further explains it. (b) HintDroid provides explanations for the function of distinguishing by color. (c) HintDroid infers hint-text based on context.
    We analyze these text input components and summarize them into three categories. Firstly, some input components have content description/alt-text for abbreviations. For example, the abbreviation “BFR” for “Body Fat Ratio” as shown in Figure 12 (a), commonly found in health apps, may not be understood by blind people. Secondly, due to poor GUI design, there may only be simple icon components on the interface or colors to differentiate input. For example, Figure 12 (b) shows an arrow icon for switching between departure city and arrival city and uses colors to differentiate them (visually impaired individuals can’t distinguish colors), without providing a textual description of them. Finally, some input components did not provide a hint and content description/alt-text, and it needs to be inferred based on the context. For example, as shown in Figure 12 (c), in the date selection, users need to enter "diastolic pressure" and "systolic pressure" without any text explanation.

    7.5.2 More Explored GUI Pages.

    With our HintDroid, the activity coverage of the experimental group is 0.69, which is about 77% ((0.69-0.39)/0.39) higher than that of the control group. The state coverage of the experimental group is 0.73, which is about 66% ((0.73-0.44)/0.44) higher than that of the control group. The results of Mann-Whitney U Test [89] shows there is a significant difference (p-value <0.01, more detailed information of experiment can be seen in our website1.) between these two groups in both metrics. It indicates that the hint-text generated by our HintDroid can help visually impaired users successfully fill in the correct input to explore more states and activities. We also find that HintDroid can help visually impaired users explore some activities that are hard to find without the help of the HintDroid. For example, some search-type text input components will not be able to access subsequent content if the correct search content is not entered.

    7.5.3 Less Time Cost.

    It takes just 0.88 minutes for visually impaired users with our HintDroid to trigger the next page by filling in the correct input while 2.10 minutes in the control group. The results of the Mann-Whitney U Test shows there are significant difference (p-value <0.01) between these two groups for the filling time. In fact, the average time of the control group is underestimated, because an average of 9 participants don’t attempt to fill in the input or don’t continue after filling in incorrect input, which means that they may need more time in this input function.
    We watched the video recording of the app exploration in the control group to further discover the reasons for the higher time cost. Without the HintDroid, we find that participants are unable to understand the requirements of the input component, attempting to input content based on one’s own experience has a significant deviation from the actual input requirements. In contrast, participants who use the hint-text generated by HintDroid almost fill in the correct input in one go. This observation further confirms the importance of the hint-text generated by our HintDroid.

    7.6 Users’ Experience With HintDroid

    According to the visually impaired users’ feedback, all of them confirm the usefulness of our HintDroid in assisting their app exploration. They all appreciate that the hint-text generated by our HintDroid can help the understanding of the input requirement to successfully fill in the correct input, increasing the activity and state coverage. For example, “The hint-text generated by HintDroid is very helpful for us to fill in the input.”(P1), “The hints were super helpful in guiding me through the input fields. Thanks for making it clear!”(P3), “I really like the straightforward expression of these hints.”(P8), “The hints gave me exactly what I needed to know.”(P14), “They made it easy for me to fill in the blanks.”(P15). Participants express they like the hint-text, such as “Great, it’s useful. I like it!”(P2), “Yo, these hints were bang on!”(P5), “Plain and simple!”(P9), “These hints were crystal clear!”(P11), “I liked how the hints were friendly.”(P12). Participants express that our HintDroid can save their exploration time such as “Nice job! The hint-text of HintDroid saves our time.”(P17), “These hints made the whole input process so much faster.”(P13).
    The participants also mention the drawback and potential improvement of our HintDroid. They hope that we can also provide some examples of correct input or provide input formats (our method also can generate the input content based on hint-text). For example, “If the hints could guide me better for specific formats like phone numbers or emails, that’d be awesome!”(P4), “Providing links or references for more info in the hints would be really helpful.”(P7), “Providing links or references for more info in the hints would be helpful.”(P10), “If these prompts can show me how much I need to input, it might be more useful”(P18). Participants also hope that HintDroid can be adjusted in real-time based on their input situation in the future. For example, “Can it be made into interactive hints? When I encounter problems, your tool can provide more details.”(P12), “If we make a mistake, it’d be awesome if the hint could help us figure out what went wrong.”(P16).

    8 Discussion

    In summary, we find that the hint-text generated by our HintDroid can effectively help visually impaired users successfully fill in the correct input.

    8.1 The Generalization of Our Approach and Findings

    HintDroid is designed to generate the hint-text of text input component, which can help visually impaired users successfully fill in the correct input. In addition to Android, there are also many other platforms such as iOS, Web and Desktop. To conquer the market, developers tend to develop either one cross-platform app or separate native apps for each platform considering the performance benefit of native apps. Although our HintDroid is designed specifically for Android, since other platforms have similar types of information, it can also be extended to other platforms.
    We conduct a small-scale experiment for another two popular platforms, and experiment on 20 iOS apps with 34 text input and 20 Web apps with 57 text input, with details on our website. Results show that HintDroid achieves the average exact match, BLEU@1, of 0.73, 0.88 for iOS apps and 0.71, 0.85 for Web apps. This further demonstrates the generality and usefulness of HintDroid, and we will conduct more thorough experiments in the future.

    8.2 Potential Applications to End-users

    In addition to generating the hint-text in helping the visually impaired users successfully fill in the correct input, our HintDroid can also be applied to help end-users in their daily app usage. Given the increasing complexity of mobile apps, filling in the correct input of mobile apps is a challenging task, especially for aged users. For example, there may be too many text input components in one GUI page for aged users to try the correct input. They may stuck on one page with repetitive attempts but not working. Even normal users may stay on the input page due to unclear input requirements(without hint-text), especially for new apps.
    Based on HintDroid, according to the GUI information of the text input component and knowledge from popular app datasets, the HintDroid can automatically generate hint-text and complete the “hint” fields of the text input component in the app. Therefore, HintDroid can also be integrated into UI automation tools [53, 69, 130] to provide developers with more diverse hint-text. In addition, HintDroid generates corresponding input content based on the generated hint-text, which can be used in software testing to help testers generate diverse test cases.

    8.3 Potential Directions to Improve the Hint-text

    Our experiment shows that hint-text can help visually impaired individuals understand input needs. However, developers lack a unified style and approach when designing hint-texts, which may lead to ambiguity for visually impaired individuals. HintDroid utilizes generative models to generate hint-text, which could potentially optimize this process in the future. For example, a hint-text generation model can be customized based on the historical usage records of visually impaired individuals, or a generative model can be customized based on the usage scenarios of different types of applications, providing personalized hint-text.

    8.4 Limitations

    Although the average metric of the hint-text generated by HintDroid exceeds 70%, there are still some inaccuracies in the generation of hint-texts. As analyzed in Section 6.2.1, different developers have different design styles when designing text-input components, some of them have little or no contextual information. All these could influence the correct generation of hint-text. We will keep improving HintDroid for generating hint-text more accurately through the information from the previous GUI page.
    For the correctness and rationality of the hint-text generated by HintDroid, we only consider whether the input content generated based on the hint-text can trigger page transitions. As described in Section 5.3, not triggering page transitions is just the worst-case scenario. In addition, factors such as whether the information conveyed by hint-text is reasonable, complete, and ambiguous need to be considered. We will also incorporate these evaluation indicators in our future work.
    In addition, HintDroid is currently an offline one-time automated hint-text missing issue repair approach that uses repackaging to complete hint-text injection. For some closed-source apps that use code encryption, code obfuscation, and other techniques that prevent decompilation and repackaging, we will also send the hint-text generated by HintDroid to the developers via email. Considering the low cost of the approach, i.e., the average time to generate hint-text on each GUI page is 1.86 seconds. Meanwhile, adopting repackaging technology may potentially bring some security risks, so we prefer to implement our HintDroid through real-time interaction. As suggested by the participants in 7.6, we will design a real-time interaction approach in the future and integrate it into the screen reader. It is possible to dynamically adjust the description of hint-text based on user input.

    9 Conclusion

    The development of applications brings a lot of convenience to the daily lives of visually impaired people. They can use screen readers embedded in mobile operating systems to read the content of each screen within the app and understand the content that needs to be operated. However, the issue of missing hint-text in the text input component poses a challenge for screen readers to obtain input information. Based on our analysis of 4,501 Android apps with text inputs, over 76% of them are with missing hint-text. To overcome these challenges, we develop an LLM-based hint-text generation model called HintDroid, which analyzes the GUI information of input components and uses in-context learning to generate the hint-text. To ensure the quality of hint-text generation, we further design a feedback-based inspection mechanism to further optimize hint-text. The automated experiments demonstrate the high BLEU and a user study further confirms its usefulness.
    In the future, we will work in two directions. First, we will improve the performance of our approach by extracting more GUI context information. According to the user feedback, we will optimize the hint-text generated by our HintDroid, which can borrow the idea from the human-machine collaboration studies to better provide convenience for users. Second, we will not limit our HintDroid to assisting visually impaired individuals in app usage, and plan to explore its potential applications in the field of software development, such as integrating it into IDE.

    Acknowledgments

    This work was supported by the National Natural Science Foundation of China Grant No.62232016, No.62072442 and No.62272445, Youth Innovation Promotion Association Chinese Academy of Sciences, Basic Research Program of ISCAS Grant No. ISCAS-JCZD-202304, and Major Program of ISCAS Grant No. ISCAS-ZD-202302.

    Footnotes

    1
    We release the source code, experiment detail, and demo videos of our HintDroid in https://github.com/franklin/HintDroid.The demo video link is https://youtu.be/FWgfcctRbfI.

    Supplemental Material

    MP4 File - Video Preview
    Video Preview
    MP4 File - Video Presentation
    Video Presentation
    MP4 File - Demo video of HintDroid
    Demo video of HintDroid

    References

    [1]
    2017. Screen Reader Survey.https://webaim.org/projects/screenreadersurvey7/.
    [2]
    2023. Android Developer Accessibility Guideline. https://developer.android.com/guide/topics/ui/accessibility.
    [3]
    2023. Apple App Store. https://www.apple.com/au/ios/app-store/.
    [4]
    2023. Apple Human Interface Guidelines-Accessibility. https://developer.apple.com/design/human-interface-guidelines/accessibility.
    [5]
    2023. Blindness and vision impairment. https://www.who.int/zh/news-room/fact-sheets/detail/blindness-and-visual-impairment.
    [6]
    2023. Google MaterialDesign-Accessibility. https://material.io/design/usability/ accessibility.html#understanding- accessibility.
    [7]
    2023. Google Play Store. https://play.google.com.
    [8]
    2023. Google TalkBack. https://github.com/google/talkback.
    [9]
    2023. Principles for improving app accessibility. https://developer.android.com/guide/topics/ui/accessibility/principles.
    [10]
    2023. VoiceOver. https://cloud.google.com/translate/docs.
    [11]
    Patricia Acosta-Vargas, Belén Salvador-Acosta, Luis Salvador-Ullauri, William Villegas-Ch, and Mario Gonzalez. 2021. Accessibility in native mobile applications for users with disabilities: A scoping review. Applied Sciences 11, 12 (2021), 5707.
    [12]
    Dragan Ahmetovic, Roberto Manduchi, James M Coughlan, and Sergio Mascetti. 2015. Zebra crossing spotter: Automatic population of spatial databases for increased safety of blind travelers. In Proceedings of the 17th International ACM SIGACCESS Conference on Computers & Accessibility. 251–258.
    [13]
    Yakup Akgül. 2022. Evaluating the performance of websites from a public value, usability, and readability perspectives: a review of Turkish national government websites. Universal Access in the Information Society (2022), 1–16.
    [14]
    Mrim Alnfiai and Srinivas Sampalli. 2016. SingleTapBraille: Developing a text entry method based on braille patterns using a single tap. Procedia Computer Science 94 (2016), 248–255.
    [15]
    Abdulaziz Alshayban, Iftekhar Ahmed, and Sam Malek. 2020. Accessibility issues in Android apps: state of affairs, sentiments, and ways forward. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1323–1334.
    [16]
    Gary Ang and Ee Peng Lim. 2022. Learning User Interface Semantics from Heterogeneous Networks with Multimodal and Positional Attributes. In 27th International Conference on Intelligent User Interfaces. 433–446.
    [17]
    Shiri Azenkot and Nicole B Lee. 2013. Exploring the use of speech input by blind people on mobile devices. In Proceedings of the 15th international ACM SIGACCESS conference on computers and accessibility. 1–8.
    [18]
    Shiri Azenkot, Jacob O Wobbrock, Sanjana Prasain, and Richard E Ladner. 2012. Input finger detection for nonvisual touch screen text entry in Perkinput. In Proceedings of graphics interface 2012. 121–129.
    [19]
    Chongyang Bai, Xiaoxue Zang, Ying Xu, Srinivas Sunkara, Abhinav Rastogi, Jindong Chen, 2021. Uibert: Learning generic multimodal representations for ui understanding. arXiv preprint arXiv:2107.13731 (2021).
    [20]
    Mars Ballantyne, Archit Jha, Anna Jacobsen, J Scott Hawker, and Yasmine N El-Glaly. 2018. Study of accessibility guidelines of mobile applications. In Proceedings of the 17th international conference on mobile and ubiquitous multimedia. 305–315.
    [21]
    Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72.
    [22]
    Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57, 1 (1995), 289–300.
    [23]
    Matthias Böhmer, Brent Hecht, Johannes Schöning, Antonio Krüger, and Gernot Bauer. 2011. Falling asleep with Angry Birds, Facebook and Kindle: a large scale study on mobile application usage. In Proceedings of the 13th international conference on Human computer interaction with mobile devices and services. 47–56.
    [24]
    Matthew N Bonner, Jeremy T Brudvik, Gregory D Abowd, and W Keith Edwards. 2010. No-look notes: accessible eyes-free multi-touch text entry. In Pervasive Computing: 8th International Conference, Pervasive 2010, Helsinki, Finland, May 17-20, 2010. Proceedings 8. Springer, 409–426.
    [25]
    Stephen Brewster. 2002. Overcoming the lack of screen space on mobile computers. Personal and Ubiquitous computing 6 (2002), 188–205.
    [26]
    Stephen Brewster, Faraz Chohan, and Lorna Brown. 2007. Tactile feedback for mobile interactions. In Proceedings of the SIGCHI conference on Human factors in computing systems. 159–162.
    [27]
    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
    [28]
    Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A Plummer. 2022. Interactive mobile app navigation with uncertain or under-specified natural language commands. arXiv preprint arXiv:2202.02312 (2022).
    [29]
    Chunyang Chen, Ting Su, Guozhu Meng, Zhenchang Xing, and Yang Liu. 2018. From ui design image to gui skeleton: a neural machine translator to bootstrap mobile gui implementation. In Proceedings of the 40th International Conference on Software Engineering. 665–676.
    [30]
    Jieshan Chen, Chunyang Chen, Zhenchang Xing, Xiwei Xu, Liming Zhu, Guoqiang Li, and Jinshui Wang. 2020. Unblind your apps: Predicting natural-language labels for mobile gui components by deep learning. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 322–334.
    [31]
    Jieshan Chen, Amanda Swearngin, Jason Wu, Titus Barik, Jeffrey Nichols, and Xiaoyi Zhang. 2022. Towards Complete Icon Labeling in Mobile Applications. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–14.
    [32]
    Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. 2020. Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems 33 (2020), 22243–22255.
    [33]
    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).
    [34]
    Yan Chen, Maulishree Pandey, Jean Y Song, Walter S Lasecki, and Steve Oney. 2020. Improving Crowd-Supported GUI Testing with Structural Guidance. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–13.
    [35]
    Paul T Chiou, Ali S Alotaibi, and William GJ Halfond. 2023. BAGEL: An Approach to Automatically Detect Navigation-Based Web Accessibility Barriers for Keyboard Users. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–17.
    [36]
    Kenny Tsu Wei Choo, Rajesh Krishna Balan, and Youngki Lee. 2019. Examining augmented virtuality impairment simulation for mobile app accessibility design. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–11.
    [37]
    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
    [38]
    John Joon Young Chung, Wooseok Kim, Kang Min Yoo, Hwaran Lee, Eytan Adar, and Minsuk Chang. 2022. TaleBrush: visual sketching of story generation with pretrained language models. In CHI Conference on Human Factors in Computing Systems Extended Abstracts. 1–4.
    [39]
    X Yu Daihua, Bambang Parmanto, Brad E Dicianno, and Gede Pramana. 2015. Accessibility of mHealth self-care apps for individuals with spina bifida. Perspectives in health information management 12, Spring (2015).
    [40]
    Rafael Jeferson Pezzuto Damaceno, Juliana Cristina Braga, and Jesús Pascual Mena-Chalco. 2018. Mobile device accessibility for the visually impaired: problems mapping and recommendations. Universal Access in the Information Society 17 (2018), 421–435.
    [41]
    Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. 2017. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th annual ACM symposium on user interface software and technology. 845–854.
    [42]
    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey for in-context learning. arXiv preprint arXiv:2301.00234 (2022).
    [43]
    Marcelo Medeiros Eler, José Miguel Rojas, Yan Ge, and Gordon Fraser. 2018. Automated accessibility testing of mobile apps. In 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST). IEEE, 116–126.
    [44]
    Shirin Feiz, Jason Wu, Xiaoyi Zhang, Amanda Swearngin, Titus Barik, and Jeffrey Nichols. 2022. Understanding Screen Relationships from Screenshots of Smartphone Applications. In 27th International Conference on Intelligent User Interfaces. 447–458.
    [45]
    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, 2020. Codebert: A pre-trained model for programming and natural languages. EMNLP (2020).
    [46]
    Jingwen Fu, Xiaoyi Zhang, Yuwang Wang, Wenjun Zeng, Sam Yang, and Grayson Hilliard. 2021. Understanding Mobile GUI: from Pixel-Words to Screen-Sentences. arXiv preprint arXiv:2105.11941 (2021).
    [47]
    Dylan Gaines. 2018. Exploring an ambiguous technique for eyes-free mobile text entry. In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility. 471–473.
    [48]
    Dylan Gaines, Mackenzie M Baker, and Keith Vertanen. 2023. FlexType: Flexible Text Input with a Small Set of Input Gestures. In Proceedings of the 28th International Conference on Intelligent User Interfaces. 584–594.
    [49]
    Monica Gori, Giulio Sandini, Cristina Martinoli, and David C Burr. 2014. Impairment of auditory spatial localization in congenitally blind human subjects. Brain 137, 1 (2014), 288–293.
    [50]
    Darren Guinness, Edward Cutrell, and Meredith Ringel Morris. 2018. Caption crawler: Enabling reusable alternative text descriptions using reverse image search. In Proceedings of the 2018 chi conference on human factors in computing systems. 1–11.
    [51]
    Izzeddin Gur, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang, Aakanksha Chowdhery, Sharan Narang, Noah Fiedel, and Aleksandra Faust. 2022. Understanding html with large language models. arXiv preprint arXiv:2210.03945 (2022).
    [52]
    Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. 2021. Transformer in transformer. Advances in Neural Information Processing Systems 34 (2021), 15908–15919.
    [53]
    Shuai Hao, Bin Liu, Suman Nath, William GJ Halfond, and Ramesh Govindan. 2014. Puma: Programmable ui-automation for large-scale dynamic analysis of mobile apps. In Proceedings of the 12th annual international conference on Mobile systems, applications, and services. 204–217.
    [54]
    Zecheng He, Srinivas Sunkara, Xiaoxue Zang, Ying Xu, Lijuan Liu, Nevan Wichers, Gabriel Schubiner, Ruby Lee, and Jindong Chen. 2021. Actionbert: Leveraging user actions for semantic understanding of user interfaces. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 5931–5938.
    [55]
    Abram Hindle, Earl T Barr, Mark Gabel, Zhendong Su, and Premkumar Devanbu. 2016. On the naturalness of software. Commun. ACM 59, 5 (2016), 122–131.
    [56]
    Yavuz Inal, Frode Guribye, Dorina Rajanen, Mikko Rajanen, and Mattias Rost. 2020. Perspectives and practices of digital accessibility: A survey of user experience professionals in nordic countries. In Proceedings of the 11th Nordic Conference on Human-Computer Interaction: Shaping Experiences, Shaping Society. 1–11.
    [57]
    Mohit Jain, Nirmalendu Diwakar, and Manohar Swaminathan. 2021. Smartphone usage by expert blind users. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–15.
    [58]
    Ellen Jiang, Kristen Olson, Edwin Toh, Alejandra Molina, Aaron Donsbach, Michael Terry, and Carrie J Cai. 2022. Promptmaker: Prompt-based prototyping with large language models. In CHI Conference on Human Factors in Computing Systems Extended Abstracts. 1–8.
    [59]
    Ellen Jiang, Edwin Toh, Alejandra Molina, Kristen Olson, Claire Kayacik, Aaron Donsbach, Carrie J Cai, and Michael Terry. 2022. Discovering the syntax and strategies of natural language programming with generative language models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–19.
    [60]
    Christoph Albert Johns, Michael Barz, and Daniel Sonntag. 2023. Interactive Link Prediction as a Downstream Task for Foundational GUI Understanding Models. In German Conference on Artificial Intelligence (Künstliche Intelligenz). Springer, 75–89.
    [61]
    Michael Jones, John Morris, and Frank Deruyter. 2018. Mobile healthcare and people with disabilities: current state and future needs. International journal of environmental research and public health 15, 3 (2018), 515.
    [62]
    Shaun K Kane, Chandrika Jayant, Jacob O Wobbrock, and Richard E Ladner. 2009. Freedom to roam: a study of mobile device adoption and accessibility for people with visual and motor disabilities. In Proceedings of the 11th international ACM SIGACCESS conference on Computers and accessibility. 115–122.
    [63]
    Akif Khan and Shah Khusro. 2021. An insight into smartphone-based assistive solutions for visually impaired and blind people: issues, challenges and opportunities. Universal Access in the Information Society 20 (2021), 265–298.
    [64]
    Julie A Kientz, Shwetak N Patel, Arwa Z Tyebkhan, Brian Gane, Jennifer Wiley, and Gregory D Abowd. 2006. Where’s my stuff? Design and evaluation of a mobile system for locating lost items for the visually impaired. In Proceedings of the 8th international ACM SIGACCESS Conference on Computers and Accessibility. 103–110.
    [65]
    Tae Soo Kim, DaEun Choi, Yoonseo Choi, and Juho Kim. 2022. Stylette: Styling the web with natural language. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–17.
    [66]
    Bridgett A King and Norman E Youngblood. 2016. E-government in Alabama: An analysis of county voting and election website content, usability, accessibility, and mobile readiness. Government Information Quarterly 33, 4 (2016), 715–726.
    [67]
    Andreas Komninos, Vassilios Stefanis, and John Garofalakis. 2023. A Review of Design and Evaluation Practices in Mobile Text Entry for Visually Impaired and Blind Persons. Multimodal Technologies and Interaction 7, 2 (2023), 22.
    [68]
    Elisa Kreiss, Cynthia Bennett, Shayan Hooshmand, Eric Zelikman, Meredith Ringel Morris, and Christopher Potts. 2022. Context Matters for Image Descriptions for Accessibility: Challenges for Referenceless Evaluation Metrics. arXiv preprint arXiv:2205.10646 (2022).
    [69]
    Rebecca Krosnick and Steve Oney. 2022. ParamMacros: Creating UI Automation Leveraging End-User Natural Language Parameterization. In 2022 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 1–10.
    [70]
    Ranjitha Kumar, Arvind Satyanarayan, Cesar Torres, Maxine Lim, Salman Ahmad, Scott R Klemmer, and Jerry O Talton. 2013. Webzeitgeist: design mining the web. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 3083–3092.
    [71]
    Richard E Ladner. 2015. Design for user empowerment. interactions 22, 2 (2015), 24–29.
    [72]
    Mina Lee, Percy Liang, and Qian Yang. 2022. Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities. In Proceedings of the 2022 CHI conference on human factors in computing systems. 1–19.
    [73]
    Yoonjoo Lee, John Joon Young Chung, Tae Soo Kim, Jean Y Song, and Juho Kim. 2022. Promptiverse: Scalable generation of scaffolding prompts through human-AI hybrid knowledge graph annotation. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–18.
    [74]
    Barbara Leporini, Maria Claudia Buzzi, and Marina Buzzi. 2012. Interacting with mobile devices via VoiceOver: usability and accessibility issues. In Proceedings of the 24th Australian computer-human interaction conference. 339–348.
    [75]
    Toby Jia-Jun Li, Lindsay Popowski, Tom Mitchell, and Brad A Myers. 2021. Screen2vec: Semantic embedding of gui screens and gui components. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–15.
    [76]
    Yuanchun Li, Ziyue Yang, Yao Guo, and Xiangqun Chen. 2017. Droidbot: a lightweight ui-guided test input generator for android. In 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C). IEEE, 23–26.
    [77]
    Zuchao Li, Jiaxun Cai, Shexia He, and Hai Zhao. 2018. Seq2seq dependency parsing. In Proceedings of the 27th International Conference on Computational Linguistics. 3203–3214.
    [78]
    Chin-Yew Lin and Eduard Hovy. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 human language technology conference of the North American chapter of the association for computational linguistics. 150–157.
    [79]
    Peng Liu, Xiangyu Zhang, Marco Pistoia, Yunhui Zheng, Manoel Marques, and Lingfei Zeng. 2017. Automatic text input generation for mobile testing. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). IEEE, 643–653.
    [80]
    Yihe Liu, Anushk Mittal, Diyi Yang, and Amy Bruckman. 2022. Will AI console me when I lose my pet? Understanding perceptions of AI-mediated email writing. In Proceedings of the 2022 CHI conference on human factors in computing systems. 1–13.
    [81]
    Zhe Liu, Chunyang Chen, Junjie Wang, Xing Che, Yuekai Huang, Jun Hu, and Qing Wang. 2023. Fill in the blank: Context-aware automated text input generation for mobile gui testing. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1355–1367.
    [82]
    Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2023. Make LLM a Testing Expert: Bringing Human-like Interaction to Mobile GUI Testing via Functionality-aware Decisions. arXiv preprint arXiv:2310.15780 (2023).
    [83]
    Zhe Liu, Chunyang Chen, Junjie Wang, Yuekai Huang, Jun Hu, and Qing Wang. 2020. Owl Eyes: Spotting UI Display Issues via Visual Understanding. In ASE. IEEE. https://doi.org/10.1145/3324884.3416547
    [84]
    Zhe Liu, Chunyang Chen, Junjie Wang, Yuekai Huang, Jun Hu, and Qing Wang. 2022. Nighthawk: Fully Automated Localizing UI Display Issues via Visual Understanding. IEEE Transactions on Software Engineering (2022), 1–16. https://doi.org/10.1109/TSE.2022.3150876
    [85]
    Zhe Liu, Chunyang Chen, Junjie Wang, Yuhui Su, Yuekai Huang, Jun Hu, and Qing Wang. 2023. Ex pede Herculem: Augmenting Activity Transition Graph for Apps via Graph Convolution Network. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1983–1995.
    [86]
    Zhe Liu, Chunyang Chen, Junjie Wang, Yuhui Su, and Qing Wang. 2022. NaviDroid: a tool for guiding manual Android testing via hint moves. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 154–158.
    [87]
    Zhe Liu, Chunyang Chen, Junjie Wang, and Qing Wang. 2022. Guided Bug Crush: Assist Manual GUI Testing of Android Apps via Hint Moves. In CHI 2022. https://doi.org/10.1145/3491102.3501903
    [88]
    Mateus M Luna, Hugo AD Nascimento, Aaron Quigley, and Fabrizzio Soares. 2023. Text entry for the Blind on Smartwatches: A study of Braille code input methods for a novel device. Universal Access in the Information Society 22, 3 (2023), 737–755.
    [89]
    Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics (1947), 50–60.
    [90]
    Sergio Mascetti, Cristian Bernareggi, and Matteo Belotti. 2012. TypeInBraille: quick eyes-free typing on smartphones. In Computers Helping People with Special Needs: 13th International Conference, ICCHP 2012, Linz, Austria, July 11-13, 2012, Proceedings, Part II 13. Springer, 615–622.
    [91]
    Forough Mehralian, Navid Salehnamadi, and Sam Malek. 2021. Data-driven accessibility repair revisited: on the effectiveness of generating labels for icons in Android apps. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 107–118.
    [92]
    Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. ICLR (2013).
    [93]
    Lauren R Milne, Cynthia L Bennett, and Richard E Ladner. 2014. The accessibility of mobile health sensors for blind users. In International technology and persons with disabilities conference scientific/research proceedings (CSUN 2014). 166–175.
    [94]
    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. Rethinking the role of demonstrations: What makes in-context learning work?arXiv preprint arXiv:2202.12837 (2022).
    [95]
    Higinio Mora, Virgilio Gilart-Iglesias, Raquel Pérez-del Hoyo, and María Dolores Andújar-Montoya. 2017. A comprehensive system for monitoring urban accessibility in smart cities. Sensors 17, 8 (2017), 1834.
    [96]
    John Morris and James Mueller. 2014. Blind and deaf consumer preferences for android and iOS smartphones. In Inclusive designing: Joining usability, accessibility, and inclusion. Springer, 69–79.
    [97]
    Meredith Ringel Morris, Jazette Johnson, Cynthia L Bennett, and Edward Cutrell. 2018. Rich representations of visual content for screen reader users. In Proceedings of the 2018 CHI conference on human factors in computing systems. 1–11.
    [98]
    Meredith Ringel Morris, Annuska Zolyomi, Catherine Yao, Sina Bahram, Jeffrey P Bigham, and Shaun K Kane. 2016. " With most of it being pictures now, I rarely use it" Understanding Twitter’s Evolving Accessibility to Blind Users. In Proceedings of the 2016 CHI conference on human factors in computing systems. 5506–5516.
    [99]
    Fiona Fui-Hoon Nah, Dongsong Zhang, John Krogstie, and Shengdong Zhao. 2017. Editorial of the special issue on mobile human–computer interaction., 429–430 pages.
    [100]
    Noor Nashid, Mifta Sintaha, and Ali Mesbah. 2023. Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning. In Proceedings of the 45th International Conference on Software Engineering (ICSE’23).
    [101]
    Leo Neat, Ren Peng, Siyang Qin, and Roberto Manduchi. 2019. Scene text access: A comparison of mobile OCR modalities for blind users. In Proceedings of the 24th International Conference on Intelligent User Interfaces. 197–207.
    [102]
    João Oliveira, Tiago Guerreiro, Hugo Nicolau, Joaquim Jorge, and Daniel Gonçalves. 2011. BrailleType: unleashing braille over touch screen mobile phones. In Human-Computer Interaction–INTERACT 2011: 13th IFIP TC 13 International Conference, Lisbon, Portugal, September 5-9, 2011, Proceedings, Part I 13. Springer, 100–107.
    [103]
    Achraf Othman, Amira Dhouib, and Aljazi Nasser Al Jabor. 2023. Fostering websites accessibility: A case study on the use of the Large Language Models ChatGPT for automatic remediation. In Proceedings of the 16th International Conference on PErvasive Technologies Related to Assistive Environments. 707–713.
    [104]
    Tim Paek and David Maxwell Chickering. 2007. Improving command and control speech recognition on mobile devices: using predictive user models for language modeling. User modeling and user-adapted interaction 17 (2007), 93–117.
    [105]
    Matthew J Page, David Moher, Patrick M Bossuyt, Isabelle Boutron, Tammy C Hoffmann, Cynthia D Mulrow, Larissa Shamseer, Jennifer M Tetzlaff, Elie A Akl, Sue E Brennan, 2021. PRISMA 2020 explanation and elaboration: updated guidance and exemplars for reporting systematic reviews. bmj 372 (2021).
    [106]
    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
    [107]
    Kyudong Park, Taedong Goh, and Hyo-Jeong So. 2014. Toward accessible mobile application design: developing mobile application accessibility guidelines for people with visual impairment. Proceedings of HCI Korea (2014), 31–38.
    [108]
    Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. 2022. Examining Zero-Shot Vulnerability Repair with Large Language Models. In 2023 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 1–18.
    [109]
    Christopher Power, André Freire, Helen Petrie, and David Swallow. 2012. Guidelines are only half of the story: accessibility problems encountered by blind users on the web. In Proceedings of the SIGCHI conference on human factors in computing systems. 433–442.
    [110]
    Lindsay Ramey, Candice Osborne, Donald Kasitinon, and Shannon Juengst. 2019. Apps and mobile health technology in rehabilitation: the good, the bad, and the unknown. Physical Medicine and Rehabilitation Clinics 30, 2 (2019), 485–497.
    [111]
    André Rodrigues, Kyle Montague, Hugo Nicolau, and Tiago Guerreiro. 2015. Getting smartphones to talkback: Understanding the smartphone adoption process of blind users. In Proceedings of the 17th international acm sigaccess conference on computers & accessibility. 23–32.
    [112]
    Anne Spencer Ross, Xiaoyi Zhang, James Fogarty, and Jacob O Wobbrock. 2018. Examining image-based button labeling for accessibility in Android apps through large-scale analysis. In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility. 119–130.
    [113]
    Navid Salehnamadi, Abdulaziz Alshayban, Jun-Wei Lin, Iftekhar Ahmed, Stacy Branham, and Sam Malek. 2021. Latte: Use-case and assistive-service driven automated accessibility testing framework for android. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–11.
    [114]
    Navid Salehnamadi, Forough Mehralian, and Sam Malek. 2022. Groundhog: An Automated Accessibility Crawler for Mobile Apps. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12.
    [115]
    Eldon Schoop, Xin Zhou, Gang Li, Zhourong Chen, Bjoern Hartmann, and Yang Li. 2022. Predicting and explaining mobile ui tappability with vision modeling and saliency analysis. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–21.
    [116]
    J Schulman, B Zoph, C Kim, J Hilton, J Menick, J Weng, JFC Uribe, L Fedus, L Metz, M Pokorny, 2022. ChatGPT: Optimizing language models for dialogue.
    [117]
    Carolyn B. Seaman. 1999. Qualitative methods in empirical studies of software engineering. IEEE Transactions on software engineering 25, 4 (1999), 557–572.
    [118]
    Akbar S Shaik, Gahangir Hossain, and Mohammed Yeasin. 2010. Design, development and performance evaluation of reconfigured mobile Android phone for people who are blind or visually impaired. In Proceedings of the 28th ACM International Conference on Design of Communication. 159–166.
    [119]
    Seongjin Shin, Sang-Woo Lee, Hwijeen Ahn, Sungdong Kim, HyoungSeok Kim, Boseop Kim, Kyunghyun Cho, Gichang Lee, Woomyoung Park, Jung-Woo Ha, 2022. On the effect of pretraining corpora on in-context learning by a large-scale language model. arXiv preprint arXiv:2204.13509 (2022).
    [120]
    Kristen Shinohara, Murtaza Tamjeed, Michael McQuaid, and Dymen A Barkins. 2022. Usability, Accessibility and Social Entanglements in Advanced Tool Use by Vision Impaired Graduate Students. Proceedings of the ACM on Human-Computer Interaction 6, CSCW2 (2022), 1–21.
    [121]
    Javier Sánchez Sierra and J Togores. 2012. Designing mobile apps for visually impaired and blind users. In The Fifth international conference on advances in computer-human interactions. Citeseer, 47–52.
    [122]
    Yingli Tian, Xiaodong Yang, and Aries Arditi. 2010. Computer vision-based door detection for accessibility of unfamiliar environments to blind persons. In Computers Helping People with Special Needs: 12th International Conference, ICCHP 2010, Vienna, Austria, July14-16, 2010, Proceedings, Part II 12. Springer, 263–270.
    [123]
    Hussain Tinwala and I Scott MacKenzie. 2010. Eyes-free text entry with error correction on touchscreen mobile devices. In Proceedings of the 6th Nordic Conference on Human-Computer Interaction: Extending Boundaries. 511–520.
    [124]
    UIAutomator. 2021. Python wrapper of Android uiautomator test tool.https://github.com/xiaocong/uiautomator.
    [125]
    Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4566–4575.
    [126]
    Christopher Vendome, Diana Solano, Santiago Liñán, and Mario Linares-Vásquez. 2019. Can everyone use my app? an empirical study on accessibility in android apps. In 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 41–52.
    [127]
    Anna Visvizi and Miltiadis D Lytras. 2019. Sustainable smart cities and smart villages research: Rethinking security, safety, well-being, and happiness., 215 pages.
    [128]
    Bryan Wang, Gang Li, and Yang Li. 2023. Enabling conversational interaction with mobile ui using large language models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–17.
    [129]
    Fahui Wang. 2012. Measurement, optimization, and impact of health care accessibility: a methodological review. Annals of the Association of American Geographers 102, 5 (2012), 1104–1112.
    [130]
    Hao Wen, Hongming Wang, Jiaxuan Liu, and Yuanchun Li. 2023. DroidBot-GPT: GPT-powered UI Automation for Android. arXiv preprint arXiv:2304.07061 (2023).
    [131]
    Jason Wu, Rebecca Krosnick, Eldon Schoop, Amanda Swearngin, Jeffrey P Bigham, and Jeffrey Nichols. 2023. Never-ending Learning of User Interfaces. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 1–13.
    [132]
    Yenchun Jim Wu, Wan-Ju Liu, and Chih-Hung Yuan. 2020. A mobile-based barrier-free service transportation platform for people with disabilities. Computers in Human Behavior 107 (2020), 105776.
    [133]
    Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 1–10.
    [134]
    Shunguo Yan and PG Ramachandran. 2019. The current status of accessibility in mobile apps. ACM Transactions on Accessible Computing (TACCESS) 12, 1 (2019), 1–31.
    [135]
    Guixin Ye, Zhanyong Tang, Shin Hwei Tan, Songfang Huang, Dingyi Fang, Xiaoyang Sun, Lizhong Bian, Haibo Wang, and Zheng Wang. 2021. Automated conformance testing for JavaScript engines via deep compiler fuzzing. In Proceedings of the 42nd ACM SIGPLAN international conference on programming language design and implementation. 435–450.
    [136]
    Tom Yeh, Tsung-Hsiang Chang, and Robert C Miller. 2009. Sikuli: using GUI screenshots for search and automation. In Proceedings of the 22nd annual ACM symposium on User interface software and technology. 183–192.
    [137]
    Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. 2017. Comparative study of CNN and RNN for natural language processing. arXiv preprint arXiv:1702.01923 (2017).
    [138]
    Yong Yu, Xiaosheng Si, Changhua Hu, and Jianxun Zhang. 2019. A review of recurrent neural networks: LSTM cells and network architectures. Neural computation 31, 7 (2019), 1235–1270.
    [139]
    Wei Yuan, Quanjun Zhang, Tieke He, Chunrong Fang, Nguyen Quoc Viet Hung, Xiaodong Hao, and Hongzhi Yin. 2022. CIRCLE: continual repair across programming languages. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 678–690.
    [140]
    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
    [141]
    Ting Zhang, Ivana Clairine Irsan, Ferdian Thung, DongGyun Han, David Lo, and Lingxiao Jiang. 2022. iTiger: an automatic issue title generation tool. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1637–1641.
    [142]
    Xiaoyi Zhang, Lilian de Greef, Amanda Swearngin, Samuel White, Kyle Murray, Lisa Yu, Qi Shan, Jeffrey Nichols, Jason Wu, Chris Fleizach, 2021. Screen recognition: Creating accessibility metadata for mobile applications from pixels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–15.
    [143]
    Xiaoyi Zhang, Anne Spencer Ross, and James Fogarty. 2018. Robust annotation of mobile application interfaces in methods for accessibility repair and enhancement. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology. 609–621.
    [144]
    Yuhang Zhao, Edward Cutrell, Christian Holz, Meredith Ringel Morris, Eyal Ofek, and Andrew D Wilson. 2019. SeeingVR: A set of tools to make virtual reality more accessible to people with low vision. In Proceedings of the 2019 CHI conference on human factors in computing systems. 1–14.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CHI '24: Proceedings of the CHI Conference on Human Factors in Computing Systems
    May 2024
    18961 pages
    ISBN:9798400703300
    DOI:10.1145/3613904
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 May 2024

    Check for updates

    Badges

    Author Tags

    1. App Accessibility
    2. Large Language Model
    3. Mobile App Design
    4. User Interface

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Data Availability

    Funding Sources

    • National Natural Science Foundation of China Gran
    • National Natural Science Foundation of China Grant award number(s)
    • National Natural Science Foundation of China Grant

    Conference

    CHI '24

    Acceptance Rates

    Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 955
      Total Downloads
    • Downloads (Last 12 months)955
    • Downloads (Last 6 weeks)548
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media