1 Introduction
In the era of burgeoning mobile device development, mobile applications have evolved into indispensable tools for accessing and engaging within diverse environments, offering unparalleled convenience, particularly for individuals with low vision. According to statistics from Google Play [
7] and App Store [
3], over 4 million apps are widely used for various tasks such as entertainment, financial services, reading, shopping, banking, and chatting. However, many app developers often don’t develop apps according to accessibility standards, which directly results in disabled people being unable to access the functionalities of many apps. This is especially true for visually impaired users, who need rich Graphical User Interface (GUI/UI) hints to understand the functionality. According to statistics from the World Health Organization (WHO) [
5], at least 2.2 billion people have near or distant vision impairment. In at least 1 billion of these, vision impairment could have been prevented or is yet to be addressed. [
5]. This requires smart devices and mobile apps to provide convenience for them. Therefore, how to make mobile apps better serve their eyes is a social justice issue that the whole society needs to pay attention to [
5,
71].
To facilitate interaction between visually impaired users and apps, an increasing number of mobile devices support screen reader apps [
8,
10]. There are also many companies that have developed relevant screen readers for different disabled users to help them understand and use the functionalities of the app. The screen reader will recognize the “
text” or
“content description” of the
text component, “
label” of the
image component and the “
hint-text” of the
text input component, and read it aloud for visually impaired users. This is very effective for visually impaired users, who can understand the app’s UI through screen readers and use their functionalities. As shown in Figure
1,
hint-text is different from
label and
content description.
Label is usually used to briefly describe image components as in Figure
1 (a),
content description provides the overview of related input components as in Figure
1 (b), while
hint-text goes a step further to explain the input requirements and help users understand what should input into the component (the red rectangles in Figure
1 indicate the example hint-text). Google developer accessibility guideline [
2] requires developers to provide hint-text for input components, especially input components that lack content description. “It’s helpful to show text in the element itself, in addition to making this hint-text available to screen readers” [
9].
Despite the guideline, the real-world practice doesn’t work like that. According to our observation on 4,950 popular apps from Google Play (in Section
4), about 91% of them have text input components, yet as high as 76% of the input components have issues of missing
hint-text as shown in Figure
3 (a). Without
hint-text, the screen reader cannot obtain information about the text input component, which can cause it to skip the component. In addition, even if some input components have the hint-text, it can be simple and lacks practical meaning as in Figure
3 (b), which can’t help people understand the text input requirements. This also leads to disabled people being unable to use the functionalities provided by the app, which still affects their access to the functionality of this page.
Therefore, it calls for automatic support for the generation of hint-text, yet to the best of our knowledge there is no existing study tackling this problem. The most relevant work is the generation of missing
labels for image components with deep learning techniques to improve the accessibility of apps [
30,
31,
142]. However, the approaches for generating
labels mainly involve the understanding of icons, while the generation of
hint-text not only requires more thorough viewpoints of the whole GUI page but also calls for deeper comprehension of the related information, as shown in Figure
1. Taken in this sense, this work will develop advanced techniques for generating the meaningful
hint-text as shown in
3 (c), which can help visually impaired users successfully fill in the correct input.
In order to overcome these challenges, inspired by the excellent performance of the Large Language Model (LLM) in natural language processing tasks [
27,
32,
37,
82,
116,
140]. We propose
HintDroid for automated
hint-text generation based on LLM and GUI page information. Given the view hierarchy file of the GUI page, we first extract the text input component and the GUI entity information of the page, and then design GUI prompts to enable LLM to understand the text input context. In order to facilitate the LLM to better understand our task, we use in-context learning and construct an example database with hint-text and its corresponding input component information. Then we design a retrieval-based example selection method to construct an in-context learning prompt. Combining the above two prompts, LLM will output hint-text and the input content generated based on it. To ensure the quality of hint-text, we use input content as a bridge to evaluate the generated hint-text, and extract feedback information by checking whether the input content can trigger the next GUI page, then construct a feedback prompt to further let LLM adjust the hint-text.
We evaluate the effectiveness of HintDroid on 2,659 text input components with hint-text involving 753 popular apps in Google Play (the largest repository of Android Apps). Results show that HintDroid can achieve 83% BLEU@1, 77% BLEU@2, 73% BLEU@3, 66% BLEU@4, 67% METEOR, 63% ROUGE and 62% CIDEr with the generated hint-text. Compared with 12 common-used and state-of-the-art baselines, HintDroid can achieve more than 82% boost in the exact match compared with the best baseline. In order to further understand the role of each module and sub-module of the approach, we conduct ablation experiments to further demonstrate its effectiveness. We further carry out a user study to evaluate its usefulness in assisting visually impaired users, with 33 apps from Google Play. Results show that the participants with HintDroid fill in 152% more correct input, cover 66% more states and 77% more activities within 139% less time, compared with those without using HintDroid. The experimental results and feedback from these participants confirm the usefulness of our HintDroid.
The contributions of this paper are as follows:
•
To our best knowledge, this is the first work to automatically predict the
hint-text of text input components for enhancing app accessibility
1. We hope this work can invoke the community’s attention to maintaining the accessibility of mobile apps from the viewpoint of hint-text.
•
An empirical study for investigating how well the text input components of current apps support the accessibility for users with vision impairment, which motivates this study as well as follow-up researches.
•
Large-scale evaluation on real-world mobile apps with promising results, and a user study demonstrating HintDroid’s usefulness in assisting visually impaired users successfully fill in the correct input.
3 Android Accessibility Background
In this section, we first introduce Android-related terminology and Android screen reader. This background helps to support the concise and clear terminology used throughout our analysis.
Android Text Input Components. When developers develop UI, the Android platform provides them with many different types of UI components [
2], such as TextView, ImageView, EditText, Button, and so on. EditText is a user interface element for entering and modifying text. When you define an EditText component, you must specify the
R.styleable.TextView_InputType attribute. The attribute “hint” is mainly used to display prompts for input requirements. For text input components, Google developer accessibility guideline [
2] requires developers to provide the
Android: hint attribute to provide input requirements for users.
Android Screen Readers. Based on the WebAIM [
1], a significant 95.1% of visually impaired respondents relied on smartphone screen readers. Google and Apple, as major industry players, shape mobile technology and apps. For vision-impaired users, Google’s TalkBack [
8] and Apple’s VoiceOver [
10] provide crucial accessibility, enabling mobile app engagement. Focusing on Android apps, TalkBack is a cornerstone accessibility service. Pre-installed on many Android devices, it empowers blind and visually impaired users. TalkBack narrates blocks of text, alerts users to interactive elements, like buttons, and extends beyond text, engaging with apps through intricate gestures. It also offers local and global context menus, allowing tailored interaction experiences and global setting adjustments. A silent ally for those navigating digital realms with different senses, it bridges the gap between user intent and device response, enhancing mobile app experiences universally. TalkBack typically obtains label and hint-text from apps to provide services to users, so providing label and hint-text in mobile applications is crucial for helping users understand the app.
5 Approach
This paper proposes
HintDroid, which uses LLM to automatically generate the hint-text of text input components to facilitate the screen reader of disabled users in better understanding the inherent meaning of the input components. Figure
5 presents the overview of
HintDroid, which consists of three main modules.
First, given the view hierarchy file of the text input page of the Android application, the GUI entity extraction module extracts GUI information related to the text input component and the contextual information of its nearby components. Then, based on these extracted GUI information,
HintDroid generates LLM understandable GUI prompts for hint-text generation (Section
5.1).
Second, based on the GUI prompt, the information retrieval-based prompt enhancement module automatically selects examples that are most similar to the current input scenario, and constructs an in-context learning prompt. This prompt uses in-context learning to help LLM better understand hint-text generation tasks, thereby enhancing LLM’s hint-text generation performance (Section
5.2).
Third, with the above prompt, we instruct the LLM to generate not only the hint-text, but also the suggested input content that aligns with the hint-text. We then input the content into the component and check whether it triggers the next page. If it doesn’t trigger the next page, we assume the generated hint-text may be inappropriate, and will provide the failed case as the feedback prompt to facilitate the LLM in adjusting the answer. Additionally, when receiving inappropriate input content, certain text input components can also report the error message for indicating the expected input, and we also utilize this error message in the feedback for re-querying the LLM (Section
5.3).
5.1 GUI Entity Extraction and Prompting
The first module of our approach is to understand, analyze, organize, and extract the entities from the GUI page with the text input component. Although LLM is equipped with knowledge learned from large-scale training corpora and excels in various tasks, its performance may be significantly affected by the quality of its input, that is, whether the input can accurately describe what to ask. Therefore, we design an approach to extract and organize the GUI information.
5.1.1 GUI Entity Extraction.
Inspired by the screen reader’s ability, we convert GUI information into natural language descriptions [
8,
10,
75,
142]. We first extract the GUI entity information of the app, the GUI page currently used, and the components on the page, which helps LLM understand the GUI of the current page and the input requirements of the text input components. The app entity information is extracted from the
AndroidMaincast.xml file, while the other two types of entity information are extracted from the view hierarchy file, which can be obtained by UIAutomator [
124]. Table
1 presents the summarized view of them.
App Entity Information provides the macro-level semantics of the app under testing, which facilitates the LLM to gain a general perspective about the functionalities of the app. The extracted information includes the app name and the name of all its activities. The app name can help LLM understand the type of app it belongs to, and the activity name can help LLM infer the current functional GUI page and the functional GUI page after entering the input content, thereby providing a basis for generating hint-text that conforms to contextual semantics.
Page GUI Entity Information provides the semantics of the current GUI page during the interactive process, which facilitates the LLM to capture the current snapshot. We extract the activity name of the current GUI page, all the components represented by the “text” field or “resource-id” field (the first non-empty one in order), and the component position of the page. For the position, inspired by the screen reader [
8,
10], we first obtain the coordinates of each component in order from top to bottom and from left to right, and the components whose ordinate is below the middle of the page are marked as lower, and the rest is marked as upper.
Input Component Entity information denotes the micro-level semantics of the GUI page, i.e., the inherent meaning of all its text input components, which facilitates the LLM in understanding the input requirements and the semantic information of its contextual components. The extracted information includes “text” and “resource-id” fields (the first non-empty one in order). To avoid the empty textual fields of a component, we also extract the information from nearby components to provide a more thorough perspective, which includes the “text” of parent node components and sibling node components.
5.1.2 GUI Prompt Construction.
With the extracted GUI information, we combine them into the GUI prompt that LLM can understand, as shown in Table
2, i.e., <App information> + <GUI page information> + <input component formation> + <Query> + <Example output>. Generally speaking, it first provides the app information, GUI page information and the input component information, then queries the LLM for the hint-text of the text input and its corresponding input content and gives it the example output as shown in Figure
6. Due to the robustness of LLM, the generated prompt sentence does not need to fully follow the grammar. After inputting the GUI prompt, the LLM will return its recommended hint-text and its inferred input content.
5.2 Enriching Prompt with Examples
It is usually difficult for LLM to perform well on domain-specific tasks as our hint-text generation, and a common practice would be employing the in-context learning [
42,
94,
119] schema to boost the performance. It provides the LLM with examples to demonstrate what the instruction is to enable the LLM to better understand the task. Following the in-context learning schema, along with the GUI prompt for the text input as described in Section
5.1.2, we additionally provide the LLM with examples of the hint-text. To achieve this, we first build a basic example dataset of hint-text from the popular mobile apps in our motivational study. Research shows that the quality and relevance of examples can significantly affect the performance of LLM [
42,
94,
119]. Therefore, based on the dataset we built, a retrieval-based example selection method (in Section
5.2.2) is designed to select the most appropriate example according to the text input and its hint-text, which further enables the LLM to learn with pertinence.
5.2.1 Example Dataset Construction.
We collect the hint-text of text input components from Android apps in our motivational study and continuously build an example dataset that serves as the basis for in-context learning. For each data instance, as demonstrated in Table
2, it records the GUI information of text input components and its hint-text, which enables us to select the most suitable examples and facilitate the LLM understanding of what the hint-text and the text input context like.
Mining Hint-text from Android App. First, we automatically crawl the view hierarchy file from the Android mobile apps in our motivational study. Then we use keyword matching to filter these related to the text input components (e.g., EditText) which have hint-text. In this way, we obtain 15,577 (45,803 - 30,226) view hierarchy file in Section
4 with text input components (all of them has the hint-text) and store them in the example dataset (there is no overlap with the evaluation datasets). We then extract the GUI entity information of the text input component with the method in Section
5.1.1, and store it together with the hint-text.
Enlarging the Dataset with Hint-text During Using. We also enrich the hint-text example dataset with the new hint-text which generates input content that truly triggers the next page (transition) during HintDroid runs on various apps. Specifically, for each generated hint-text, after running it in the mobile apps, we put the ones that generate input content that truly triggers the next page (transition) and its GUI information into the example dataset.
5.2.2 Retrieval-based Example Selection and In-context Learning.
The hint-text examples can provide intuitive guidance to the LLM in accomplishing a task, yet excessive examples might mislead the LLM and cause the performance to decline. Therefore, we design a retrieval-based example selection method to choose the most suitable examples (i.e., most similar to the input components) for LLM.
In detail, the similarity comparison is based on the GUI entity information of the text input components. We use Word2Vec (Lightweight word embedding method) [
92] to encode the context information of each input component into a 300-dimensional sentence embedding, and calculate the cosine similarity between the input component and each data instance in the example dataset. We choose the top-K data instance with the highest similarity score, and set K as 6 empirically. The selected data instances (i.e., examples) will be provided to the LLM in the format of GUI entity information and hint-text, as demonstrated in Table
2.
5.3 Feedback Extraction and Prompting
Since the LLM cannot always generate the correct hint-text as we want, we ask it to generate not only the hint-text, but also the input content that aligns with the hint-text, and use the input content to help determine whether the hint-text is appropriate or not. The design idea of HintDroid is to role-play LLM as a low vision user, providing it with GUI context and hint-text to see if it can generate input content correctly. So, we optimize the hint-text by determining whether the generated input content can trigger the next page, rather than simply triggering the next page which is what QTypist [
81] did for software testing. In detail, we enter the generated content into the text input component and check whether it triggers the next page (transition). If it doesn’t trigger the next page (transition), we use the failed information to serve as the feedback and add it to the prompt to re-query the LLM.
5.3.1 Automated Input Content Checking.
We design an automated script to automatically input the content into components and perform operations. Specifically, LLM outputs the hint-text and input content of each text input component in a fixed format. We will recode it into an operation script in Android ADB format, such as adb shell input text “xxx”. Then we iterate through the components of the current page, identify the components that require further action after input is completed, and use the ADB command to perform the operation. After the operation is completed, we will check whether the situation detector has triggered the next page. Determine whether the page name and components have changed. For cases that failed trigger, we will proceed with the following feedback extraction.
5.3.2 Error Message Extraction.
When one inputs an incorrect text into the component, they might report the error message, e.g., the app may alter the users that the password should contain letters and digits. The error message can further help LLM understand what the valid input should look like.
We extract the error messages via differential analysis which compares the differences between the GUI page before and after inputting the text, and extracts the text field of the newly emerged components (e.g., a popup window) in the later GUI page, with examples shown in Figure
2. We also record the text input which makes the error message happen, which can help the LLM to understand the reason behind it.
5.3.3 Feedback Prompt Construction.
For cases where there is no successful page transfer, we combine the aforementioned information in Table
2 as the feedback prompt, i.e. <Feedback> + GUI prompt + <Feedback query> + <Example output>. We design two types of <Feedback>. One is to tell the LLM that the current hint-text and input content are incorrect. The other is to provide the information dynamically generated by the text input component, which may make people more clear at once. Note that not all input components may provide an error message, and for components without an error message, we display null in the prompt. Finally, we request LLM to re-optimize the hint-text based on the feedback prompt mentioned above.
5.4 Implementation
We implement
HintDroid based on the Turbo-3.5 which is released on the OpenAI website
2. It obtains the view hierarchy file of the current GUI page through UIAutomator [
124] to extract GUI information of the text input components.
HintDroid can be integrated by the automated GUI exploration tool, which automatically extracts the GUI information and generates the hint-text. When we obtain the generated hint-text from
HintDroid, we further design an automated script to add its missing hint-text to the app. Specifically, as shown in Figure
7, given any app, the automated script consists of the following five steps. (1) We use Application Explorer [
29] to automatically run the app and get the view hierarchy files for each page. (2) We detect GUI pages (view hierarchy files) with missing hint-text based on text input components. (3)
HintDroid automatically generates the corresponding hint-text. (4) Automatically decompile the APK file of the app and retrieve the code of text input. We design a script to automatically execute to decode the APK. (5) Automatically repackage APK files. We design a script to automatically execute to encode and package the modified code into APK, and complete the repair. Please note that if it is an open-source app, we can directly modify its source code and repackage it. So
HintDroid is not a dynamic interactive process, it is a one-time offline job. We also calculate that the average time for generating hint-text on each GUI page is 1.86 seconds.
7 Usefulness Evaluation
To evaluate our HintDroid, we also conduct a user study to demonstrate its usefulness in real-world practice. Our goal is to examine: (1) whether HintDroid can help visually impaired users successfully fill in the correct input. (2) whether HintDroid can effectively help visually impaired users explore the functionality of the application. (3) whether HintDroid can save time by filling in the correct input.
7.1 Dataset of User Study
To ensure the representativeness of test data, We begin with the 3,398 apps from Google Play described in Section
4, which have text input components without hint-texts (Please note that the data in this Section is not used for model training or in-context learning). To further confirm the universality and usefulness of our model, we first filter them according to the following rules. (1) At least 3 pages requiring text input components, (2) generated hint-texts can be integrated into the app, and (3) the number of activities in the app is more than 10. According to the above rules, we obtained 371 apps, and further selected the app with the highest download number of each app category as our experimental data. As shown in Figure
11, we describe the data selection process according to the PRISMA Flow Diagram[
105]. We end up with 33 apps (1 app per category) with 237 text input components, and use them for the final evaluation, with details in Table
4.
7.2 Participants Recruitment
We recruit 36 visually impaired users to participate in the experiment, of whom 20 are male and 16 are female. Their ages ranged from 20 to 55 years (median = 36 years). 22 participants have no residual vision, 8 have only light/dark perception and 6 have very little central vision. The participants have visual impairment for a period between 7 and 41 years. All participants use screen readers (Talkback [
8]) as their primary assistive technology to use mobile apps. All of the participants have been using mobile devices for 5 years or more. Every participant receives a $100 as a reward after the experiment. At the beginning of the experiment, we ask participants to use app functions as much as possible. We also conduct a follow-up survey among the participants regarding their experiment experience.
The study involves two groups of 36 participants: the experimental group from P1 to P18 who use the mobile apps with the hint-text generated by our
HintDroid, and the control group from P19 to P36 who use the app without hint-text. Each pair of participants ⟨ Px, P(x+18) ⟩ has a comparable app using experience to ensure that the experimental group has similar expertise and capability to the control group in total [
36,
97,
109,
120]. Specifically, given that all participants are affiliated with the same rehabilitation and education institution, we seek the collaboration of the institution’s director to assist in the matching process. This process ensures a one-to-one ratio between the experimental and control groups, with pairings based on comparable personal competencies. The director’s extensive familiarity with the participants, stemming from over two years of close association, provided invaluable insight into their capabilities, ensuring an equitable and balanced distribution between the two study groups.
7.3 Experimental Design
To avoid potential inconsistency, we pre-install the 36 apps in the Samsung Galaxy Note 10 with Android 9.0 OS. For each app in the experiment group, we first run Application Explorer [
29] to observe the GUI page file. Then run our
HintDroid to complete the missing hint-text for each app. Finally, we repackage the APK file according to the automated script in Section
5.4. Please note that to ensure the correctness of the experiment, we check that each app can still run correctly after repackaging.
We start the screen readers on the devices. and ask them to explore each app separately. The participants in the two groups need to use the 33 given mobile apps. They are required to fully explore the app and cover as many functionalities as possible. Each participant has up to 15 minutes to use a mobile app which is far more than the typical app session (71.56 seconds) [
23]. Each participant can choose whether to end the exploration early based on their perceived exploration situation. Each of them conducts the experiment individually without any discussion with each other. During their exploring, all their screen interactions are recorded, based on which we derive their exploring performance.
7.4 Evaluation Metrics
Following previous studies [
34,
76,
81], we use the following metrics to evaluate the effectiveness of
HintDroid.
•
Input accuracy: (number of correct inputs filled in by user in an app) / (number of all input components in an app)
•
Activity coverage: (number of discovered activities) / (number of all activities)
•
State coverage: (number of discovered states) / (number of all possible states)
•
Filling time: average time spent from arriving at the page with text input to filling in the correct input.
7.5 Results and Analysis
We present the
HintDroid’s input accuracy, average activity, state coverage and the average filling time across the two groups, as shown in Table
4.
7.5.1 Higher Input Accuracy.
As shown in Table
4, the average input accuracy of the experimental group is 0.83 which is about 152% ((0.83-0.33)/0.33) higher than that of the control group. The results of Mann-Whitney U Test [
89] show there is a significant difference (p-value <0.01) between these two groups in the input accuracy metrics. This indicates that
HintDroid can generate hint-text by analyzing the GUI information from text input components, helping visually impaired individuals better understand input requirements and successfully fill in the correct input. We also find that for some input components with limited information, the hint-text generated by the
HintDroid is of great help to visually impaired individuals.
We analyze these text input components and summarize them into three categories. Firstly, some input components have content description/alt-text for abbreviations. For example, the abbreviation “BFR” for “Body Fat Ratio” as shown in Figure
12 (a), commonly found in health apps, may not be understood by blind people. Secondly, due to poor GUI design, there may only be simple icon components on the interface or colors to differentiate input. For example, Figure
12 (b) shows an arrow icon for switching between departure city and arrival city and uses colors to differentiate them (visually impaired individuals can’t distinguish colors), without providing a textual description of them. Finally, some input components did not provide a hint and content description/alt-text, and it needs to be inferred based on the context. For example, as shown in Figure
12 (c), in the date selection, users need to enter "diastolic pressure" and "systolic pressure" without any text explanation.
7.5.2 More Explored GUI Pages.
With our
HintDroid, the activity coverage of the experimental group is 0.69, which is about 77% ((0.69-0.39)/0.39) higher than that of the control group. The state coverage of the experimental group is 0.73, which is about 66% ((0.73-0.44)/0.44) higher than that of the control group. The results of Mann-Whitney U Test [
89] shows there is a significant difference (p-value <0.01, more detailed information of experiment can be seen in our website
1.) between these two groups in both metrics. It indicates that the hint-text generated by our
HintDroid can help visually impaired users successfully fill in the correct input to explore more states and activities. We also find that
HintDroid can help visually impaired users explore some activities that are hard to find without the help of the
HintDroid. For example, some search-type text input components will not be able to access subsequent content if the correct search content is not entered.
7.5.3 Less Time Cost.
It takes just 0.88 minutes for visually impaired users with our HintDroid to trigger the next page by filling in the correct input while 2.10 minutes in the control group. The results of the Mann-Whitney U Test shows there are significant difference (p-value <0.01) between these two groups for the filling time. In fact, the average time of the control group is underestimated, because an average of 9 participants don’t attempt to fill in the input or don’t continue after filling in incorrect input, which means that they may need more time in this input function.
We watched the video recording of the app exploration in the control group to further discover the reasons for the higher time cost. Without the HintDroid, we find that participants are unable to understand the requirements of the input component, attempting to input content based on one’s own experience has a significant deviation from the actual input requirements. In contrast, participants who use the hint-text generated by HintDroid almost fill in the correct input in one go. This observation further confirms the importance of the hint-text generated by our HintDroid.
7.6 Users’ Experience With HintDroid
According to the visually impaired users’ feedback, all of them confirm the usefulness of our HintDroid in assisting their app exploration. They all appreciate that the hint-text generated by our HintDroid can help the understanding of the input requirement to successfully fill in the correct input, increasing the activity and state coverage. For example, “The hint-text generated by HintDroid is very helpful for us to fill in the input.”(P1), “The hints were super helpful in guiding me through the input fields. Thanks for making it clear!”(P3), “I really like the straightforward expression of these hints.”(P8), “The hints gave me exactly what I needed to know.”(P14), “They made it easy for me to fill in the blanks.”(P15). Participants express they like the hint-text, such as “Great, it’s useful. I like it!”(P2), “Yo, these hints were bang on!”(P5), “Plain and simple!”(P9), “These hints were crystal clear!”(P11), “I liked how the hints were friendly.”(P12). Participants express that our HintDroid can save their exploration time such as “Nice job! The hint-text of HintDroid saves our time.”(P17), “These hints made the whole input process so much faster.”(P13).
The participants also mention the drawback and potential improvement of our HintDroid. They hope that we can also provide some examples of correct input or provide input formats (our method also can generate the input content based on hint-text). For example, “If the hints could guide me better for specific formats like phone numbers or emails, that’d be awesome!”(P4), “Providing links or references for more info in the hints would be really helpful.”(P7), “Providing links or references for more info in the hints would be helpful.”(P10), “If these prompts can show me how much I need to input, it might be more useful”(P18). Participants also hope that HintDroid can be adjusted in real-time based on their input situation in the future. For example, “Can it be made into interactive hints? When I encounter problems, your tool can provide more details.”(P12), “If we make a mistake, it’d be awesome if the hint could help us figure out what went wrong.”(P16).
8 Discussion
In summary, we find that the hint-text generated by our HintDroid can effectively help visually impaired users successfully fill in the correct input.
8.1 The Generalization of Our Approach and Findings
HintDroid is designed to generate the hint-text of text input component, which can help visually impaired users successfully fill in the correct input. In addition to Android, there are also many other platforms such as iOS, Web and Desktop. To conquer the market, developers tend to develop either one cross-platform app or separate native apps for each platform considering the performance benefit of native apps. Although our HintDroid is designed specifically for Android, since other platforms have similar types of information, it can also be extended to other platforms.
We conduct a small-scale experiment for another two popular platforms, and experiment on 20 iOS apps with 34 text input and 20 Web apps with 57 text input, with details on our website. Results show that HintDroid achieves the average exact match, BLEU@1, of 0.73, 0.88 for iOS apps and 0.71, 0.85 for Web apps. This further demonstrates the generality and usefulness of HintDroid, and we will conduct more thorough experiments in the future.
8.2 Potential Applications to End-users
In addition to generating the hint-text in helping the visually impaired users successfully fill in the correct input, our HintDroid can also be applied to help end-users in their daily app usage. Given the increasing complexity of mobile apps, filling in the correct input of mobile apps is a challenging task, especially for aged users. For example, there may be too many text input components in one GUI page for aged users to try the correct input. They may stuck on one page with repetitive attempts but not working. Even normal users may stay on the input page due to unclear input requirements(without hint-text), especially for new apps.
Based on
HintDroid, according to the GUI information of the text input component and knowledge from popular app datasets, the
HintDroid can automatically generate hint-text and complete the “hint” fields of the text input component in the app. Therefore,
HintDroid can also be integrated into UI automation tools [
53,
69,
130] to provide developers with more diverse hint-text. In addition,
HintDroid generates corresponding input content based on the generated hint-text, which can be used in software testing to help testers generate diverse test cases.
8.3 Potential Directions to Improve the Hint-text
Our experiment shows that hint-text can help visually impaired individuals understand input needs. However, developers lack a unified style and approach when designing hint-texts, which may lead to ambiguity for visually impaired individuals. HintDroid utilizes generative models to generate hint-text, which could potentially optimize this process in the future. For example, a hint-text generation model can be customized based on the historical usage records of visually impaired individuals, or a generative model can be customized based on the usage scenarios of different types of applications, providing personalized hint-text.
8.4 Limitations
Although the average metric of the hint-text generated by
HintDroid exceeds 70%, there are still some inaccuracies in the generation of hint-texts. As analyzed in Section
6.2.1, different developers have different design styles when designing text-input components, some of them have little or no contextual information. All these could influence the correct generation of hint-text. We will keep improving
HintDroid for generating hint-text more accurately through the information from the previous GUI page.
For the correctness and rationality of the hint-text generated by
HintDroid, we only consider whether the input content generated based on the hint-text can trigger page transitions. As described in Section
5.3, not triggering page transitions is just the worst-case scenario. In addition, factors such as whether the information conveyed by hint-text is reasonable, complete, and ambiguous need to be considered. We will also incorporate these evaluation indicators in our future work.
In addition,
HintDroid is currently an offline one-time automated hint-text missing issue repair approach that uses repackaging to complete hint-text injection. For some closed-source apps that use code encryption, code obfuscation, and other techniques that prevent decompilation and repackaging, we will also send the hint-text generated by
HintDroid to the developers via email. Considering the low cost of the approach, i.e., the average time to generate hint-text on each GUI page is 1.86 seconds. Meanwhile, adopting repackaging technology may potentially bring some security risks, so we prefer to implement our
HintDroid through real-time interaction. As suggested by the participants in
7.6, we will design a real-time interaction approach in the future and integrate it into the screen reader. It is possible to dynamically adjust the description of hint-text based on user input.
9 Conclusion
The development of applications brings a lot of convenience to the daily lives of visually impaired people. They can use screen readers embedded in mobile operating systems to read the content of each screen within the app and understand the content that needs to be operated. However, the issue of missing hint-text in the text input component poses a challenge for screen readers to obtain input information. Based on our analysis of 4,501 Android apps with text inputs, over 76% of them are with missing hint-text. To overcome these challenges, we develop an LLM-based hint-text generation model called HintDroid, which analyzes the GUI information of input components and uses in-context learning to generate the hint-text. To ensure the quality of hint-text generation, we further design a feedback-based inspection mechanism to further optimize hint-text. The automated experiments demonstrate the high BLEU and a user study further confirms its usefulness.
In the future, we will work in two directions. First, we will improve the performance of our approach by extracting more GUI context information. According to the user feedback, we will optimize the hint-text generated by our HintDroid, which can borrow the idea from the human-machine collaboration studies to better provide convenience for users. Second, we will not limit our HintDroid to assisting visually impaired individuals in app usage, and plan to explore its potential applications in the field of software development, such as integrating it into IDE.