1 Introduction

The rapid advancement of Information Technology has brought about significant changes and improvements in various aspects of people's lives. In the field of higher education, extensive research has led to remarkable achievements. However, the comprehensive implementation of educational digitization is hindered by constraints imposed by teaching requirements and the entrenched habits of teachers and students. The practical application of these achievements in the actual instructional/learning process is severely limited (Johnson et al., 2016; Marcelo et al., 2015), and there are significant variations among their implementations across different disciplines, with some disciplines exhibiting notable differences (Mercader & Gairín, 2020).

Nevertheless, language education stands out among other fields with its proactive, persistent, and extensive demand for educational technology, particularly in the realm of second language education, where it serves as a pioneer in the digitalization of various disciplines. Relevant studies have consistently demonstrated the effectiveness of computers and other language learning tools as aids for language learners (Nomass, 2013), catering to a wide range of users, from elementary school students to lifelong learners, with diverse applications in mobile-assisted language learning (MALL) (Elaish et al., 2019). Language learning applications encompass a broad spectrum of functionalities, ranging from looking up words on a mobile device (Alemi et al., 2012) to evaluating pronunciations (Tarighat & Khodabakhsh, 2016). Even learners’ ability on listening comprehension cannot be tested without information technology nowadays. According to a report by Talking Data (2021), five out of the top ten educational apps in China in 2020 were foreign English learning apps, and 15 out of the top 50 were language learning apps, with 12 dedicated to foreign language learning. This underscores the indispensability of technology-assisted learning tools in English learning.

However, it is unfortunate that despite the golden mine of big data generated by English learning apps., the commercial nature of such apps. hinders scholars from accessing and studying the behavior patterns hidden behind these data. One major reason is that the user data is seldom publicly available. Currently, the measurement on variables in English learning primarily relies on traditional methods such as self-reporting, observation, tests, and peer reviews. Additionally, experimental data is often limited in terms of geographical scope and scale. For instance, a study revealed that only 25% of research in the field of educational data mining related to language learning had a data volume exceeding 1,000 items (Bravo-Agapito et al., 2020). This indicates that research methods in this area fall behind the rapid speed of developing various foreign language learning apps.

In this context, we have accessed publicly available data from 2022 to conduct our research, specifically focusing on user behavior during vocabulary memorization. Utilizing big data and educational mining technology, our objective is to analyze English vocabulary learning behavior and identify potential learning patterns and correlations among learners' strategies.

2 Literature review

2.1 Mobile-assisted language learning

The development of information and communication technology has expanded the scope of educational activities and opportunities. Mobile devices, particularly smartphones, are widely recognized as the most commonly used information and communication technologies, enabling people to learn anytime and anywhere. In the past two decades, the number of mobile devices has surpassed that of traditional desktop computers (Loewen et al., 2019), allowing individuals to engage in learning during travel, walking, work, or even on a bus. English learning, as a leading field in integrating technology in education, has witnessed a rapid and extensive application of various learning devices and applications.

The utilization of mobile devices, especially smartphones, in the context of mobile-assisted language learning has garnered significant attention from the academic community. Mobile-assisted language learning (MALL) refers to the use of diverse mobile devices to support language learning. A growing body of research indicates that incorporating mobile technology in various aspects of language learning can enhance second language acquisition (Jeong, 2022; Shortt et al., 2023), particularly in areas such as oral communication, writing, vocabulary, and pronunciation (Hwang & Fu, 2019). Many commercial language learning software programs provide a wide range of features to cater to diverse learning needs, including grammar exercises, vocabulary retention, listening training, and communication practice. These applications are highly popular among students due to their interactivity and personalized learning experiences. Research also suggests that many learners express willingness to utilize mobile-assisted language learning materials outside classrooms and hold a positive attitude towards MALL tools (Burston, 2014; Cheng & Kim, 2019; Rosell-Aguilar, 2018).

However, despite the widespread application of MALL in language learning, there are still several challenges in investigating the learning effectiveness of mobile-assisted learning devices. Existing literature primarily relies on self-report, experimental, and observational methods, with relatively small sample sizes, which limits the conclusions drawn from the studies (Hwang & Fu, 2019). In this context, there is a new research opportunity to explore how students can better utilize mobile-assisted learning devices to improve learning outcomes. As mentioned by Kukulska-Hulme (2009), mobile devices, due to their widespread use, already have an impact on how people learn and educators need to do more than just watch it happen. Therefore, future research can integrate learners' actual behaviors and learning strategies to provide more practical mobile learning recommendations, enhancing the effectiveness of MALL devices in serving learners.

2.2 Language learning strategies

The field of learning strategies has witnessed a significant increase in research activity over the past few decades. Employing effective learning strategies can greatly enhance the learning experience by enabling learners to efficiently retain and master language knowledge. Currently, there is no universally agreed-upon definition of learning strategies, and different perspectives have emerged from various research angles. Some view learning strategies as specific methods or skills (Naveh et al., 2011), while others conceptualize them as processes and specific steps (Neroni et al., 2019). Additionally, some argue that learning strategies encompass implicit rule systems for learning (Liu & Huang, 2002; Chen & Liu, 2019). Despite these varying perspectives, scholars widely acknowledge that the definition of learning strategies is closely associated with learning behaviors, means, and procedures. Certain scholars have explored the connection between learning strategies and learners' behavioral patterns. For instance, Franklin et al. (2020) examined whether the utilization of learning strategies affects student performance by analyzing their behavioral patterns. Similarly, Ahmad Uzir et al. (2020) discovered a significant correlation between students' use of time management strategies and their academic performance through an examination of their behavioral patterns. Thus, a certain correlation exists between learning behavior and learning strategies.

Similar to learning strategies, second language learning strategies are considered effective means of achieving success in language acquisition. Carton (1966) emphasized that language learning is a problem-solving process where learners utilize their existing knowledge and experiences. This utilization shifted the focus from innate ability and talent to the significant impact of learners' personal effort and strategic use on language proficiency. Consequently, researchers have shown great interest in learning strategies employed during the language learning process. The advent and widespread use of technology-based teaching methods, particularly in language learning, have revolutionized learners' approaches to language acquisition. With a simple internet connection, learners can access and select suitable technologies and tools to facilitate their language learning journey. This has led to the diversification and complexity of language learning strategies, which play a crucial role in enhancing learners' language proficiency (Ramalingam et al., 2022). Furthermore, language learning strategies are closely connected to learning behaviors. Oxford (2003) defines second language learning strategies as conscious and intentional activities that manage and control the process of learning a second language. These strategies encompass a set of teachable behaviors.

Based on these findings, we conclude that language learning strategies encompass a specific sequence of behaviors displayed by learners to efficiently acquire, retain, and utilize language knowledge. These strategies directly impact the overall management and regulation of the language learning process. However, since observing a student's language learning strategies directly is challenging, it is necessary to employ suitable mining methods to externalize them.

2.3 Learning ROI

To assess the effectiveness of learning strategies, it is crucial to adopt appropriate evaluation metrics. The traditional evaluation system, which solely relies on learning outcomes as the criteria, has faced criticism for deviating from learning objectives (Eisner, 2001). However, there is no consensus among scholars regarding how to evaluate the learning process and determine suitable evaluation metrics (Murtonen et al., 2017). To enhance the evaluation of the intricate learning process, the input–output model from economics has been introduced to the field of education. Since Coleman (1968) proposed that school resource investment is a criterion for assessing teaching quality in the mid-1960s, numerous studies have explored the relationship between measurable resource inputs in education and student performance. Concepts such as "education production function" and "education productivity" have been adopted by researchers to describe specific relationships within the education field (Verstegen & King, 1998).

In recent years, with the rapid development of online courses, students' learning environments and methods have undergone significant changes. The relationship between students' course input and learning efficiency has become a prominent topic among educational researchers (Smiderle et al., 2020; Veluvali & Surisetti, 2022). Some scholars describe learning efficiency as the ratio of output to input during the learning process, indicating the ratio of effective learning volume to energy input. Therefore, efficient learning implies achieving better learning outcomes with less energy input (Zhang & Yang, 2002). Yang (2007) borrowed the concept of "cost and benefit" from economics and proposed that learning effort can be viewed as the cost of learning, while the gains represent the return on learning costs, referred to as learning returns. Drawing an analogy, a learner's learning process can be seen as a production process, where learning investment acts as a type of productive investment with a cost–benefit relationship. Learning ROI (return on investment) can be utilized to evaluate students' learning effectiveness. Some scholars have focused their research on enhancing students' learning efficiency to address the issue of long study hours and heavy study burdens (Luo & Yu, 2021; Zhang, 2020). However, such concept has not been quantified into a math formula, though these surveys indicate that Chinese students invest significant time and effort in learning English during their school years. What's more, this substantial investment has not translated into significant improvements in learning outcomes (Sun & Wang, 2017). Therefore, evaluating the effectiveness of learning strategies must consider the cost of learning input, and strive to address this high-cost, low-efficiency situation.

In this study, learning cost refers to the efforts made by learners to achieve their learning objectives, while learning benefit represents the overall advantage gained through students' learning behaviors. We use the term "learning benefits" to measure the extent of students' mastery of English vocabulary. Additionally, we calculate the "learning ROI" by comparing the learning benefit to the learning cost, providing a means to assess the effectiveness of various learning strategies. More details can be found in the method section.

2.4 Applying educational data mining techniques in language learning

The rapid advancement of technology-enabled learning tools and the introduction of new educational environments, such as blended learning, virtual reality, and mobile learning, have led to the accumulation of student data. One of the primary challenges educational institutions are facing today is effectively harnessing this data to gain valuable insights into the learning process. In response to this challenge, a novel approach known as Educational Data Mining (EDM) has emerged. Unlike relying solely on observations or learners' subjective perceptions and recollections, EDM directly taps into the immense potential of data. Specifically, it analyzes users' learning traces and utilizes data mining techniques to examine their actual behaviors. This enables EDM to uncover, monitor, and enhance the educational process, gradually unraveling previously unexplored facets of learning.

In the field of language learning, EDM techniques are commonly used for various purposes. These include exploring second language learners' motivation (Saeed et al., 2014), providing feedback to teachers (Coskun & Mutlu, 2017; Kaoropthai et al., 2016), identifying language anxiety (Baghaei & Ravand, 2015; Guntzviller et al., 2016; Martin & Alvarez Valdivia, 2017), and predicting academic performance (Hwang et al., 2021), with performance prediction being the most prevalent (Wu, 2021).Unsupervised mining methods, such as clustering, sequence pattern mining, and hidden Markov models (HMMs), are widely adopted in this field. These techniques have demonstrated their effectiveness in uncovering hidden learning patterns and relationships from students' learning data (Kovanović et al., 2015). Despite the growing popularity of online and mobile learning environments in language learning, a report on the usage of EDM technology in this domain (Wu, 2021) revealed that more than 50% of studies still take place in traditional learning settings. While many studies in online environments are based on platforms like MOOCs, there is relatively less research focusing on mobile applications. Additionally, although big data is one of the key characteristics of EDM research, only 25% of language learning research projects involve datasets with more than 1,000 items of data (Wu, 2021).

Additionally, the literature on utilizing EDM to uncover learners' learning strategies is limited, with additional constraints posed by platform selection and the data content which can be collected. Saint et al. (2018) conducted an analysis of computer learners' behavior data in online environments using HMM techniques, focusing on exploring the relationships between learning strategies. However, they did not explicitly provide recommendations for "beneficial" learning strategies. Maldonado-Mahauad et al. (2018) identified self-regulated learning strategies among MOOC learners, categorizing them based on questionnaire data and partial behavioral data, but did not delve into the connections between learning strategies and learning outcomes. Furthermore, to the best of our knowledge, we did not find any published literature involve the identification of learning strategies using data mining techniques in MALL data.

In summary, two gaps exist in current research:

  1. A)

    To the best of our knowledge, there is a lack of research on extracting language learning strategies from MALL data.

  2. B)

    Many educational data mining studies have not integrated learning ROI, while the existing literature that introduces the idea of learning ROI has not provided quantitative measurements or evaluations.

We have acquired a dataset that allows us to observe language learners' behavior from a different angle: identifying their learning strategies, and discover potential patterns. This study utilizes learning ROI as a measure of learning strategy effectiveness and provides a quantified calculating method for learning ROI. By considering the associated learning costs, we examine which learning strategies can improve learners' learning ROI. Specifically, we aim to answer the following two questions:

  • RQ1. Which learning strategies have been used by learners with different behavioral characteristics?

  • RQ2. Which strategies can increase learning ROI?

3 Methods

In this section, we divided the processed behavioral data into two parts: a training set and a test set, using a random split. The first set of data, consisting of over 1 million data points, was assigned to the training set (referred to as "data0"), while the subsequent set of over 1 million data points was assigned to the test set (referred to as "data1"). To address Q1, we employed clustering analysis to categorize the learning events (one example is presented in Table 2) in the training set data0 and subsequently identified the learning strategies for word memorization based on the features associated with each category of learning events. To validate the reliability of the extracted strategies, we employed the remaining data1 as the test set and compared the features and learning strategies across each category between data0 and data1 to assess their consistency. For Q2, we compared the learning ROI of each strategy to explore the optimal approach for enhancing word learning efficiency. Similarly, we replicated the same procedures on the test set data1 to examine the effectiveness of the optimal strategy obtained from the training set.

3.1 Data source

3.1.1 Data collection

We utilized an open-source dataset from Maimemo, a language learning application that helps learners memorize English vocabulary. The dataset includes behavior data from over 16,000 randomly selected Chinese learners, comprising 2 million instances of word memorization.Footnote 1 The data provides insights into each student's review frequency, time intervals between reviews, word difficulty, and whether the words were successfully memorized during the usage of the app. Due to data anonymization, we are unable to obtain more specific user information. However, we can determine that the user base of Maimemo mainly consists of full-time students aged 20–29, with a focus on individuals who need to take or prepare for exams (Chen, 2021). The target vocabulary in the application is primarily presented in a flashcard format. Learners have the options to indicate whether they remember the current word. If they indicate that they remember the word, it is skipped. However, if they indicate that they do not remember the word, they are directed to a word learning page, as illustrated in Fig. 1. The application periodically prompts learners to recall words they have previously learned, aiding in their consolidation and review on the knowledge.

Fig. 1
figure 1

Vocabulary learning interface in the language learning application

We have selected this dataset for two primary reasons. Firstly, the dataset is of large size, comprising 2 million instances of users' word memorization information. This large dataset allows for robust analyses based on big data, which may generate potential conclusions. Secondly, the dataset contains multiple memory actions by the same user on the same word in a whole life span, rather than treating them as separate instances. This approach takes into account the impact of each recall action on the learner's retention state and forms a retention sequence. Consequently, this dataset is well-suited for studying learners’ complex learning characteristics.

3.1.2 Preprocessing of the data source

To eliminate the influence of learners' prior learning memories before using the app., this dataset has filtered out events where learners reported knowing the word upon their first encounter. We have selected seven attributes relevant to our study, as listed in Table 1.

Table 1 Attributes selected for clustering groups

A sample of the dataset is presented in Table 2. This example shows that a learner whose user ID was f17fb3 reviewed the word "circumstance" whose difficulty level was 7, all together 4 times. The intervals between the first three reviews were 0, 1, and 3 days, respectively. The corresponding retention results for each review were "not remembered", "remembered", and "not remembered", respectively. The final result of the last review on the word was not remembered.

Table 2 Dataset sample

Due to the sequential nature of "r-history" and its recorded format as a sequence, a data format conversion is required during the clustering process. Specifically, we utilized the "Nominal to Numerical" operator in the RapinMinerFootnote 2 tool. Such operatorFootnote 3 is employed to change the data type of non-numeric attributes to a numeric type. This operator not only alters the data type of the selected attributes but also maps all attribute values to corresponding numeric values. This mapping ensures that identical sequences correspond to the same integer values. So is "t-history".

Among these seven attributes, six are the labels used in the original dataset (Ye et al., 2022). The only exception, difficulty (d), was introduced by us, to control the influence of word difficulty on word learning. We searched the word (w) in the word difficulty database adopted by this app. to obtain its difficulty. The difficulty database divided difficult words into 10 levels based on the average probability of users remembering them, with 1 being the easiest and 10 being the most difficult (Ye et al., 2022). As we concerned about the potential biased influence of very easy words on the results, we excluded all events with word difficulty levels below 4 (for example, "among"). After filtering, the training dataset (data0) contained 901,225 learning events, while the testing dataset (data1) contained 919,165 learning events.

3.2 Methods for RQ1

To address RQ1, our analysis entails the following steps: 1) Conducting clustering analysis to categorize the learning events into distinct clusters based on their behavioral characteristics; 2) Employing the Kruskal–Wallis H test to determine if there are significant differences in attributes among the clusters; 3) Performing descriptive statistical analysis to identify typical values associated with each cluster of learning events.

To identify the learning strategies of different clusters of learners, it is necessary to extract meaningful information from both the group and individual learners' behavioral data. Simple statistical analysis has limited explanatory power and may not reveal hidden correlations, considering that learning is a complex process involving multiple variables. Hence, we employed cluster analysis, which enables simultaneous consideration on multiple attributes and unsupervised identification of similar groups. And cluster analysis has been extensively utilized in educational environments (Merceron & Yacef, 2005; Sinaga & Yang, 2020).

Except user ID and word, five attributes were involved as clustering features, i.e., i, t-history, r-history, d and r. The clustering algorithms used for comparison and analysis were DBSCAN(Density-Based Spatial Clustering of Applications with Noise) (Ester et al., 1996), support vector clustering (Ben-Hur et al., 2001), hierarchical clustering (Defays, 1977), k-means (Ahmed et al., 2020) and x-means (Pelleg & Moore, 2000). These five algorithms were applied for the training dataset data0 to compare their performance. Except that the hierarchical clustering algorithm was run in the OriginFootnote 4 tool, the other four algorithms were run in the RapidMiner tool as illustrated in Fig. 2.

Fig. 2
figure 2

Data flowchart of clustering analysis in RapidMiner

These algorithms have different complexities, and some cannot even finish running in a given time, when the input data was the whole data0. Under this circumstance, we re-sampled data0 to obtain a smaller subset, to finish running algorithms. The details on clustering results of various algorithms can be found in Appendix.

Considering the relatively best performance on the trade-off between data scale and running time, and the consistent clustering results among these algorithms, K-means algorithm in RapidMiner was chosen for the subsequent analysis task, e.g., checking clustering results in data1.

K-means is an unsupervised algorithm used for classifying behavioral data of learners. It iteratively searches for the closest cluster centers (centroids) to partition the data into multiple clusters (Bangoria Bhoomi, 2014; Sinaga et al., 2021). This algorithm has become the most commonly used classical algorithm in data mining due to its ease of implementation, efficiency, and speed (Venkateswarlu & Raju, 2013; Yang & Sinaga, 2019). The K-means process consists of three stages: firstly, determining the centroids; secondly, calculating the distances between centroids and individual data objects using the Euclidean distance; and thirdly, grouping the data based on the minimum distance between centroids and objects.

To validate significant differences in word memory behavior among different groups of learners and ensure successful division of learners into four distinct groups through clustering, a different test is required. Because the data distribution deviated from the fundamental assumption of variance analysis, specifically, the dataset did not exhibit a normal distribution, we employed Kruskal–Wallis H tests on all attributes and conducted pairwise comparisons between clusters. Additionally, to observe the contribution of each attribute to the clustering results, we conducted a correlation analysis on the clustering variables, using "η2" from the correlation analysis to measure attribute contribution. "η2" is an effect size measure commonly used to assess the strength of correlation between two categorical variables, and it is widely applied in research on second language acquisition (Norouzian & Plonsky, 2018). The values of η2 range from 0 to 1, indicating the extent to which one categorical variable explains the variance in another categorical variable. A higher value of η2 indicates a stronger correlation between the two categorical variables, where one variable can account for a significant portion of the variance in the other variable. Conversely, a value close to 0 suggests a weaker correlation, where one variable explains little or no variance in the other variable (Richardson, 2011).

To extract meaningful characteristics of group learners from the clustering groups for identifying their learning strategies, it is necessary to select the most representative events from each cluster. Since the main behavioral data in this study are the retention sequence and time interval sequence of students, we identified the top 1‰ most frequent r-history in each category, and selected the most representative 3 events based on the attributes of the review times and the most frequent t-history. By comparing the characteristics of typical events, we can determine the corresponding learning strategies of learners in each group.

3.3 Methods for RQ2

In order to answer RQ2, it is necessary to evaluate the learning ROI of each cluster. As mentioned previously, the learning ROI represents the ratio of a student's learning benefit to their learning cost. Traditionally, it could be better to measure the learning cost by the duration spent on one review or the duration accumulated over multiple reviews. However, none of the existing variables in the original dataset recorded such information. Instead, we turned to use the number of review times (i.e., i) as an alternative. It is reasonable to assume that if a learner reviews a word multiple times, this learner has cost more efforts than the one who only reviews the word only one time. On the other hand, to indicate the successful chance of a word being remembered, we use the average value of r within each event, denoted by P (recalling probability). A higher P indicates a better recall performance. The learning benefit is associated with the recalling probability (i.e., P) about the word being remembered or not, and the word’s difficulty level (i.e., d).

In other words, within the clustering variables, the number of reviews is inversely proportional to the learning ROI, while word difficulty and the probability of successfully recalling a word are proportional to the learning ROI. Therefore, for each learning event, we derived the following formula to evaluate the learning ROI:

$$ROI=\frac{P\times d}{i}$$
(1)

Using cluster analysis, we input the values of each group into the formula to calculate the learning ROI for each cluster. Subsequently, the Kruskal–Wallis H test was employed to examine whether there are significant differences in the learning ROI among the clusters. Through comparative analysis, we aim to identify the strategies that yield higher learning ROI.

3.4 Validating answers to RQ1 and RQ2 in a test dataset

We utilized the remaining portion of the data, data1, as a test set to validate the results of training on data0 for RQ1 and RQ2.

To validate the results of RQ1 in data1, we followed the same procedure as in data0. We selected the same clustering attributes, and performed k-means clustering analysis. To determine if there were any significant differences between each cluster, we employed the Kruskal–Wallis H test. Within each cluster, we identified the top 1‰ of events based on the frequencies of r-history and analyzed their attributes, such as i and t-history. These attributes helped us identify typical values, representing the learning strategies of each group. Finally, we compared these results with the ones found in the training set to enhance the reliability of our findings.

To validate the results of RQ2 in data1, we also followed the same procedure as in data0. We calculated the learning ROI of each cluster using the aforementioned formula. In addition, we performed a Kruskal–Wallis H test and correlation analysis on the clustering variables, to assess the variations in learning ROI across different groups and examine the variables' contributions to the clustering results.

4 Results

4.1 Learning strategies employed by learners with different behavioral characteristics

4.1.1 Clustering group

The clustering results for data0 are presented in Tables 3 and 4, respectively. Among the four clusters, cluster0 has the highest number of events, significantly greater than the other groups, as shown in Table 3.

Table 3 Number of events for each group in data0
Table 4 Descriptive statistics of attributes in each cluster in data0

In Table 4, the average number of review times (i) varies greatly across the groups, from the lowest being around 6 times in cluster0, to the highest being 25 times in cluster1. The average difficulty level of the words (d) across the groups ranges from 6.8 to 7.4. Cluster3 has the highest P of 0.845, while cluster0 has the lowest at 0.761.

According to Section 3.1.2, both "t-history" and "r-history" are sequences, which are transformed to numerical values so that they can be used in clustering algorithms. Therefore, neither their mean values nor standard deviations are meaningful in this descriptive statistical table. However, the averages of them are calculated to present an intuitive perception on significant differences among clusters.

It is found that r-history, t-history, and i have more significant contribution and impact than r or P on the clustering results (which will be discussed in detail later in this section). To gain a visual understanding on the attributes’ distributions, we have plotted a chart of the samples using these three variables (Fig. 3). Because of the upper limit of displaying by RapidMiner, we randomly selected 2000 samples. It is evident that events in cluster 0 have relatively low numerical values for all three attributes, while events in cluster 3 are concentrated in higher value ranges.

Fig. 3
figure 3

Samples of each cluster in data0

Subsequently, Kruskal–Wallis H tests and correlation measurements were conducted on all clustering attributes within each group, followed by pairwise comparisons between the clusters. According to the test, there are statistically significant differences among the groups in five clustering attributes: i, d, P, r-history, and t-history. Due to the space limitation, we only present the results of the Kruskal–Wallis H test on attribute "i" as an example in Table 5.

Table 5 Pairwise comparison of "i" in data0

The results of the correlation measurements are shown in Table 6. It shows that t-history, r-history, and i show a higher value of η2, indicating their high contribution to the clustering results.

Table 6 Measurement on attributes’ correlation in data0

On the other hand, the attributes P and d exhibit a relatively lower value of η2, however, the Kruskal–Wallis H test still reveals significant differences on these two attributes among the clustering groups. This could be attributed to factors such as the influence of sample size and the complexity of the clustering results on the correlation analysis.

Based on the above findings, we conclude that the k-means clustering algorithm successfully categorizes the learners into four distinct groups with significant differences. Among these groups, t-history, r-history, and i make substantial contributions to the clustering results, while the variables P and d had relatively subtle contributions. This highlights the importance of the number of review times and interval allocation between multiple reviews.

4.1.2 Learning strategies of learners in the training set

By comparing i and the most frequent t-history in the top 1‰ r-history ranking in each group, we have identified the three most representative events selected for each cluster. The clustering attributes and their respective average values are shown in Table 7, where the mean P represents the average probability of recalling words in the learning events of that group.

Table 7 Representative events for each cluster in data0

Table 9 shows that the learners from different clusters adopt different types of learning strategies when using the language learning application. Based on the representative events and attribute meanings for each cluster, we can describe these learning strategies as follows:

  • Data0-Cluster 0—Low Effort-Short Interval Group: The learners in this cluster exhibit a relatively low level of effort, with an average of about six review times per learning event. They tend to review words with shorter intervals of one to two days and learn words with a lower difficulty level compared to other clusters. However, their ability to recall the learned words is relatively low.

  • Data0-Cluster 1—High Effort-Short Interval Group: This cluster is characterized by learners who put in a high level of effort, with an average of about 25 review times per learning event, and have a higher tendency to forget after review. They review with shorter intervals, mostly five days or less, and learn words with a higher difficulty level than other clusters. However, their ability to recall the learned words is also relatively low.

  • Data0-Cluster 2—High Effort-Long Interval Group: The learners in this cluster also put in a high level of effort, with an average of about 20 review times per learning event, second only to Cluster 1. However, they review with longer intervals of 10 days or more, and learn words with a slightly higher difficulty level. Their ability to recall the learned words is the highest among all the clusters.

  • Data0-Cluster 3—Low Effort-Long Interval Group: The learners in this cluster exhibit a relatively low level of effort, with an average of about 13 review times per learning event. However, they tend to forget the learned words less frequently after review, and review with longer intervals. They learn words with a slightly lower difficulty level, but their ability to recall the learned words in the final review is the highest among all the clusters.

4.1.3 Validation of the test set

To validate the results of the training set data0, a similar data analysis was conducted on the test set data1, which revealed similar results to the training set. The clustering results are presented in Table 8. After conducting the Kruskal–Wallis H test and variable correlation analysis, it was found that there are significant statistical differences among the attributes of each cluster. Furthermore, t-history, i, and r-history are again found to have a significant impact on the clustering results. Three typical values were selected from the top 1‰ of r-history frequency rankings in each cluster, as shown in Table 9.

Table 8 Average values of cluster attributes in data1
Table 9 Representative events for each cluster in data1

Similar to the classification of training set data0, we can obtain the characteristics of the following four classes.

Data0-Cluster 0—High Effort-Long Interval Group

They have a relatively high average number of review times and longer intervals between reviews, which results in a higher probability of successfully recalling the vocabulary. This group is similar to cluster 2 in data0.

Data0-Cluster 1—Low Effort-Short Interval Group

They have the least number of average review times with shorter intervals between reviews, which leads to a lower probability of successfully recalling the vocabulary. This group is similar to cluster 0 in data0.

Data0-Cluster 2—Low Effort-Long Interval Group

They have a lower average number of review sessions with relatively longer intervals between reviews. However, they have a relatively higher probability of successfully recalling the vocabulary after their final review. This group is similar to cluster 3 in data0.

Data0-Cluster 3—High Effort-Short Interval Group

They have the highest average number of review sessions with shorter intervals between reviews, and the vocabulary they learn is of a higher level of difficulty. However, they have a relatively lower probability of successfully recalling the vocabulary. This group is similar to cluster 1 in data0.

4.2 Learning ROI of different groups

4.2.1 Learning ROI in the training set

The mean values of each clustering variable in data0 are used to calculate the learning ROI for each group, as shown in Table 10. Upon comparison, it is observed that the groups are ranked by learning ROI from high to low as follows: the Low Effort—Short Interval Group (cluster0) has the highest ROI, followed by the Low Effort—Long Interval Group (cluster3), the High Effort—Long Interval Group (cluster2), and lastly the High Effort—Short Interval Group (cluster1), which has the lowest ROI.

Table 10 Average learning ROI for each cluster in data0

To test whether there are significant differences in the learning ROI among different groups, we also conduct Kruskal–Wallis H tests on the learning ROI of each group, and find that there are significant differences in learning ROI among different groups (Table 11).

Table 11 Significance test of inter-group differences in learning ROI for Data0

4.2.2 Validation of the test set

Table 12 presents the learning ROI of each cluster in data1. The clusters are ranked from high learning ROI to low learning ROI as follows: Low Effort-Short Interval cluster (cluster1) > Low Effort-Long Interval cluster (cluster2) > High Effort-Long Interval cluster (cluster0) > High Effort-Short Interval cluster (cluster3). It is obvious that such order is completely consistent with its counterpart order obtained in data0. Learning ROI among the clusters also all pass the Kruskal–Wallis H test, indicating significant differences among each pair of clusters.

Table 12 Average learning ROI for each cluster in data1

5 Discussion and conclusions

5.1 RQ1: Profiles of users’ learning strategies in technology-supported second language learning

Learners' behavior can be classified into four categories based on factors such as the number of review sessions and the time interval between each review. These categories correspond to four learning strategies: 1) Low-effort Short-interval, characterized by fewer review sessions and short intervals between each session; 2) High-effort Short-interval, characterized by more review sessions and short intervals between each session; 3) High-effort Long-interval, characterized by more review sessions and longer time intervals between each session; 4) Low-effort Long-interval, characterized by fewer review sessions and longer time intervals between each session. These behavior patterns can be interpreted as manifestations of specific learning strategies (Maldonado-Mahauad et al., 2018). These strategies primarily reflect the effort invested by learners during the learning process and the methods employed to memorize words. Among these strategies, the number of review sessions varies significantly, with the group having the fewest review sessions averaging only about six sessions, while the group with the most review sessions averages around 25 sessions, which is approximately four times the number of the group with the fewest review sessions.

5.2 RQ2: The differences in learning ROI among different groups

5.2.1 There is a significant correlation between learning ROI and learning strategies

The experimental results showed a significant difference between the clustering attributes in the training and testing sets, indicating that the clustering algorithm successfully divided the dataset into subsets with different features and attributes. The Kruskal–Wallis H test also revealed significant differences in learning ROI between different groups in both experiments. In this study, different learning strategies corresponded to different behavioral sequences, and learners employing different strategies had significantly different learning ROI. This finding is consistent with the conclusions of some scholars, such as Jeong and Biswas (2008), who found a strong correlation between the learning performance of middle school science students and the identified behavior patterns, and Kovanović et al. (2015), who identified effective learning strategies and established a significant relationship with high-level cognition from the perspectives of self-regulation, goal orientation, and cognitive presence. Similarly, Ahmad Uzir et al. (2020) used clustering methods to discover that students who employed time management strategies in a flipped classroom achieved better learning outcomes.

5.2.2 Low effort-short interval learning strategy yields higher learning ROI

This study reveals that achieving high learning ROI is closely related to two factors: low effort and short review intervals. While the number of word reviews in this study can be viewed as repetitions of the word memorization task, few studies have explored the interaction between task repetition and the distribution of learning outcomes. Most experiments on task repetition typically involve only three or four repetitions (Rogers, 2022). However, our experiment clearly demonstrates the significant impact of repetition on learning ROI in word memorization tasks. Whether in the training set or the test set, The "low effort-short interval" is considered the optimal learning strategy. Learners in this group achieving significantly higher learning ROI than other groups (with 6 review sessions). This suggests that, considering the learning cost, to effectively remember words with a certain difficulty level (d >  = 4), the number of reviews can be controlled at approximately 6, and the review interval should not be too long. Typical values of the optimal strategy group indicate that their review intervals mostly fall within 5 days, and even 1–2 days, enabling them to attain the peak of learning ROI more easily. The results of this study align with those of other researchers, such as Suzuki and DeKeyser (2017) who compared the learning effects of learners with review intervals of 1 day and 7 days in language learning. They observed that learners with shorter intervals had an advantage in delayed post-tests after 28 days. Subsequently, Suzuki (2017) expanded on these findings and found that the 3.3-day group (The average review interval is approximately 3.3 days) exhibited higher accuracy in speaking tests compared to the 7-day group (The average interval for reviewing is approximately 7 days).

Additionally, we noticed that the "low-effort-short-interval" group had the highest number of events, significantly surpassing the second-ranked group by 19 times. This suggests that the majority of learners using the language learning application have already adopted learning strategies associated with higher learning ROI. We believe this may be attributed to the language learning application consciously implementing specific review plans based on the learning forgetting curve, which overlaps to some extent with our optimal learning strategy of "low-effort-short-interval".

5.2.3 The highest recall probability does not guarantee the highest learning ROI

We have discovered some interesting phenomena. If we only focus on the P, the "low effort-long interval" group in both the training and testing sets has the highest probability of recall. Although this group (with about 13 review sessions) has the highest probability of P, their learning ROI is not the highest. In other words, even if the probability of recalling a word is very high, the learning ROI of the "low effort-long interval" group can only rank second when considering the learning cost. That is, the group with lower learning cost, the "low effort-short interval" group (with about 6 review sessions), has the highest learning ROI. Additionally, we found that the remaining two "high effort" groups (with an average of more than 20 review sessions) have both lower recall probability and learning ROI. Therefore, improving learning efficiency cannot rely solely on increasing learning costs. When using language learning applications, learners need to control the learning cost within a reasonable range.

5.2.4 Higher learning ROI possible with fewer review sessions

Many studies suggest a positive correlation between effort and academic performance (Ampofo & Osei-Owusu, 2015), indicating that more review sessions and effort lead to better learning outcomes. However, contrary to this conventional wisdom, more effort and review sessions do not necessarily lead to greater learning ROI. In fact, across all groups, both short and long-term intervals, the "high effort" group showed significantly lower learning ROI compared to the other two "low effort" groups. This means that even with a high number of reviews, whether the interval is long or short, the learning ROI is relatively low. On the other hand, the group with fewer repetitions of memorized words obtained greater learning ROI. This is consistent with some scholars' findings, such as Sample and Michel (2014), who found that in a study on children, students demonstrated significant improvement in their learning outcomes during the initial two repetitions of the task, but their attention declined during the third repetition. Lambert et al. (2017) observed that when learners were asked to repeat a learning task six times within one class, students showed significant improvement in their performance in the first three tasks, but the performance remained stable in the fifth and sixth tasks, indicating a decreasing return on investment for repeated tasks.

We believe that there are two main reasons for this phenomenon. First, there is a law of diminishing marginal returns in the learning process. That is, as learning time increases, the learner's learning performance will show an increasing trend. However, when learning time exceeds a certain limit, the input gradually becomes greater than the output, resulting in the disappearance of the increasing trend and ultimately showing a decreasing trend (Cui, 2000; Huang & Zhen, 2015). In our study, as i increases, the increase in P diminishes or even decreases, resulting in a lower learning ROI.

Another reason may be the presence of an inhibitory effect in the learning process, where negative interference occurs between previously learned and subsequently learned information, resulting in the phenomenon of retroactive inhibition in memory retention (Chen & Liu, 2019). Particularly when faced with frequent repetition of word memorization, retroactive inhibition can significantly contribute to information forgetting, and the likelihood of retroactive inhibition increases with the number of repetitions. This finding also supports the perspective of Luo and Yu (2021) that increasing study time does not necessarily lead to improved learning performance, and that fewer repetitions of word memory can actually yield higher learning ROI. Therefore, it is essential to find the optimal balance between learning costs and benefits, adjust the learning state accordingly, and rationally allocate learning time.

5.3 Implications for educational practices for helping users engage in language learning

The main contribution of this article is to utilize learner behavior data to identify optimal learning strategies that enhance learning ROI and effectiveness. The concept of learning ROI, borrowed from the economic framework of cost–benefit analysis, quantifies learning outcomes by assessing the ratio of costs to benefits. This concept emphasizes the importance of cost-effectiveness, guiding learners to make more rational choices in their learning strategies and avoid blindly following trends or investing excessive time and energy without careful consideration. Additionally, this article offers appropriate language learning strategies for learners. The interaction between review intervals, review times, and their impact on task performance and language learning development remains a debatable issue in current practice (Rogers, 2022). Through the analysis of learner behavior data, this study provides exploratory insights to address this problem and identifies the optimal language learning strategy in terms of repetition frequency and interval. Learners can practically apply this strategy to enhance their learning efficiency and effectiveness. Furthermore, it aids teachers in gaining a better understanding of their students' learning processes and enables them to tailor their teaching approaches to enhance teaching quality. In addition, the research methodology and ideas presented in this article provide valuable guidance for designing future language learning applications. By mining and analyzing data, learner characteristics and behavior patterns can be uncovered, offering a solid foundation for designing applications that align with learners' needs and habits, ultimately improving usability and enhancing the user experience. Overall, the contributions of this article carry significant implications for language learning research and teaching practices.

5.4 Limitation and future directions

The use of data mining techniques to identify and compare learning strategies among language learners in educational applications is a novel approach, yet there are several limitations that need to be addressed. Firstly, due to the limited sample size and acquisition method, our research was limited to examining learners' review strategies, and thus we were unable to explore whether they employed other learning strategies. To expand our understanding of learners' language application usage, future research could collect more diverse data to investigate learners' use of other learning strategies.

Secondly, our evaluation of learning ROI mainly focused on the aspect of time cost and retention benefit, a more comprehensive evaluation of learners' strategies would require consideration of their other explicit and implicit aspects, such as economic and emotional investment during the learning process. To achieve this, future research could measure other dimensions of learners' strategies, improve clustering accuracy, and identify additional potential conclusions, applying our research approach and findings to disciplines beyond language learning.

Lastly, future research could collect other forms of data and employ various data mining methods, such as time-series analysis, to investigate learners' most commonly used word-memorization strategies. These methods would provide greater insight into learners' learning behaviors, enabling the identification of more effective strategies and the development of personalized.

5.5 Conclusions

This study has identified distinct groups of language learners based on their learning behaviors within a language application, with each group exhibiting unique characteristics related to effort and review intervals. By analyzing these characteristics, specific learning strategies were identified for each group, with the low-effort, short-interval group demonstrating the highest learning ROI and the high-effort, short-interval group showing the lowest. This highlights the importance of cost-effectiveness in language learning, with the optimal learning strategy being one that balances review frequency and interval to maximize learning ROI.

The study also introduced the economic concept of cost–benefit analysis to the language learning context, emphasizing the importance of balancing investment and expected results. By leveraging data mining techniques to analyze learner behavior data, effective learning strategies were uncovered, providing learners with a more scientific approach to language learning. Furthermore, this data-driven approach offers a new perspective and direction for the design of future language learning applications. In summary, this article offers a data-driven and cost-effective learning method and strategy for language learning through the analysis of big data within an English learning application. It underscores the importance of cost-effectiveness in language learning and highlights the potential of data mining in discovering effective learning strategies.