1 Introduction
Linguistic synchrony, or linguistic alignment, has been identified as one of the key factors that are positively correlated with cognitive and affective dimensions essential to student learning. In classroom learning environments, students and teachers often engage in linguistic alignment both intentionally and unintentionally, by mirroring each other’s use of vocabulary, syntax, and intonation patterns [
46]. This alignment, encompassing lexical, syntactic, and semantic synchrony, enables students to more effectively grasp complex concepts through clearer communication [
61]. High levels of linguistic synchrony have been shown to positively affect student task performance [
51] and foster a sense of connection between tutors and students, serving as an indicator of improvement in engagement and rapport [
49].
Linguistic synchrony assumes an even more critical role in informal educational settings such as peer-assisted learning, and in AI-powered or online tutoring systems [
14,
25,
51]. Unlike traditional classroom environments where a teacher addresses a group, these one-to-one tutoring formats offer enhanced opportunities for dynamic, back-and-forth interactions. These interactions often facilitate the development of synchronous conversations crucial for effective learning [
54]. Moreover, non-traditional environments introduce unique communication dynamics, including the absence of physical presence, delayed responses, and a reliance on technology, all of which can significantly affect the effectiveness of linguistic alignment [
13,
22,
38]. For instance, in virtual tutoring sessions, the lack of face-to-face interaction may impede natural rapport building, making linguistic synchrony a critical bridge to close the communication gap and ensure students remain engaged and comprehend the material.
In second language (L2) tutoring, students must navigate not only comprehension but also the production of target language features [
9,
47]. In such settings, linguistic synchrony facilitates not only understanding but aids in acquiring nuanced aspects of the target language [
30]. Studies indicate that linguistic synchrony between tutors and students in L2 learning differs markedly from alignment in fluent conversational dialogues [
9,
47]. This is because students’ ability to align their language with the tutor may be constrained by their proficiency and familiarity with the target language [
41,
53]. Additionally, although tutors aim to guide students toward synchronous language production, they must also balance this with fostering independent language generation, which may temporarily reduce synchrony [
48]. This dynamic tension between alignment and independent language use makes the study of linguistic synchrony in online L2 tutoring especially critical for optimizing teaching strategies and improving student outcomes [
28].
Despite its significance, research specifically focusing on linguistic synchrony within language tutoring for second language learners remains limited in scope [
28]. Existing studies often examine synchrony in a global context [
14,
51], evaluating how overall or average synchrony manifests across an entire learning process. However, they seldom investigate how synchrony evolves dynamically as a moving variable throughout the session or how it interacts with the learning environment, including the tutor’s pedagogical strategies. A few studies have highlighted the potential impact of tutors’ strategies that could influence students’ linguistic alignment (e.g., [
53]), such as the complexity of their word choices [
35,
62]. However, there is limited evidence to understand how tutors’ pedagogical strategies, such as their dialogue acts, associate with the dynamics of linguistic synchrony over the course of a learning session (e.g., [
48]).
Furthermore, many previous studies have focused on evaluating specific aspects of linguistic synchrony independently. Linguistic synchrony is multifaceted, meaning that the alignment between student and tutor can occur across different language dimensions [
46], such as word choices (lexical [
47]), sentence structure or grammatical patterns (syntactic), overall topics, themes, and concepts (semantic [
14]), as well as timing and interaction patterns (temporal [
51]). Recent advancements in computational linguistics and natural language processing (NLP) have significantly enhanced the capacity to efficiently and accurately capture varying dimensions of linguistic synchrony in dialogues [
15,
46,
60].
Hence, this study aims to evaluate the dynamics of linguistic synchrony in second language virtual tutoring environments, while focusing on the relationship between the evolving dynamics of the multifaceted characteristics of linguistic synchrony—namely, semantic, syntactic, and lexical synchrony—between the tutor and the students. We also aim to understand the relationship between the specific pedagogical strategies (i.e., pedagogical dialouge acts) employed by tutors and their association with synchrony in encounters with higher- and lower-performing students. The following research questions were posed to navigate the study:
•
RQ1: What is the relationship between semantic, syntactic, and lexical synchrony and the pedagogical dialogue acts (e.g., scaffolding, eliciting) introduced by the tutor?
•
RQ2: How do the dynamics of semantic, syntactic, and lexical synchrony differ in tutoring conversations based on student performance?
In addressing these questions, we aim to enhance the empirical evidence through detailed, step-by-step process involving the computational modeling of the multifaceted aspects of linguistic synchrony using multivariate time-series analysis (e.g., [
8]). By examining synchrony as a dynamic process, our research provides a more detailed analysis of student-tutor interactions, with potential applications for improving personalized learning environments and tailoring pedagogical strategies in virtual second language tutoring environments.
3 Results
3.1 RQ1. Linguistic Synchrony Measures and Pedagogical Dialogue Act
To address RQ1, we investigated synchrony methods that represent the lexical, syntactic, and semantic dimensions of the tutoring sessions. The average synchrony measures for syntactic synchrony were extracted from JSDuPOS and ALIGN (Penn POS distribution features). Lexical synchrony was evaluated using the ALIGN method (lexical token overlap). Lastly, semantic synchrony was calculated in relation to the pedagogical dialogue acts using CLiD (Word2Vec+WMD) and variants of semDist (using BERT, FastText, and GloVe embeddings with cosine distance).
Among the semantic synchrony measures, the
reference and
clarification dialogue act showed the notably lowest average values across all four methods of semantic synchrony, as illustrated in Figure
2 and Table
4. Since these metrics are distance-based, where a lower value indicates a closer distance, this suggests that making references and clarification is typically related to higher semantic synchrony on average. However, it is important to note that both categories appeared significantly less number of times as discourse act, limiting its representation in the dataset. This infrequent occurrence could lead to higher variability in the synchrony metrics and may not fully capture the role of reference acts in fostering semantic alignment. Conversely, the
revision and
opening dialogue acts indicated the lowest semantic synchrony, or highest semantic divergence.
Due to the non-normal distribution of the data, a non-parametric statistical test, the Kruskal-Wallis H test, was conducted to evaluate whether any dialogue acts exhibited significantly different synchrony values. Additionally, Dunn’s post-hoc test with a Bonferroni correction was applied to pairwise comparisons to determine which dialogue acts significantly differed from each other. In summary, opening was often identified as the most semantically and lexically divergent compared to the other dialogue acts. For instance, across the semantic measures, the dialogue acts of opening showed significant differences in semantic values when compared to enquiry (p<.001) and topic development (p<.001). Similarly, the lexical measure again indicated that opening was significantly different from other categories, including eliciting (p<.001), enquiry (p<.001), repair (p<.001), revision (p=.012), scaffolding (p<.001), topic development (p <.001), and topic opening (p =.008). Syntactically, the ‘topic’-related dialogue acts often showed significantly high synchrony. The JSDuPOS score revealed significant differences in syntactic synchrony between topic development and eliciting (p=.011), opening (p=.036), repair (p=.001), revision (p=.013), and scaffolding (p<.001). Significant differences were found between topic opening and repair (p=.035), as well as between topic opening and scaffolding (p=.015).
3.2 RQ2. Student Performance based on Synchrony Patterns in the Sessions
A total of 320 Tsfresh features were extracted after feature filtering (out of 396) was applied as described in section
2.5.4. Separately, a total of 11 features that are dummy coded to identify the total frequency of pedagogical dialogue acts were extracted. Figure
2 provides an overview of how the final Tsfresh features were extracted to model the time-series patterns of the linguistic synchrony in the three dimensions–semantic, syntactic, and lexical. Table
5 showcases the performance results of the five classification models, where the left side of the table showcases the classification model accuracy when no pedagogical dialogue act features were introduced, hence solely based on the time-series characteristics of the semantic, syntactic, and lexical linguistic synchrony features. By contrast, the right side showcases when the pedagogical dialogue acts frequency features were introduced. We noticed that the overall performance across all five models improved. The performance improvement was most drastic in GBM, which achieved the accuracy of 0.808, precision of 0.814, recall of 0.808, and F1 of 0.810 when both the time-series features and the pedagogical dialogue acts were presented in the model training. This represents 17.4% higher accuracy and 17.1% higher F1 score in GBM performance when the pedagogical dialogue acts were introduced, indicating the significant role that pedagogical dialogue acts have in predicting student performance based on their tutor-student interaction over the period of tutoring sessions.
3.3 RQ2. Student Performance based on Dynamic Synchrony and Pedagogical Patterns
We identified the top 10% of those with the highest importance based on from the Tsfresh and dialogue act features. The importance of each feature in the Random Forest model was determined by the average decrease in impurity, calculated as the sum of the Gini impurity reduction each feature provided across all trees in the forest. Features that resulted in larger reductions in impurity were assigned higher importance scores. The total importance was normalized so that the sum of all feature importance values equaled 1. For the Gradient Boosting algorithm, feature importance was determined by the cumulative error reduction associated with each feature, normalized so that the total importance across all features summed to 1. Lastly, for the KNN algorithm, we used a permutation-based approach to compute feature importance. As shown in Figure
3, the importance of the features is ordered by each method, with the top categories representing the dialogue acts, lexical, semantic, and syntactic features.
In terms of pedagogical dialogue acts, the frequency of scaffolding, eliciting, and enquiry were found to have the highest importance in the GBM model, while the frequency of revision dialogue acts in the SVM model were identified as key pedagogical strategies predictive of student performance. Scaffolding and enquiry dialogue acts occurred most frequently in higher performance groups, while eliciting and revision dialogue acts appeared more frequently in the other groups. Additionally, the frequency of eliciting dialouge act was also identified as a key predictive feature of student performance in the RF model (0.05).
For lexical features, the GBM model assigned the highest importance to a CWT coefficient with a larger window (n=14), indicating that changes in lexical synchrony over a longer conversational context were the most significant variable. Similarly, the GBM highlighted the FFT absolute coefficient for a larger context (n=26), emphasizing the importance of lexical synchrony over a broader window of synchrony changes. We found that the lower-performance group tends to have higher values for the CWT coefficient (B1:0.132 and B2:0.087) and higher FFT coefficient (B1:0.314, B2:0.347) compared to the higher-performance group (C1:0.080, C2:0.051 and C1:0.296, C2:0.292). Higher values in these coefficients indicate stronger long-term lexical synchrony that is persistent and smooth between the tutor and student, while lower values generally signify weaker or less consistent lexical synchrony over a longer window, suggesting rapid and short-term changes in lexical choices. The SVM model also emphasized the importance of FFT and CWT coefficients, and it also assigned the highest importance to the mean change quartile and Lempel-Ziv complexity (n=3), which identifies the repetitiveness of the lexical synchrony pattern over a short-term window of conversational turns. This pattern also showed a decrease in value as student performance level increased (B1:0.356 vs. B2:0.306 vs. C1:0.300 vs. C2:0.289), indicating that, in terms of short-term lexical alignment, lower-performance students exhibited higher lexical mimicry of the tutors than others. The RF model showed similar behavior, where the C3 lag score was identified as the most predictive feature. To summarize, both long-term and short-term lexical synchrony intensity and repetitiveness were predictive of student performance, with lower-performance students demonstrating higher synchrony in lexical use to the tutors in both contexts.
In the case of semantic features, the GBM model identified the overall value of semantic synchrony across the session as a key feature, as evidenced by the high importance scores assigned to the quantile values of semantic synchrony. This pattern was identifical in the RF model. The C3 lag score reflects how previous semantic patterns (such as topics, meanings, or concepts) immediately resurface in the dialogue. In the SVM model, important predictors of student performance included absolute energy (the overall intensity of the semantic score) and mean change (the intensity of semantic score fluctuations over the tutoring sessions). Interestingly, all of the coefficients showed lower semantic synchrony (with higher coefficient values) in the lower-performance group and higher synchrony (with lower coefficient values) in the higher-performance group. This indicates that the overall intensity (e.g., quantile, absolute energy) and short-term recurrence and repetitiveness of semantic topics (e.g., C3 lag score, mean) were much stronger in the higher-performance group.
Lastly, for syntactic features, the GBM model assigned the highest importance to the ratio of syntactic synchrony values, which indicates the density of syntactic synchrony occurrences. We observed that the higher-performance group showcased a higher ratio (B1:0.845, B2:0.859, C1:0.940, C2: 0.947). The SVM model highlighted approximate entropy indices as the most important features, signifying the predictability and variability of syntactic synchrony. Again, higher performance was associated with higher entropy scores (B1:0.325, B2:0.422, C1:0.429, C2:0.452). These results indicate that the frequent occurrence of high syntactic alignment, which is repetitive and predictable, is predictive of students in the higher-performance group compared to those in the lower-performance group.
3.3.1 Time-series Pattern Comparisons.
Comparison of patterns indicate differences in synchrony levels by the pedagogical dialogue acts as shown in Figure
4. We specifically focused on comparing the dialogue acts that were assigned as important features in the classification model decisions. At the syntactic (JSD) and semantic (WMD) level, lower values refer to higher levels of synchrony, whereas at the lexical level (ALIGN), higher values reflect higher synchrony. At the syntactic and semantic level, higher performing groups show lower distance values overall, displaying higher levels of synchrony throughout the session. Furthermore, the patterns of the dialogue acts align with our findings from the models’ feature importance, where levels of consistency in long-term lexical synchrony contribute to predicting student performance.
The eliciting dialogue act at the semantic level shows that the tutor and student in the higher performing groups have greater levels of synchrony in comparison to the lower performing groups. As the tutor elicits information during the session, responses of students in the lower performance group appear not to align well with the tutors’ intended topic. At the lexical level, synchrony of the eliciting dialogue act improves for the lower performing groups towards the latter part of the session. However, at the syntactic and semantic level, initial synchrony values are maintained and worsen. This suggests that students from this group tend to replicate the vocabulary associated with the elicited topic; however, the use of these terms do not necessarily reflect full comprehension. For the scaffolding dialogue act, the higher performing group shows steady improvement in synchrony over time at all three dimensions of synchrony, indicating that the tutors’ supports are effective in improving understanding. On the other hand, synchrony levels for the lower performing group remain lower and are steadily decreasing, suggesting that the content discussed in the session presents substantial challenge in their understanding. Alternatively, the enquiry dialogue act shows similar patterns between higher and lower performance groups. A prominent distinction in synchrony levels across the three dimensions is that higher performance groups exhibit prominently low levels of synchrony at the end of the tutoring session. This is depicted by high values at the syntactic and semantic level, and a flat line towards the 60-turn mark of the lexical dimension. This suggests that during the latter stage of the session, the questions raised refer to challenging content resulting in the disconnect of their shared understanding.