Duolingo Speaking Whitepaper
Duolingo Speaking Whitepaper
Duolingo Speaking Whitepaper
Xiangying Jiang∗ , Joseph Rollinson∗ , Haoyu Chen∗ , Ben Reuveni∗ , Erin Gustafson∗ , Luke Plonsky† , and
Bozena Pajak∗
Abstract
Duolingo has previously been shown to be highly effective at teaching receptive listening and reading comprehension skills. The
question remains as to how well Duolingo courses teach productive skills, such as speaking. This study measured the speaking
proficiency of Duolingo learners who had completed the beginning-level course material in the Spanish and French courses. Results
of the Pearson Versant Spanish Test and French Test showed that the speaking skills of Duolingo learners, who had little to no
prior knowledge in the target language and used Duolingo as the only language learning tool, were in line with Duolingo’s course
expectations. Specifically, most of the study participants achieved the level of A2 or above on the CEFR scale. The findings of the
study suggest that Duolingo is effective at teaching speaking in addition to listening and reading.
Keywords
Duolingo, efficacy, Spanish, French, speaking, foreign language
Table 1. Number of Units in Each Section of the Duolingo Spanish and French Courses
3.1 The Background Survey include seven tasks. Table 4 lists each task with a brief
The background questionnaire included questions related to description (Pearson Education, 2019).
participants’ language background, reasons for learning the The Versant Spanish and French Tests require test-takers to read
language, level of education, age group, and whether they took sentences aloud, listen and repeat sentences, say the opposites of
classes or used other programs/apps during the time they used words they hear, answer short questions, build sentences from
Duolingo. The latter question confirmed eligibility to satisfy jumbled-up word combinations, retell stories, and answer open-
Criterion #2 for participant selection; see Participants above. ended questions. According to the test description (Pearson
Education, 2018a, 2018b), the tests place a great deal of
3.2 The Versant Spanish and French Tests emphasis on automaticity with the language. In particular, the
The Versant Spanish and French Tests are tests of demands for automaticity are shown in tasks such as “saying
spoken language developed by Pearson Education the opposite of a word you hear” and “building sentences from
(https://www.pearson.com/english/versant.html). The spoken jumbled-up word combinations you hear.” For these tasks, test-
language tests were designed to “measure the core speaking takers are required to recognize words or word combinations
skills” of language learners. The Spanish and French tests they hear, quickly access and retrieve lexical items or build
phrases and clause structures, and articulate them under extreme Table 5. Mapping of Versant Spanish and French Test Scores to CEFR
Levels
time pressure.
Versant test score CEFR level
The responses from test-takers are scored automatically by
means of a speech recognition and parser program. The score 79-80 C2
report (Pearson Education, 2019–2020) provides an overall 69-78 C1
proficiency score and four subscores (fluency, pronunciation, 58-68 B2
sentence mastery, vocabulary), all scored between 20-80. 47-57 B1
The overall score of the test represents the ability to 36-46 A2
understand the spoken language and “speak it intelligibly at a 26-35 A1
native-like conversational pace on everyday topics” (Pearson 20-25 <A1
Education, 2018b, p. 11), and it is calculated based on a
weighted combination of the four diagnostic subscores (30%
Sentence Mastery, 20% Vocabulary, 30% Fluency, and 20% person entering the camera view) and monitor browser use (to
Pronunciation). Among the four subcomponents of speaking, see if they navigate away from the test). In the data we report in
sentence mastery measures “the ability to understand, recall, and our analysis, we excluded all scores from participants who were
produce phrases and clauses in complete sentences”; vocabulary marked “suspicious” by the system (see Table 6 for the number
“reflects the ability to understand common everyday words of suspicious scores).
spoken in sentence context and to produce such words as
needed”; fluency is measured from “the rhythm, phrasing and 4 Procedures
timing evident in constructing, reading and repeating sentences”;
and pronunciation assesses “the ability to produce consonants, We sent an email soliciting participation in the research study
vowels, and stress in a native-like manner in sentence context” to a random sample of Duolingo learners when they completed
(Pearson Education, 2018b, pp. 11–12). Unit 5 in the Spanish or French course, if they met the following
criteria: prior proficiency of 0-2 in the language and an IP
Based on the Test Description and Validation Summary (Pearson address in countries where Spanish or French is not an official or
Education, 2018a, 2018b), the split-half reliability coefficients widely spoken language. Learners aged 18 and above who were
of the Spanish test and the French test were both 0.97, interested in participating completed a background survey to
indicating that both tests are highly reliable. The split-half verify eligibility and collect additional demographic information.
reliability coefficients for the Spanish subscores ranged from Learners who responded that they had taken classes or used other
0.91 to 0.95 and those for the French subscores were 0.77 programs/apps to learn the language during the time they used
for vocabulary, 0.89 for sentence mastery, 0.93 for fluency, Duolingo were excluded from participation.
and 0.95 for pronunciation. Furthermore, the Versant Spanish
and French scores correlate with CEFR estimates at 0.90 and Qualified participants were emailed on a rolling basis and invited
0.88, respectively. The overall score and the subscores are to take the Versant Spanish or French Test for free. Participants
mapped to the CEFR scales as shown in Table 5, with detailed completed the test within two weeks. Each participant received
oral interaction descriptors in Appendix B (Pearson Education, $20 and their score report after taking the test. Table 6 shows
2018a, 2018b). the data collection funnel. This funnel is noteworthy in several
respects. First, only about 40% of the learners who were
The Versant test takes 15-17 minutes to complete. To strengthen eligible for the test attempted to take the test. This large
the validity of our findings, we used the remote monitoring drop in participation rate was mostly due to lack of appropriate
feature provided by HirePro. It video-records participants as equipment. The incorporation of the remote monitoring system
they take the test to flag suspicious behavior (e.g, a second imposed restrictions and high system requirements; for example,
it only allowed the test to be taken with Version 80 or higher pronunciation instruction is needed to meet the goal of teaching
of the Google Chrome browser on a computer with a stable A2-level pronunciation skills by the end of Unit 5 in the
internet connection and high quality video and audio equipment. Duolingo French course. The fact that 43 (28%) completed
Most Duolingo learners use their mobile phones to learn and tests were unable to be scored by the speech recognizer
communicate with Duolingo and they might not have access to (see Table 6) could be additional evidence that the French
all the required equipment. Second, 18 learners in French and participants struggled with pronunciation. Although the Versant
20 in Spanish started the test but did not complete it. Third, 43 representative could not confirm the exact reason why these
participants (about 28%) in French did not receive a score after tests were not scored, it is unlikely this was due to recording
they completed the test. According to a Versant representative, quality since there were no similar problems with Spanish
“this seemed to suggest that some candidates are either not tests. If the unscored tests were indeed due to low participant
speaking clearly in French or are taking the test in an improper intelligibility caused by poor French pronunciation, including all
environment (background noise noise, faulty mic, etc.)” (M. these participants in our sample could have further lowered the
Kumar, personal communication, May 18, 2021). However, already low French pronunciation scores.
given that the same was not the case with our participants
On average, participants in both Spanish and French Duolingo
in Spanish, the improper environment explanation seems less
courses demonstrated A2 speaking abilities in the sub-skills
likely; instead, the pronunciation of the participants in French
of sentence mastery and fluency. Duolingo courses focus on
was probably insufficiently clear for the Versant French speech
sentence-level language throughout all lessons and levels, so
recognition program. Finally, the remote monitoring system
the participants had a considerable amount of practice building
detected suspicious behaviors of 10 participants in French and 19
sentences in their Duolingo exercises. Participants in both
participants in Spanish. These suspicious scores were excluded
courses achieved, on average, solid A2 scores in understanding,
in the following analyses.
recalling, and producing phrases and clauses in complete
sentences, as measured by the sentence mastery subscore. They
5 Results also achieved, on average, A2 level for the fluency subscore,
To answer the first research question–what levels of speaking which measured their ability in producing rhythmic language
proficiency did Duolingo learners achieve upon completing the and appropriate phrasing in constructing, reading, and repeating
beginning-level course content for Spanish or French–we report sentences.
the means and standard deviations of the overall scores on the The subscores also showed that the vocabulary score was the
Versant test (see Table 7). According to the guidelines for weakest in Spanish and the second weakest in French. The lower
mapping Versant scores to CEFR levels (see Table 5), a score vocabulary scores might have been related to the specific test
range of 36-46 indicates the CEFR level of A2. Therefore, for tasks in the Versant Spanish and French Tests. For example,
Duolingo Spanish learners, an average of 40.97 indicates solid one of the tasks that assesses vocabulary knowledge asks the
A2 speaking abilities. For Duolingo French learners, an average test-takers to say the opposites of the words they hear within
of 36.72 indicates a low A2. a few seconds. This task requires strong automaticity in lexical
In addition to learners’ average scores, we also present the access and retrieval (Pearson Education, 2018a, 2018b), which
distribution of scores in Figure 3. For Spanish, 66.03% of exerts high time pressure on the test-takers. A less time-
learners scored at A2 or above; for French, 52.94% of learners sensitive measure of vocabulary knowledge (i.e., one that relies
scored at A2 or above. less on automatic production) would have likely yielded higher
scores in this domain. Duolingo courses, however, may be
To answer our second research question concerning the extent more facilitative in developing learners’ receptive vocabulary
to which the Duolingo Spanish and French courses prepare knowledge. Having more activities in the courses that require
learners in the sub-skills of speaking, including sentence mastery, lexical retrieval in productive tasks would likely be beneficial
vocabulary, fluency, and pronunciation, we report the means and for Duolingo learners, especially in timed vocabulary tasks such
standard deviations of the subscores in Table 8 and then Figure as the one used in the Versant Spanish and French Tests.
4.
The subscores provide important diagnostic feedback regarding 6 Discussion
Duolingo courses. First, there were dramatic differences in
This study evaluated the speaking proficiency of Duolingo
pronunciation scores across the Spanish and French learners: the
learners who had completed the beginning-level course content
Spanish pronunciation score was the highest of all subscores,
in the Spanish and French courses. The results of the study
while the French pronunciation score was the weakest of all
showed that, on average, the participants in the Spanish course
subscores in both courses and lowered the overall French
achieved solid A2 speaking abilities and those in French
score. The French pronunciation score (30.37) fell below
achieved a somewhat weaker A2 level. Specifically, about two-
the A2 threshold of 36 and indicates that more and improved
thirds of the participants (66%) in Spanish and more than half
Table 7. Means and Standard Deviations of Overall Versant Scores of Duolingo Learners
Figure 3. Distribution of Versant test scores of Duolingo learners based on CEFR levels.
Table 8. Means (and Standard Deviations) of Versant Test Subscores of Duolingo Learners
Figure 4. Distribution of Versant Test subscores of Duolingo learners shown with density plots. Dashed line represents median and dotted lines
represent interquartile range.
of the participants (53%) in French achieved the level of A2 or courses would benefit from more activities that would facilitate
above in speaking. With extensive opportunities in the Duolingo the development of productive vocabulary knowledge.
courses to hear the target language and practice sentence-
level speaking with feedback from the speech recognition 7 Conclusion
program, participants developed speaking skills and reached
the proficiency level targeted by the CEFR-based curriculum The results of the speaking assessment demonstrated that most
standards. beginning-level Duolingo learners have achieved the expected
proficiency outcomes and curriculum objectives for speaking
The subscores of the speaking tests were mostly in line with the skills. Specifically, the test subscores indicated that Duolingo
overall scores, but three observations are noteworthy. First, the learners have speaking abilities in line with the standards for
subscores of the speaking tests indicated a strong contrast on four speaking sub-skills, with the exception of French learners’
pronunciation skills of the participants in Spanish and French. pronunciation. Together with findings from a previous study
Among the subscores, pronunciation scored the highest in (Jiang et al., 2020) which evaluated the listening and reading
Spanish but the lowest in French. This is not entirely surprising proficiency of Duolingo learners, this study complements the
given that French pronunciation is known to be difficult for accumulating body of evidence of the efficacy of the beginning-
English speakers to learn, requiring pedagogical attention over a level course content in the Duolingo Spanish and French courses.
long time (Huensch, 2019; Sturm, 2019). Second, the subcores
on sentence mastery and fluency indicated that Duolingo
learners developed these sub-skills as expected, and they were
able to understand, recall, and produce complete sentences, and
articulate them with good rhythm and appropriate phrasing. The
third observation is that the vocabulary subscores were relatively
low for both Spanish and French learners. Previous research
has consistently shown that vocabulary learning is one of the
strengths of mobile-based language learning (Loewen et al.,
2020; Lord, 2015, 2016; Vesselinov & Grego, 2016). However,
this study demonstrates that Duolingo’s emphasis on receptive
vocabulary knowledge may not transfer directly to productive
knowledge, especially when automaticity is the goal of the
assessment. At the same time, these results are not necessarily
an indication of a lack of vocabulary knowledge among beginner
Duolingo users. Rather, the relatively low vocabulary subscore
may be seen in part as an artifact of Versant’s vocabulary
measure which requires a high level of automaticity in speech
production.
The results for learners in the French course, however, need to be
taken with caution. As mentioned earlier, about 28% of the test-
takers in French did not receive a score after they completed the
test, but that was not the case for Spanish participants. Although
we cannot pinpoint the exact reason, it is likely that the French
participants did not speak clearly enough for the scoring system
to capture meaningful production in French. If that was the case,
their inclusion would have lowered the overall average score for
French learners. We consider this an important limitation of our
findings.
The speaking assessment provided important diagnostic infor-
mation to help us understand the strengths and weaknesses of
Duolingo courses in teaching various components of learners’
speaking ability. One pedagogical implication of the findings
is the need to enhance the intelligibility of Duolingo learners
by teaching pronunciation more effectively in French (Hirschi,
2020). Another pedagogical implication of the findings is that in
addition to teaching receptive vocabulary knowledge, Duolingo
A Appendix
Table 9. Spanish- and French-Speaking Countries or Regions. Duolingo learners whose IP addresses were in those countries or regions
(Spanish-speaking for learners of Spanish, and French-speaking for learners of French) were considered ineligible to participate in this study.
B Appendix
Table 10. Relation of Scores of Versant Spanish and French Tests to Oral Interaction Descriptors Based on Council of Europe (2001) Framework
(as cited in Pearson Education, 2018a, 2018b)
Versant
Spanish or CEFR
Oral Interaction Descriptors Based on Council of Europe (2001)
French Test level
Score
Conveys finer shades of meaning precisely and naturally.
Can express him/herself spontaneously at length with a natural
79-80 C2 colloquial flow. Consistent grammatical and phonological control of
a wide range of complex language, including appropriate use of
connectors and other cohesive devices.
Shows fluent, spontaneous expression in clear, well-structured
speech.
Can express him/herself fluently and spontaneously, almost
69-78 C1 effortlessly, with a smooth flow of language. Clear, natural
pronunciation. Can vary intonation and stress for emphasis. High
degree of accuracy; errors are rare. Controlled use of connectors
and cohesive devices.
Relates information and points of view clearly and without noticeable strain.
Can produce stretches of language with a fairly even tempo; few
58-68 B2 noticeably long pauses. Clear pronunciation and intonation. Does
not make errors that cause misunderstanding. Clear, coherent,
linked discourse, though there may be some “jumpiness.”
Relates comprehensibly main points he/she wants to make on familiar matters.
Can keep going comprehensibly, even though pausing for
grammatical and lexical planning and repair may be very evident.
Pronunciation is intelligible even if a foreign accent is sometimes
47-57 B1
evident and occasional mispronunciations occur. Reasonably
accurate use of main repertoire associated with more predictable
situations. Can link discrete, simple elements into a connected
sequence.
Relates basic information on, e.g., work, background, family,
free time, etc.
Can make him/herself understood in very short utterances, even
though pauses, false starts, and reformulation are very evident.
36-46 A2
Pronunciation is generally clear enough to be understood despite a
noticeable foreign accent. Uses some simple structures correctly,
but still systematically makes basic mistakes. Can link groups of
words with simple connectors like “and,” “but,” and “because.”
Makes simple statements on personal details and very familiar
topics.
26-35 A1 Can manage very short, isolated, mainly prepackaged utterances.
Much pausing to search for expressions to articulate less familiar
words. Pronunciation is very foreign.
20-25 <A1 Candidate performs below level defined as A1.