Introduction

Word frequency is one of the most studied variables in the visual word recognition and reading literatures; as frequency increases, processing time decreases. Of particular interest here are the effects of word frequency on eye movements in reading. Frequency affects both early (first-pass) and later eye movement measures (e.g. Inhoff & Rayner, 1986; Rayner & Duffy, 1986; Williams & Morris, 2004; see Plummer, Perea, & Rayner, 2014 for recent discussion and review).

Recently, striking evidence has emerged that effects previously attributed to token frequency might be better accounted for by metrics that take into account the range of contexts in which a word occurs. For example, Adelman and colleagues (2006) operationalized a measure, contextual diversity, (CD) – the proportion of texts in a corpus in which a word occurs. CD is typically correlated with frequency; high typically occur in a wide variety of contexts. However, when Adelman et al. (2006) used databases to manipulate frequency, while controlling for CD, and vice versa, CD accounted for nearly all of the variance in naming and lexical decision times. The same pattern is observed with young developing readers (Hills, Maouene, Riordan, & Smith, 2010; Perea, Soares, & Comesaña, 2013) and has been replicated with related measures (Adelman & Brown, 2008; Hoffman, Lambon Ralph, & Rogers, 2013; Hoffman, Rogers, & Lambon Ralph, 2011; Johns, Gruenenfelder, Pisoni, & Jones, 2012; Jones, Johns, & Recchia, 2012; Steyvers & Malmberg, 2003; Yap, Tan, Pexman, & Hargreaves, 2011).

CD and related measures capture an important aspect of the relationship between contexts and the likelihood of encountering a word form as well as a given sense of a word. If frequency effects can be reduced to CD effects, it would have important consequences for how we conceptualize and manipulate context in studies of visual and spoken word recognition. Typically, context is manipulated locally, i.e., with a sentence or two. However, an equally important aspect of context might be a more general notion, such as topic, which might affect the likelihood of occurrence of otherwise unrelated words. For example, “balk” occurs relatively infrequently but it is much more likely to occur in a description of a baseball game. Moreover, the relative frequency of syntactic structures that a word occurs in varies with word sense, which can be strongly influenced by context. In a baseball context, the types of syntactic structures, that the verb “walked” appears in, and their relative frequencies, differ from non-baseball contexts. We return to these issues in the general discussion.

Most studies manipulating CD and frequency measure responses to words in isolation. An exception is Plummer et al. (2014), who monitored eye movements in sentences, which more closely approximates normal reading. Reading times were longer for words with lower CD than for words with higher CD. When CD was controlled, there were no reliable effects of frequency.

There are two important limitations of Plummer et al.’s experiment. Plummer et al. created separate sentences for words from three groups: (a) Control words; (b) Lower frequency words (LWF) words with the same CD as the Control words; and (c) Higher CD words matched in frequency with the Control words. It is challenging to construct plausible sentences in which target words from different conditions are rotated through the same sentence frames, which likely influenced Plummer et al.’s design. However, a replication with a rotated design would make it less likely that either the effects of CD or the absence of frequency effects could be attributed to uncontrolled differences in sentence frames. Experiment 1 uses a design in which yoked triples of LWF, Control and HCD words are rotated through the same sentence frame.

Few low frequency words have high CD values, thus restricting the range of frequencies that can be compared. Moreover in Plummer et al.s’ results, and in the results we report there are small, but consistent, numerical differences between the Control words and the LWF words that would be consistent with a small frequency effect. In Plummer et al., for example, six of the seven measures show this pattern. In Experiment 2 we manipulated the size of the frequency differences between the Control and the LWF words. Reading times increase as a linear function of log frequency. Therefore, any differences due to frequency should increase as the log difference in frequency between the conditions increases.

The current studies used Mandarin Chinese, which has a logographic orthography, with frequency and CD computed from a corpus of words and characters used in films (Cai & Brysbaert, 2010). Word frequency has similar effects on eye movements in Chinese and in English (e.g., Li & Pollatsek, 2011; Liversedge, Zang, Zhang, Bai, Yan, & Drieghe, 2014), which makes it reasonable to compare results across the two languages.

Experiment 1 – Frequency and contextual diversity: Words rotated through sentence frames

In all of the experiments reported here, we compared three groups of words: A control group with higher frequency words and lower CD words; a lower frequency (LWF) group with a similar CD as the Control group but lower frequency; and a higher-CD (HCD) group but higher frequency. A higher CD, lower frequency condition was not included because we could not identify sufficient low frequency words with higher CD. The contrasts of primary interest are the LWF group with the Control group, which would reveal any effects of frequency with CD controlled and the HCD group with the Control group, which would reveal any effects of CD with frequency controlled.

Method

Participants

Participants were 30 native Mandarin Chinese speakers with normal or corrected-to-normal vision. Participants gave informed consent and were paid 20 RMB.

Materials

Target words for all experiments were chosen from SUBTLEX-CH-WF (Cai & Brysbaert, 2010), which provides values of word frequency based on the log10 transformed number of occurrences from 33 million words and CD based on the log10 transformed number of films in which the word appears in a 6,243 film corpus. We chose this corpus because Cai and Brysbaert (2010) found that frequencies based on this database explain more of the variance in word and character reading than frequencies based on written texts – the Language Corpus System of Modern Chinese, Center from Chinese Linguistics Character Frequency, and the Lancaster Corpus of Modern Chinese.

Twenty-seven word triples were selected. LWF words have a similar CD as the Control group (t (52) = 1.62, p = 0.11) with a lower word frequency, t (52) = 12.67, p < 0.001. HCD words have similar frequencies as the Control group (t (52) = -0.79, p = 0.43) but with a higher CD value, t (52) = -13.26, p < 0.001. Target words are two-character compound nouns matched in number of stokes (t < 1.34, ps > 0.18), radicals (ts < 1.24, ps > 0.22), orthographic neighborhood size (ts < 0.64, ps > 0.53), and semantic diversity (ts < 0.77, ps > 0.44). The values for each measure in each condition are presented in the Appendix Table 8.

We created 27 sentence frames and rotated target words from each triple through the same frame across three lists (see examples 1a–1c). Each list contained nine sentences from each condition, with one version of each sentence.

  1. 1a.

    爷爷收藏的陨石是他挚爱的珍宝。

    The meteorite my grandpa collected is his treasure.

  2. 1b.

    爷爷收藏的牛角是他挚爱的珍宝。

    The ox horn my grandpa collected is his treasure.

  3. 1c.

    爷爷收藏的古董是他挚爱的珍宝。

    The antique my grandpa collected is his treasure.

Control norms

For the three studies we report, separate groups of 20–25 participants rated the plausibility and the difficulty of the experimental sentences on a 7-point scale. Word predictability was assessed using a cloze completion task in which the target word was replaced by a blank. For Experiment 1, there were no significant differences between conditions (ts < 0.82, ps > 0.41) in difficulty or plausibility. The target word appeared in 2.22 % of completions, with no significant difference across conditions (ts < 0.22, ps > 0.83).

Apparatus

Eye movements were monitored using a SensoMotoric Instruments (Teltow/Berlin, Germany) iView Hi-Speed system, sampling at 1,250 Hz (tracking resolution < 0.01°). Viewing was binocular but data were collected only from the right eye. Sentences were presented on a 17-in. CRT monitor. Each character subtended 1.05° of visual angle at a viewing distance of 70 cm. Stimulus presentation and response collection used the E-Prime software package (Psychology Software Tools, Pittsburgh, PA, USA).

Procedure

Trials began with presentation of a fixation point (+) left aligned with the first character. The order of the 27 experimental sentences and 27 filler sentences was randomized for each participant. Participants were instructed to read each sentence silently at their normal rate and then answer a yes/ no question. The experiment began with eight practice sentences.

Results and discussion

All participants responded correctly to at least 90 % of the questions. Reading times for critical words were analyzed as target regions. We excluded 5.42 % percentages of trials because of track losses, blinks, and fixations shorter than 80 ms, longer than 800 ms, three standard deviations above or below mean fixation time (Chen & Huang, 2012). Means for each dependent measure are presented in Table 1.

Table 1 Eye movement measures on target words in each group

We computed three “first-pass” measures considered to primarily reflect early processes and three measures considered to reflect later processes (Rayner, 1998, 2009). First-pass measures were first-fixation duration, gaze duration, and skipping rate. Later measures were go-past time, regression rate, and total fixation time. Planned comparisons used linear mixed-effects models for fixation durations and mixed logit models for skipping using the lme4 package (Bates, Maechler, & Bolker, 2012) in R (R Development Core Team, 2014). The regression model included fixed effects (e.g., log-transformed word frequency and log-transformed CD) and the maximal random effects structure with by-participants and by-items random intercepts and slopes (Barr, Levy, Scheepers, & Tily, 2013; Jaeger, 2008). We report t-values for linear mixed-effects models, z-values for mixed logit models, and corresponding p-values (see Table 2). For t-values, the lmerTest package was implemented, for mixed effects models we estimated p values using the Satterthwaite approximation for degrees of freedom (Kuznetsova, Christensen, & Brockhoff, 2014).

Table 2 Regression coefficients and test statistics from linear mixed-effects and logistic mixed-effects models for eye movement measures on the target

Effects of word frequency (control group vs. LWF group)

Fixations to low and high frequency words did not differ in either the first-pass measures, all ts < 1, ps > 0.33, βs < 2.20, or measures reflecting later processing, all ts < 1.16, ps > 0.25, βs < 25.50. Differences in skipping rate were not reliable, z < 0.10, p > 0.90.

Effects of contextual diversity (control group vs. HCD group)

For first-pass measures, HCD target words had shorter first-fixation durations (β = -22.74) and gaze durations (β = -46.31) than words with lower CD, ts > -2.80, ps < 0.01. The effect of CD in skipping rate was not significant, z < 1, p > 0.33.

For later measures, go-past time (β = -73.49) and total fixation time (β = -58.23) were shorter for words with higher CD than for words with lower CD, ts > -3.01, ps < 0.007. The regression rates did not differ significantly (β = -0.02).

In sum, CD affected eye-movement measures reflecting both early and later processes with no effects of word frequency when CD was controlled. Because we embedded the words from the different conditions in the same sentence frames, the results cannot be plausibly attributed to differences in sentence frames.

Finally, we note that the first-fixation durations are relatively long. They are, however, within the range of frequently occurring fixations for reading Mandarin (see Li, Bicknell, Liu, Wei, & Rayner, 2014 for a distribution of fixation times). We return to this issue when we present the results for Experiment 2a, which has shorter fixation durations, and Experiment 2b, where the fixation durations are similar to this Experiment.

Experiment 2 – Comparison between studies with different-sized frequency ranges

Differences between the LWF and the Control condition did not approach significance in Experiment 1. However, the direction of the effects for four of the six measures is consistent with small increases in processing difficulty for the LWF words. For example, the (mean) Total Reading Time for the LWF words and the Control words is 542 ms and 517 ms, respectively. There was a similar pattern in Plummer et al. (2014) for six of seven measures. In Experiment 2 we compare results between two comparably powered experiments (Experiments 2a and 2b) in which the frequency difference between the LWF words and the Control words differs in size. If there are underlying frequency effects, differences between the LWF words and the Control words should be larger in Experiment 2b.

In Experiment 2a mean log frequencies for the LWF and Control words are 2.36 and 2.77, respectively. In Experiment 2b, the log frequency differences are substantially larger: 1.30 and 2.36, respectively. Each word was placed in its own sentence frame, allowing us to use a larger set of items than in Experiment 1.

Experiment 2

Method

Experiments 2a and 2b used the same apparatus and procedure as Experiment 1.

Participants

Participants were 90 native Mandarin Chinese speakers with normal or corrected-to-normal vision, 45 each for Experiment 2a and 2b. Participants gave informed consent and were paid 20 RMB.

For each experiment, we selected 48 words with 16 words for each condition (see Table 3). Words in the LWF group have a similar CD to the control condition (ts < 1.60, ps > 0.12) but with lower word frequencies ts > 8.20, ps < 0.001). HCD words have similar frequencies to the control condition (ts < -1.60, ps > 0.10) but with a higher CD value (ts > 10.60, ps < 0.001). All were two-character compound nouns and matched in number of stokes (ts < 0.63, ps > 0.39), radicals (ts < 0.30, ps > 0.53), orthographic neighborhood size (ts < 0.94, ps > 0.35), and semantic diversity (ts < 0.87, ps > 0.39). Words immediately prior to and after the target word were matched in frequency, CD, and strokes (ts < 1.49, ps > 0.15) across conditions. Values for each condition are presented in the Appendix Table 8.

Table 3 Experimental conditions and exemplar sentences

Norming studies were used to match plausibility and perceived difficulty across conditions (ts < 1.28, ps > 0.20). Cloze probability was 2.60 % in Experiment 2a and 1.77 % in Experiment 2b, with no significant differences across conditions (ts < 1.12, ps > 0.27).

Results

We used the same exclusion criteria as in Experiment 1. For Experiment 2a, 7.83 % of trials were excluded, and 6.77 % for Experiment 2b. Tables 4 and 5 display means for each eye movement measure in each condition for Experiments 2a and 2b respectively. As in Experiment 1, linear mixed-effects models and mixed logit models were used to implement planned comparisons. Results for Experiments 2a and 2b are presented in Tables 6 and 7, respectively.

Table 4 Eye movement measures on target words in Experiment 2a
Table 5 Regression coefficients and test statistics from linear mixed-effects and logistic mixed-effects models for eye movement measures on the target in Experiment 2a
Table 6 Eye movement measures on target words in Experiment 2b
Table 7 Regression coefficients and test statistics from linear mixed-effects and logistic mixed-effects models for eye movement measures on the target in Experiment 2b

Effects of word frequency (control group vs. LWF group)

There were no reliable differences in first-pass measures and measures reflecting later processing in either experiment: For first pass measures in Experiment 2a, all ts < 1, ps > 0.30, βs < 10.10 (for skipping rate, z = -1.54, p = 0.123, β = -0.53); for Experiment 2b, all ts < 1, ps > 0.70, βs < 1.60 (for skipping rate, z = -1.57, p = 0.117, β = -0.63). For measures reflecting later processing in Experiment 2a, all ts < 1.36, ps > 0.18, βs < 17.50; for Experiment 2b, all ts < 1.27, ps > 0.21, βs < 9.60.

Effects of contextual diversity (control group vs. HCD group)

The data patterns were similar for Experiments 2a and 2b. For first-pass measures in Experiment 2a, HCD target words had shorter first-fixation durations (β = -15.44) and gaze durations (β = -29.79) than words with lower CD, ts > -2.53, ps < 0.019, with no effect of CD in skipping rate, z = 1.50, p = 0.134, β = 0.46. In Experiment 2b, HCD target words had shorter first-fixation durations (β = -19.85) and gaze durations (β = -44.26) than words with lower CD, ts > -3.48, ps < 0.01. There was no main effect of CD in skipping rate, z = 1.28, p = 0.199, β = 0.41.

For the later measures in Experiment 2a, go-past time (β = -68.28) and total fixation time (β = -70.79) were shorter for HCD words than Controls CD, ts > -2.96, ps < 0.01. The regression rate was numerically, but not significantly, lower for HCD words (β = -0.04). For the later measures in Experiment 2b, go-past time (β = -90.34) and total fixation time (β = -95.03) were shorter for HCD words, ts > -4.27, ps < 0.01. The regression rate was again, numerically, but not significantly different (β = -0.03).

We note that the first-fixation durations varied across the three experiments. Experiment 1 and Experiment 2b have relatively long fixation durations compared to Experiment 2a, and some other experiments in the literature (e.g., Cui, Yan, Bai, Hyönä, Wang, & Liversedge, 2013; Yan, Tian, Bai, & Rayner, 2006). As noted earlier, they are nonetheless within the range of commonly observed fixation durations (Li et al., 2014). In contrast, the first fixations in Experiment 2a are on the shorter end of the distribution. It is difficult to compare across experiments because fixation durations can be variable (McBride-Chang & Chen, 2003). Moreover, the words selected in these experiments are a tightly constrained set, which complicates comparisons with other studies in the literature. That said, in Experiments 1 and 2b, which had longer first fixations, the overall CDs were similar between the two experiments, and they were considerably lower than in Experiment 2a, which had the shorter fixation durations. This is the expected pattern.

Four of the six measures showed small differences that would be consistent with a frequency effect. When combined with Plummer et al., 19 of 25 measures showed small differences in the same direction (P (19) = 0.005 by Binomial test, though that p-value is almost certainly inflated because the binomial test assumes that the measures are not correlated, which is almost certainly not the case. Crucially, however, whereas log frequency differences are larger for Experiment 2b compared to Experiment 2a, for most measures the differences between LWF group and Control group are slightly smaller (Please see Fig. 1). This is inconsistent with a frequency-based explanation.

Fig. 1
figure 1

Differences in values in the smaller (Exp. 2a) and larger frequency range (Exp. 2b)

General discussion

The current studies make three contributions. First, we replicate Plummer et al.’s finding that frequency does not affect fixation patterns when CD is controlled, using a design in which sentence frames are matched across conditions. Second, we provide novel evidence that the absence of frequency effects is not the result of using a restricted range of frequencies. Third, we extend the evidence for CD effects in silent reading to Mandarin.

As Plummer et al. (2014) note, the absence of a frequency effect is only compatible with models of visual word recognition in which frequency of occurrence is the relevant dimension if we make the implausible assumption that CD measures are better estimates of frequency of occurrence than frequency per se.

A more likely explanation is that CD provides a better measure of the prior that a reader would have for encountering a particular word, a notion that is compatible with Smith and Levy’s (2013) conclusion that predictability is the primary determinant of word-level reading time. When CD is matched, the differences in token frequency arise because the higher frequency words occur more often within specific contexts. That is, once one sees that word, it is more likely to re-occur. In the absence of reliable information that one is in such a context, the priors would be the same. This maps onto the situation common in most studies of word recognition, where either there is little or no context (words in isolation) or words are presented in sentences constructed to test psycholinguistic hypotheses.

The more far-reaching implication is that instantiating a context that increases the likelihood of encountering a word or class of words, e.g., a report about a baseball game, would be an even more powerful predictor of reading time than CD, especially for words with lower CD measures in which the contexts (e.g., type of texts) a word occurs in are either thematically related or share genres. For example, subcategorization frequency and sense-based frequencies of argument structures (e.g., Hare, McRae, & Elman, 2003) are context and genre dependent. Examining the interaction between CD and context-specific effects might then shed light on the organization of semantic memory and the lexical processing in natural contexts. For example, “frequency” effects for words and structures are likely to differ across different genres, including texts written for readers of different ages or readers with different levels of expertise.

Chinese might prove particularly useful for examining the issues because the sense of a word and the sense of a character bear a quasi-systematic relationship. In Chen et al. (submitted) we demonstrate CD effects (with no residual frequency effects) for the first character of a two-character word, with those effects being smaller in magnitude than the word-level CD effects reported in this paper. Comparisons of alphabetic languages with languages like Chinese might also help tease apart word-level context effects from lower-level orthographic/phonotactic effects.