Abstract
Knowledge is an important predictor and outcome of learning and development. Its measurement is challenged by the fact that knowledge can be integrated and homogeneous, or fragmented and heterogeneous, which can change through learning. These characteristics of knowledge are at odds with current standards for test development, demanding a high internal consistency (e.g., Cronbach's Alphas greater .70). To provide an initial empirical base for this debate, we conducted a meta-analysis of the Cronbach's Alphas of knowledge tests derived from an available data set. Based on 285 effect sizes from 55 samples, the estimated typical Alpha of domain-specific knowledge tests in publications was α = .85, CI90 [.82; .87]. Alpha was so high despite a low mean item intercorrelation of .22 because the tests were relatively long on average and bias in the test construction or publication process led to an underrepresentation of low Alphas. Alpha was higher in tests with more items, with open answers and in younger age, it increased after interventions and throughout development, and it was higher for knowledge in languages and mathematics than in science and social sciences/humanities. Generally, Alphas varied strongly between different knowledge tests and populations with different characteristics, reflected in a 90% prediction interval of [.35, .96]. We suggest this range as a guideline for the Alphas that researchers can expect for knowledge tests with 20 items, providing guidelines for shorter and longer tests. We discuss implications for our understanding of domain-specific knowledge and how fixed cut-off values for the internal consistency of knowledge tests bias research findings.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Many studies in psychological and educational science include measures of knowledge because knowledge is a powerful determinant of learning and an important learning outcome (e.g., Ackerman & Beier, 2006; Koedinger et al., 2012; Siegler & Alibali, 2011). For example, a recent bibliometric analysis identified 13,507 publications on the role of domain-specific prior knowledge in human learning (Bittermann et al., 2023). A meta-analysis on the association of prior knowledge and learning synthesized about 8000 effect sizes from 500 studies longitudinally measuring knowledge at least twice (Simonsmeier et al., 2022). Knowledge is domain specific when it relates to central concepts, principles, or procedures in a domain, such as the concept of force in Physics or how to grow bacteria in Biology (Edelsbrunner et al., 2022). Domain-specific knowledge stored in learners’ long-term memories is a central component of achievement (Watts et al., 2014), a predisposition for reaching expert levels of performance (Meier et al., 2023), and an instigator of cognitive development (Keil, 1981; Suffill et al., 2022). Consequently, developing valid measures for assessing domain-specific knowledge is highly relevant for research and educational settings.
Domain-specific knowledge in learners’ memory is notoriously difficult to measure because it is a heterogeneous and dynamically changing construct. Several sources contribute to the heterogeneity of learners’ knowledge (de Jong & Ferguson-Hessler, 1996). A learner’s knowledge elements in long-term memory can differ in their domain (for example, mathematics vs. history), their content (e.g., the principle of commutativity vs. the principle of associativity in mathematics; Schneider & Stern, 2009), their knowledge type (e.g., conceptual understanding of the equivalence principle vs. procedural skills for solving systems of equations), their sources (e.g., parents vs. books), their representations (e.g., symbolic verbal code vs. analog pictorial code), and the contexts they are typically activated in (e.g., physics instruction vs. science fiction movies). Learners’ knowledge in long-term memory has a network structure, and knowledge elements can differ in how often, strongly, and profoundly they are interlinked with other knowledge elements in this network (Anderson, 2020). When knowledge elements are acquired in different contexts or times, learners often fail to see their interrelations and thus store the knowledge elements independently in long-term memory (Linn, 2006), leading to a fragmented knowledge base (diSessa et al., 2004). Instruction can integrate knowledge representations in long-term memory by showing how knowledge elements relate (Linn, 2000). However, instruction can also increase knowledge fragmentation in long-term memory when it conveys new information without clarifying how it relates to learners' prior knowledge (e.g., Flaig et al., 2018). Thus, learners can differ in the degree of their knowledge heterogeneity, and instruction can change this degree. Due to the heterogeneous and dynamically changing nature of knowledge, the items of knowledge tests might not always intercorrelate highly. One way to quantify this interrelatedness of the items is the index Cronbach's Alpha.
Cronbach's Alpha is commonly interpreted as a measure of the internal consistency of tests (Cortina, 1993; McNeish, 2018; Schmitt, 1996; Sijtsma, 2009). It indicates the strength of interrelations between the test items related to test length (Cortina, 1993; Schmitt, 1996). If test items interrelate only moderately, Alpha tends to be low, and if they interrelate strongly, it tends to be high (Schmitt, 1996). Usually, a high Alpha is seen as desirable because it is assumed to show that test items reflect a low degree of random measurement error so that a test has a high reliability (Raykov & Marcoulides, 2015). Accordingly, it might be difficult to publish the results of tests with a low Alpha. Published yet arbitrary (Taber, 2018) recommendations suggest that tests below certain levels of Alpha (e.g., 0.70) should not be used at all or need improvement (cf. Schmitt, 1996).
Two recent articles highlight a substantial problem with how the Alpha of knowledge tests is typically used and interpreted (Stadler et al., 2021; Taber, 2018). Learners with heterogeneous knowledge have inconsistent answer patterns on knowledge tests. Thus, it is unrealistic to expect that each reliable knowledge test has a high internal consistency. It may also be unrealistic to expect that a knowledge test has the same Alpha when used before and after instruction or at different achievement levels because the heterogeneity of knowledge can change through learning and experience (e.g., Linn, 2000). The response format of a knowledge test might affect the cognitive process and characteristics of knowledge activated by items (Currie & Chiramanee, 2010), which may as well affect Alpha. When journal editors and reviewers prevent the publication of studies using knowledge tests with low Alphas, this can lead to distorted findings that might not be representative of knowledge in general. This might hinder the progress of research on knowledge acquisition.
These insights raise several questions: How high is the Alpha of knowledge tests on average, and how strongly does it vary between knowledge tests? Does the distribution of Alphas indicate that studies with low Alphas are underrepresented in the literature? How strongly does the Alpha of knowledge tests change through instruction? How does it differ between age groups and for different types of knowledge? In the present study, we examine these questions in a meta-analysis of Cronbach's Alphas of knowledge tests. The first aim of this meta-analysis is methodological. Specifically, we aim to provide indications for educational researchers regarding the typical Alphas they can expect when developing a new test of domain-specific knowledge. The second aim is to provide theoretical insights into variables that explain variation in the Alphas of knowledge tests, including characteristics of the test, the domain, and the learner.
Sources of the Heterogeneity of Domain-Specific Knowledge
Domain-specific knowledge is also called domain knowledge or content knowledge (cf. Simonsmeier et al., 2022). It can be distinguished from domain-general knowledge, such as knowledge about learning and reasoning strategies (Edelsbrunner et al., 2022). Knowledge is a heterogeneous construct. At least five influences contribute to this heterogeneity.
First, knowledge is stored in long-term memory in forms of elements and their interrelations, for example, as nodes and edges of a semantic network (e.g., Anderson et al., 2004; De Deyne et al., 2016). Knowledge is heterogeneous in that learners can hold some knowledge elements in memory (e.g., the principle of associativity) without necessarily holding other, related knowledge elements in memory (e.g., the principle of commutativity).
Second, for some knowledge elements, learners might understand how they interrelate and store them in an interrelated way in long-term memory. But for other elements, learners might fail to see how they connect and store them independently, without inter-relation, in their long-term memory (Linn, 2000). The resulting structure of knowledge elements with few connections is called fragmented knowledge (diSessa et al., 2004). Due to the lack of interrelations, fragmented knowledge can contain knowledge elements that are inconsistent or incompatible with each other (Schneider & Hardy, 2013).
Third, the problem of fragmented knowledge is aggravated by the fact that learners acquire their knowledge in different contexts (e.g., school, hobbies, internet, movies, friends) and activate knowledge elements depending on the context (diSessa et al., 2004). When they learn a piece of knowledge in one context, they might fail to activate another piece of knowledge in their long-term memory that would be related but has been learned in a different context (Barnett & Ceci, 2002). Accordingly, they do not realize that the new knowledge relates to their prior knowledge and cannot establish a relation between the two elements in their long-term memory. So, knowledge is heterogeneous also in the sense that different contexts (e.g., test items) activate different knowledge elements.
Fourth, learners do not automatically erase their misconceptions about a topic from memory when they learn the correct scientific concepts. Instead, correct concepts and intuitive misconceptions can co-exist in long-term memory and interfere during recall (Shtulman & Valcarcel, 2012; Stricker et al., 2021).
Finally, even when learners have complete, correct, and integrated knowledge in a domain, this knowledge can be heterogeneous in that knowledge elements can differ in their types and qualities (see de Jong & Ferguson-Hessler, 1996 and Bittermann et al., 2023 for reviews). For example, knowledge can be symbolic (e.g., verbal) or analog (e.g., pictorial), abstract or concrete, explicit (conscious and verbalizable) or implicit (not conscious and verbalizable), and it can be of various knowledge types differing in their functional characteristics.
Commonly investigated knowledge types are fact knowledge, conceptual knowledge, and procedural knowledge (de Jong & Ferguson-Hessler, 1996). Fact knowledge is explicit verbal declarative knowledge about facts (e.g., France lies in Europe; Agarwal, 2019). Conceptual knowledge gives learners a general abstract understanding of principles and their interrelations in a domain Crooks & Alibali, 2014). It tends to be more relational (i.e., understanding the relations between multiple knowledge representations) than fact knowledge or procedural knowledge (Goldwater & Schalk, 2016). Procedural knowledge is the foundation of skills by specifying which operators (e.g., manual actions or cognitive transformation) learners can perform to reach a specific goal. Procedural knowledge is built up through practice and can be automatized with different degrees (Ziegler et al., 2021).
Knowledge of different types measured in the same domain tends to be highly intercorrelated (Rittle-Johnson & Siegler, 1998). The reason for this is demonstrated by longitudinal studies finding bi-directional influences between knowledge types over time (Rittle-Johnson et al., 2001; Schneider et al., 2011). Factor analytic studies found that knowledge types can be inseparably intertwined in some samples or content domains (Schneider & Stern, 2010) but form inter-correlated but clearly separable latent factors in other cases (Lenz et al., 2020; Schneider et al., 2011).
In addition to characteristics of the studied knowledge type, domain, and population of test takers, a characteristic of the test itself that might affect heterogeneity in assessed knowledge is the response format. Prior research provides theoretical debates and empirical evidence regarding how multiple choice and open response items differ in the response processes or knowledge that they evoke. Individual differences in fact knowledge are stable independently of the response format, although multiple choice items tend to result in less difficult items because they draw more strongly on response recognition than on more difficult response production (Currie & Chiramanee, 2010; Goecke et al., 2022). This may not be the case or even reverse for items measuring conceptual knowledge because open responses can be better suited to reveal learners’ understanding of relational knowledge elements than multiple choice items (Ha & Lee, 2011). Open response formats may as well indicate more pronounced knowledge integration than multiple-choice items (Lee et al., 2011).
Empirical Studies Investigating the Heterogeneity of Knowledge
Several studies have used multivariate methods to model the multifaceted nature of knowledge and changes in knowledge structure over time. Some of these studies modeled facets of learners' knowledge as latent factors, investigating how strongly the factors inter-correlated and how this factor structure changed over time (Schneider & Stern, 2010; Watrin et al., 2022). Other studies used latent profile transition analysis, in which each profile score indicated one facet of knowledge, such as a specific concept (e.g., Edelsbrunner et al., 2018; Schneider & Hardy, 2013). These studies found evidence for fragmented knowledge and for influences of instruction on the degree of fragmentation. In a study with seventh- and eighth-graders with low prior knowledge, instruction about equation solving increased the homogeneity of knowledge (Schneider et al., 2011). Latent factors representing different aspects of knowledge had higher intercorrelations after than before instruction (Schneider et al., 2011). In the same study, the instruction did not affect the heterogeneity in a similar sample of students with high prior knowledge. By applying a latent transition analysis, Schneider and Hardy (2013) examined elementary school students' conceptual knowledge about floating and sinking. In this analysis, the frequency of profiles indicating fragmented knowledge decreased from 34 to 24% during participation in a constructivist learning environment over 1 week. After that, the frequency remained almost constant over a year without further instruction on the topics. Edelsbrunner et al. (2018) replicated most of these findings in a larger sample, finding increased diversity in knowledge structures after instruction. In another study of adults aged 18 to 70 years, the factor structure of declarative knowledge (i.e., knowledge about facts or concepts; de Jong & Ferguson-Hessler, 1996) was not moderated by age, indicating stability of the heterogeneity of knowledge across the lifespan (Watrin et al., 2022).
These findings suggest that the heterogeneity of learners' knowledge does not generally change with age but when the knowledge is targeted by instruction. This hypothesis aligns with the knowledge integration perspective, posing that students rarely spontaneously engage in knowledge integration (Linn, 2006). Therefore, it is an important task for instructional interventions to not only convey new knowledge but also to stimulate the integration of learners' old and new knowledge (Linn, 2006; Schneider & Stern, 2009). However, the comparability and generalizability of the findings reported above are unclear, as they have been obtained in differing age groups, content domains, and countries.
Cronbach's Alpha
Cronbach's Alpha is a psychometric index frequently used in educational research and the behavioral sciences in general as an indicator of the internal consistency of a test (McNeish, 2018; Taber, 2018). By a test, we mean a diagnostic tool consisting of multiple items used to estimate individuals' domain-specific knowledge about a topic.
Alpha is the product of two characteristics of the test items: the ratio of the average covariances of the items to their total variance, and test length (Bland & Altman, 1997; McNeish, 2018). In other words, Alpha is a product of the strength of interrelations of the items in a test and the number of items. The formula of Alpha is:
where k is the number of test items, s2i is the variance of item i which is summed across all items i = 1…, i = k, and s2X is the total variance of all items (McNeish, 2018). The first term of the product represents test length and the second part the interrelatedness of the items. This formula can also be expressed as:
where \({\overline{s} }_{ij}\) is the average covariance between all items (Cortina, 1993; McNeish, 2018).
Alpha increases with the item covariances (or, in standardized metric, the item inter-correlations) and the number of items. Consequently, researchers have two options for increasing the Alpha in test construction. Either they add further items to a test that correlate positively with the other items (for the exact requirements for increase in Alpha, see Yousfi, 2005a, b) or they modify items. In both cases, to increase Alpha, alterations must increase the ratio of the average item covariance in comparison to the total item variance. When researchers try to optimize Alpha by creating sets of items that are similar to each other and thus strongly interrelated, they might inadvertently decrease the construct validity of their test because the items fail to cover the construct to its full extent (so-called validity–reliability trade-off; e.g., Clifton, 2020; Garner, 2024; Steger et al., 2023).
Many researchers have criticized the use of Alpha as an indicator of the reliability or dimensionality of a test (for an overview of criticisms, see McNeish, 2018). Alpha is a lower bound of reliability that can be close to or further away from the true reliability value depending on the psychometric characteristics of the test (Sijtsma, 2009). In the present study, we do not use Alpha as an indicator of reliability or dimensionality but as an indicator of the strength of interrelations of the items in a test. This interpretation of Alpha is in line with Cortina (1993) and Schmitt (1996), who clarified that Alpha reflects the internal consistency, or interrelatedness, of a set of items (see Crano & Brewer, 2014; Green et al., 1977, for similar views, and Revelle & Zinbarg, 2009, for discussion of the term internal consistency).
The dependence of Alpha on the number and interrelatedness of items implies that the stronger the associations between items in a test, the fewer items are required to result in an Alpha of a specific magnitude. When items interrelate strongly, for example because they have good psychometric characteristics, fewer items suffice to yield the same Alpha coefficient than when items interrelate less strongly. Building on this characteristic, for a given Alpha and number of items, the well-known Spearman–Brown prophecy formula can be applied to predict how the Alpha would change if the number of items was reduced or increased (Lord & Novick, 1968; Schmitt, 1996):
where α is the Alpha-estimate with the original test length (i.e., number of items), n is the proportion of items by which the test is shortened or lengthened, and α* is the predicted Alpha for the new test length. Building on this formula, one can predict the average intercorrelation of items on the respective test by first predicting the Alpha for a test length of just two items and then, by plugging in the formula for standardized Alpha (Falk & Savalei, 2011), deriving their correlation by:
where \({\alpha }^{*}\) is the Alpha predicted by the Spearman–Brown formula for a test length of just two items, and \(\overline{r }\) is the expected average item intercorrelation. Although the Spearman–Brown formula in principle makes the strong assumption that newly added or removed items are parallel (i.e., they all measure the same construct with the same item discrimination), deviations from this assumption tend to have limited impact on the precision of the formula (Ellis & Sijtsma, 2024).
Cut-off Values for Cronbach's Alpha
There are published guidelines for developing tests and interpreting the estimated Alphas of these tests. The probably most commonly applied guidelines are those attributed to Nunnally (1978), who described Alpha as moderate if it exceeds 0.70, as good for research purposes if it exceeds 0.80, and as sufficient for diagnostic purposes if it exceeds 0.90. To exceed either of these values, researchers might add, remove, or exchange test items until an Alpha estimate reaches one of these cut-offs. However, it has been argued that these and other cut-offs lack a theoretical or empirical basis and might be outdated (Greco et al., 2018). In addition, recent publications emphasize the reliability–validity trade-off, arguing that ensuring high construct validity often requires compromising on internal consistency (). These issues remain for alternative indices such as Omega (Dunn et al., 2014), which is seen as a more precise indicator of internal consistency than Cronbach’s Alpha but also lacks an empirical basis. Researchers might still stick to these cut-offs because there is a lack of more principled guidelines or comparison standards, particularly within educational research. Information on the typical Alphas that are empirically found for specific kinds of tests could be used to develop guidelines. Researchers could use these guidelines to judge whether their tests exhibit similar Alphas as previous tests that were supposed to measure a similar construct. Such empirically based guidelines have been published for effect sizes such as Cohen's d (Kraft, 2020) and correlations (Gignac & Szodorai, 2016). In management research, meta-analytic estimates of Alpha have been compared to the guidelines by Nunnally (Greco et al., 2018). The authors found that the Alphas of measures assessing individual differences, attitudes, and behaviors almost always exceeded 0.70 but were often below 0.80. We are not aware of similar studies or empirically based guidelines for Cronbach's Alpha in educational research. In the present research, we aim to establish such guidelines for tests assessing domain-specific knowledge, although we will discuss reasons for the view that dropping guidelines altogether may be a better way forward.
The Present Study
In sum, Cronbach's Alphas indicates the internal consistency, that is, the degree of interrelatedness of the items of a test (Cortina, 1993; Schmitt, 1996). Some guidelines recommend that Cronbach's Alpha of tests should always be high and that tests with a low Alpha should not be used. These guidelines have been challenged for knowledge tests based on theoretical and empirical arguments about the heterogeneous and dynamically changing structure of knowledge (Stadler et al., 2021; Taber, 2018). However, a synthesis of the evidence on the heterogeneity of knowledge tests and variables moderating this heterogeneity is still missing.
Therefore, we conducted an initial meta-analysis of the Cronbach's Alphas of domain-specific knowledge tests in research on learning and instruction. Since we were interested in how instruction and development affect Alpha, we only analyzed Alphas from studies with at least two measurement points. To this end, we used the data from an extensive recent meta-analysis on the influence of prior knowledge on learning, including over 8000 effect sizes (Simonsmeier et al., 2022). The meta-analysis examined the correlations between prior knowledge and learning outcomes and coded Cronbach's Alphas for attenuation correction of correlations. The Alphas were not further described and analyzed in that meta-analysis. In the present study, we build on this data pool. Although this is a selective sample, the data coded by Simonsmeier and colleagues is well suited for the present analysis because they had conducted a comprehensive literature search, screening 10,000 titles and abstracts for eligibility. They did not limit their meta-analysis to specific content domains, age groups, countries, or school forms. Therefore, the sub-sample of their data for which Cronbach's Alphas were available represents a large corpus of the literature concerned with knowledge change through learning or development. We used this data set to investigate the following research questions, which we group by five more general topics:
Meta-Analytic Average Alpha, Prediction Interval, and Average Item-Intercorrelation
The first research questions should shed light on the typical Alpha that researchers can expect for their knowledge tests on average, and on the expected average item-intercorrelation that this Alpha implies. These research questions have both theoretical and methodological merit, by informing about the typical homogeneity of knowledge items and the typical variation therein that researchers may expect for their knowledge tests. The specific questions were as follows:
-
Research question 1: What is the average (i.e., meta-analytic estimate of) Cronbach's Alpha and concomitant expected item-intercorrelation of tests used in studies of knowledge acquisition and development?
As explained before, from a theoretical perspective, we expect knowledge to be a rather heterogeneous construct, which would be a reason to expect a relatively low average Alpha. On the other hand, studies with Alphas below 0.70 might be hard to publish because they go against established cut-offs.
-
Hypothesis 1: The estimated average Alpha is greater than 0.70, but only slightly so (0.70 < α < 0.75; Hypothesis 1).
Since Alpha depends on test length, that is, on the number of items, based on this expected average Alpha we would also expect rather moderate item-intercorrelations. We will convert the predicted Cronbach’s Alphas into average item-intercorrelations by using Eq. (4) to derive an indicator of internal consistency that is interpretable in a standardized and well-known metric (i.e., the correlation coefficient) that is independent of test length. Depending on the average number of items in our sample of knowledge tests, our prediction of the average Alpha in the range of 0.70–0.75 would imply different expected item-intercorrelations according to Eq. (4). If the average knowledge test is composed of just ten items, then the implied item-intercorrelation with an Alpha of 0.70 would be rexp = 0.19, and with an Alpha of 0.75 it would be rexp = 0.23. If the average knowledge test is composed of 30 items, then the implied item-correlation with an Alpha of 0.70 would be rexp = 0.07, and with an Alpha of 0.75 it would be rexp = 0.09. Overall, for expectations regarding the expected item-intercorrelation on knowledge tests, we cannot be sure since it is a function of both the Alpha and the test length, but given our theoretical expectation of substantial heterogeneity and these expectations implied by Eq. (4), we expect the expected item-intercorrelation to be moderate at best.
-
Hypothesis 2: The expected item intercorrelation of the domain-specific knowledge tests is rexp < 0.30.
But we did not expect the Alphas of knowledge test to be very similar across different tests, bringing us to our next research question:
-
Research question 2: What Typical Alphas, as represented in the prediction interval around the meta-analytic average Alpha, can researchers expect for their knowledge tests?
We expected a broad distribution of Alphas around the meta-analytic average. The assumption of strong variability of Alphas arises from the fact that the Alphas are very likely influenced by the test length, the breadth of the tested knowledge, and other sources of heterogeneity (e.g., whether the test includes different response formats). This broad distribution should lead to large heterogeneity estimates and encompassing prediction intervals (i.e., meta-analytic predictions for new tests from the same population; Riley et al., 2011) for the expected Cronbach’s Alphas. In contrast to confidence intervals, prediction intervals acknowledge the random variation around model estimates, implying that different tests do not all receive the same population estimate. A prediction interval thereby provides a more realistic assumption about the typical range of Alphas. This research question covers the methodological aim of this meta-analysis, that is, by computing the prediction interval around the meta-analytic estimate of Alpha, we provide a first attempt of a guideline for Cronbach’s Alphas that researchers can typically expect for their tests. Although we expected rather large variation of Alphas across tests, we did not have a specific expectation regarding the width of the resulting prediction intervals.
Dependence on Test Length, Publication Bias, and Effects of the Response Format
Research questions 3 and 4 were concerned with the dependence of the meta-analytic Alpha estimates on test length (i.e., the number of items), which, as we will explain, relates to publication bias, and research question 5 covers another characteristic of the test, specifically the response format.
-
Research question 3: How does the Alpha of knowledge tests relate to the number of items in the test?
We examine test length as a moderator of Alpha. Since Alpha is a product of the strength of interrelations and the number of test items, we expect the number of items to be a good predictor of the Alphas in our database. This relation is expected to be quadratic (Hypothesis 3), as evidenced by tables depicting the relation between the strength of interrelations between items and their number (e.g., Table 3 in Cortina, 1993).
-
Hypothesis 3: There is a quadratic relation between test length and Alpha.
The expected relation between Alpha and test length is not self-evident because other moderators and publication bias may distort the relation. Since the relation with test length is part of the equation of Alpha, we expect test length to be the most important moderator variable. We will therefore derive prediction intervals for the expected Cronbach’s Alphas for tests with differing length to be more precise regarding research question 2 (typically expected Alphas for knowledge tests) and establish guidelines that researchers can use to evaluate the internal consistencies of their knowledge tests depending on test length.
-
Research question 4: Is there evidence for bias in the distribution of Alphas?
When an Alpha-estimate does not reach standard cut-offs such as 0.70 or 0.80, it can lead to bias in several ways: Researchers might not submit the study for publication or get rejected from journals, contributing to file drawer bias (Franco et al., 2014), or they might remove, select, or re-write items to increase Alpha. We examined bias in the distribution of Alphas with four different approaches. First, we examined the skewness of the distribution of transformed Alphas. Alphas typically exhibit a right-skewed distribution, on which we applied the transformation by Bonett (2002) to yield a normalized distribution. However, if there is bias, then the transformed distribution of Alphas would be non-normal. We hypothesized an underrepresentation of small Alphas in published studies, implying the prediction that the distribution of the transformed Alphas will be right-skewed.
-
Hypothesis 4: The distribution of Alphas transformed following Bonett (2002) is right-skewed, indicating publication bias.
Second, we examined the frequency distribution of Alphas and expected to find higher frequencies of Alphas in the areas just above the typical cut-offs because researchers might modify their tests until they exceed one of the cut-offs suggested by Nunnally (1978).
-
Hypothesis 5: There is increased frequency of Alphas just above 0.70, 0.80, and 0.90.
Third, we used the Spearman–Brown formula to examine bias. Based on this formula, we computed the expected curve of Alphas across differing test lengths. Since Alpha depends strongly on test length, yielding a high Alpha with shorter tests is much harder than with longer ones. We assumed that when researchers construct short tests that result in low Alphas, they will modify these tests until Alpha reaches a satisfactory level to avoid problems with publishing the results.
-
Hypothesis 6: There are fewer short tests with low Alphas than the Spearman–Brown formula predicts.
Finally, we used a funnel plot to examine bias in the Alphas. We expected that the funnel plot would fail to identify bias. The reason is that a typical funnel plot relates sample size (specifically, the inverse of the standard error) to effect size (Duval & Tweedie, 2000). The magnitude of Alpha does not depend on sample size, but instead on test length. Estimates of Alphas will be more precise with increasing sample sizes (i.e., its standard error will be lower), but sample sizes do not affect the expectation (i.e., statistically expected mean estimate) of Alpha. Consequently, there is no statistical reason to condition the publication of Alphas on sample size.
-
Hypothesis 7: Bias indicated by our other approaches is not visible in a funnel plot.
We also combined the funnel plot with the trim-and-fill procedure (Duval & Tweedie, 2000) to see whether data points imputed by this method appear in areas of the distribution in which our other methods indicate bias.
-
Research question 5: Do tests with different response formats exhibit differences in internal consistencies?
The internal consistencies of tests and their related items might differ depending on whether the items require open answers, multiple-choice answers, or other response formats (e.g., Likert scales). We examined this question in an exploratory manner and did not have related hypotheses.
Changes of Alpha Throughout Learning and Development
Research questions 6 to 8 concern how the Alphas change with learning and development (i.e., interventions or age):
-
Research question 6: How does instruction affect the internal consistency of knowledge tests? That is, how does Alpha change from pretest to posttest (and follow-up tests) in the course of intervention studies?
In line with previous findings, we expected that well-designed instruction tends to decrease knowledge heterogeneity by clarifying relations between pieces of knowledge for the learners.
-
Hypothesis 8: Alphas increase in intervention studies from before to after learning.
-
Research question 7: Does the internal consistency of knowledge tests differ between measurement points in longitudinal studies?
It has been suggested that students rarely spontaneously integrate their knowledge. Therefore, it is the task of instruction to achieve this integration (Linn, 2000).
-
Hypothesis 9: There are no visible change in Alphas throughout longitudinal studies that do not employ an intervention.
-
Research question 8: Does the sample mean age moderate the internal consistency of knowledge tests?
When individuals acquire knowledge throughout and beyond schooling, this might lead to a broad yet only loosely related knowledge network. On the other hand, the older individuals get, the more time they have had to reflect on knowledge and acquire a deeper understanding of the connections between different pieces of knowledge. However, a large previous study found that age does not moderate the heterogeneity of adults' knowledge over the lifespan (18–70 years; Watrin et al., 2022).
-
Hypothesis 10: There is no visible correlation between Alpha and sample mean age.
Characteristics of the Assessed Knowledge
-
Research question 9: Does the type of assessed knowledge moderate the internal consistency of knowledge tests?
Fact knowledge and procedural knowledge are usually simpler in their structure than conceptual knowledge with its network of interrelations (Goldwater & Schalk, 2016). Facts can be accumulated without understanding their interrelation, whereas conceptual knowledge is highly dependent on interrelations.
-
Hypothesis 11: The Alphas of conceptual knowledge are higher than those of fact and procedural knowledge.
-
Research question 10: Do content domains differ in the internal consistency of knowledge tests?
-
Hypothesis 12: We expected the internal consistency to be higher for well-structured domains, such as, for example, mathematics and language (Ball et al., 2005), than for less structured domains, such as the humanities and social sciences (Buehl et al., 2002).
Method
Study Inclusion
The study search and inclusion process is shown in the PRISMA flowchart in Fig. 1 (Page et al., 2021). In our study, we re-analyzed a data set coded in the meta-analysis on prior knowledge and learning by Simonsmeier et al. (2022). They investigated the relation between prior knowledge and learning, thus including all studies measuring knowledge before and after learning. They coded the Cronbach’s Alphas of the knowledge tests when they were reported in the publications. Thus, the dataset from the meta-analysis by Simonsmeier et al. (2022) is a good base also for the present meta-analysis. Further advantages of using these data are the broad scope of the meta-analysis, which was not limited to specific content domains or age grounds, the extensive literature search, and the high inter-coder reliabilities, as described below.
Simonsmeier et al. (2022) had three inclusion criteria. First, the study included an objective measure or an experimental manipulation of the quantity of domain-specific prior knowledge. Studies investigating differences in the activation of or familiarity with knowledge were excluded. Second, the study used at least one objective measure of knowledge or achievement after learning. Studies in which a different test was used before and after learning were included. Third, the study reported a standardized effect size of the relation between prior knowledge and knowledge after learning or the pretest–posttest knowledge gains. Studies reporting the information to compute such an effect size for the meta-analysis were also included.
Simonsmeier et al.’s (2022) search was conducted in PsycINFO and ERIC with the search string ((“pre-test” or “post-test” or “pretest” or “posttest” or “pre test” or “post test” or “longitudinal” or “repeated measure” or “repeated measures” or “measurement point” or “measurement points” or “Matthew effect” or “compensatory effect” or “learning gain” or “learning gains”) and knowledge) or “prior knowledge” or “conceptual change” or “knowledge change” or “knowledge gain” or “knowledge gains”. The search string was designed to not only find studies using the term “prior knowledge” but also other studies measuring knowledge at least twice. An additional search used mailing lists and exploratory internet searches. For the screening of the titles and abstracts, the absolute inter-coder agreement was 83% for 100 randomly chosen records. For the inclusion of the full texts, the absolute agreements of three trained research assistants with the main coder were 82, 72, and 79%. Simonsmeier et al. (2022) included 493 studies.
Many of the 493 studies included in the previous meta-analysis could not be included in the present meta-analysis because they did not report information necessary for investigating our research questions and, thus, did not fulfill the two additional inclusion criteria we had in our study. First, since our research questions relate to Cronbach’s Alphas, we included only studies reporting Cronbach’s alphas. Second, since we were interested in how the Alphas change with instruction and development, we included only studies reporting Alpha for the same test before and after learning. Thus, we excluded studies that had measured prior knowledge and learning outcomes using different tests. Of Simonsmeier et al.’s (2022) 493 included studies, 52 studies with 55 samples and 285 effect sizes fulfilled our additional inclusion criteria and were analyzed in the present meta-analysis. Multiple effect sizes per study resulted from multiple knowledge outcome measures or multiple measurement points per study. The average number of effect sizes per study was 4.83, with a median of 2 and a range of 1–24.
Data Coding
In the study by Simonsmeier et al. (2022), one main coder and three trained research assistants code the effect sizes, alphas, and moderator variables with absolute agreements of 92, 89, and 90%. If reported, multiple effect sizes per study were coded, for example, when a study reported results from several time points or measures. In the present study, we extended this data set by additionally coding the sample size used for the computation of each Alpha. For example, when 100 persons' knowledge was measured at the pretest and 50 persons' knowledge was measured at the posttest, Simonsmeier had coded the sample size as 50 because this is the number of persons used to compute the correlation between the pretest and posttest. By contrast, we coded sample sizes of 100 for Alpha at the pretest and 50 for Alpha at the posttest. This procedure led to the final version of the data set analyzed in the present meta-analysis. The analytic data and scripts are available under https://osf.io/7ygc5.
Statistical Analyses
For the estimations of meta-analytic effect sizes and meta-regressions, we used cluster-robust random-effects estimation in the metafor package in the R software environment (Borenstein et al., 2017; Cameron & Miller, 2015; Sánchez-Meca et al., 2021; Viechtbauer, 2010). We specified the 55 samples as clusters, which allowed unbiased estimation with multiple effect sizes per study/sample (Viechtbauer, 2010). The metafor package then estimates the meta-analytic variance/covariance matrix with a sandwich estimator that corrects estimates for the study dependence (Cameron & Miller, 2015). As Cronbach's Alpha cannot exceed 1, it is usually non-normally distributed with a left-skewed distribution. To compensate for this skew, we normalized the Alphas by the formula of Bonett (2002) before our analyses. The formula is Alphatransformed = ln(1 − Alphauntransformed). For obtaining predicted Alphas from the statistical models, we back-transformed model predictions via Alphapred_untransformed = − exp(Alphapred_transformed) − 1. As estimates of the heterogeneity of the Alphas, we report the between-studies variance τ2 (Deeks et al., 2008) and the ratio of true heterogeneity to total observed variation I2 (Borenstein et al., 2017). We used the psych (Revelle, 2017) and tidyverse (Wickham et al., 2019) packages for descriptive statistics and data visualizations.
To analyze the effects of moderator variables, we applied the same estimation technique but used the meta-regression function in the same package by adding predictor variables to the estimation of Alpha. In addition, we estimated the marginal effects (i.e., model-implied predicted values) for the individual levels of categorical moderators. We analyzed each moderator individually, separately from the others, to avoid collinearity, which can cause power and interpretability issues. However, in all moderator analyses, we controlled for the log-transformed number of items to account for the confounding of this factor in the computation of Alpha. Log transformation was applied because, per the assumptions of linear meta-regression, the relation between the log-transformed number of items and Alphas approximated linearity very well.
We present 90% confidence intervals and interpret p-values below 0.10 as significant findings because the number of relevant Alphas in most moderator analyses was relatively small. Increasing the confidence interval and the critical p-value prevent high rates of beta errors (Tabachnick et al., 2013). We also took this decision because, to our knowledge, this is the first meta-analysis of Alphas of knowledge tests. We did not want to overlook any effects that future research might want to explore further. A default alpha error level of 10% decreases the chance of such beta errors and thus balances the two types of errors in accordance with our research focus (Fisher, 1926). We present 90% prediction intervals (Riley et al., 2011) in addition to confidence intervals to provide preliminary guidelines for expected Alphas in future studies. Prediction intervals do not see estimated effects as fixed but acknowledge the nature of random effects in predicting Alphas that can be expected in future studies. Based on these intervals, researchers can see whether their obtained Alphas lie within the bulk of Alphas that can be expected in future studies based on our data. We have two hypotheses that predict the absence of effects (hypothesis 9, no effect of development; hypothesis 10, no effect of age). Since the absence of statistical significance itself does not provide evidence for the absence of effects, we will interpret visualizations of the raw data and descriptive statistics, as well as the model estimated confidence intervals to interpret results regarding these hypotheses (Edelsbrunner & Thurn, 2024).
Results
We initiate the results with a short descriptive overview of the distribution of Alphas in our sample, followed by the results grouped by the four topics as presented in the research questions and hypotheses. To be better able to follow our results and their discussion later on, an overview of our hypotheses, including short summaries of the analytic result regarding each hypothesis, is provided in Table 1.
The distribution of the Alphas included in this meta-analysis is shown in Fig. 2 separately for individual Alphas and study-average Alphas. As expected, the distribution was non-normal and left-skewed. The distribution had its highest density between 0.80 and 0.90, with an expected long tail to the lower end.
Meta-Analytic Alpha, Expected Item-Intercorrelation, and Prediction Interval
Research Question 1: Meta-Analytic Alpha
Regarding our first research question, the overall estimate from the cluster-robust random-effects meta-analysis was Alpha = 0.85, with a 90% confidence interval (CI) of [0.82, 0.87]. This finding is against Hypothesis 1, which predicted an average Alpha just above 0.70. Note that the mean Alpha is not in the middle of the CI and not in the middle of the PI because the Alphas were transformed to a linear scale, averaged, and then back transformed to the non-linear Alpha scale.
To examine the second, related aspect of the first research question, we used Eqs. (3) and (4) and the meta-analytic Alpha to derive the average item intercorrelation that can be expected for tests of domain-specific knowledge. This estimated intercorrelation was \({\overline{r}}_{exp}\) = 0.22, 90% CI [0.18; 0.25], with a 90% prediction interval (PI) of 90% PI [0.03, 0.57]. This expected intercorrelation is in line with Hypothesis 2, predicting that it would be rexp < 0.30, although the broad prediction interval indicates substantial heterogeneity in the expected range of this statistic.
Research Question 2: Meta-Analytic Prediction Interval
The 90% PI for the overall Alpha was [0.35, 0.96]. There was an estimated between-studies variance of τ2 = 0.73, which made up for an estimated proportion of I2 = 98.68% of the total variability. This large estimated between-study variance is in line with the large width of the prediction interval, which makes it difficult to predict the Alpha for future studies.
Dependence on Test Length, Publication Bias, and Effects of the Response Format
Research Question 3: Relation with Test Length
Table 2 lists the results of the moderator analyses. The first moderator we examined was test length, that is, the number of items. It had a range of 3 to 150 items, with a mean of 28.81 and a median of 20. Figure 3 presents a scatter plot of the number of items in each sample related to the Alphas, with larger bubbles indicating larger sample sizes. As visible from Fig. 3, there was a substantial relation between the number of items and Alpha, potentially with a moderate exponential or quadratic tail in the lower end of the distribution. As explained above, the number of items was log-transformed for the meta-analytic regression model, after which the number of items was approximately normally distributed, and its relation to Alpha appeared almost linear. As indicated from the result of the respective meta-analytic regression model in Table 2, there was a substantial association of the number of items with the estimated Alphas on the log scale. This confirmed our prediction of a strong relationship. However, the relation appeared less quadratic than we had expected based on the equation of Alpha in formulating Hypothesis 3.
In accordance with this result, we controlled for the log-transformed (to be in line with the linearity assumption) and centered (for improved interpretability of model intercepts) test length in all further meta-analytic regression models.
Research Question 4: Publication Bias
As discussed above (research question 4), we applied four methods to examine bias in the distribution of Alphas. First, as a descriptive method, we estimated skewness in the distribution of transformed Alphas. After the transformation by Bonett (2002), Alphas can be expected to follow a normal distribution. The transformed Alphas in our study are depicted in Fig. 4. In accordance with Hypothesis 4, the distribution indicates moderate right-skewness, as indicated by a skewness index of 0.97 (SE = 0.14; bootstrapped p < 0.001). Figure 4 indicates that nine exceptionally high Alphas stemming from two studies might significantly contribute to this skewness. After removing the Alphas from these two studies, skewness was still moderate at 0.44 (SE = 0.14; bootstrapped p = 0.003), indicating that the overall skewness was partially but not entirely caused by these extreme values.
Second, we investigated in how many samples and studies the Alphas just exceeded the commonly cited cut-offs of 0.70, 0.80, and 0.90. In line with Hypothesis 5, Figs. 2 and 4 show moderate yet salient peaks in the numbers of reported Alphas just above these three cut-offs. The peaks are visible directly to the right of 0.70, 0.80, and 0.90 for the individual Alphas (in the back of Fig. 2 and in Fig. 4) and the study-average Alphas (in the front of Fig. 2).
Third, we investigated whether there is still evidence for a distributional bias when the test length is taken into account. This is important because Alpha increases with the number of items in a test. We used our meta-analytic mean estimate of Alpha = 0.85 and the Spearman–Brown prophecy formula to predict what Alpha can be expected for which test length (Fig. 3). As shown in Fig. 3, the shorter-dashed fit line adapting smoothly to the empirical distribution of Alphas mostly follows parallel to the predicted curve for longer tests in the right part of the plot. The parts of the empirical curve located below the predicted curve might be explained by the sample size-weighting that is undertaken in the meta-analytic model, which estimated Alpha to be at 0.85 at the median of 20 items. This explanation is supported by the fact that many studies with large samples sizes (represented by bigger bubbles) are located in higher areas of Alphas. There were two further observations relating to distributional bias. First, just above 0.80, the empirical curve flattens out where it would be supposed to show a steeper decline with a decreasing number of items. The steeper decline follows between test lengths of about 25 to 20 items, at Alphas below 0.80. Second, the steepness of the empirical curve deviates visibly from the predicted curve again for tests below the median length of 20 items. In this area, the longer-dashed line computed with the Spearman–Brown formula predicts a much steeper decline in Alphas with lower numbers of items than found empirically. As predicted (Hypothesis 6), this finding indicates an underrepresentation of short tests with low Alphas compared to what can be expected for an unbiased distribution. The result of this bias is a great bulk of studies with high Alphas and low numbers of items in the upper left of Fig. 3.
Finally, to examine Hypothesis 7, we created a funnel plot and applied the trim-and-fill procedure. This is a standard practice in meta-analyses. The resulting plot in Fig. 5 shows that these methods indicate very different bias than the other three methods used for investigating bias. Specifically, plotting transformed Alphas against standard errors indicates that Alphas might be missing in the right part of the plot, with Alphas imputed by the trim-and-fill procedure only in the very high range.
Overall, our investigation of bias supports Hypotheses 3, 4, and 5, assuming that our three preferred methods (skewness, frequencies just above cut-offs, deviation from Spearman–Brown prediction) would indicate bias. It also supports Hypothesis 6, predicting that a funnel plot, the method we deem less appropriate for application to Alphas, would yield results incompatible with the results of the alternative methods for estimating bias.
Research Question 5: Response Format as Moderator
In an exploratory moderator analysis, we examined whether the response format of the test items played a role for the Alphas (research question 5). As indicated by Fig. 6 and confirmed by the results in Table 2, tests with open response formats had the highest Alphas.
Change Through Interventions and Development
Research Questions 6 and 7: Change in Alphas in Interventions and Development
We hypothesized that Alphas are higher after interventions than before (Hypothesis 7) but do not increase over time points in longitudinal studies (Hypothesis 8). There were 127 (44.4%) Alphas from studies with interventions and 158 (55.4%) from longitudinal studies without targeted interventions. As visible from Table 2 and Fig. 7, the estimated Alphas of developmental studies (i.e., longitudinal studies without intervention) tended to be higher than those of studies with interventions. In intervention studies, Alphas increased substantially from the first to the second measurement point. Differences between the first measurement point and all following time points were statistically significant when controlling for test length (Table 2). This confirmed Hypothesis 7 (predicting increase after interventions). In developmental studies, there was also a trend that Alphas increased over time. This increase became clearly visible at time point 4 and was statistically significant when controlling for test length at time points 5 and 6 (Table 2). This contradicted Hypothesis 8 (predicting no change throughout development). In sum, in studies with several measurement points, Alpha tends to increase over time—regardless of whether the study included an intervention.
Research Question 8: Age as Moderator
We hypothesized not to find a clear relation with age (Hypothesis 9). Participant mean age across all samples and studies was Mage = 8.64 years (SD = 5.38), with a median of 6.75 and a range of 3.42 to 24.15 years. Age was skewed and therefore log-transformed for the meta-regression. There was a clear and significant trend that Alphas were lower in samples with higher age, as visible from Fig. 8 and Table 2. In addition, Fig. 8 descriptively indicates a curvilinear relation, with Alphas being very low and showing large variation in young children, highest in older children, and lower again throughout adolescence and beginning adulthood.
Research Question 9: Knowledge Type as Moderator
We expected higher Alphas for conceptual knowledge than other knowledge types (Hypothesis 10). Figure 9 indicates that fact knowledge had higher Alphas than conceptual knowledge, with procedural knowledge in between. The respective meta-regression did not confirm this impression; none of the three contrasts of conceptual knowledge with the other types of knowledge was significant. Tests assessing fact knowledge tended to have more items (M = 57.27, SD = 60.70, Md = 25) than tests assessing conceptual knowledge (M = 19.52, SD = 9.32, Md = 20). Since we controlled for the number of items in all models, the descriptive differences visible in Fig. 9 have been explained by test length in the moderator analyses, leaving no visible remaining differences between the Alphas of different knowledge types. Since a non-significant result does not directly provide evidence for the absence of an effect (Mehler et al., 2019; Edelsbrunner & Thurn, 2024), we also inspected the estimated R2 and confidence intervals (Table 2). These estimates indicated very similar Alphas for the different knowledge types, confirming a lack of effect.
Research Question 10: Content Domain as Moderator
Regarding domains, we expected Alphas to be higher for mathematics and languages than humanities and social sciences (Hypothesis 11). Figure 10 shows that tests assessing knowledge in languages or mathematics had higher Alphas than tests assessing knowledge in social sciences/humanities or science. This visual impression was confirmed by a meta-regression indicating significantly higher Alphas for languages and mathematics than for the reference category science (see Table 2). The Alphas were also significantly larger for language (t = 5.35, p < 0.001) and mathematics (t = 2.07, p = 0.044) than for social sciences/humanities when the latter was used as the reference category.
Predicted Alphas in Future Studies
As a final result of our analyses, prediction intervals for the Alphas that can be expected in future tests with differing test lengths are presented in Table 3 and prediction intervals for tests with different characteristics on moderator variables in Table 2. As visible from these tables, prediction intervals were generally very wide, reflecting the large estimates of between-study heterogeneity in effect sizes presented in Table 2.
Discussion
Theoretically, knowledge is commonly described as a heterogeneous construct with a dynamically changing structure. What does this mean for the internal consistency of knowledge tests? Our meta-analysis showed that the Cronbach's Alpha of knowledge tests in published studies is rather high, 0.85 on average, with a 90% CI ranging from 0.82 to 0.87. The expected intercorrelation between any two knowledge items appears much lower, with rexp = 0.22. Alphas varied a lot between studies, resulting in large prediction intervals. The findings appear to be distorted by a distributional bias, particularly an underrepresentation of small Alphas for short knowledge tests in published studies. Alpha increased from before to after instruction and over time in longitudinal studies. It was higher for younger participants than older ones, for open-answer items than multiple-choice items, and for languages and mathematics compared to science and social sciences/humanities. We discuss these results and their implications in turn. We first discuss in which ways our findings are biased before discussing the mean Alpha and moderator effects.
The Alphas of Knowledge Tests in Published Studies are Biased
We had hypothesized that the distribution of Alphas in the meta-analyzed publications would be biased because knowledge tests with Alphas below 0.70 are difficult to publish (Hypothesis 3). Several meta-analytic findings are in line with this hypothesis. First, when we projected the Alphas on a linear scale, there was an asymmetry in their distribution, indicating a lack of low Alphas. Such an underrepresentation of low effect sizes in a meta-analysis usually indicates a publication bias (Zhu & Carriere, 2018). We had expected this bias under the assumption that editors and reviewers tend to reject manuscripts with Alphas smaller than 0.70 based on published guidelines for test quality. As explained in the introduction, this is bad practice because a high Alpha can only be expected for homogeneous constructs (e.g., cognitive skills), which knowledge is not, at least not always. Thus, the selective publication of high Alphas for knowledge tests creates the erroneous impression that knowledge tests have a higher internal consistency and that learners' knowledge is more homogeneous than is the case.
Second, there was another bias in the distribution of Alphas. Many more Alphas were just above 0.70, 0.80, and 0.90 than just below these values. This bias is likely caused by researchers who tried to increase the Alphas of their tests from below to above these values, or again by publication bias. Increasing Alphas can be done through at least three strategies. Researchers can exclude items with low item-total correlations. They can reformulate their items to target a narrower range of content. They can also add more items to the test to increase Alpha. All of these strategies have disadvantages. The first two strategies make the items more similar, which bears the danger that they do not span the full range of the construct to be assessed. For example, an Algebra test with items only on functions likely has a higher Alpha than an Algebra test with items on functions and equation solving. However, an Algebra test including only items on functions does not cover the whole domain of Algebra and thus has a lower content validity. This problem is known as the reliability–validity trade-off in test construction (e.g., Steger et al., 2023). When researchers add items to a test in this way for the sole reason of increasing Alpha, they waste test time and other resources.
The third piece of evidence indicating a bias in the distribution of Alphas resulted from plotting Alpha against test length (i.e., item number). We did this once for the empirically found Alphas and once for the meta-analytically found mean Alpha of 0.85 projected to different test lengths using the Spearman–Brown formula. A comparison of these two curves showed an underrepresentation of short tests with low Alphas in the published studies. This finding further corroborates the assumption that researchers modify their tests to increase their Alphas. Such modifications are easier for shorter than longer tests. Adding an item to a short test increases Alpha more than adding an item to a long test, and it is easier to construct a small than a large set of homogeneous items.
In contrast to the three rather novel methods we used to identify bias in the distribution of Alphas, the more traditional funnel plot indicated some bias in the opposite, unexpected direction. We suppose this occurred because plotting the effect size against the inverse of the standard errors is not a meaningful approach for effect sizes such as Alpha that do not meaningfully relate to standard errors (Sánchez-Meca et al., 2021).
Overall, there was converging evidence of bias in the distribution of published Alphas. Thus, our meta-analytic results do not indicate precisely how high the Alpha of knowledge tests actually is. We cannot correct our estimates for these biases, as to our knowledge such methods do not yet exist for indices of internal consistency. Subjectively, the increased number of Alphas just above cut-offs seems moderate, whereas the lack of short tests with low Alphas seems more pronounced. Likely, the actual average Alpha of knowledge tests is lower than the one found in this meta-analysis. Our results do still provide an initial basis for evaluating new knowledge tests' adequacy and internal consistency. Specifically, in Table 3, we provide estimates of prediction intervals for the Alphas of knowledge tests with different lengths. As visible from Table 3, the prediction intervals are very broad, reflecting the high heterogeneity of Alphas that we found in our sample. Despite evidence for publication bias of unknown extent, the width of the prediction intervals provides a first indication of the great heterogeneity among knowledge tests.
We are not in favor of guidelines for Alphas or similar measures for reasons we outline in the implications section, but if researchers intend to use guidelines, our empirically derived estimates might be a step forward in comparison to earlier arbitrary cut-offs (e.g., Nunnally, 1978). Given the publication bias of unknown magnitude in our sample, we suggest that researchers who use our guidelines aim at reaching at least the lower boundaries of these prediction intervals with the Cronbach’s Alphas of their knowledge tests. In addition, for tests with specific characteristics, the prediction intervals from the moderator analyses (Table 2) provide a first basis for guidelines for typical Cronbach’s Alphas that researchers can consult. These guidelines should, however, be updated in future research collecting larger samples of Cronbach’s Alphas or other indices relating to internal consistency. In addition, methods are required to correct for publication bias or test construction biases in the distributions of Cronbach’s Alphas.
Explanations for the High Mean Alpha
The mean Alpha for published domain-specific knowledge tests in this meta-analysis was 0.85. This value is higher than expected (Hypothesis 1). It lies far above the typically recommended cut-off of 0.70. It was also higher than most Alphas of measures of individual differences, which lay between 0.70 and 0.80 in Greco et al.'s (2018) review. We see at least five possible explanations for this unexpected finding. These explanations do not exclude each other. First, as explained above, Alpha was biased upwards by a publication bias. Second, as explained above, this problem might have been exacerbated by researchers modifying their tests until their Alphas exceed 0.70, 0.80, or 0.90.
Third, it is possible that the knowledge in the meta-analyzed studies was less heterogeneous than expected. Previous studies (e.g., Edelsbrunner et al., 2018; Schneider & Hardy, 2013) showed that knowledge can be fragmented and heterogeneous. However, there is a lack of large-scale studies with representative samples showing how often and strongly knowledge is fragmented and heterogeneous in learners. Latent profile transition analyses show strong individual differences at each point in time and intraindividual differences over time in knowledge fragmentation. For example, Schneider and Hardy (2013) found that 35% of their sample had fragmented knowledge before instruction, 25% after instruction, and 20% at follow-up. Accordingly, the claim that knowledge is always highly fragmented would be overblown. When interpreting the Alphas of knowledge tests, it is important to consider interindividual and intraindividual differences in the prevalence of fragmented knowledge.
Fourth, it is possible that knowledge was more homogeneous in the meta-analyzed studies than in general due to a selection bias for the included studies. We re-analyzed data from Simonsmeier et al. (2022), who had included only studies measuring knowledge longitudinally at least twice. These studies are only a small subset of all studies measuring knowledge. They are not representative for studies with only one measurement point and for unpublished studies. As we explain below, synthesizing these effect sizes remains an important but challenging task for future research.
Finally, potentially relativizing the former points to some degree, the Alphas found in our meta-analysis were also relatively high because Alpha increases with test length, and the tests in our meta-analysis were relatively long, with a median length of 20 items. In line with Hypothesis 3, test length was a moderator of the Alphas. When the Alpha was transformed into the expected mean item inter-correlation, which does not depend on test length, it was just at rexp = 0.22, which was in accordance with our expectation of rexp < 0.30. This correlation is relatively weak, considering that it represents the relation between items that are meant to measure the same construct. This estimate is a mean value, implying that the (sample-size weighted) item inter-correlation was even lower in a good share of the included studies. Thus, overall, even though we found a relatively high mean Alpha, our findings support the view that knowledge can be integrated as well as fragmented so that the items of a knowledge test sometimes strongly and sometimes weakly inter-correlate (Schiefer et al., 2022; Stadler et al., 2021; Taber, 2018). We will discuss in our implications section what to make of this overall incoherent situation.
Effects of the Response Format
An additional exploratory moderator analysis showed that Alphas are higher for knowledge tests with open-answer items than for knowledge tests with multiple-choice items. This difference remained statistically significant when controlling for the number of items, which might be higher in tests with multiple-choice items. A potential reason for the higher Alpha for tests with open-answer items is that these items can yield more detailed information than multiple-choice items and thus might better indicate to what extent learners' knowledge is integrated and homogeneous (Lee et al., 2011). Prior research on the effects of the response format often found inconsistent results regarding its influence on reliability and validity (Martinez, 1999). Within Likert-scale formats and mostly for different kinds of personality questionnaires, research rather consistently shows that having more response options (e.g., seven instead of three Likert-scale options) increases reliability and correlations with external criteria (Haslbeck et al., 2024; Simms et al., 2019). Regarding knowledge tests, Goecke et al. (2022) found a perfect correlation between multiple-choice and open-response items probing fact knowledge, but the authors did not provide comparisons of internal consistency estimates across the response formats. Rodriguez (2003) in a meta-analysis found that multiple-choice and open-response tests correlated perfectly when the same item stems were used, but they did not focus on knowledge tests. Future studies should examine the effect of different response formats experimentally by varying only the response format in otherwise isomorphic knowledge tests. This would ensure that higher Alphas of open-response formats are not caused by confounding characteristics such as multiple-choice items potentially covering broader content areas.
Learning and Development as Moderators
In line with our expectations (Hypothesis 8), the internal consistency of learners’ answers on knowledge tests increased through learning. This finding aligns with theories proposing that knowledge integration is a central aim of instruction (Linn, 2000). The finding also dovetails with empirical studies indicating increased knowledge integration through instruction (Edelsbrunner et al., 2018; Schneider & Hardy, 2013; Schneider et al., 2011). Apparently, the interventions in the studies included in our meta-analysis succeeded in integrating students’ knowledge so that their answer patterns became more internally consistent. This finding demonstrates the often-forgotten aspect of Alpha that it is not a fixed characteristic of a test (e.g., Sijtsma, 2009). The reason is that Alpha does not indicate the internal consistency of the items but the internal consistency of participants’ responses to the items. Participants’ answers to the items can differ between samples and change through learning. Thus, it is important to report Alpha not for a test but for a sample taking a test in a specific situation (Parsons et al., 2019). In addition, this finding points to the relevance of inspecting and modeling effects of interventions not only on quantitative (e.g., mean) change, but also on the internal structure of knowledge tests, for example with latent variable approaches (Schneider & Stern, 2010). Care should be taken, however, when approaching and interpreting factor-analytic studies from a theoretical perspective, as knowledge as a construct might be more congruent with formative construct modeling than with reflective (i.e., factor) constructs (Stadler et al., 2021). Fortunately, latent variable methods have been extended recently to be able to integrate formative constructs into classical modeling frameworks such as structural equation modeling (Schuberth, 2021; Yu et al., 2023). Formative measurement is demanding because it requires including all items that cover the relevant conceptual space. Yet, thinking of knowledge constructs as formative might help develop new approaches to its measurement. For example, an integration of domain-sampling theory, in which we would sample knowledge items from the conceptual space, but the underlying construct is considered reflective, with formative measurement conceptualizations might bring us to a novel approach connecting those meta-theoretical assumptions of both approaches that are appropriate for knowledge constructs. A recent proposal which might be applicable for knowledge constructs goes in that direction, introducing a measurement theory which acknowledges that the assumptions of reflective and formative measurement are both fallible and that psychological constructs are multidimensional in nature (VanderWeele, 2022).
A related approach that we could not follow within this study is juxtaposing changes in Alphas and changes in means or other distributional characteristics of knowledge. For example, future studies may examine whether learners who gain more knowledge over time could also undergo stronger increases in internal consistency since the amount of knowledge acquisition may on average correlate positively with increase in knowledge integration. Of the studies encompassed in our meta-analyses, only very few reported Alphas in combination with descriptive statistics at pre- and posttest that would allow such an analysis.
We had expected no increase in Alphas over the measurement points of longitudinal studies (Hypothesis 9) because it has been suggested that knowledge integration requires instruction and rarely happens spontaneously (Linn, 2000). Against this expectation, the Alphas increased over time in longitudinal studies. A potential explanation is that the study participants might have been exposed to relevant formal or informal instruction between the measurement points of the longitudinal studies, for example, school lessons or internet videos, which fostered knowledge integration. Another explanation might be (re-)testing effects that affected participants to process and integrate their knowledge (Rowland, 2014; Wilson et al., 2006). Overall, this finding raises the hypothesis that knowledge tends to become more homogeneous with development. Future research should examine moderating factors to test the robustness and explanatory factors of this phenomenon, such as whether individuals undergo further topic-related instruction or tend to use the respective knowledge frequently in their daily lives.
Sample mean age had a negative moderating effect on Alpha. Against Hypothesis 10, higher age was associated with lower Alphas. One explanation is that, after instruction, selective forgetting slowly but steadily increases knowledge fragmentation again. Another explanation is that, throughout their life, people constantly learn new things. A third, related, explanation is that knowledge tests for older learners might target more complex content (e.g., Algebra in seventh grade and higher) than knowledge tests for younger learners (e.g., counting in pre-school or whole-number arithmetic in primary school). Of note, the reviewed studies did not cover samples with a mean age beyond 25 years. Thus, our findings are not representative of populations beyond the tertiary education sector.
Knowledge Type and Content Domain as Moderators
We found that the internal consistency of knowledge tests was not significantly moderated by knowledge type, with the R2 estimate and confidence intervals indicating that Alphas were rather similar across levels of this moderator. In the literature, conceptual knowledge is described as more relational than fact knowledge and procedural knowledge. Accordingly, we had expected higher Alphas for conceptual knowledge than for the other knowledge types (Hypothesis 9). This was not the case. This finding could be due to the problem that behavior can be influenced by several types of knowledge simultaneously, making it difficult to measure knowledge types in isolation (cf. Bittermann et al., 2023). Another possibility is that the conceptual knowledge was partly fragmented or that fact knowledge and procedural knowledge were more heterogeneous than expected. This would be in line with approaches emphasizing that procedural knowledge can be rich in relations when learners reflect on their solution procedures and connect them to each other and their conceptual understanding (Star, 2005).
In accordance with Hypothesis 11, we found higher internal consistencies for knowledge in mathematics and languages than for science and social sciences/humanities. This finding is in accordance with general descriptions of mathematics as a well-structured domain (Ball et al., 2005) and social sciences and humanities as less well-structured domains (Buehl et al., 2002).
Limitations
Our meta-analysis is limited in that we used the data obtained by Simonsmeier et al. (2022) instead of conducting a comprehensive literature search including published and gray literature. The meta-analysis only included studies with at least two measurement points, and not all included studies reported the Alphas for both points in time. This limits the breadth and generalizability of our findings. Still, as described above, even without being able to analyze unpublished effect sizes, we found firm evidence for a publication bias in the published effect sizes. Overall, we see our meta-analysis as a limited and preliminary but still useful steppingstone for more comprehensive or more direct investigations of the internal consistency of knowledge tests.
We meta-analyzed Cronbach’s Alphas as an indicator of internal consistency of interindividual differences. This approach depends on the heterogeneity of item correlations; for tests with large variance in the interrelations between items, our approach may not be as useful. We recommend reporting the standard deviation of inter-item correlations in addition to Alphas to help readers evaluate whether Alpha may be a good summary of item intercorrelations. A complementary approach may be modeling idiographic knowledge networks to examine internal consistency within individuals over time (Chaku & Beltz, 2022; Mansueto et al., 2023). This would require intensive data collection methods to monitor knowledge development over time. Such an approach may be problematic because it may induce testing effects, but elaborate experimental designs using different items across measurement waves (Little & Rhemtulla, 2013) or microgenetic designs (Kuhn et al., 1992; Siegler, 2007) may allow disentangling retesting effects from individual development. Related to retest effects, we also do not know whether and to what extent the findings regarding changes of Alphas throughout interventions and development may be explained by such effects. This could be examined by encompassing studies without pretests in future reviews or conducting experimental studies that examine retest effects by Solomon designs (Roozenbeek et al., 2021).
In the reviewed studies, researchers usually did not mention how they computed Alpha. Cronbach (1951) mentioned that point–tetrachoric correlations may not be appropriate for the computation of Alpha based on dichotomous items. In cases in which inappropriate correlation matrices were used to compute Alphas, these may have been biased. We suggest reporting the exact approach by which Alpha was computed to avoid such bias.
Implications of the Meta-Analytic Findings
Although being a first steppingstone and requiring replication and extension on broader populations of tests and test-takers, our meta-analysis has at least six implications for research. First, the strong evidence for bias in the published Alphas shows that current standards requiring all tests to have Alphas greater than 0.70 should be modified. For heterogeneous and dynamically changing constructs such as knowledge, a fixed cut-off value may be of limited value or even harmful because it is at odds with theoretical conceptions of the construct, hinders empirical research, increases the file-drawer problem in research syntheses, and falsely incentivizes researchers to construct tests that are either overly long or so homogeneous that they do not cover the construct in its breadth (Revelle, 2024).
But what should applied researchers do with this recommendation who are under pressure to report psychometrics characteristics of their knowledge tests? A first step in this regard is to confidently argue that estimates of internal consistency, such as Alpha or Omega (Dunn et al., 2014), are of limited value for knowledge tests. References such as Taber (2018), Stadler et al. (2021), and the current article help in supporting this statement and summarize various theoretical and empirical arguments in this regard. Instead of insisting on the reporting and interpretation of Alpha or similar statistics, reviewers and editors should focus on validity evidence for the individual items (e.g., cognitive surveys) that a knowledge test is composed of and its overall score (e.g., convergent and divergent validities). A second step is to report alternative measures of psychometric characteristics of knowledge tests. Stadler et al. (2021) suggested the Variance Inflation Factor (VIF) in this regard: The VIF indicates whether and to which extent individual knowledge items are highly redundant with the other items, implying redundancy in covering the knowledge construct of interest. Relying on this factor will actually decrease internal consistency because it suggests removing items that are highly correlated with the other items.
Alpha is often reported as if it were a fixed characteristic of a test. Methodologists urge researchers to report Alpha and similar statistics for tests in specific samples instead (Bonett & Wright, 2015; Sijtsma, 2009). Our results suggest that for knowledge and other constructs affected by learning, researchers need to report Alpha for a test (characterized by item number and response format) and a population (characterized for example by age) and a point in the learning process (e.g., whether it is the pretest or the posttest or the nth measurement time of a longitudinal study).
Researchers should report both Alpha (or Omega; Dunn et al., 2014) and the mean item-intercorrelation for their knowledge tests (and perhaps also variation in correlation to capture heterogeneity therein, see Cortina, 1993). This makes it easier to interpret empirically obtained Alphas and to decide what to do with tests with low Alphas. If a test exhibits a moderate estimate of Alpha, researchers should compare the average intercorrelation of their items with theoretical expectations. If items intercorrelate within an expected range, further refinement might not be needed. Average item-intercorrelation, then, is an index of validity rather than reliability: If intercorrelations between items turn out as theoretically expected, then this lends credibility both to the theory and to the assessment instrument. Importantly, dropping or replacing items despite theoretically plausible intercorrelations should be avoided because it leads to a distorted picture of the internal consistency of constructs. Item-intercorrelation on its own does not have a direct connection to reliability theory as Cronbach’s Alpha. However, as argued and shown recently (Neubauer & Hofer, 2022; Schiefer et al., 2022; Stadler et al., 2021; Taber, 2018), even tests with low Alphas can exhibit high test–retest reliability and provide generalizable results. Obtaining reliable estimates of retest reliability with knowledge tests, however, is difficult because knowledge can change over time and through retest effects (Francis et al., 2020). However, different approaches exist to control retest effects (e.g., Broitman et al., 2020). If researchers are concerned with the reliability of their knowledge tests, then they should plan retest designs in which retest reliability can be examined. Although such designs are not easy to implement due to retesting effects that might distort such reliability indices, this issue might be circumvented by employing a planned missing data design in which each participant only receives a part of the items at each measurement point. This would allow suppressing retesting effects for items that participants only work on once on the one hand and estimating retesting effects for those they work on twice or more often on the other hand, to statistically correct for such effects.
We interpreted Alpha as an indirect measure of the fragmentation or homogeneity/heterogeneity of knowledge (for discussions of these interpretations of Alpha, see Cortina, 1993; Revelle & Zinbarg, 2009). However, as shown, Alpha also reflects item number, response format, content domain, learner age, and other variables. Therefore, more direct and standardized measures of the degree of fragmentation or integration of knowledge are needed to cross-validate the current findings. A recent literature review concluded that such measures are lacking (Bittermann et al., 2023). Their construction thus remains an important task for future research. Researchers should also explore to what extent multivariate statistical models, including latent variables, latent transition analysis (Hickendorff et al., 2018), and psychometric network analysis (Isvoranu et al., 2022), can be used to trace knowledge structures and their changes over time. As a starter in this regard, in the current meta-analysis we have computed and interpreted the expected item-intercorrelation as a proxy for the homogeneity/heterogeneity of knowledge. In theorizing about how strong item-intercorrelations might be, a helpful consideration might be the distinction between more reflective and more formative constructs. Reflective constructs are usually represented in traditional latent variables, that is, latent factors. For these constructs, indicators are interchangeable, meaning that all items of a test would point to the same psychological construct that determines how participants answer to knowledge items (White et al., 2024). For knowledge constructs, this assumption seems unrealistic in most cases. An exception to this rule might be items that cover understanding of the same abstract concept, such as Newton’s laws. Tests that assess understanding of Newton’s laws typically exhibit moderate to high internal consistencies as indicated by Alpha or Omega, even in short test versions (Hofer et al., 2017). For knowledge constructs of this kind, where understanding of the same concept or procedure is required to solve different items, higher Alphas should be expected because if learners have acquired the underlying concept or procedure, then their answers across all items should tend to improve in line with reflective measurement. For knowledge tests that cover different kinds of concepts, procedures, or facts, the expected item-intercorrelations are lower and these tests should be seen as providing formative index variables that do not consist of items that all point to the same psychological trait, but that are connected by the common meaning that we give them (e.g., Mathematics tests that cover a broad range of items; Baird et al., 2017). Domain sampling theory may provide another informative perspective on constructs of this kind. In this approach, items are sampled from a typically very broad item pool that covers a construct (van Bork et al., 2024). Yet, domain sampling theory, like classical test theory and item response theory, assumes that items must correlate to capture the same construct. Knowledge space theory, in contrast, assumes correlations only in cases in which knowledge that is captured by one item functions as a prerequisite for solving the other items (Cosyn et al., 2021; de Chiusole et al., 2020, 2024; Stefanutti et al., 2020). We suggest researchers to choose the psychometric approach that best represents the meta-theoretical assumptions they ascribe to the knowledge construct that they aim to measure (Edelsbrunner, 2022).
We re-analyzed an existing dataset. Future studies could use text-mining approaches to find and synthesize as many Alphas of knowledge tests as possible. This might decrease the data quality compared to the carefully hand-coded data analyzed in our study, but it would considerably broaden the database.
We showed that publication bias in Alphas is more difficult to investigate for Alphas than for more common types of effect sizes because the Alphas are not normally distributed and depend on the test length but show no direct relation to standard errors. Researchers should be cautious with applications and interpretations of funnel plots and similar methods for investigating publication bias for Cronbach's Alpha. Future studies could employ mixture meta-regression (Beath, 2014) to statistically model in more detail which Alphas might have been subject to biases in test construction and which might be part of the unbiased component of the distribution.
Conclusion
Although Cronbach’s Alphas promised to be an easy-to-use-and-interpret indicator when it was introduced in the 1950s (Cronbach & Shavelson, 2004), its widespread application has led to distorted expectations for the internal consistency of knowledge tests. Researchers should develop careful theoretical and formal models (Robinaugh et al., 2021; Smaldino, 2020) that allow them to develop expectations regarding the internal consistency of their tests to see whether the intercorrelations of items are in accordance with these expectations. Guidelines based on the internal consistencies of published studies might help in developing such theories, and we have provided a first steppingstone in this regard. Future research should aim at extending the empirical basis regarding the internal consistency of knowledge tests and consider that internal consistencies are not just part of reliability theory, but also provide a tool to inform us about the structural characteristics and validity of our tests and of the underlying construct, that is, knowledge.
Data Availability
The analytic data and scripts are available under https://osf.io/7ygc5.
References
Ackerman, P. L., & Beier, M. E. (2006). Determinants of domain knowledge and independent study learning in an adult sample. Journal of Educational Psychology, 98(2), 366–381. https://doi.org/10.1037/0022-0663.98.2.366
Agarwal, P. K. (2019). Retrieval practice & Bloom’s taxonomy: Do students need fact knowledge before higher order learning? Journal of Educational Psychology, 111(2), 189. https://doi.org/10.1037/edu0000282
Anderson, J. R., Bothell, D., Byrne, M. D., Douglass, S., Lebiere, C., & Qin, Y. (2004). An integrated theory of the mind. Psychological Review, 111(4), 1036–1060. https://doi.org/10.1037/0033-295X.111.4.1036
Anderson, J. R. (2020). Cognitive psychology and its implications (9th ed.). Macmillan Learning.
Baird, J. A., Andrich, D., Hopfenbeck, T. N., & Stobart, G. (2017). Assessment and learning: Fields apart? Assessment in Education: Principles, Policy & Practice, 24(3), 317–350. https://doi.org/10.1080/0969594X.2017.1319337
Ball, D. L., Hill, H.C, & Bass, H. (2005). Knowing mathematics for teaching: Who knows mathematics well enough to teach third grade, and how can we decide? American Educator, 29(1), 14–17. Retrieved from https://deepblue.lib.umich.edu/handle/2027.42/65072
Barnett, S. M., & Ceci, S. J. (2002). When and where do we apply what we learn?: A taxonomy for far transfer. Psychological Bulletin, 128(4), 612–637. https://doi.org/10.1037/0033-2909.128.4.612
Beath, K. J. (2014). A finite mixture method for outlier detection and robustness in meta-analysis. Research Synthesis Methods, 5(4), 285–293. https://doi.org/10.1002/jrsm.1114
Bittermann, A., McNamara, D., Simonsmeier, B. A., & Schneider, M. (2023). The landscape of research on prior knowledge and learning: A bibliometric analysis. Educational Psychology Review, 35(2), 58. https://doi.org/10.1007/s10648-023-09775-9
Bland, J. M., & Altman, D. G. (1997). Statistics Notes: Cronbach’s Alpha. Bmj, 314(7080), 572.
Bonett, D. G. (2002). Sample size requirements for testing and estimating coefficient alpha. Journal of Educational and Behavioral Statistics, 27(4), 335–340. https://doi.org/10.3102/10769986027004335
Bonett, D., & Wright, T. (2015). Cronbach’s alpha reliability: Interval estimation, hypothesis testing, and sample size planning. Journal of Organizational Behavior, 36, 3–15. https://doi.org/10.1002/JOB.1960
Borenstein, M., Higgins, J. P. T., Hedges, L. V., & Rothstein, H. R. (2017). Basics of meta-analysis: I2 is not an absolute measure of heterogeneity. Research Synthesis Methods, 8(1), 5–18. https://doi.org/10.1002/jrsm.1230
Broitman, A. W., Kahana, M. J., & Healey, M. K. (2020). Modeling retest effects in a longitudinal measurement burst study of memory. Computational Brain & Behavior, 3(2), 200–207. https://doi.org/10.1007/s42113-019-00047-w
Buehl, M. M., Alexander, P. A., & Murphy, P. K. (2002). Beliefs about schooled knowledge: Domain specific or domain general? Contemporary Educational Psychology, 27(3), 415–449. https://doi.org/10.1006/ceps.2001.1103
Cameron, A. C., & Miller, D. L. (2015). A practitioner’s guide to cluster-robust inference. Journal of Human Resources, 50(2), 317–372. https://doi.org/10.3368/jhr.50.2.317
Chaku, N., & Beltz, A. M. (2022). Using temporal network methods to reveal the idiographic nature of development. Advances in Child Development and Behavior, 62, 159–190. Elsevier. https://doi.org/10.1016/bs.acdb.2021.11.003
Clifton, J. D. W. (2020). Managing validity versus reliability trade-offs in scale-building decisions. Psychological Methods, 25(3), 259–270. https://doi.org/10.1037/met0000236
Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78(1), 98. https://doi.org/10.1037/0021-9010.78.1.98
Cosyn, E., Uzun, H., Doble, C., & Matayoshi, J. (2021). A practical perspective on knowledge space theory: ALEKS and its data. Journal of Mathematical Psychology, 101, 102512. https://doi.org/10.1016/j.jmp.2021.102512
Crano, W. D., Brewer, M. B., & Lac, A. (2014). Principles and methods of social research (3rd ed.). Routledge.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297–334. https://doi.org/10.1007/BF02310555
Cronbach, L. J., & Shavelson, R. J. (2004). My current thoughts on coefficient alpha and successor procedures. Educational and Psychological Measurement, 64(3), 391–418. https://doi.org/10.1177/0013164404266386
Crooks, N. M., & Alibali, M. W. (2014). Defining and measuring conceptual knowledge in mathematics. Developmental Review, 34(4), 344–377. https://doi.org/10.1016/j.dr.2014.10.001
Currie, M., & Chiramanee, T. (2010). The effect of the multiple-choice item format on the measurement of knowledge of language structure. Language Testing, 27, 471–491. https://doi.org/10.1177/0265532209356790
de Jong, T., & Ferguson-Hessler, M. G. M. (1996). Types and qualities of knowledge. Educational Psychologist, 31(2), 105–113. https://doi.org/10.1207/s15326985ep3102_2
de Chiusole, D., Stefanutti, L., Anselmi, P., & Robusto, E. (2020). Stat-Knowlab. Assessment and learning of statistics with competence-based knowledge space theory. International Journal of Artificial Intelligence in Education, 30, 668–700. https://doi.org/10.1007/s40593-020-00223-1
de Chiusole, D., Granziol, U., Spoto, A., & Stefanutti, L. (2024). Reliability of a probabilistic knowledge structure. Behavior Research Methods, 56(7), 8022–8037. https://doi.org/10.3758/s13428-024-02468-3
De Deyne, S., Navarro, D. J., Perfors, A., & Storms, G. (2016). Structure at every scale: A semantic network account of the similarities between unrelated concepts. Journal of Experimental Psychology: General, 145(9), 1228–1254. https://doi.org/10.1037/xge0000192
Deeks, J. J., Higgins, J. P. T., & Altman, D. G. (2008). Analysing data and undertaking meta-analyses. In J. P. T. Higgins & S. Green (Eds.), Cochrane handbook for systematic reviews of interventions: Cochrane Book Series (pp. 243–296). Wiley.
diSessa, A. A., Gillespie, N. M., & Esterly, J. B. (2004). Coherence versus fragmentation in the development of the concept of force. Cognitive Science, 28(6), 843–900. https://doi.org/10.1207/s15516709cog2806_1
Dunn, T. J., Baguley, T., & Brunsden, V. (2014). From alpha to omega: A practical solution to the pervasive problem of internal consistency estimation. British Journal of Psychology, 105(3), 399–412. https://doi.org/10.1111/bjop.12046
Duval, S., & Tweedie, R. (2000). A Nonparametric, “Trim and Fill” Method of Accounting for Publication Bias in Meta-Analysis. Journal of the American Statistical Association, 95(449), 89–98. https://doi.org/10.1080/01621459.2000.10473905
Edelsbrunner, P. A. (2022). A model and its fit lie in the eye of the beholder: Long live the sum score. Frontiers in Psychology, 13, 986767. https://doi.org/10.3389/fpsyg.2022.986767
Edelsbrunner, P. A., & Thurn, C. M. (2024). Improving the utility of non-significant results for educational research: A review and recommendations. Educational Research Review, 42, 100590. https://doi.org/10.1016/j.edurev.2023.100590
Edelsbrunner, P. A., Schalk, L., Schumacher, R., & Stern, E. (2018). Variable control and conceptual change: A large-scale quantitative study in elementary school. Learning and Individual Differences, 66, 38–53. https://doi.org/10.1016/j.lindif.2018.02.003
Edelsbrunner, P. A., Schumacher, R., & Stern, E. (2022). Children's scientific reasoning in light of general cognitive development. In: Houde & Borst (Eds.), The Cambridge handbook of cognitive development (pp. 585–605). Cambridge University Press.
Ellis, J. L., & Sijtsma, K. (2024). Proof of reliability convergence to 1 at rate of Spearman-Brown formula for random test forms and irrespective of item pool dimensionality. Psychometrika, 89(3), 1–22. https://doi.org/10.1007/s11336-024-09956-7
Falk, C. F., & Savalei, V. (2011). The relationship between unstandardized and standardized alpha, true reliability, and the underlying measurement model. Journal of Personality Assessment, 93(5), 445–453. https://doi.org/10.1080/00223891.2011.594129
Fisher, R. A. (1926). The arrangement of field experiments. Journal of the Ministry of Agriculture, 33, 503–15. https://doi.org/10.23637/rothamsted.8v61q
Flaig, M., Simonsmeier, B. A., Mayer, A.-K., Rosman, T., Gorges, J., & Schneider, M. (2018). Conceptual change and knowledge integration as learning processes in higher education: A latent transition analysis. Learning and Individual Differences, 66, 92–104. https://doi.org/10.1016/j.lindif.2018.07.001
Francis, A. P., Wieth, M. B., Zabel, K. L., & Carr, T. H. (2020). A classroom study on the role of prior knowledge and retrieval tool in the testing effect. Psychology Learning & Teaching, 19(3), 258–274. https://doi.org/10.1177/1475725720924872
Franco, A., Malhotra, N., & Simonovits, G. (2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345(6203), 1502–1505. https://doi.org/10.1126/science.1255484
Garner, K. M. (2024). The forgotten trade-off between internal consistency and validity. Multivariate Behavioral Research, 59(3), 656–657. https://doi.org/10.1080/00273171.2024.2310429
Gignac, G. E., & Szodorai, E. T. (2016). Effect size guidelines for individual differences researchers. Personality and Individual Differences, 102(11), 74–78. https://doi.org/10.1016/j.paid.2016.06.069
Goecke, B., Staab, M., Schittenhelm, C., & Wilhelm, O. (2022). Stop worrying about multiple-choice: Fact knowledge does not change with response format. Journal of Intelligence, 10. https://doi.org/10.3390/jintelligence10040102
Goldwater, M. B., & Schalk, L. (2016). Relational categories as a bridge between cognitive and educational research. Psychological Bulletin, 142(7), 729–757. https://doi.org/10.1037/bul0000043
Greco, L. M., O’Boyle, E. H., Cockburn, B. S., & Yuan, Z. (2018). Meta-analysis of coefficient alpha: A reliability generalization study. Journal of Management Studies, 55(4), 583–618. https://doi.org/10.1111/joms.12328
Green, S. B., Lissitz, R. W., & Mulaik, S. A. (1977). Limitations of Coefficient Alpha as an Index of Test Unidimensionality. Educational and Psychological Measurement, 37(4), 827–838. https://doi.org/10.1177/001316447703700403
Ha, M., & Lee, J.-K. (2011). The analysis of pre-service biology teachers’ natural selection conceptions in multiple-choice and open-response instruments. https://doi.org/10.14697/jkase.2011.31.6.887
Haslbeck, J. M., Jover-Martínez, A., Roefs, A. J., Fried, E. I., Lemmens, L. H., Groot, E., & Edelsbrunner, P. A. (2024). Comparing likert and visual analogue scales in ecological momentary assessment. Preprint available from https://osf.io/preprints/psyarxiv/yt8xw
Hickendorff, M., Edelsbrunner, P. A., McMullen, J., Schneider, M., & Trezise, K. (2018). Informative tools for characterizing individual differences in learning: Latent class, latent profile, and latent transition analysis. Learning and Individual Differences, 66, 4–15. https://doi.org/10.1016/j.lindif.2017.11.001
Hofer, S. I., Schumacher, R., & Rubin, H. (2017). The test of basic Mechanics Conceptual Understanding (bMCU): Using Rasch analysis to develop and evaluate an efficient multiple choice test on Newton’s mechanics. International Journal of STEM Education, 4, 1–20. https://doi.org/10.1186/s40594-017-0080-5
Isvoranu, A. M., Epskamp, S., Waldorp, L., & Borsboom, D. (Eds.). (2022). Network psychometrics with R: A guide for behavioral and social scientists. Routledge.
Keil, F. C. (1981). Constraints on knowledge and cognitive development. Psychological Review, 88(3), 197–227. https://doi.org/10.1037/0033-295X.88.3.197
Koedinger, K. R., Corbett, A. T., & Perfetti, C. (2012). The knowledge-learning-instruction framework: Bridging the science-practice chasm to enhance robust student learning. Cognitive Science, 36(5), 757–798. https://doi.org/10.1111/j.1551-6709.2012.01245.x
Kraft, M. A. (2020). Interpreting effect sizes of education interventions. Educational Researcher, 49(4), 241–253. https://doi.org/10.3102/0013189X20912798
Kuhn, D., Schauble, L., & Garcia-Mila, M. (1992). Cross-domain development of scientific reasoning. Cognition and Instruction, 9, 285–327. https://doi.org/10.1207/S1532690XCI0904_1
Lee, H.-S., Liu, O., & Linn, M. (2011). Validating measurement of knowledge integration in science using multiple-choice and explanation items. Applied Measurement in Education, 24, 115–136. https://doi.org/10.1080/08957347.2011.554604
Lenz, K., Dreher, A., Holzäpfel, L., & Wittmann, G. (2020). Are conceptual knowledge and procedural knowledge empirically separable? The case of fractions. British Journal of Educational Psychology, 90(3), 809–829. https://doi.org/10.1111/bjep.12333
Linn, M. C. (2000). Designing the knowledge integration environment. International Journal of Science Education, 22(8), 781–796. https://doi.org/10.1080/095006900412275
Linn, M. C. (2006). The knowledge integration perspective on learning and instruction. Cambridge University Press.
Little, T., & Rhemtulla, M. (2013). Planned missing data designs for developmental researchers. Child Development Perspectives, 7, 199–204. https://doi.org/10.1111/CDEP.12043
Lord, F. M., Novick, M. R., & Birnbaum, A. (1968). Statistical theories of mental test scores. Addison-Wesley.
Mansueto, A. C., Pan, T., van Dessel, P., & Wiers, R. W. (2023). Ecological momentary assessment and personalized networks in cognitive bias modification studies on addiction: Advances and challenges. Journal of Experimental Psychopathology, 14(2). https://doi.org/10.1177/20438087231178123
Martinez, M. E. (1999). Cognition and the question of test item format. Educational Psychologist, 34(4), 207–218. https://doi.org/10.1207/s15326985ep3404_2
McNeish, D. (2018). Thanks coefficient alpha, we’ll take it from here. Psychological Methods, 23(3), 412–433. https://doi.org/10.1037/met0000144
Mehler, D. M. A., Edelsbrunner, P. A., & Matić, K. (2019). Appreciating the significance of non-significant findings in psychology. Journal of European Psychology Students, 10(4), 4. https://doi.org/10.5334/e2019a
Meier, M. A., Gross, F., Vogel, S. E., & Grabner, R. H. (2023). Mathematical expertise: The role of domain-specific knowledge for memory and creativity. Scientific Reports, 13(1), 12500. https://doi.org/10.1038/s41598-023-39309-w
Neubauer, A. C., & Hofer, G. (2022). (Retest-) Reliable and valid despite low alphas? An example from a typical performance situational judgment test of emotional management. Personality and Individual Differences, 189, 111511. https://doi.org/10.1016/j.paid.2022.111511
Nunnally, J. C. (1978). Psychometric theory (2nd ed.). McGraw-Hill.
Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., Shamseer, L., Tetzlaff, J. M., Akl, E. A., & Brennan, S. E. (2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. bmj, 372:71. https://www.bmj.com/content/372/bmj.n71.short
Parsons, S., Kruijt, A.-W., & Fox, E. (2019). Psychological Science Needs a Standard Practice of Reporting the Reliability of Cognitive-Behavioral Measurements. Advances in Methods and Practices in Psychological Science, 2(4), 378–395. https://doi.org/10.1177/2515245919879695
Raykov, T., & Marcoulides, G. A. (2015). A Direct Latent Variable Modeling Based Method for Point and Interval Estimation of Coefficient Alpha. Educational and Psychological Measurement, 75(1), 146–156. https://doi.org/10.1177/0013164414526039
Revelle, W. R. (2017). psych: Procedures for personality and psychological research. R Package, Version, 2(1), 9.
Revelle, W. (2024). The seductive beauty of latent variable models: Or why I don’t believe in the Easter Bunny. Personality and Individual Differences, 221, 112552. https://doi.org/10.1016/j.paid.2024.112552
Revelle, W., & Zinbarg, R. E. (2009). Coefficients alpha, beta, omega, and the glb: Comments on Sijtsma. Psychometrika, 74, 145–154. https://doi.org/10.1007/s11336-008-9102-z
Riley, R. D., Higgins, J. P. T., & Deeks, J. J. (2011). Interpretation of random effects meta-analyses. BMJ, 342, d549. https://doi.org/10.1136/bmj.d549
Rittle-Johnson, B., Siegler, R. S., & Alibali, M. W. (2001). Developing conceptual understanding and procedural skill in mathematics: An iterative process. Journal of Educational Psychology, 93(2), 346–362. https://doi.org/10.1037/0022-0663.93.2.346
Rittle-Johnson, B., & Siegler, R. S. (1998). The relation between conceptual and procedural knowledge in learning mathematics: A review. In C. Donlan (Ed.), The development of mathematical skills (pp. 75–110). Psychology Press/Taylor & Francis (UK).
Robinaugh, D. J., Haslbeck, J. M., Ryan, O., Fried, E. I., & Waldorp, L. J. (2021). Invisible hands and fine calipers: A call to use formal theory as a toolkit for theory construction. Perspectives on Psychological Science, 16(4), 725–743. https://doi.org/10.1177/1745691620974697
Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and constructed-response items: A random effects synthesis of correlations. Journal of Educational Measurement, 40(2), 163–184. https://doi.org/10.1111/j.1745-3984.2003.tb01102.x
Roozenbeek, J., Maertens, R., McClanahan, W., & Van Der Linden, S. (2021). Disentangling item and testing effects in inoculation research on online misinformation: Solomon revisited. Educational and Psychological Measurement, 81(2), 340–362. https://doi.org/10.1177/0013164420940378
Rowland, C. A. (2014). The effect of testing versus restudy on retention: A meta-analytic review of the testing effect. Psychological Bulletin, 140(6), 1432–1463. https://doi.org/10.1037/a0037559
Sánchez-Meca, J., Marín-Martínez, F., López-López, J. A., Núñez-Núñez, R. M., Rubio-Aparicio, M., López-García, J. J., López-Pina, J. A., Blázquez-Rincón, D. M., López-Ibáñez, C., & López-Nicolás, R. (2021). Improving the reporting quality of reliability generalization meta-analyses: The REGEMA checklist. Research Synthesis Methods, 12(4), 516–536. https://doi.org/10.1002/jrsm.1487
Schiefer, J., Edelsbrunner, P. A., Bernholt, A., Kampa, N., & Nehring, A. (2022). Profiles of epistemic beliefs in science: An integration of evidence from multiple studies. Educational Psychology Review, 34(1), 1541–1575. https://doi.org/10.1007/s10648-022-09661-w
Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8(4), 350. https://doi.org/10.1037/1040-3590.8.4.350
Schneider, M., & Hardy, I. (2013). Profiles of inconsistent knowledge in children’s pathways of conceptual change. Developmental Psychology, 49(9), 1639–1649. https://doi.org/10.1037/a0030976
Schneider, M., & Stern, E. (2009). The inverse relation of addition and subtraction: A knowledge integration perspective. Mathematical Thinking and Learning, 11(1–2), 92–101. https://doi.org/10.1080/10986060802584012
Schneider, M., & Stern, E. (2010). The developmental relations between conceptual and procedural knowledge: A multimethod approach. Developmental Psychology, 46(1), 178–192. https://doi.org/10.1037/a0016701
Schneider, M., Rittle-Johnson, B., & Star, J. R. (2011). Relations among conceptual knowledge, procedural knowledge, and procedural flexibility in two samples differing in prior knowledge. Developmental Psychology, 47(6), 1525–1538. https://doi.org/10.1037/a0024997
Schuberth, F. (2021). The Henseler-Ogasawara specification of composites in structural equation modeling: A tutorial. Psychological Methods, 28(4), 843–859. https://doi.org/10.1037/met0000432
Shtulman, A., & Valcarcel, J. (2012). Scientific knowledge suppresses but does not supplant earlier intuitions. Cognition, 124(2), 209–215. https://doi.org/10.1016/j.cognition.2012.04.005
Siegler, R., & Alibali, M. (2011). Children’s thinking (5th ed.). Pearson.
Siegler, R. (2007). Microgenetic analyses of learning. In W. Damon, & R. M. Lerner (Eds.), Handbook of child psychology (pp. 464–510). Wiley.
Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s Alpha. Psychometrika, 74(1), 107–120. https://doi.org/10.1007/s11336-008-9101-0
Simms, L. J., Zelazny, K., Williams, T. F., & Bernstein, L. (2019). Does the number of response options matter? Psychometric perspectives using personality questionnaire data. Psychological Assessment, 31(4), 557–566. https://doi.org/10.1037/pas0000648
Simonsmeier, B. A., Flaig, M., Deiglmayr, A., Schalk, L., & Schneider, M. (2022). Domain-specific prior knowledge and learning: A meta-analysis. Educational Psychologist, 57(1), 31–54. https://doi.org/10.1080/00461520.2021.1939700
Smaldino, P. E. (2020). How to translate a verbal theory into a formal model. Social Psychology, 51(4), 207–218. https://doi.org/10.1027/1864-9335/a000425
Stadler, M., Sailer, M., & Fischer, F. (2021). Knowledge as a formative construct: A good alpha is not always better. New Ideas in Psychology, 60, 100832. https://doi.org/10.1016/j.newideapsych.2020.100832
Star, J. R. (2005). Reconceptualizing procedural knowledge. Journal for Research in Mathematics Education, 36(5), 404–411.
Stefanutti, L., de Chiusole, D., Gondan, M., & Maurer, A. (2020). Modeling misconceptions in knowledge space theory. Journal of Mathematical Psychology, 99, 102435. https://doi.org/10.1016/j.jmp.2020.102435
Steger, D., Jankowsky, K., Schroeders, U., & Wilhelm, O. (2023). The road to hell is paved with good intentions: How common practices in scale construction hurt validity. Assessment, 30(6), 1811–1824. https://doi.org/10.1177/10731911221124846
Stricker, J., Vogel, S. E., Schöneburg-Lehnert, S., Krohn, T., Dögnitz, S., Jud, N., Spirk, M., Windhaber, M.-C., Schneider, M., & Grabner, R. H. (2021). Interference between naïve and scientific theories occurs in mathematics and is related to mathematical achievement. Cognition, 214, 104789. https://doi.org/10.1016/j.cognition.2021.104789
Suffill, E., Schonberg, C., Vlach, H. A., & Lupyan, G. (2022). Children’s knowledge of superordinate words predicts subsequent inductive reasoning. Journal of Experimental Child Psychology, 221, 105449. https://doi.org/10.1016/j.jecp.2022.105449
Tabachnick, B. G., Fidell, L. S., & Ullman, J. B. (2013). Using multivariate statistics (6th ed.). Pearson.
Taber, K. S. (2018). The use of Cronbach’s Alpha when developing and reporting research instruments in science education. Research in Science Education, 48(6), 1273–1296. https://doi.org/10.1007/s11165-016-9602-2
Tipton, E. (2015). Small sample adjustments for robust variance estimation with meta-regression. Psychological Methods, 20(3), 375–393. https://doi.org/10.1037/met0000011
van Bork, R., Rhemtulla, M., Sijtsma, K., & Borsboom, D. (2024). A causal theory of error scores. Psychological Methods, 29(4), 807–826. https://doi.org/10.1037/met0000521
VanderWeele, T. J. (2022). Constructed measures and causal inference: Towards a new model of measurement for psychosocial constructs. Epidemiology, 33(1), 141–151. https://doi.org/10.1097/EDE.0000000000001434
Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 1–48. https://doi.org/10.18637/jss.v036.i03
Watrin, L., Schroeders, U., & Wilhelm, O. (2022). Structural invariance of declarative knowledge across the adult lifespan. Psychology and Aging, 37(3), 283–297. https://doi.org/10.1037/pag0000660
Watts, T. W., Duncan, G. J., Siegler, R. S., & Davis-Kean, P. E. (2014). What’s past is prologue: Relations between early mathematics knowledge and high school achievement. Educational Researcher, 43(7), 352–360. https://doi.org/10.3102/0013189X14553660
White, M., Edelsbrunner, P. A., & Thurn, C. M. (2024). The conceptualisation implies the statistical model: Implications for measuring domains of teaching quality. Assessment in Education: Principles, Policy & Practice, 31(3–4), 254–278. https://doi.org/10.1080/0969594X.2024.2368252
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D. A., François, R., ... & Yutani, H. (2019). Welcome to the Tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wilson, R. S., Li, Y., Bienias, L., & Bennett, D. A. (2006). Cognitive decline in old age: Separating retest effects from the effects of growing older. Psychology and Aging, 21(4), 774–789. https://doi.org/10.1037/0882-7974.21.4.774
Yousfi, S. (2005a). Mythen und Paradoxien der klassischen Testtheorie (I). Diagnostica, 51(1), 1–11. https://doi.org/10.1026/0012-1924.51.1.1
Yousfi, S. (2005b). Mythen und Paradoxien der klassischen Testtheorie (II). Diagnostica, 51(2), 55–66. https://doi.org/10.1026/0012-1924.51.2.55
Yu, X., Schuberth, F., & Henseler, J. (2023). Specifying composites in structural equation modeling: A refinement of the Henseler-Ogasawara specification. Statistical Analysis and Data Mining: The ASA Data Science Journal, 16(4), 348–357. https://doi.org/10.1002/sam.11608
Zhu, Q., & Carriere, K. C. (2018). Detecting and correcting for publication bias in meta-analysis–A truncated normal distribution approach. Statistical Methods in Medical Research, 27(9), 2722–2741. https://doi.org/10.1177/096228021668467
Ziegler, E., Trninic, D., & Kapur, M. (2021). Micro productive failure and the acquisition of algebraic procedural knowledge. Instructional Science, 49(3), 313–336. https://doi.org/10.1007/s11251-021-09544-7
Acknowledgements
The authors thank Ulrich Schroeders, Diana Steger, Oliver Wilhelm, Eric Klopp, and Noell Röhrig for helpful feedback on this project.
Funding
Open access funding provided by Swiss Federal Institute of Technology Zurich.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Edelsbrunner, P.A., Simonsmeier, B.A. & Schneider, M. The Cronbach’s Alpha of Domain-Specific Knowledge Tests Before and After Learning: A Meta-Analysis of Published Studies. Educ Psychol Rev 37, 4 (2025). https://doi.org/10.1007/s10648-024-09982-y
Accepted:
Published:
DOI: https://doi.org/10.1007/s10648-024-09982-y