U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Front Psychol

Levels of Reading Comprehension in Higher Education: Systematic Review and Meta-Analysis

Cristina de-la-peña.

1 Departamento de Métodos de Investigación y Diagnóstico en Educación, Universidad Internacional de la Rioja, Logroño, Spain

María Jesús Luque-Rojas

2 Department of Theory and History of Education and Research Methods and Diagnosis in Education, University of Malaga, Málaga, Spain

Associated Data

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Higher education aims for university students to produce knowledge from the critical reflection of scientific texts. Therefore, it is necessary to develop a deep mental representation of written information. The objective of this research was to determine through a systematic review and meta-analysis the proportion of university students who have an optimal performance at each level of reading comprehension. Systematic review of empirical studies has been limited from 2010 to March 2021 using the Web of Science, Scopus, Medline, and PsycINFO databases. Two reviewers performed data extraction independently. A random-effects model of proportions was used for the meta-analysis and heterogeneity was assessed with I 2 . To analyze the influence of moderating variables, meta-regression was used and two ways were used to study publication bias. Seven articles were identified with a total sample of the seven of 1,044. The proportion of students at the literal level was 56% (95% CI = 39–72%, I 2 = 96.3%), inferential level 33% (95% CI = 19–46%, I 2 = 95.2%), critical level 22% (95% CI = 9–35%, I 2 = 99.04%), and organizational level 22% (95% CI = 6–37%, I 2 = 99.67%). Comparing reading comprehension levels, there is a significant higher proportion of university students who have an optimal level of literal compared to the rest of the reading comprehension levels. The results have to be interpreted with caution but are a guide for future research.

Introduction

Reading comprehension allows the integration of knowledge that facilitates training processes and successful coping with academic and personal situations. In higher education, this reading comprehension has to provide students with autonomy to self-direct their academic-professional learning and provide critical thinking in favor of community service ( UNESCO, 2009 ). However, research in recent years ( Bharuthram, 2012 ; Afflerbach et al., 2015 ) indicates that a part of university students are not prepared to successfully deal with academic texts or they have reading difficulties ( Smagorinsky, 2001 ; Cox et al., 2014 ), which may limit academic training focused on written texts. This work aims to review the level of reading comprehension provided by studies carried out in different countries, considering the heterogeneity of existing educational models.

The level of reading comprehension refers to the type of mental representation that is made of the written text. The reader builds a mental model in which he can integrate explicit and implicit data from the text, experiences, and previous knowledge ( Kucer, 2016 ; van den Broek et al., 2016 ). Within the framework of the construction-integration model ( Kintsch and van Dijk, 1978 ; Kintsch, 1998 ), the most accepted model of reading comprehension, processing levels are differentiated, specifically: A superficial level that identifies or memorizes data forming the basis of the text and a deep level in which the text situation model is elaborated integrating previous experiences and knowledge. At these levels of processing, the cognitive strategies used, are different according to the domain-learning model ( Alexander, 2004 ) from basic coding to a transformation of the text. In the scientific literature, there are investigations ( Yussof et al., 2013 ; Ulum, 2016 ) that also identify levels of reading comprehension ranging from a literal level of identification of ideas to an inferential and critical level that require the elaboration of inferences and the data transformation.

Studies focused on higher education ( Barletta et al., 2005 ; Yáñez Botello, 2013 ) show that university students are at a literal or basic level of understanding, they often have difficulties in making inferences and recognizing the macrostructure of the written text, so they would not develop a model of a situation of the text. These scientific results are in the same direction as the research on reading comprehension in the mother tongue in the university population. Bharuthram (2012) indicates that university students do not access or develop effective strategies for reading comprehension, such as the capacity for abstraction and synthesis-analysis. Later, Livingston et al. (2015) find that first-year education students present limited reading strategies and difficulties in understanding written texts. Ntereke and Ramoroka (2017) found that only 12.4% of students perform well in a reading comprehension task, 34.3% presenting a low level of execution in the task.

Factors related to the level of understanding of written information are the mode of presentation of the text (printed vs. digital), the type of metacognitive strategies used (planning, making inferences, inhibition, monitoring, etc.), the type of text and difficulties (novel vs. a science passage), the mode of writing (text vs. multimodal), the type of reading comprehension task, and the diversity of the student. For example, several studies ( Tuncer and Bahadir, 2014 ; Trakhman et al., 2019 ; Kazazoglu, 2020 ) indicate that reading is more efficient with better performance in reading comprehension tests in printed texts compared to the same text in digital and according to Spencer (2006) college students prefer to read in print vs. digital texts. In reading the written text, metacognitive strategies are involved ( Amril et al., 2019 ) but studies ( Channa et al., 2018 ) seem to indicate that students do not use them for reading comprehension, specifically; Korotaeva (2012) finds that only 7% of students use them. Concerning the type of text and difficulties, for Wolfe and Woodwyk (2010) , expository texts benefit more from the construction of a situational model of the text than narrative texts, although Feng (2011) finds that expository texts are more difficult to read than narrative texts. Regarding the modality of the text, Mayer (2009) and Guo et al. (2020) indicate that multimodal texts that incorporate images into the text positively improve reading comprehension. In a study of Kobayashi (2002) using open questions, close, and multiple-choice shows that the type and format of the reading comprehension assessment test significantly influence student performance and that more structured tests help to better differentiate the good ones and the poor ones in reading comprehension. Finally, about student diversity, studies link reading comprehension with the interest and intrinsic motivation of university students ( Cartwright et al., 2019 ; Dewi et al., 2020 ), with gender ( Saracaloglu and Karasakaloglu, 2011 ), finding that women present a better level of reading comprehension than men and with knowledge related to reading ( Perfetti et al., 1987 ). In this research, it was controlled that all were printed and unimodal texts, that is, only text. This is essential because the cognitive processes involved in reading comprehension can vary with these factors ( Butcher and Kintsch, 2003 ; Xu et al., 2020 ).

The Present Study

Regardless of the educational context, in any university discipline, preparing essays or developing arguments are formative tasks that require a deep level of reading comprehension (inferences and transformation of information) that allows the elaboration of a situation model, and not having this level can lead to limited formative learning. Therefore, the objective of this research was to know the state of reading comprehension levels in higher education; specifically, the proportion of university students who perform optimally at each level of reading comprehension. It is important to note that there is not much information about the different levels in university students and that it is the only meta-analytic review that explores different levels of reading comprehension in this educational stage. This is a relevant issue because the university system requires that students produce knowledge from the critical reflection of scientific texts, preparing them for innovation, employability, and coexistence in society.

Materials and Methods

Eligibility criteria: inclusion and exclusion.

Empirical studies written in Spanish or English are selected that analyze the reading comprehension level in university students.

The exclusion criteria are as follows: (a) book chapters or review books or publications; (b) articles in other languages; (c) studies of lower educational levels; (d) articles that do not identify the age of the sample; (e) second language studies; (f) students with learning difficulties or other disorders; (g) publications that do not indicate the level of reading comprehension; (h) studies that relate reading competence with other variables but do not report reading comprehension levels; (i) pre-post program application work; (j) studies with experimental and control groups; (k) articles comparing pre-university stages or adults; (l) publications that use multi-texts; (m) studies that use some type of technology (computer, hypertext, web, psychophysiological, online questionnaire, etc.); and (n) studies unrelated to the subject of interest.

Only those publications that meet the following criteria are included as: (a) be empirical research (article, thesis, final degree/master’s degree, or conference proceedings book); (b) university stage; (c) include data or some measure on the level of reading comprehension that allows calculating the effect size; (d) written in English or Spanish; (e) reading comprehension in the first language or mother tongue; and (f) the temporary period from January 2010 to March 2021.

Search Strategies

A three-step procedure is used to select the studies included in the meta-analysis. In the first step, a review of research and empirical articles in English and Spanish from January 2010 to March 2021. The search is carried out in online databases of languages in Spanish and English, such as Web of Science (WoS), Scopus, Medline, and PsycINFO, to review empirical productions that analyze the level of reading comprehension in university students. In the second step, the following terms (titles, abstracts, keywords, and full text) are used to select the articles: Reading comprehension and higher education, university students, in Spanish and English, combined with the Boolean operators AND and OR. In the last step, secondary sources, such as the Google search engine, Theseus, and references in publications, are explored.

The search reports 4,294 publications (articles, theses, and conference proceedings books) in the databases and eight records of secondary references, specifically, 1989 from WoS, 2001 from Scopus, 42 from Medline, and 262 of PsycINFO. Of the total (4,294), 1,568 are eliminated due to duplications, leaving 2,734 valid records. Next, titles and abstracts are reviewed and 2,659 are excluded because they do not meet the inclusion criteria. The sample of 75 publications is reduced to 40 articles, excluding 35 because the full text cannot be accessed (the authors were contacted but did not respond), the full text did not show specific statistical data, they used online questionnaires or computerized presentations of the text. Finally, seven articles in Spanish were selected for use in the meta-analysis of the reading comprehension level of university students. Data additional to those included in the articles were not requested from the selected authors.

The PRISMA-P guidelines ( Moher et al., 2015 ) are followed to perform the meta-analysis and the flow chart for the selection of publications relevant to the subject is exposed (Figure 1) .

An external file that holds a picture, illustration, etc.
Object name is fpsyg-12-712901-g001.jpg

Flow diagram for the selection of articles.

Encoding Procedure

This research complies with what is established in the manual of systematic reviews ( Higgins and Green, 2008 ) in which clear objectives, specific search terms, and eligibility criteria for previously defined works are established. Two independent coders, reaching a 100% agreement, carry out the study search process. Subsequently, the research is codified, for this, a coding protocol is used as a guide to help resolve the ambiguities between the coders; the proposals are reflected and discussed and discrepancies are resolved, reaching a degree of agreement between the two coders of 97%.

For all studies, the reference, country, research objective, sample size, age and gender, reading comprehension test, other tests, and reading comprehension results were coded in percentages. All this information was later systematized in Table 1 .

Results of the empirical studies included in the meta-analysis.

In relation to the type of reading comprehension level, it was coded based on the levels of the scientific literature as follows: 1 = literal; 2 = inferential; 3 = critical; and 4 = organizational.

Regarding the possible moderating variables, it was coded if the investigations used a standardized reading comprehension measure (value = 1) or non-standardized (value = 0). This research considers the standardized measures of reading comprehension as the non-standardized measures created by the researchers themselves in their studies or questionnaires by other authors. By the type of evaluation test, we encode between multiple-choice (value = 0) or multiple-choices plus open question (value = 1). By type of text, we encode between argumentative (value = 1) or unknown (value = 0). By the type of career, we encode social sciences (value = 1) or other careers (health sciences; value = 0). Moreover, by the type of publication, we encode between article (value = 1) or doctoral thesis (value = 0).

Effect Size and Statistical Analysis

This descriptive study with a sample k = 7 and a population of 1,044 university students used a continuous variable and the proportions were used as the effect size to analyze the proportion of students who had an optimal performance at each level of reading comprehension. As for the percentages of each level of reading comprehension of the sample, they were transformed into absolute frequencies. A random-effects model ( Borenstein et al., 2009 ) was used as the effect size. These random-effects models have a greater capacity to generalize the conclusions and allow estimating the effects of different sources of variation (moderating variables). The DerSimonian and Laird method ( Egger et al., 2001 ) was used, calculating raw proportion and for each proportion its standard error, value of p and 95% confidence interval (CI).

To examine sampling variability, Cochran’s Q test (to test the null hypothesis of homogeneity between studies) and I 2 (proportion of variability) were used. According to Higgins et al. (2003) , if I 2 reaches 25%, it is considered low, if it reaches 50% and if it exceeds 75% it is considered high. A meta-regression analysis was used to investigate the effect of the moderator variables (type of measure, type of evaluation test, type of text, type of career, and type of publication) in each level of reading comprehension of the sample studies. For each moderating variable, all the necessary statistics were calculated (estimate, standard error, CI, Q , and I 2 ).

To compare the effect sizes of each level (literal, inferential, critical, and organizational) of reading comprehension, the chi-square test for the proportion recommended by Campbell (2007) was used.

Finally, to analyze publication bias, this study uses two ways: Rosenthal’s fail-safe number and regression test. Rosenthal’s fail-safe number shows the number of missing studies with null effects that would make the previous correlations insignificant ( Borenstein et al., 2009 ). When the values are large there is no bias. In the regression test, when the regression is not significant, there is no bias.

The software used to classify and encode data and produce descriptive statistics was with Microsoft Excel and the Jamovi version 1.6 free software was used to perform the meta-analysis.

The results of the meta-analysis are presented in three parts: the general descriptive analysis of the included studies; the meta-analytic analysis with the effect size, heterogeneity, moderating variables, and comparison of effect sizes; and the study of publication bias.

Overview of Included Studies

The search carried out of the scientific literature related to the subject published from 2010 to March 2021 generated a small number of publications, because it was limited to the higher education stage and required clear statistical data on reading comprehension.

Table 1 presents all the publications reviewed in this meta-analysis with a total of students evaluated in the reviewed works that amounts to 1,044, with the smallest sample size of 30 ( Del Pino-Yépez et al., 2019 ) and the largest with 570 ( Guevara Benítez et al., 2014 ). Regarding gender, 72% women and 28% men were included. Most of the sample comes from university degrees in social sciences, such as psychology and education (71.42%) followed by health sciences (14.28%) engineering and a publication (14.28%) that does not indicate origin. These publications selected according to the inclusion criteria for the meta-analysis come from more countries with a variety of educational systems, but all from South America. Specifically, the countries that have more studies are Mexico (28.57%) and Colombia, Chile, Bolivia, Peru, and Ecuador with 14.28% each, respectively. The years in which they were published are 2.57% in 2018 and 2016 and 14.28% in 2019, 2014, and 2013.

A total of 57% of the studies analyze four levels of reading comprehension (literal, inferential, critical, and organizational) and 43% investigate three levels of reading comprehension (literal, inferential, and critical). Based on the moderating variables, 57% of the studies use standardized reading comprehension measures and 43% non-standardized measures. According to the evaluation test used, 29% use multiple-choice questions and 71% combine multiple-choice questions plus open questions. 43% use an argumentative text and 57% other types of texts (not indicated in studies). By type of career, 71% are students of social sciences and 29% of other different careers, such as engineering or health sciences. In addition, 71% are articles and 29% with research works (thesis and degree works).

Table 2 shows the reading comprehension assessment instruments used by the authors of the empirical research integrated into the meta-analysis.

Reading comprehension assessment tests used in higher education.

Meta-Analytic Analysis of the Level of Reading Comprehension

The literal level presents a mean proportion effect size of 56% (95% CI = 39–72%; Figure 2 ). The variability between the different samples of the literal level of reading comprehension was significant ( Q = 162.066, p < 0.001; I 2 = 96.3%). No moderating variable used in this research had a significant contribution to heterogeneity: type of measurement ( p = 0.520), type of test ( p = 0.114), type of text ( p = 0.520), type of career ( p = 0.235), and type of publication ( p = 0.585). The high variability is explained by other factors not considered in this work, such as the characteristics of the students (cognitive abilities) or other issues.

An external file that holds a picture, illustration, etc.
Object name is fpsyg-12-712901-g002.jpg

Forest plot of literal level.

The inferential level presents a mean proportion effect size of 33% (95% CI = 19–46%; Figure 3 ). The variability between the different samples of the inferential level of reading comprehension was significant ( Q = 125.123, p < 0.001; I 2 = 95.2%). The type of measure ( p = 0.011) and the type of text ( p = 0.011) had a significant contribution to heterogeneity. The rest of the variables had no significance: type of test ( p = 0.214), type of career ( p = 0.449), and type of publication ( p = 0.218). According to the type of measure, the proportion of students who have an optimal level in inferential administering a standardized test is 28.7% less than when a non-standardized test is administered. The type of measure reduces variability by 2.57% and explains the differences between the results of the studies at the inferential level. According to the type of text, the proportion of students who have an optimal level in inferential using an argumentative text is 28.7% less than when using another type of text. The type of text reduces the variability by 2.57% and explains the differences between the results of the studies at the inferential level.

An external file that holds a picture, illustration, etc.
Object name is fpsyg-12-712901-g003.jpg

Forest plot of inferential level.

The critical level has a mean effect size of the proportion of 22% (95% CI = 9–35%; Figure 4 ). The variability between the different samples of the critical level of reading comprehension was significant ( Q = 627.044, p < 0.001; I 2 = 99.04%). No moderating variable used in this research had a significant contribution to heterogeneity: type of measurement ( p = 0.575), type of test ( p = 0.691), type of text ( p = 0.575), type of career ( p = 0.699), and type of publication ( p = 0.293). The high variability is explained by other factors not considered in this work, such as the characteristics of the students (cognitive abilities).

An external file that holds a picture, illustration, etc.
Object name is fpsyg-12-712901-g004.jpg

Forest plot of critical level.

The organizational level presents a mean effect size of the proportion of 22% (95% CI = 6–37%; Figure 5 ). The variability between the different samples of the organizational level of reading comprehension was significant ( Q = 1799.366, p < 0.001; I 2 = 99.67%). The type of test made a significant contribution to heterogeneity ( p = 0.289). The other moderating variables were not significant in this research: type of measurement ( p = 0.289), type of text ( p = 0.289), type of career ( p = 0.361), and type of publication ( p = 0.371). Depending on the type of test, the proportion of students who have an optimal level in organizational with multiple-choices tests plus open questions is 37% higher than while using only multiple-choice tests. The type of text reduces the variability by 0.27% and explains the differences between the results of the studies at the organizational level.

An external file that holds a picture, illustration, etc.
Object name is fpsyg-12-712901-g005.jpg

Forest plot of organizational level.

Table 3 shows the difference between the estimated effect sizes and the significance. There is a larger proportion of students having an optimal level of reading comprehension at the literal level compared to the inferential, critical, and organizational level; an optimal level of reading comprehension at the inferential level vs. the critical and organizational level.

Results of effect size comparison.

Analysis of Publication Bias

This research uses two ways to verify the existence of bias independently of the sample size. Table 4 shows the results and there is no publication bias at any level of reading comprehension.

Publication bias results.

This research used a systematic literature search and meta-analysis to provide estimates of the number of cases of university students who have an optimal level in the different levels of reading comprehension. All the information available on the subject at the international level was analyzed using international databases in English and Spanish, but the potentially relevant publications were limited. Only seven Spanish language studies were identified internationally. In these seven studies, the optimal performance at each level of reading comprehension varied, finding heterogeneity associated with the very high estimates, which indicates that the summary estimates have to be interpreted with caution and in the context of the sample and the variables used in this meta-analysis.

In this research, the effects of the type of measure, type of test, type of text, type of career, and type of publication have been analyzed. Due to the limited information in the publications, it was not possible to assess the effect of any more moderating variables.

We found that some factors significantly influence heterogeneity according to the level of reading comprehension considered. The type of measure influenced the optimal performance of students in the inferential level of reading comprehension; specifically, the proportion of students who have an optimal level in inferential worsens if the test is standardized. Several studies ( Pike, 1996 ; Koretz, 2002 ) identify differences between standardized and non-standardized measures in reading comprehension and a favor of non-standardized measures developed by the researchers ( Pyle et al., 2017 ). The ability to generate inferences of each individual may difficult to standardize because each person differently identifies the relationship between the parts of the text and integrates it with their previous knowledge ( Oakhill, 1982 ; Cain et al., 2004 ). This mental representation of the meaning of the text is necessary to create a model of the situation and a deep understanding ( McNamara and Magliano, 2009 ; van den Broek and Espin, 2012 ).

The type of test was significant for the organizational level of reading comprehension. The proportion of students who have an optimal level in organizational improves if the reading comprehension assessment test is multiple-choice plus open questions. The organizational level requires the reordering of written information through analysis and synthesis processes ( Guevara Benítez et al., 2014 ); therefore, it constitutes a production task that is better reflected in open questions than in reproduction questions as multiple choice ( Dinsmore and Alexander, 2015 ). McNamara and Kintsch (1996) identify that open tasks require an effort to make inferences related to previous knowledge and multidisciplinary knowledge. Important is to indicate that different evaluation test formats can measure different aspects of reading comprehension ( Zheng et al., 2007 ).

The type of text significantly influenced the inferential level of reading comprehension. The proportion of students who have an optimal level in inferential decreases with an argumentative text. The expectations created before an argumentative text made it difficult to generate inferences and, therefore, the construction of the meaning of the text. This result is in the opposite direction to the study by Diakidoy et al. (2011) who find that the refutation text, such as the argumentative one, facilitates the elaboration of inferences compared to other types of texts. It is possible that the argumentative text, given its dialogical nature of arguments and counterarguments, with a subject unknown by the students, has determined the decrease of inferences based on their scarce previous knowledge of the subject, needing help to elaborate the structure of the text read ( Reznitskaya et al., 2007 ). It should be pointed out that in meta-analysis studies, 43% use argumentative texts. Knowing the type of the text is relevant for generating inferences, for instance, according to Baretta et al. (2009) the different types of text are processed differently in the brain generating more or fewer inferences; specifically, using the N400 component, they find that expository texts generate more inferences from the text read.

For the type of career and the type of publication, no significance was found at any level of reading comprehension in this sample. This seems to indicate that university students have the same level of performance in tasks of literal, critical inferential, and organizational understanding regardless of whether they are studying social sciences, health sciences, or engineering. Nor does the type of publication affect the state of the different levels of reading comprehension in higher education.

The remaining high heterogeneity at all levels of reading comprehension was not captured in this review, indicating that there are other factors, such as student characteristics, gender, or other issues, that are moderating and explaining the variability at the literal, inferential, critical, and organizational reading comprehension in university students.

To the comparison between the different levels of reading comprehension, the literal level has a significantly higher proportion of students with an optimal level than the inferential, critical, and organizational levels. The inferential level has a significantly higher proportion of students with an optimal level than the critical and organizational levels. This corresponds with data from other investigations ( Márquez et al., 2016 ; Del Pino-Yépez et al., 2019 ) that indicate that the literal level is where university students execute with more successes, being more difficult and with less success at the inferential, organizational, and critical levels. This indicates that university students of this sample do not generate a coherent situation model that provides them with a global mental representation of the read text according to the model of Kintsch (1998) , but rather they make a literal analysis of the explicit content of the read text. This level of understanding can lead to less desirable results in educational terms ( Dinsmore and Alexander, 2015 ).

The educational implications of this meta-analysis in this sample are aimed at making universities aware of the state of reading comprehension levels possessed by university students and designing strategies (courses and workshops) to optimize it by improving the training and employability of students. Some proposals can be directed to the use of reflection tasks, integration of information, graphic organizers, evaluation, interpretation, nor the use of paraphrasing ( Rahmani, 2011 ). Some studies ( Hong-Nam and Leavell, 2011 ; Parr and Woloshyn, 2013 ) demonstrate the effectiveness of instructional courses in improving performance in reading comprehension and metacognitive strategies. In addition, it is necessary to design reading comprehension assessment tests in higher education that are balanced, validated, and reliable, allowing to have data for the different levels of reading comprehension.

Limitations and Conclusion

This meta-analysis can be used as a starting point to report on reading comprehension levels in higher education, but the results should be interpreted with caution and in the context of the study sample and variables. Publications without sufficient data and inaccessible articles, with a sample of seven studies, may have limited the international perspective. The interest in studying reading comprehension in the mother tongue, using only unimodal texts, without the influence of technology and with English and Spanish has also limited the review. The limited amount of data in the studies has limited meta-regression.

This review is a guide to direct future research, broadening the study focus on the level of reading comprehension using digital technology, experimental designs, second languages, and investigations that relate reading comprehension with other factors (gender, cognitive abilities, etc.) that can explain the heterogeneity in the different levels of reading comprehension. The possibility of developing a comprehensive reading comprehension assessment test in higher education could also be explored.

This review contributes to the scientific literature in several ways. In the first place, this meta-analytic review is the only one that analyzes the proportion of university students who have an optimal performance in the different levels of reading comprehension. This review is made with international publications and this topic is mostly investigated in Latin America. Second, optimal performance can be improved at all levels of reading comprehension, fundamentally inferential, critical, and organizational. The literal level is significantly the level of reading comprehension with the highest proportion of optimal performance in university students. Third, the students in this sample have optimal performance at the inferential level when they are non-argumentative texts and non-standardized measures, and, in the analyzed works, there is optimal performance at the organizational level when multiple-choice questions plus open questions are used.

The current research is linked to the research project “Study of reading comprehension in higher education” of Asociación Educar para el Desarrollo Humano from Argentina.

Data Availability Statement

Author contributions.

Cd-l-P had the idea for the article and analyzed the data. ML-R searched the data. Cd-l-P and ML-R selected the data and contributed to the valuable comments and manuscript writing. All authors contributed to the article and approved the submitted version.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The handling editor declared a shared affiliation though no other collaboration with one of the authors ML-R at the time of the review.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Funding. This paper was funded by the Universidad Internacional de la Rioja and Universidad de Málaga.

  • Afflerbach P., Cho B.-Y., Kim J.-Y. (2015). Conceptualizing and assessing higher-order thinking in reading . Theory Pract. 54 , 203–212. 10.1080/00405841.2015.1044367 [ CrossRef ] [ Google Scholar ]
  • Alexander P. A. (2004). “ A model of domain learning: reinterpreting expertise as a multidimensional, multistage process ,” in Motivation, Emotion, and Cognition: Integrative Perspectives on Intellectual Functioning and Development. eds. Dai D. Y., Sternberg R. J. (Mahwah, NJ: Erlbaum; ), 273–298. [ Google Scholar ]
  • Amril A., Hasanuddin W. S., Atmazaki (2019). The contributions of reading strategies and reading frequencies toward students’ reading comprehension skill in higher education . Int. J. Eng. Adv. Technol. (IJEAT) 8 , 593–595. 10.35940/ijeat.F1105.0986S319 [ CrossRef ] [ Google Scholar ]
  • Baretta L., Braga Tomitch L. M., MacNair N., Kwan Lim V., Waldie K. E. (2009). Inference making while reading narrative and expository texts: an ERP study . Psychol. Neurosci. 2 , 137–145. 10.3922/j.psns.2009.2.005 [ CrossRef ] [ Google Scholar ]
  • Barletta M., Bovea V., Delgado P., Del Villar L., Lozano A., May O., et al.. (2005). Comprensión y Competencias Lectoras en Estudiantes Universitarios. Barranquilla: Uninorte. [ Google Scholar ]
  • Bharuthram S. (2012). Making a case for the teaching of reading across the curriculum in higher education . S. Afr. J. Educ. 32 , 205–214. 10.15700/saje.v32n2a557 [ CrossRef ] [ Google Scholar ]
  • Borenstein M., Hedges L. V., Higgins J. P. T., Rothstein H. R. (2009). Introduction to Meta-Analysis. United Kingdom: John Wiley and Sons, Ltd, 45–49. [ Google Scholar ]
  • Butcher K. R., Kintsch W. (2003). “ Text comprehension and discourse processing ,” in Handbook of Psychology: Experimental Psychology. 2nd Edn . Vol . 4 . eds. Healy A. F., Proctor R. W., Weiner I. B. (New Jersey: John Wiley and Sons, Inc.), 575–595. [ Google Scholar ]
  • Cain K., Oakhill J., Bryant P. (2004). Children’s reading comprehension ability: concurrent prediction by working memory, verbal ability, and component skills . J. Educ. Psychol. 96 , 31–42. 10.1037/0022-0663.96.1.31 [ CrossRef ] [ Google Scholar ]
  • Campbell I. (2007). Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample recommendations . Stat. Med. 26 , 3661–3675. 10.1002/sim.2832, PMID: [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Cartwright K. B., Lee S. A., Barber A. T., DeWyngaert L. U., Lane A. B., Singleton T. (2019). Contributions of executive function and cognitive intrinsic motivation to university students’ reading comprehension . Read. Res. Q. 55 , 345–369. 10.1002/rrq.273 [ CrossRef ] [ Google Scholar ]
  • Channa M. A., Abassi A. M., John S., Sahito J. K. M. (2018). Reading comprehension and metacognitive strategies in first-year engineering university students in Pakistan . Int. J. Engl. Ling. 8 , 78–87. 10.5539/ijel.v8n6p78 [ CrossRef ] [ Google Scholar ]
  • Cox S. R., Friesner D. L., Khayum M. (2014). Do Reading skills courses help underprepared readers achieve academic success in college? J. College Reading and Learn. 33 , 170–196. 10.1080/10790195.2003.10850147 [ CrossRef ] [ Google Scholar ]
  • Del Pino-Yépez G. M., Saltos-Rodríguez L. J., Moreira-Aguayo P. Y. (2019). Estrategias didácticas para el afianzamiento de la comprensión lectora en estudiantes universitarios . Revista científica Dominio de las Ciencias 5 , 171–187. 10.23857/dc.v5i1.1038 [ CrossRef ] [ Google Scholar ]
  • Dewi R. S., Fahrurrozi, Hasanah U., Wahyudi A. (2020). Reading interest and Reading comprehension: a correlational study in Syarif Hidayatullah State Islamic University, Jakarta . Talent Dev. Excell. 12 , 241–250. [ Google Scholar ]
  • Diakidoy I. N., Mouskounti T., Ioannides C. (2011). Comprehension and learning from refutation and expository texts . Read. Res. Q. 46 , 22–38. 10.1598/RRQ.46.1.2 [ CrossRef ] [ Google Scholar ]
  • Dinsmore D. J., Alexander P. A. (2015). A multidimensional investigation of deep-level and surface-level processing . J. Exp. Educ. 84 , 213–244. 10.1080/00220973.2014.979126 [ CrossRef ] [ Google Scholar ]
  • Egger M., Smith D., Altmand D. G. (2001). Systematic Reviews in Health Care: Meta-Analysis in Context. London: BMJ Publishing Group. [ Google Scholar ]
  • Feng L. (2011). A short analysis of the text variables affecting reading and testing reading . Stud. Lit. Lang. 2 , 44–49. [ Google Scholar ]
  • Figueroa Romero R. L., Castañeda Sánchez W., Tamay Carranza I. A. (2016). Nivel de comprensión lectora en los estudiantes del primer ciclo de la Universidad San Pedro, filial Caraz, 2016. (Trabajo de investigación, Universidad San Pedro). Repositorio Institucional USP. Available at: http://repositorio.usanpedro.edu.pe/bitstream/handle/USANPEDRO/305/PI1640418.pdf?sequence=1andisAllowed=y (Accessed February 15, 2021).
  • Guevara Benítez Y., Guerra García J., Delgado Sánchez U., Flores Rubí C. (2014). Evaluación de distintos niveles de comprensión lectora en estudiantes mexicanos de Psicología . Acta Colombiana de Psicología 17 , 113–121. 10.14718/ACP.2014.17.2.12 [ CrossRef ] [ Google Scholar ]
  • Guo D., Zhang S., Wright K. L., McTigue E. M. (2020). Do you get the picture? A meta-analysis of the effect of graphics on reading comprehension . AERA Open 6 , 1–20. 10.1177/2332858420901696 [ CrossRef ] [ Google Scholar ]
  • Higgins J. P., Green S. (2008). Cochrane Handbook for Systematic Reviews of Interventions. The Cochrane Collaboration. [ Google Scholar ]
  • Higgins J. P., Thompson S. G., Deeks J. J., Altman D. G. (2003). Measuring inconsistency in meta-analyses . BMJ 327 , 327–557. 10.1136/bmj.327.7414.557, PMID: [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hong-Nam K., Leavell A. G. (2011). Reading strategy instruction, metacognitive awareness, and self-perception of striving college developmental readers . J. College Literacy Learn. 37 , 3–17. [ Google Scholar ]
  • Kazazoglu S. (2020). Is printed text the best choice? A mixed-method case study on reading comprehension . J. Lang. Linguistic Stud. 16 , 458–473. 10.17263/jlls.712879 [ CrossRef ] [ Google Scholar ]
  • Kintsch W. (1998). Comprehension: A Paradigm for Cognition. New York: Cambridge University Press. [ Google Scholar ]
  • Kintsch W., van Dijk T. A. (1978). Toward a model of text comprehension and production . Psychol. Rev. 85 , 363–394. 10.1037/0033-295X.85.5.363 [ CrossRef ] [ Google Scholar ]
  • Kobayashi M. (2002). Method effects on reading comprehension test performance: test organization and response format . Lang. Test. 19 , 193–220. 10.1191/0265532202lt227oa [ CrossRef ] [ Google Scholar ]
  • Koretz D. (2002). Limitations in the use of achievement tests as measures of educators’ productivity . J. Hum. Resour. 37 , 752–777. 10.2307/3069616 [ CrossRef ] [ Google Scholar ]
  • Korotaeva I. V. (2012). Metacognitive strategies in reading comprehension of education majors . Procedural-Social and Behav. Sci. 69 , 1895–1900. 10.1016/j.sbspro.2012.12.143 [ CrossRef ] [ Google Scholar ]
  • Kucer S. B. (2016). Accuracy, miscues, and the comprehension of complex literary and scientific texts . Read. Psychol. 37 , 1076–1095. 10.1080/02702711.2016.1159632 [ CrossRef ] [ Google Scholar ]
  • Livingston C., Klopper B., Cox S., Uys C. (2015). The impact of an academic reading program in the bachelor of education (intermediate and senior phase) degree . Read. Writ. 6 , 1–11. 10.4102/rw.v6i1.66 [ CrossRef ] [ Google Scholar ]
  • Márquez H., Díaz C., Muñoz R., Fuentes R. (2016). Evaluación de los niveles de comprensión lectora en estudiantes universitarios pertenecientes a las carreras de Kinesiología y Nutrición y Dietética de la Universidad Andrés Bello, Concepción . Revista de Educación en Ciencias de la Salud 13 , 154–160. [ Google Scholar ]
  • Mayer R. E. (ed.) (2009). “ Modality principle ,” in Multimedia Learning. United States: Cambridge University Press, 200–2020. [ Google Scholar ]
  • McNamara D. S., Kintsch W. (1996). Learning from texts: effects of prior knowledge and text coherence . Discourse Process 22 , 247–288. 10.1080/01638539609544975 [ CrossRef ] [ Google Scholar ]
  • McNamara D. S., Magliano J. (2009). “ Toward a comprehensive model of comprehension ,” in The psychology of learning and motivation. ed. Ross E. B. (New York: Elsevier; ), 297–384. [ Google Scholar ]
  • Moher D., Shamseer L., Clarke M., Ghersi D., Liberati A., Petticrew M., et al.. (2015). Preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P) 2015 statement . Syst. Rev. 4 :1. 10.1186/2046-4053-4-1, PMID: [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ntereke B. B., Ramoroka B. T. (2017). Reading competency of first-year undergraduate students at the University of Botswana: a case study . Read. Writ. 8 :a123. 10.4102/rw.v8i1.123 [ CrossRef ] [ Google Scholar ]
  • Oakhill J. (1982). Constructive processes in skilled and less skilled comprehenders’ memory for sentences . Br. J. Psychol. 73 , 13–20. 10.1111/j.2044-8295.1982.tb01785.x, PMID: [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Parr C., Woloshyn V. (2013). Reading comprehension strategy instruction in a first-year course: an instructor’s self-study . Can. J. Scholarship Teach. Learn. 4 :3. 10.5206/cjsotl-rcacea.2013.2.3 [ CrossRef ] [ Google Scholar ]
  • Perfetti C. A., Beck I., Bell L. C., Hughes C. (1987). Phonemic knowledge and learning to read are reciprocal: a longitudinal study of first-grade children . Merrill-Palmer Q. 33 , 283–319. [ Google Scholar ]
  • Pike G. R. (1996). Limitations of using students’ self-reports of academic development as proxies for traditional achievement measures . Res. High. Educ. 37 , 89–114. 10.1007/BF01680043 [ CrossRef ] [ Google Scholar ]
  • Pyle N., Vasquez A. C., Lignugaris B., Gillam S. L., Reutzel D. R., Olszewski A., et al.. (2017). Effects of expository text structure interventions on comprehension: a meta-analysis . Read. Res. Q. 52 , 469–501. 10.1002/rrq.179 [ CrossRef ] [ Google Scholar ]
  • Rahmani M. (2011). Effects of note-taking training on reading comprehension and recall . Reading Matrix: An Int. Online J. 11 , 116–126. [ Google Scholar ]
  • Reznitskaya A., Anderson R., Kuo L. J. (2007). Teaching and learning argumentation . Elem. Sch. J. 107 , 449–472. 10.1086/518623 [ CrossRef ] [ Google Scholar ]
  • Sáez Sánchez B. K. (2018). La comprensión lectora en jóvenes universitarios de una escuela formadora de docentes . Revista Electrónica Científica de Investigación Educativa 4 , 609–618. [ Google Scholar ]
  • Sanabria Mantilla T. R. (2018). Relación entre comprensión lectora y rendimiento académico en estudiantes de primer año de Psicología de la Universidad Pontificia Bolivariana. (Trabajo de grado, Universidad Pontificia Bolivariana). Repositorio Institucional UPB. Available at: https://repository.upb.edu.co/bitstream/handle/20.500.11912/5443/digital_36863.pdf?sequence=1andisAllowed=y (Accessed February 15, 2021).
  • Saracaloglu A. S., Karasakaloglu N. (2011). An investigation of prospective teachers’ reading comprehension levels and study and learning strategies related to some variables . Egit. ve Bilim 36 , 98–115. [ Google Scholar ]
  • Smagorinsky P. (2001). If meaning is constructed, what is it made from? Toward a cultural theory of reading . Rev. Educ. Res. 71 , 133–169. 10.3102/00346543071001133 [ CrossRef ] [ Google Scholar ]
  • Spencer C. (2006). Research on learners’ preferences for reading from a printed text or a computer screen . J. Dist. Educ. 21 , 33–50. [ Google Scholar ]
  • Trakhman L. M. S., Alexander P., Berkowitz L. E. (2019). Effects of processing time on comprehension and calibration in print and digital mediums . J. Exp. Educ. 87 , 101–115. 10.1080/00220973.2017.1411877 [ CrossRef ] [ Google Scholar ]
  • Tuncer M., Bhadir F. (2014). Effect of screen reading and reading from printed out material on student success and permanency in introduction to computer lesson . Turk. Online J. Educ. Technol. 13 , 41–49. [ Google Scholar ]
  • Ulum O. G. (2016). A descriptive content analysis of the extent of Bloom’s taxonomy in the reading comprehension questions of the course book Q: skills for success 4 reading and writing . Qual. Rep. 21 , 1674–1683. [ Google Scholar ]
  • UNESCO (2009). “Conferencia mundial sobre la Educación Superior – 2009.” La nueva dinámica de la educación superior y la investigación para el cambio social y el desarrollo; July 5-8, 2009; Paris.
  • van den Broek P., Espin C. A. (2012). Connecting cognitive theory and assessment: measuring individual differences in reading comprehension . Sch. Psychol. Rev. 43 , 315–325. 10.1080/02796015.2012.12087512 [ CrossRef ] [ Google Scholar ]
  • van den Broek P., Mouw J. M., Kraal A. (2016). “ Individual differences in reading comprehension ,” in Handbook of Individual Differences in Reading: Reader, Text, and Context. ed. Afflerbach E. P. (New York: Routledge; ), 138–150. [ Google Scholar ]
  • Wolfe M. B. W., Woodwyk J. M. (2010). Processing and memory of information presented in narrative or expository texts . Br. J. Educ. Psychol. 80 , 341–362. 10.1348/000709910X485700, PMID: [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Xu Y., Wong R., He S., Veldre A., Andrews S. (2020). Is it smart to read on your phone? The impact of reading format and culture on the continued influence of misinformation . Mem. Cogn. 48 , 1112–1127. 10.3758/s13421-020-01046-0, PMID: [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Yáñez Botello C. R. (2013). Caracterización de los procesos cognoscitivos y competencias involucrados en los niveles de comprensión lectora en Estudiantes Universitarios . Cuadernos Hispanoamericanos de Psicología 13 , 75–90. 10.18270/chps.v13i2.1350 [ CrossRef ] [ Google Scholar ]
  • Yussof Y. M., Jamian A. R., Hamzah Z. A. Z., Roslan A. (2013). Students’ reading comprehension performance with emotional . Int. J. Edu. Literacy Stud. 1 , 82–88. 10.7575/aiac.ijels.v.1n.1p.82 [ CrossRef ] [ Google Scholar ]
  • Zheng Y., Cheng L., Klinger D. A. (2007). Do test format in reading comprehension affect second-language students’ test performance differently? TESL Can. J. 25 , 65–78. 10.18806/tesl.v25i1.108 [ CrossRef ] [ Google Scholar ]

An IERI – International Educational Research Institute Journal

  • Open access
  • Published: 28 October 2021

The achievement gap in reading competence: the effect of measurement non-invariance across school types

  • Theresa Rohm   ORCID: orcid.org/0000-0001-9203-327X 1 , 2 ,
  • Claus H. Carstensen 2 ,
  • Luise Fischer 1 &
  • Timo Gnambs 1 , 3  

Large-scale Assessments in Education volume  9 , Article number:  23 ( 2021 ) Cite this article

3681 Accesses

1 Citations

2 Altmetric

Metrics details

After elementary school, students in Germany are separated into different school tracks (i.e., school types) with the aim of creating homogeneous student groups in secondary school. Consequently, the development of students’ reading achievement diverges across school types. Findings on this achievement gap have been criticized as depending on the quality of the administered measure. Therefore, the present study examined to what degree differential item functioning affects estimates of the achievement gap in reading competence.

Using data from the German National Educational Panel Study, reading competence was investigated across three timepoints during secondary school: in grades 5, 7, and 9 ( N  = 7276). First, using the invariance alignment method, measurement invariance across school types was tested. Then, multilevel structural equation models were used to examine whether a lack of measurement invariance between school types affected the results regarding reading development.

Our analyses revealed some measurement non-invariant items that did not alter the patterns of competence development found among school types in the longitudinal modeling approach. However, misleading conclusions about the development of reading competence in different school types emerged when the hierarchical data structure (i.e., students being nested in schools) was not taken into account.

Conclusions

We assessed the relevance of measurement invariance and accounting for clustering in the context of longitudinal competence measurement. Even though differential item functioning between school types was found for each measurement occasion, taking these differences in item estimates into account did not alter the parallel pattern of reading competence development across German secondary school types. However, ignoring the clustered data structure of students being nested within schools led to an overestimation of the statistical significance of school type effects.

Introduction

Evaluating measurement invariance is a premise for the meaningful interpretation of differences in latent constructs between groups or over time (Brown, 2006 ). By assessing measurement invariance, it is made certain that the observed changes present true change instead of differences in the interpretation of items. The present study investigates measurement invariance between secondary school types for student reading competence, which is the cornerstone of learning. Reading competences develop in secondary school from reading simple texts, retrieving information and making inference from what is explicitly stated, up to the level of being a fluent reader by reading longer and more complex texts and being able to infer from what is not explicitly stated in the text (Chall, 1983 ). In particular, students’ reading competence is essential for the comprehension of educational content in secondary school (Edossa et al., 2019 ; O’Brien et al., 2001 ). Reading development is often investigated either from a school-level perspective or by focusing on individual-level differences. When taking a school-level perspective on reading competence growth within the German secondary school system, the high degree of segregation after the end of primary school must be considered. Most students are separated into different school tracks on the basis of their fourth-grade achievement level to obtain homogenous student groups in secondary school (Köller & Baumert, 2002 ). This homogenization based on proficiency levels is supposed to optimize teaching and education to account for students’ preconditions, enhancing learning for all students (Baumert et al., 2006 ; Gamoran & Mare, 1989 ). Consequently, divergence in competence attainment already exists at the beginning of secondary school and might increase among the school tracks over the school years. Previous studies comparing reading competence development between different German secondary school types have presented ambiguous results by finding either a comparable increase in reading competence development (e.g., Retelsdorf & Möller, 2008 ; Schneider & Stefanek, 2004 ) or a widening gap between upper, middle, and lower academic school tracks (e.g., Pfost & Artelt, 2013 ) for the same schooling years. Increasing performance differences in reading over time are termed “Matthew effects”, in the biblical analogy of rich getting richer and the poor getting poorer (e.g., Bast and Reitsma, 1998 ; Walberg & Tsai, 1983 ). This Matthew effect hypothesis was first used in the educational context by Stanovich ( 1986 ) to examine individual differences in reading competence development. Besides this widening pattern, as described by the Matthew effect phenomena, also parallel or compensatory patterns in reading development can be present. Parallel development is the case, when studied groups initially diverge in their reading competence and similarly increase over time. A compensatory pattern describes a reading competence development, where an initially diverging reading competence between groups converges over time.

Moreover, findings on the divergence in competence attainment have been criticized as being dependent on the quality of the measurement construct (Pfost et al., 2014 ; Protopapas et al., 2016 ). More precisely, the psychometric properties of the administered tests, such as the measurement (non-)invariance of items, can distort individual- or school-level differences. A core assumption of many measurement models pertains to comparable item functioning across groups, meaning that differences between item parameters are zero across groups, or in case of approximate measurement invariance, approximately zero. In practice, this often holds for only a subset of items and partial invariance can then be applied, where some item parameters (i.e., intercepts) are held constant across groups and others are allowed to be freely estimated (Van de Schoot et al., 2013 ). Using data from the German National Educational Panel Study (NEPS; Blossfeld et al., 2011 ), we focus on school-level differences in reading competence across three timepoints. We aim to examine the degree to which measurement non-invariance distorts comparisons of competence development across school types. We therefore compare a model that assumes partial measurement invariance across school types with a model that does not take differences in item estimates between school types into account. Finally, we demonstrate the need to account for clustering (i.e., students nested in schools) in longitudinal reading competence measurement when German secondary school types are compared.

School segregation and reading competence development

Ability tracking of students can take place within schools (e.g., differentiation through course assignment as, for example, in U.S. high schools) or between schools with a curricular differentiation between school types and with distinct learning certificates being offered by each school track, as is the German case (Heck et al., 2004 ; LeTendre et al., 2003 ; Oakes & Wells, 1996 ). The different kinds of curricula at each school type are tailored to the prerequisites of the students and provide different learning opportunities. German students are assigned to different school types based on primary school recommendations that take primary school performance during fourth grade into account, but factors such as support within the family are also considered (Cortina & Trommer, 2009 ; Pfost & Artelt, 2013 ; Retelsdorf et al., 2012 ). Nevertheless, this recommendation is not equally binding across German federal states, leaving room for parents to decide on their children’s school track. Consequently, student achievement in secondary school is associated with the cognitive abilities of students but also with their social characteristics and family background (Baumert et al., 2006 ; Ditton et al., 2005 ). This explicit between-school tracking after fourth grade has consequences for students’ achievement of reading competence in secondary school.

There might be several reasons why different trajectories of competence attainment are observed in the tracked secondary school system (Becker et al., 2006 ). First, students might already differ in their initial achievement and learning rates at the beginning of secondary school. This is related to curricular differentiation, as early separation aims to create homogenous student groups in terms of student proficiency levels and, in effect, enhances learning for all students by providing targeted learning opportunities (Baumert et al., 2003 ; Köller & Baumert, 2002 ; Retelsdorf & Möller, 2008 ). Hence, different learning rates are expected due to selection at the beginning of secondary school (Becker et al., 2006 ). Second, there are differences in learning and teaching methods among the school tracks, as learning settings are targeted towards students’ preconditions. Differences among school types are related to cognitive activation, the amount of support from the teacher in problem solving and demands regarding students’ accomplishments (Baumert et al., 2003 ). Third, composition effects due to the different socioeconomic and ethnic compositions of schools can shape student achievement. Not only belonging to a particular school type but also individual student characteristics determine student achievement. Moreover, the mixture of student characteristics might have decisive effects (Neumann et al., 2007 ). For example, average achievement rates and the characteristics of students’ social backgrounds were found to have additional effects on competence attainment in secondary school (Baumert et al., 2006 ), beyond mere school track affiliation and individual characteristics. Hence, schools of the same school type were found to differ greatly from each other in their attainment levels and their social compositions (Baumert et al., 2003 ).

Findings from the cross-sectional Programme for International Student Assessment (PISA) studies, conducted on behalf of the OECD every three years since 2000, unanimously show large differences between school tracks in reading competence for German students in ninth grade (Baumert et al., 2001 , 2003 ; Nagy et al., 2017 ; Naumann et al., 2010 ; Weis et al., 2016 , 2020 ). Students in upper academic track schools have, on average, higher reading achievement scores than students in the middle and lower academic tracks. Reading competence is thereby highly correlated with other assessed competencies, such as mathematics and science, where these differences between school tracks hold as well.

A few studies have also examined between-school track differences in the development of reading competence in German secondary schools, with most studies focusing on fifth and seventh grade in selected German federal states (e.g., Bos et al., 2009 ; Lehmann & Lenkeit, 2008 ; Lehmann et al., 1999 ; Pfost & Artelt, 2013 ; Retelsdorf & Möller, 2008 ). While some studies reported parallel developments in reading competence from fifth to seventh grade between school types (Retelsdorf & Möller, 2008 ; Schneider & Stefanek, 2004 ), others found a widening gap (Pfost & Artelt, 2013 ; Pfost et al., 2010 ). A widening gap between school types was also found for other competence domains, such as mathematics (Baumert et al., 2003 , 2006 ; Becker et al., 2006 ; Köller & Baumert, 2001 ), while parallel developments were rarely observed (Schneider & Stefanek, 2004 ).

In summary, there might be different school milieus created by the processes of selection into secondary school and formed by the social and ethnic origins of the students (Baumert et al., 2003 ). This has consequences for reading competence development during secondary school, which can follow a parallel, widening or compensatory pattern across school types. The cross-sectional PISA study regularly indicates large differences among German school types in ninth grade but does not offer insight into whether these differences already existed at the beginning of secondary school or how they developed throughout secondary school. In comparison, longitudinal studies have indicated a pattern in reading competence development through secondary school, but the studies conducted in the past were regionally limited and presented inconsistent findings on reading competence development among German secondary school types. In addition to differences in curricula, learning and teaching methods, students’ social backgrounds, family support, and student composition, the manner in which competence development during secondary school is measured and analyzed might contribute to the observed pattern in reading competence development.

Measuring differences in reading development

A meaningful longitudinal comparison of reading competence between school types and across grades requires a scale with a common metric. To be more specific, the relationships between the latent trait score and each observed item should not depend on group membership. The interpretability of scales has been questioned due to scaling issues (Protopapas et al., 2016 ). While the item response theory (IRT) calibration is assumed to be theoretically invariant, it depends in practice on the sample, item fit, and equivalence of item properties (e.g., discrimination and difficulty) among test takers and compared groups. Hence, empirically discovered between-group differences might be confounded with the psychometric properties of the administered tests. For example, Pfost et al. ( 2014 ) concluded from a meta-analysis of 28 studies on Matthew effects in primary school (i.e., the longitudinally widening achievement gap between good and poor readers) that low measurement precision (e.g., constructs presenting floor or ceiling effects) is strongly linked with compensatory patterns in reading achievement. Consequently, measuring changes using reading competence scores might depend on the quality of the measurement. Regarding competence development in secondary school, measurement precision is enhanced through the consideration of measurement error, the consideration of the multilevel data structure, and measurement invariance across groups. A biased measurement model might result when measurement error or the multilevel data structure are ignored, while the presence of differential item functioning (DIF) can be evidence of test-internal item bias. Moreover, the presence of statistical item bias might also contribute to test unfairness and, thus, invalid systematic disadvantages for specific groups (Camilli, 2006 ).

Latent variable modeling for reading competence, such as latent change models (Raykov, 1999 ; Steyer et al., 2000 ), can be advantageous compared to using composite scores. When using composite scores representing latent competences, measurement error is ignored (Lüdtke et al., 2011 ). Hence, biased estimates might be obtained if the construct is represented by composite scores instead of a latent variable measured by multiple indicators and accounting for measurement error (Lüdtke et al., 2008 ). Investigating student competence growth in secondary school poses a further challenge, as the clustered structure of the data needs to be taken into account. This can for example be achieved using cluster robust standard error estimation methods or through hierarchical linear modeling (cf. McNeish et al., 2017 ). If the school is the primary sampling unit, students are nested within schools and classes. Ignoring this hierarchical structure during estimation might result in inaccurate standard errors and biased significance tests, as standard errors would be underestimated. In turn, the statistical significance of the effects would be overestimated (Finch & Bolin, 2017 ; Hox, 2002 ; Raudenbush & Bryk, 2002 ; Silva et al., 2019 ). As one solution, multilevel structural equation modeling (MSEM) takes the hierarchical structure of the data into account while allowing for the estimation of latent variables with dichotomous and ordered categorical indicators (Kaplan et al., 2009 ; Marsh et al., 2009 ; Rabe-Hesketh et al., 2007 ). Although explicitly modeling the multilevel structure (as compared to cluster robust standard error estimation) involves additional assumptions regarding the distribution of the random effects and the covariance structure of random effects, it allows for the partitioning of variance to different hierarchical levels and for cluster-specific inferences (McNeish et al., 2017 ).

Furthermore, regarding the longitudinal modeling of performance divergence, an interpretation of growth relies on the assumption that the same attributes are measured across all timepoints (Williamson et al., 1991 ) and that the administered instrument (e.g., reading competence test items) is measurement invariant across groups (Jöreskog, 1971 ; Schweig, 2014 ). The assumption of measurement invariance presupposes that all items discriminate comparably across groups as well as timepoints and are equally difficult, independent of group membership and measurement occasion. Hence, the item parameters of a measurement model have to be constant across groups, meaning that the probability of answering an item correctly should be the same for members of different groups and at different timepoints when they have equal ability levels (Holland & Wainer, 1993 ; Millsap & Everson, 1993 ). When an item parameter is not independent of group membership, DIF is present.

The aim of our study is to investigate the effects of measurement non-invariance among school types on the achievement gap in reading competence development in German secondary schools. Measurement invariance between secondary school types is investigated for each measurement occasion to test whether items are biased among the school types. Then, we embed detected DIF into the longitudinal estimation of reading competence development between school types. A model considering school-type-specific item discrimination and difficulty for items exhibiting non-invariance between school types is therefore compared to a model that does not consider these school-type specificities. To achieve measurement precision for this longitudinal competence measurement, we consider measurement error and the clustered data structure through multilevel latent variable modeling. Finally, we present the same models without consideration of the clustered data structure and compare school type effects on reading competence development.

It is our goal to investigate whether the longitudinal development of reading competence is sensitive to the consideration of measurement non-invariance between the analyzed groups and to the consideration of the clustered data structure. This has practical relevance for all studies on reading competence development, where comparisons between school types are of interest and where schools were the primary sampling unit. Such evaluations increase the certainty that observed changes between school types reflect true changes.

Sample and procedure

The sample consisted of N  = 7276 German secondary school students, repeatedly tested and interviewed in 2010 and 2011 (grade 5), 2012 and 2013 (grade 7), and 2014 and 2015 (grade 9) as part of the NEPS. Approximately half of the sample was female (48.08%), and 25.46% had a migration background (defined as either the student or at least one parent born abroad). Please note that migration background is unequally distributed across school types: 22.1% high school students, 26.9% middle secondary school students, 38.5% lower secondary school students, 31.2% comprehensive school students and 15.2% students from schools offering all tracks of secondary education except the high school track had a migration background. In fifth grade, the students’ ages ranged from 9 to 15 years ( M  = 11.17, SD  = 0.54). Students were tested within their class context through written questionnaires and achievement tests. For the first timepoint in grade 5, immediately after students were assigned to different school tracks, a representative sample of German secondary schools was drawn using a stratified multistage sampling design (Aßmann et al., 2011 ). First, schools that teach at the secondary level were randomly drawn, and second, two grade 5 classes were randomly selected within these schools. The five types of schools were distinguished and served as strata in the first step: high schools (“Gymnasium”), middle secondary schools (“Realschule”), lower secondary schools (“Hauptschule”), comprehensive schools (“Gesamtschule”), and schools offering all tracks of secondary education except the high school track (“Schule mit mehreren Bildungsgängen”). The schools were drawn proportional to their number of classes from these strata. Finally, all students of the selected classes for whom a positive parent’s consent was obtained before panel participation were asked to take part in the study. At the second measurement timepoint in 2012 to 2013, when students attended grade 7, a refreshment sample was drawn due to German federal state-specific differences in the timing of the transition to lower secondary education ( N  = 2170; 29.82% of the total sample). The sampling design of the refreshment sample resembles the sampling design of the original sample (Steinhauer & Zinn, 2016 ). The ninth-grade sample in 2014 and 2015 was taken at the third measurement timepoint and was a follow-up survey for the students from regular schools in both the original and the refreshment sample. Students were tested at their schools, but N  = 1797 students (24.70% of the total sample) had to be tested at least one measurement timepoint through an individual follow-up within their home context. In both cases, the competence assessments were conducted by a professional survey institute that sent test administrators to the participating schools or households. For an overview of the students being tested per measurement timepoint per school type, within the school or home context, as well as information on temporary and final sample attrition, see Table 1 .

To group students into their corresponding school type, we used the information on the survey wave when the students were sampled (original sample in grade 5, refreshment sample in grade 7). Overall, most of the sampled students attended high schools ( N  = 3224; 44.31%), 23.65% attended middle secondary schools ( N  = 1721), 13.95% attended lower secondary schools ( N  = 1015), 11.96% of students attended schools offering all tracks of secondary education except the high school track ( N  = 870), and 6.13% attended comprehensive schools ( N  = 446). Altogether, the students attended 299 different schools, with a median of 24 students per school. Further details on the survey and the data collection process are presented on the project website ( http://www.neps-data.de/ ).

Instruments

During each assessment, reading competence was measured with a paper-based achievement test, including 32 items in fifth grade, 40 items in seventh grade administered in easy (27 items) and difficult (29 items) booklet versions, and 46 items in ninth grade administered in easy (30 items) and difficult (32 items) booklet versions. The items were specifically constructed for the administration of the NEPS, and each item was administered once (Krannich et al., 2017 ; Pohl et al., 2012 ; Scharl et al., 2017 ). Because memory effects might distort responses if items are repeatedly administered, the linking of the reading measurements in the NEPS is based on an anchor-group design (Fischer et al., 2016 ). With two independent link samples (one to link the grade 5 and grade 7 reading competence tests and the other to link the grade 7 with the grade 9 test), drawn from the same population as the original sample, a mean/mean linking was performed (Loyd & Hoover, 1980 ). In addition, the unidimensionality of the tests, measurement invariance of the items regarding reading development over the grade levels, as well as for relevant sample characteristics (i.e., gender and migration background) was demonstrated (Fischer et al., 2016 ; Krannich et al., 2017 ; Pohl et al., 2012 ; Scharl et al., 2017 ). Marginal reliabilities were reported as good, with 0.81 in grade 5, 0.83 in grade 7, and 0.81 in grade 9.

Each test administered to the respondents consisted of five different text types (domains: information, instruction, advertising, commenting and literary text) with subsequent questions in either a simple or complex multiple-choice format or a matching response format. In addition, but unrelated to the five text types, the questions covered three types of cognitive requirements (finding information in the text, drawing text-related conclusions, and reflecting and assessing). To answer the respective question types, these cognitive processes needed to be activated. These dimensional concepts and question types are linked to the frameworks of other large-scale assessment studies, such as PISA (OECD, 2017 ) or the International Adult Literacy Survey (IALS/ALL; e.g., OECD & Statistics Canada 1995 ). Further details on the reading test construction and development are presented by Gehrer et al. ( 2003 ).

Statistical analysis

We adopted the multilevel structural equation modelling framework for the modeling of student reading competence development and fitted a two-level factor model with categorical indicators (Kamata & Vaughn, 2010 ) to the reading competence tests. Each of the three measurement occasions was modeled as a latent factor. Please note that MSEM is the more general framework to fitting multilevel item response theory models (Fox, 2010 ; Fox & Glas, 2001 ; Kamata & Vaughn, 2010 ; Lu et al., 2005 ; Muthén & Asparouhov, 2012 ), and therefore, each factor in our model resembles a unidimensional, two-parametric IRT model. The model setup was the same for the student and the school level and therefore discrimination parameters (i.e., item loadings) were constrained to be equal at the within- and between-level, while difficulty estimates (i.e., item thresholds) and item residual variances are measured on the between-level (i.e., school-level). School type variables were included as binary predictors of latent abilities at the school level.

The multilevel structural equation models for longitudinal competence measurement were estimated using Bayesian MCMC estimation methods in the Mplus software program (version 8.0, Muthén and Muthén 1998 –2020). Two Markov chains were implemented for each parameter, and chain convergence was assessed using the potential scale reduction (PSR, Gelman & Rubin, 1992 ) criterion, where values below 1.10 indicate convergence (Gelman et al., 2004 ). Furthermore, successful convergence of the estimates was evaluated based on trace plots for each parameter. To determine whether the estimated models delivered reliable estimates, autocorrelation plots were investigated. The mean of the posterior distribution and the Bayesian 95% credibility interval were used to evaluate the model parameters. Using the Kolmogorov–Smirnov test, the hypothesis that both MCMC chains have an equal distribution was evaluated using 100 draws from each of the two chains per parameter. For all estimated models, the PSR criterion (i.e., Gelman and Rubin diagnostic) indicated that convergence was achieved, which was confirmed by a visual inspection of the trace plots for each model parameter.

Diffuse priors were used with a normal distribution with mean zero and infinite variance, N (0, ∞), for continuous indicators such as intercepts, loading parameters or regression slopes; normal distribution priors with mean zero and a variance of 5, N (0, 5), were used for categorical indicators; inverse-gamma priors IG (− 1, 0) were used for residual variances; and inverse-Wishart priors IW (0, − 4) for variances and covariances.

Model fit was assessed using the posterior predictive p-value (PPP), obtained through a fit statistic based on the likelihood-ratio \({\chi }^{2}\) test of an \({H}_{0}\) model against an unrestricted \({H}_{1}\) model, as implemented in Mplus. A low PPP indicates poor fit, while an acceptable model fit starts with PPP > 0.05, and an excellent-fitting model has a PPP value of approximately 0.5 (Asparouhov & Muthén, 2010 ).

Differential item functioning was examined using the invariance alignment method (IA; Asparouhov & Muthén, 2014 ; Kim et al., 2017 ; Muthén & Asparouhov, 2014 ). These models were estimated with maximum likelihood estimation using numerical integration and taking the nested data structure into account through cluster robust estimation. One can choose between fixing one group or free estimation. As the fixed alignment was shown to slightly outperform the free alignment in a simulation study (Kim et al., 2017 ), we applied fixed alignment and ran several models fixing each of the five school types once. Item information for items exhibiting DIF between school types were then split to the respective non-aligning group versus the remaining student groups. Hence, new pseudo-items are introduced for the models that take school-type specific item properties into account.

In the multilevel structural equation models, for the students selected as part of the refreshment sample at the time of the second measurement, we treated their missing information from the first measurement occasion as missing completely at random (Rubin, 1987 ). Please note that student attrition from the seventh and ninth grade samples can be related to features of the sample, even though the multilevel SEM accounts for cases with missing values for the second and third measurement occasions. We fixed the latent factor intercept per assessment for seventh and ninth grade to the value of the respective link constant. The average changes in item difficulty to the original sample were computed from the link samples, and in that manner, an additive linking constant for the overall sample was obtained. Please note that this (additive) linking constant does not change the relations among school type effects per measurement occasion.

Furthermore, we applied weighted effect coding to the school type variables, which is preferred over effect coding, as the categorical variable school type has categories of different sizes (Sweeney & Ulveling, 1972 ; Te Grotenhuis et al., 2017 ). This procedure is advantageous for observational studies, as the data are not balanced, in contrast to data collected via experimental designs. First, we set the high school type as the reference category. Second, to obtain an estimate for this group, we re-estimated the model using middle secondary school as the reference category. Furthermore, we report the Cohen’s ( 1969 ) d effect size per school type estimate. We calculated this effect size as the difference per value relative to the average of all other school type effects per measurement occasion and divided it by the square root of the factor variance (hence the standard deviation) per respective latent factor. For models where the multilevel structure was accounted for, the within- and between-level components of the respective factor variance were summed for the calculation of Cohen’s d .

Data availability and analysis syntax

The data analyzed in this study and documentation are available at https://doi.org/10.5157/NEPS:SC3:9.0.0 . Moreover, the syntax used to generate the reported results is provided in an online repository at https://osf.io/5ugwn/?view_only=327ba9ae72684d07be8b4e0c6e6f1684 .

We first tested for measurement invariance between school types and subsequently probed the sensitivity of school type comparisons when accounting for measurement non-invariance. In our analyses, sufficient convergence in the parameter estimation was indicated for all models through an investigation of the trace and autocorrelation plots. Furthermore, the PSR criterion fell below 1.10 for all parameters after 8000 iterations. Hence, appropriate posterior predictive quality for all parameters on the between and within levels was assumed.

DIF between school types

Measurement invariance of the reading competence test items across the school types was assessed using IA. Items with non-aligning, and hence measurement non-invariant, item parameters between these higher-level groups were found for each measurement occasion (see the third, sixth and last columns of Table 2 ). For the reading competence measurement in fifth grade, 11 out of the 32 administered items showed measurement non-invariance in either discrimination or threshold parameters across school types. Most non-invariance occurred for the lowest (lower secondary school) and the highest (high school) types. For 5 of the 11 non-invariant items, the school types with non-invariance were the same for both the discrimination and threshold parameters. In seventh grade, non-invariance across school types was found for 11 out of the 40 test items in either discrimination or threshold parameters. While non-invariance occurred six times in discrimination parameters, it occurred seven times in threshold parameters, and most non-invariance occurred for the high school type (10 out of the 11 non-invariant items). Applying the IA to the competence test administered in ninth grade showed non-invariance for 11 out of the 44 test items. Nearly all non-invariances were between the lowest and highest school types, and most item non-invariance in discrimination and threshold parameters occurred for the last test items.

Consequences of DIF for school type effects

Comparisons of competence development across school types were estimated using MSEM. Each timepoint was modeled as a latent factor, and the between-level component of each latent factor was regressed on the school type. Furthermore, the latent factors were correlated through this modeling approach, both at the within and between levels. Please note that the within- and between-level model setup was the same, and each factor was modeled with several categorical indicators. In Models 1a and 1b, no school-type specific item discrimination or item difficulty estimates were accounted for, while in Models 2a and 2b, school-type specific item discrimination and item difficulty estimates were taken into account for items exhibiting DIF. The amount of variance attributable to the school type (intraclass correlation) was high in both of these longitudinal models and amounted to 43.0% (Model 1a)/42.4% (Model 2a) in grade 5, 40.3% (Model 1a)/40.6% (Model 2a) in grade 7 and 43.4% (Model 1a)/43.3% (Model 2a) in grade 9. After including the school type covariates (Model 1b and Model 2b), the amount of variance in the school-level random effects was reduced by approximately two-thirds for each school-level factor, while the amount of variance in the student-level random effects remained nearly the same.

The development of reading competence from fifth to ninth grade appeared to be almost parallel between school types. The results of the first model (see Model 1b in Table 3 ) present quite similar differences in reading competence between school types at each measurement occasion. The highest reading competence is achieved by students attending high schools, followed by middle secondary schools, comprehensive schools and schools offering all school tracks except high school. Students in lower secondary schools had the lowest achievement at all timepoints. As the 95 percent posterior probability intervals overlap between the middle secondary school type, the comprehensive school type and the type of schools offering all school tracks except high school (see Model 1b and Model 2b in Table 3 ), three distinct groups of school types, as defined by reading competence achievement, remain. Furthermore, the comparison of competence development from fifth to ninth grade across these school types was quite stable. The Cohen’s d effect size per school type estimate and per estimated model are presented in Table 4 and support this finding. A large positive effect relative to the average reading competence of the other school types is found for high school students across all grades. A large negative effect is found across all grades for lower secondary school students relative to the other school types. The other three school types have overall small effect sizes across all grades relative to the averages of the other school types.

The results of the second model (see Model 2b in Table 3 ) show similar differences between the school types when compared to the former model. Additionally, effect sizes are similar between the two models. Hence, differences in the development of reading competence across school types are parallel, and this pattern is robust to the discovered school-type specific DIF of item discrimination and difficulty estimates. With regard to model fit, only two models (Models 2a and 2b) showed an acceptable fit with PPP > 0.05 when school type-specific item discrimination and item difficulty estimates for items exhibiting DIF were accounted for. Furthermore, single-level regression analyses with cluster robust standard error estimation using the robust maximum likelihood (MLR) estimator were performed to investigate if the findings were robust to the application of an alternative estimation method for hierarchical data. Please note that result tables for these analyses are presented in the Additional file 1 . The main findings remain unaltered, as a parallel pattern of reading competence development between the school types was found, as well as three distinct school type groups.

Consequences when ignoring clustering effects

Finally, we estimated the same models without accounting for the clustered data structure (see Table 5 ). In comparison to the previous models, Model 3a and Model 4a show that in seventh and ninth grade the comprehensive school type performed significantly better than the middle secondary schools and schools offering all school tracks except high school.

Additionally, we replicated the analyses of longitudinal reading competence development using point estimates of student reading competence. The point estimates are the linked weighted maximum likelihood estimates (WLE; Warm, 1989 ) as provided by NEPS and we performed linear growth modelling with and without cluster robust standard error estimation. Results are presented in Additional file 1 : Tables S3–S5. As before, these results support our main findings on the pattern of competence development between German secondary school types and the three distinct school type groups. When it was not accounted for the clustered data structure, the misleading finding resulted that the comprehensive schools performed significantly better in seventh and ninth grade than middle secondary schools and schools offering all school tracks except high school.

We evaluated measurement invariance between German secondary school types and tested the sensitivity of longitudinal comparisons to the found measurement non-invariance. Differences in reading competence between German secondary school types from fifth to ninth grade were investigated, while reading competence was modeled as a latent variable with measurement error taken into account. Multilevel modeling was employed to account for the clustered data structure, and measurement invariance between school types was assessed. Based on our results, partial invariance between school types is assumed (i.e., more than half of the items were measurement invariant/ free of DIF; Steenkamp & Baumgartner, 1998 ; Vandenberg & Lance, 2000 ).

The results on the longitudinal estimation of reading competence revealed a parallel pattern between German secondary school types, and that pattern remained when school-type-specific item estimates were included for items exhibiting DIF. Nevertheless, estimations of the same models without consideration of the clustered data structure led to misleading assumptions about the pattern of longitudinal reading competence development. In these models, students attending the comprehensive school type are estimated to be significantly better in seventh and ninth grade than students attending the middle secondary school type and those attending schools offering all school tracks except high school. For research focusing on school type comparisons of latent competence, we emphasize the use of hierarchical modeling when a nested data structure is present.

Furthermore, although we recommend the assessment of measurement invariance, it is not (or not only) a statistical question whether an item induces bias for group comparisons. Rather, procedures for measurement invariance testing are at best part of the test development process, including expert reviews on items exhibiting DIF (Camilli, 1993 ). Items that are measurement non-invariant and judged to be associated with construct irrelevant factors are revised or replaced throughout the test development process. Robitzsch and Lüdtke ( 2020 ) provide a thoughtful discussion on the reasoning behind (partial) measurement invariance for group comparison under construct relevant DIF and DIF caused by construct irrelevant factors. Information about the amount of item bias for a developed test is also useful to quantify the uncertainty in group comparisons, which is analogous to the report of linking errors in longitudinal large-scale assessments (cf. Robitzsch & Lüdtke, 2020 ). While the assumption of exact item parameter invariance across groups is quite strict, we presented a method to assess the less strict approach of partial measurement invariance. Even when a measured construct is only partially invariant, comparisons of school types can be valid. Nevertheless, no statistical method alone can define construct validity without further theoretical reasoning and expert evaluation. As demonstrated in this study, the sensitivity of longitudinal reading competence development to partial measurement invariance between school types can be assessed.

Implications for research on the achievement gap in reading competence

Studies on reading competence development have presented either parallel development (e.g., Retelsdorf & Möller, 2008 ; Schneider & Stefanek, 2004 ) or a widening gap (e.g., Pfost & Artelt, 2013 ) among secondary school types. In these studies, samples were drawn from different regions (i.e., German federal states), and different methods of statistical analysis were used. We argued that group differences, such as school type effects, can be distorted by measurement non-invariance of test items. As these previous studies have not reported analyses of measurement invariance such as DIF, it is unknown whether the differences found relate to the psychometric properties of the administered tests. With our analyses, we found no indication that the pattern of competence development is affected by DIF. As a prerequisite for group-mean comparisons, studies should present evidence of measurement invariance between investigated groups and in the longitudinal case, across measurement occasions, or refer to the respective sources where these analyses are presented. Also, to enhance comparability of results across studies on reading competence development, researchers should discuss if the construct has the same meaning for all groups and over all measurement occasions. On a further note, the previous analyses were regionally limited and considered only one or two German federal states. In comparison, the sample we used is representative on a national level, and we encourage future research to strive to include more regions. Please note that the clustered data structure was always accounted for in previous analyses on reading competence development through cluster robust maximum likelihood estimation. When the focus is on regression coefficients and variance partitioning or inference on the cluster-level is not of interest, researchers need to make less assumptions of their data when choosing the cluster robust maximum likelihood estimation approach, as compared to hierarchical linear modeling (McNeish et al., 2017 ; Stapleton et al., 2016 ). As mentioned before, inaccurate standard errors and biased significance tests can result when hierarchical structures are ignored during estimation (Hox, 2002 ; Raudenbush & Bryk, 2002 ). As a result, standard errors are underestimated and the confidence intervals are narrower than they actually are, and effects become statistically significant more easily. As our results showed, ignoring the clustered data structure can result in misleading conclusions about the pattern of longitudinal reading competence development in comparisons of German secondary school types.

Limitations

One focus of our study was to investigate the consequences for longitudinal measurements of latent competence when partial invariance is taken into account in the estimation model. It was assumed that the psychometric properties of the scale and the underlying relationship among variables can be affected when some items are non-invariant and thus unfair between school types. With the NEPS study design for reading competence measurement, this assumption cannot be entirely tested, as for each measurement occasion, a completely new set of items is administered to circumvent memory effects. The three measurement occasions are linked through a mean/mean linking approach based on an anchor-group design (Fischer et al., 2016 , 2019 ). Hence, a unique linking constant is assumed to hold for all school types. The computation of the linking constant relies on the assumption that items are invariant across all groups under investigation (e.g., school types). Due to data restrictions, as the data from the additional linking studies are not published by NEPS, we could not investigate the effect of item non-invariance across school types on the computation of linking constants. Therefore, we cannot test the assumption that the scale score metric, upon which the linking constant is computed, holds across measurement occasions for the school clusters and the school types under study. Overall, we assume that high effort was invested in the item and test construction for the NEPS. However, we can conclude that the longitudinal competence measurement is quite robust against the findings presented here regarding measurement non-invariance between school types, as the same measurement instruments are used to create the linking constants. Whenever possible, we encourage researchers to additionally assess measurement invariance across repeated measurements.

On a more general note, and looking beyond issues of statistical modeling, the available information on school types for our study is not exhaustive, as the German secondary school system is very complex and offers several options for students regarding schooling trajectories. A detailed variable on secondary school types and an identification of students who change school types between measurement occasions is desired but difficult to provide for longitudinal analyses (Bayer et al., 2014 ). As we use the school type information that generated the strata for the sampling of students, this information is constant over measurement occasions, but the comparability for later measurement timepoints (e.g., ninth grade) is rather limited.

In summary, it was assumed that school-level differences in measurement constructs may impact the longitudinal measurement of reading competence development. Therefore, we assessed measurement invariance between school types. Differences in item estimates between school types were found for each of the three measurement occasions. Nevertheless, taking these differences in item discrimination and difficulty estimates into account did not alter the parallel pattern of reading competence development when comparing German secondary school types from fifth to ninth grade. Furthermore, the necessity of taking the hierarchical data structure into account when comparing competence development across the school types was demonstrated. Ignoring the fact that students are nested within schools by sampling design in the estimation led to an overestimation of the statistical significance of the effects for the comprehensive school type in seventh and ninth grade.

Availability of data and materials

The data analyzed in this study and documentation are available at doi: https://doi.org/10.5157/NEPS:SC3:9.0.0 . Moreover, the syntax used to generate the reported results is provided in an online repository at https://osf.io/5ugwn/?view_only=327ba9ae72684d07be8b4e0c6e6f1684 .

This paper uses data from the National Educational Panel Study (NEPS): Starting Cohort Grade 5, doi: https://doi.org/10.5157/NEPS:SC3:9.0.0 . From 2008 to 2013, NEPS data was collected as part of the Framework Program for the Promotion of Empirical Educational Research funded by the German Federal Ministry of Education and Research (BMBF). As of 2014, NEPS is carried out by the Leibniz Institute for Educational Trajectories (LIfBi) at the University of Bamberg in cooperation with a nationwide network.

Asparouhov, T., & Muthén, B. (2010). Bayesian analysis using Mplus: Technical implementation (Mplus Technical Report). http://statmodel.com/download/Bayes3.pdf . Accessed 12 November 2020.

Asparouhov, T., & Muthén, B. (2014). Multiple-group factor analysis alignment. Structural Equation Modeling: A Multidisciplinary Journal, 21 (4), 495–508. https://doi.org/10.1080/10705511.2014.919210

Article   Google Scholar  

Aßmann, C., Steinhauer, H. W., Kiesl, H., Koch, S., Schönberger, B., Müller-Kuller, A., Rohwer, G., Rässler, S., & Blossfeld, H.-P. (2011). 4 Sampling designs of the National Educational Panel Study: Challenges and solutions. Zeitschrift Für Erziehungswissenschaft, 14 (S2), 51–65. https://doi.org/10.1007/s11618-011-0181-8

Bast, J., & Reitsma, P. (1998). Analyzing the development of individual differences in terms of Matthew effects in reading: Results from a Dutch longitudinal study. Developmental Psychology, 34 (6), 1373–1399. https://doi.org/10.1037/0012-1649.34.6.1373

Baumert, J., Klieme, E., Neubrand, M., Prenzel, M., Schiefele, U., Schneider, W., Stanat, P., Tillmann, K.-J., & Weiß, M. (2001). PISA 2000: Basiskompetenzen von Schülerinnen und Schülern im internationalen Vergleich . Leske + Budrich. https://doi.org/10.1007/978-3-322-83412-6

Book   Google Scholar  

Baumert, J., Stanat, P., & Watermann, R. (2006). Schulstruktur und die Entstehung differenzieller Lern- und Entwicklungsmilieus. In J. Baumert, P. Stanat, & R. Watermann (Eds.), Herkunftsbedingte Disparitäten im Bildungssystem (pp. 95–188). VS Verlag für Sozialwissenschaften.

Google Scholar  

Baumert, J., Trautwein, U., & Artelt, C. (2003). Schulumwelten—institutionelle Bedingungen des Lehrens und Lernens. In J. Baumert, C. Artelt, E. Klieme, M. Neubrand, M. Prenzel, U. Schiefele, W. Schneider, K.-J. Tillmann, & M. Weiß (Eds.), PISA 2000. Ein differenzierter Blick auf die Länder der Bundesrepublik Deutschland (pp. 261–331). Leske u. Budrich.

Chapter   Google Scholar  

Bayer, M., Goßmann, F., & Bela, D. (2014). NEPS technical report: Generated school type variable t723080_g1 in Starting Cohorts 3 and 4 (NEPS Working Paper No. 46). Bamberg: Leibniz Institute for Educational Trajectories, National Educational Panel Study. https://www.neps-data.de/Portals/0/Working%20Papers/WP_XLVI.pdf . Accessed 12 November 2020.

Becker, M., Lüdtke, O., Trautwein, U., & Baumert, J. (2006). Leistungszuwachs in Mathematik. Zeitschrift Für Pädagogische Psychologie, 20 (4), 233–242. https://doi.org/10.1024/1010-0652.20.4.233

Blossfeld, H.-P., Roßbach, H.-G., & von Maurice, J. (Eds.), (2011). Education as a lifelong process: The German National Educational Panel Study (NEPS) [Special Issue]. Zeitschrift für Erziehungswissenschaft , 14.

Bos, W., Bonsen, M., & Gröhlich, C. (2009). KESS 7 Kompetenzen und Einstellungen von Schülerinnen und Schülern an Hamburger Schulen zu Beginn der Jahrgangsstufe 7. HANSE—Hamburger Schriften zur Qualität im Bildungswesen (Vol. 5). Waxmann.

Brown, T. A. (2006). Confirmatory factor analysis for applied research . Guilford Press.

Camilli, G. (1993). The case against item bias detection techniques based on internal criteria: Do item bias procedures obscure test fairness issues? In P. W. Holland & H. Wainer (Eds.), Differential item functioning: Theory and practice (pp. 397–417). Erlbaum.

Camilli, G. (2006). Test fairness. In R. Brennan (Ed.), Educational measurement (4th ed., pp. 221–256). American Council on Education and Praeger.

Chall, J. S. (1983). Stages of reading development . McGraw-Hill.

Cohen, J. (1969). Statistical power analysis for the behavioral sciences . Academic Press.

Cortina, K. S., & Trommer, L. (2009). Bildungswege und Bildungsbiographien in der Sekundarstufe I. Das Bildungswesen in der Bundesrepublik Deutschland: Strukturen und Entwicklungen im Überblick . Waxmann.

Ditton, H., Krüsken, J., & Schauenberg, M. (2005). Bildungsungleichheit—der Beitrag von Familie und Schule. Zeitschrift Für Erziehungswissenschaft, 8 (2), 285–304. https://doi.org/10.1007/s11618-005-0138-x

Edossa, A. K., Neuenhaus, N., Artelt, C., Lingel, K., & Schneider, W. (2019). Developmental relationship between declarative metacognitive knowledge and reading comprehension during secondary school. European Journal of Psychology of Education, 34 (2), 397–416. https://doi.org/10.1007/s10212-018-0393-x

Finch, W. H., & Bolin, J. E. (2017). Multilevel Modeling using Mplus . Chapman and Hall—CRC.

Fischer, L., Gnambs, T., Rohm, T., & Carstensen, C. H. (2019). Longitudinal linking of Rasch-model-scaled competence tests in large-scale assessments: A comparison and evaluation of different linking methods and anchoring designs based on two tests on mathematical competence administered in grades 5 and 7. Psychological Test and Assessment Modeling, 61 , 37–64.

Fischer, L., Rohm, T., Gnambs, T., & Carstensen, C. H. (2016). Linking the data of the competence tests (NEPS Survey Paper No. 1). Bamberg: Leibniz Institute for Educational Trajectories, National Educational Panel Study. https://www.lifbi.de/Portals/0/Survey%20Papers/SP_I.pdf . Accessed 12 November 2020.

Fox, J.-P. (2010). Bayesian item response modeling: Theory and applications . Springer.

Fox, J.-P., & Glas, C. A. W. (2001). Bayesian estimation of a multilevel IRT model using gibbs sampling. Psychometrika, 66 , 271–288.

Gamoran, A., & Mare, R. D. (1989). Secondary school tracking and educational inequality: Compensation, reinforcement, or neutrality? American Journal of Sociology, 94 (5), 1146–1183. https://doi.org/10.1086/229114

Gehrer, K., Zimmermann, S., Artelt, C., & Weinert, S. (2003). NEPS framework for assessing reading competence and results from an adult pilot study. Journal for Educational Research Online, 5 , 50–79.

Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian data analysis (2nd ed.). Chapman & Hall.

Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple Sequences. Statistical Science, 7 , 457–472.

Heck, R. H., Price, C. L., & Thomas, S. L. (2004). Tracks as emergent structures: A network analysis of student differentiation in a high school. American Journal of Education, 110 (4), 321–353. https://doi.org/10.1086/422789

Holland, P. W., & Wainer, H. (1993). Differential item functioning . Routledge. https://doi.org/10.4324/9780203357811

Hox, J. J. (2002). Multilevel analysis: Techniques and applications. Quantitative methodology series . Erlbaum.

Jak, S., & Jorgensen, T. (2017). Relating measurement invariance, cross-level invariance, and multilevel reliability. Frontiers in Psychology, 8 , 1640. https://doi.org/10.3389/fpsyg.2017.01640

Jöreskog, K. G. (1971). Simultaneous factor analysis in several populations. Psychometrika, 36 (4), 409–426. https://doi.org/10.1007/BF02291366

Kamata, A., & Vaughn, B. K. (2010). Multilevel IRT modeling. In J. J. Hox & J. K. Roberts (Eds.), Handbook of advanced multilevel analysis (pp. 41–57). Routledge.

Kaplan, D., Kim, J.-S., & Kim, S.-Y. (2009). Multilevel latent variable modeling: Current research and recent developments. In R. E. Millsap & A. Maydeu-Olivares (Eds.), The Sage handbook of quantitative methods in psychology (pp. 592–612). Sage Publications Ltd. https://doi.org/10.4135/9780857020994.n24

Kim, E., Cao, C., Wang, Y., & Nguyen, D. (2017). Measurement invariance testing with many groups: A comparison of five approaches. Structural Equation Modeling: A Multidisciplinary Journal . https://doi.org/10.1080/10705511.2017.1304822

Köller, O., & Baumert, J. (2001). Leistungsgruppierungen in der Sekundarstufe I. Ihre Konsequenzen für die Mathematikleistung und das mathematische Selbstkonzept der Begabung. Zeitschrift Für Pädagogische Psychologie, 15 , 99–110. https://doi.org/10.1024//1010-0652.15.2.99

Köller, O., & Baumert, J. (2002). Entwicklung von Schulleistungen. In R. Oerter & L. Montada (Eds.), Entwicklungspsychologie (pp. 735–768). Beltz/PVU.

Krannich, M., Jost, O., Rohm, T., Koller, I., Carstensen, C. H., Fischer, L., & Gnambs, T. (2017). NEPS Technical report for reading—scaling results of starting cohort 3 for grade 7 (NEPS Survey Paper No. 14). Bamberg: Leibniz Institute for Educational Trajectories, National Educational Panel Study. https://www.neps-data.de/Portals/0/Survey%20Papers/SP_XIV.pdf . Accessed 12 November 2020.

Lehmann, R., Gänsfuß, R., & Peek, R. (1999). Aspekte der Lernausgangslage und der Lernentwicklung von Schülerinnen und Schülern an Hamburger Schulen: Klassenstufe 7; Bericht über die Untersuchung im September 1999 . Hamburg: Behörde für Schule, Jugend und Berufsbildung, Amt für Schule.

Lehmann, R. H., & Lenkeit, J. (2008). ELEMENT. Erhebung zum Lese- und Mathematikverständnis. Entwicklungen in den Jahrgangsstufen 4 bis 6 in Berlin . Berlin: Senatsverwaltung für Bildung, Jugend und Sport.

LeTendre, G. K., Hofer, B. K., & Shimizu, H. (2003). What Is tracking? Cultural expectations in the United States, Germany, and Japan. American Educational Research Journal, 40 (1), 43–89. https://doi.org/10.3102/00028312040001043

Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17 , 179–193.

Lu, I. R. R., Thomas, D. R., & Zumbo, B. D. (2005). Embedding IRT in structural equation models: A comparison with regression based on IRT scores. Structural Equation Modeling: A Multidisciplinary Journal, 12 (2), 263–277. https://doi.org/10.1207/s15328007sem1202_5

Lüdtke, O., Marsh, H. W., Robitzsch, A., Trautwein, U., Asparouhov, T., & Muthén, B. (2008). The multilevel latent covariate model: A new, more reliable approach to group-level effects in contextual studies. Psychological Methods, 13 , 203–229.

Lüdtke, O., Marsh, H. W., Robitzsch, A., & Trautwein, U. (2011). A 2x2 taxonomy of multilevel latent contextual model: Accuracy-bias trade-offs in full and partial error correction models. Psychological Methods, 16 , 444–467.

Marsh, H. W., Lüdtke, O., Robitzsch, A., Trautwein, U., Asparouhov, T., Muthén, B., & Nagengast, B. (2009). Doubly-latent models of school contextual effects: Integrating multilevel and structural equation approaches to control measurement and sampling error. Multivariate Behavioral Research, 44 , 764–802.

McNeish, D., Stapleton, L. M., & Silverman, R. D. (2017). On the unnecessary ubiquity of hierarchical linear modeling. Psychological Methods, 22 (1), 114–140. https://doi.org/10.1037/met0000078

Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17 (4), 297–334. https://doi.org/10.1177/014662169301700401

Muthén, B., & Asparouhov, T. (2012). Bayesian SEM: A more flexible representation of substantive theory. Psychological Methods, 17 , 313–335.

Muthén, B., & Asparouhov, T. (2014). IRT studies of many groups: The alignment method. Frontiers in Psychology, 5 , 978. https://doi.org/10.3389/fpsyg.2014.00978

Muthén, L.K. and Muthén, B.O. (1998–2020). Mplus User’s Guide (8th ed.), Los Angeles, CA: Muthén and Muthén.

Nagy, G., Retelsdorf, J., Goldhammer, F., Schiepe-Tiska, A., & Lüdtke, O. (2017). Veränderungen der Lesekompetenz von der 9. zur 10. Klasse: Differenzielle Entwicklungen in Abhängigkeit der Schulform, des Geschlechts und des soziodemografischen Hintergrunds? Zeitschrift Für Erziehungswissenschaft, 20 (S2), 177–203. https://doi.org/10.1007/s11618-017-0747-1

Naumann, J., Artelt, C., Schneider, W. & Stanat, P. (2010). Lesekompetenz von PISA 2000 bis PISA 2009. In E. Klieme, C. Artelt, J. Hartig, N. Jude, O. Köller, M. Prenzel (Eds.), PISA 2009. Bilanz nach einem Jahrzehnt. Münster: Waxmann. https://www.pedocs.de/volltexte/2011/3526/pdf/DIPF_PISA_ISBN_2450_PDFX_1b_D_A.pdf . Accessed 12 November 2020.

Neumann, M., Schnyder, I., Trautwein, U., Niggli, A., Lüdtke, O., & Cathomas, R. (2007). Schulformen als differenzielle Lernmilieus. Zeitschrift Für Erziehungswissenschaft, 10 (3), 399–420. https://doi.org/10.1007/s11618-007-0043-6

O’Brien, D. G., Moje, E. B., & Stewart, R. A. (2001). Exploring the context of secondary literacy: Literacy in people’s everyday school lives. In E. B. Moje & D. G. O’Brien (Eds.), Constructions of literacy: Studies of teaching and learning in and out of secondary classrooms (pp. 27–48). Erlbaum.

Oakes, J., & Wells, A. S. (1996). Beyond the technicalities of school reform: Policy lessons from detracking schools . UCLA Graduate School of Education & Information Studies.

OECD. (2017). PISA 2015 assessment and analytical framework: science, reading, mathematic, financial literacy and collaborative problem solving . OECD Publishing. https://doi.org/10.1787/9789264281820-en

OECD & Statistics Canada. (1995). Literacy, economy and society: Results of the first international adult literacy survey . OECD Publishing.

Pfost, M., & Artelt, C. (2013). Reading literacy development in secondary school and the effect of differential institutional learning environments. In M. Pfost, C. Artelt, & S. Weinert (Eds.), The development of reading literacy from early childhood to adolescence empirical findings from the Bamberg BiKS longitudinal studies (pp. 229–278). Bamberg: University of Bamberg Press.

Pfost, M., Hattie, J., Dörfler, T., & Artelt, C. (2014). Individual differences in reading development: A review of 25 years of empirical research on Matthew effects in reading. Review of Educational Research, 84 (2), 203–244. https://doi.org/10.3102/0034654313509492

Pfost, M., Karing, C., Lorenz, C., & Artelt, C. (2010). Schereneffekte im ein- und mehrgliedrigen Schulsystem: Differenzielle Entwicklung sprachlicher Kompetenzen am Übergang von der Grund- in die weiterführende Schule? Zeitschrift Für Pädagogische Psychologie, 24 (3–4), 259–272. https://doi.org/10.1024/1010-0652/a000025

Pohl, S., Haberkorn, K., Hardt, K., & Wiegand, E. (2012). NEPS technical report for reading—scaling results of starting cohort 3 in fifth grade (NEPS Working Paper No. 15). Bamberg: Otto-Friedrich-Universität, Nationales Bildungspanel.

Protopapas, A., Parrila, R., & Simos, P. G. (2016). In Search of Matthew effects in reading. Journal of Learning Disabilities, 49 (5), 499–514. https://doi.org/10.1177/0022219414559974

Rabe-Hesketh, S., Skrondal, A., & Zheng, X. (2007). Multilevel Structural Equation Modeling. In S.-Y. Lee (Ed.), Handbook of Latent Variable and Related Models (pp. 209–227). Elsevier.

Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods. Advanced quantitative techniques in the social sciences, (Vol. 1). Thousand Oaks, CA.: Sage Publ.

Raykov, T. (1999). Are simple change scores obsolete? An approach to studying correlates and predictors of change. Applied Psychological Measurement, 23 (2), 120–126. https://doi.org/10.1177/01466219922031248

Retelsdorf, J., Becker, M., Köller, O., & Möller, J. (2012). Reading development in a tracked school system: A longitudinal study over 3 years using propensity score matching. The British Journal of Educational Psychology, 82 (4), 647–671. https://doi.org/10.1111/j.2044-8279.2011.02051.x

Retelsdorf, J., & Möller, J. (2008). Entwicklungen von Lesekompetenz und Lesemotivation: Schereneffekte in der Sekundarstufe? Zeitschrift Für Entwicklungspsychologie Und Pädagogische Psychologie, 40 (4), 179–188. https://doi.org/10.1026/0049-8637.40.4.179

Robitzsch, A., & Lüdtke, O. (2020). A review of different scaling approaches under full invariance, partial invariance, and noninvariance for cross-sectional country comparisons in large-scale assessments. Psychological Test and Assessment Modeling , 62(2), 233–279. https://www.psychologie-aktuell.com/fileadmin/Redaktion/Journale/ptam-2020-2/03_Robitzsch.pdf

Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys . Wiley. https://doi.org/10.1002/9780470316696

Scharl, A., Fischer, L., Gnambs, T., & Rohm, T. (2017). NEPS Technical report for reading: scaling results of starting cohort 3 for grade 9 (NEPS Survey Paper No. 20). Bamberg: Leibniz Institute for Educational Trajectories, National Educational Panel Study. https://www.neps-data.de/Portals/0/Survey%20Papers/SP_XX.pdf . Accessed 12 November 2020.

Schneider, W., & Stefanek, J. (2004). Entwicklungsveränderungen allgemeiner kognitiver Fähigkeiten und schulbezogener Fertigkeiten im Kindes- und Jugendalter. Zeitschrift Für Entwicklungspsychologie Und Pädagogische Psychologie, 36 (3), 147–159. https://doi.org/10.1026/0049-8637.36.3.147

Schweig, J. (2014). Cross-level measurement invariance in school and classroom environment surveys: Implications for policy and practice. Educational Evaluation and Policy Analysis, 36 (3), 259–280. https://doi.org/10.3102/0162373713509880

Silva, C., Bosancianu, B. C. M., & Littvay, L. (2019). Multilevel Structural Equation Modeling . Sage.

Stanovich, K. E. (1986). Matthew effects in reading: Some consequences of individual differences in the acquisition of literacy. Reading Research Quarterly, 21 (4), 360–407. https://doi.org/10.1598/RRQ.21.4.1

Stapleton, L. M., McNeish, D. M., & Yang, J. S. (2016). Multilevel and single-level models for measured and latent variables when data are clustered. Educational Psychologist, 51 (3–4), 317–330. https://doi.org/10.1080/00461520.2016.1207178

Steenkamp, J. E. M., & Baumgartner, H. (1998). Assessing measurement invariance in cross-national consumer research. Journal of Consumer Research, 25 , 78–90. https://doi.org/10.1086/209528

Steinhauer, H. W. & Zinn, S. (2016). NEPS technical report for weighting: Weighting the sample of starting cohort 3 of the national educational panel study (Waves 1 to 3) (NEPS Working Paper No. 63). Bamberg: Leibniz Institute for Educational Trajectories, National Educational Panel Study. https://www.neps-data.de/Portals/0/Working%20Papers/WP_LXIII.pdf . Accessed 12 November 2020.

Steyer, R., Partchev, I., & Shanahan, M. J. (2000). Modeling True Intraindividual Change in Structural Equation Models: The Case of Poverty and Children’s Psychosocial Adjustment. In T. D. Little, K. U. Schnabel, & J. Baumert (Eds.),  Modeling longitudinal and multilevel data: Practical issues, applied approaches and specific examples  (pp. 109–26). Mahwah, N.J.: Lawrence Erlbaum Associates. https://www.metheval.uni-jena.de/materialien/publikationen/steyer_et_al.pdf . Accessed 12 November 2020.

Sweeney, R. E., & Ulveling, E. F. (1972). A Transformation for simplifying the interpretation of coefficients of binary variables in regression analysis. The American Statistician, 26 (5), 30–32. https://doi.org/10.2307/2683780

Te Grotenhuis, M., Pelzer, B., Eisinga, R., Nieuwenhuis, R., Schmidt-Catran, A., & Konig, R. (2017). When size matters: Advantages of weighted effect coding in observational studies. International Journal of Public Health, 62 (1), 163–167. https://doi.org/10.1007/s00038-016-0901-1

van de Schoot, R., Kluytmans, A., Tummers, L., Lugtig, P., Hox, J., & Muthén, B. (2013). Facing off with Scylla and Charybdis: A comparison of scalar, partial, and the novel possibility of approximate measurement invariance. Frontiers in Psychology, 4 , 770. https://doi.org/10.3389/fpsyg.2013.00770

Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3 (1), 4–70. https://doi.org/10.1177/109442810031002

Walberg, H. J., & Tsai, S.-L. (1983). Matthew effects in education. American Educational Research Journal, 20 (3), 359–373. https://doi.org/10.2307/1162605

Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54 , 427–450. https://doi.org/10.1007/BF02294627

Weis, M., Doroganova, A., Hahnel, C., Becker-Mrotzek, M., Lindauer, T., Artelt, C., & Reiss, K. (2020). Aktueller Stand der Lesekompetenz in PISA 2018. In K. Reiss, M. Weis & A Schiepe-Tiska (Hrsg). Schulmanagement Handbuch (pp. 9–19). München: Cornelsen. https://www.pisa.tum.de/fileadmin/w00bgi/www/_my_direct_uploads/PISA_Bericht_2018_.pdf . Accessed 12 November 2020.

Weis, M., Zehner, F., Sälzer, C., Strohmeier, A., Artelt, C., & Pfost, M. (2016). Lesekompetenz in PISA 2015: Ergebnisse, Veränderungen und Perspektiven. In K. Reiss, C. Sälzer, A. Schiepe-Tiska, E. Klieme & O. Köller (Eds.), PISA 2015—Eine Studie zwischen Kontinuität und Innovation (pp. 249–283). Münster: Waxmann.

Williamson, G. L., Appelbaum, M., & Epanchin, A. (1991). Longitudinal analyses of academic achievement. Journal of Educational Measurement, 28 (1), 61–76. https://doi.org/10.1111/j.1745-3984.1991.tb00344.x

Download references

Acknowledgements

The authors would like to thank David Kaplan for helpful suggestions on the analysis of the data. We would also like to thank Marie-Ann Sengewald for consultation on latent variable modelling.

This research project was partially funded by the Deutsche Forschungsgemeinschaft (DFG; http://www.dfg.de ) within Priority Programme 1646 entitled “A Bayesian model framework for analyzing data from longitudinal large-scale assessments” under Grant No. CA 289/8–2 (awarded to Claus H. Carstensen). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and affiliations.

Leibniz Institute for Educational Trajectories, Wilhelmsplatz 3, 96047, Bamberg, Germany

Theresa Rohm, Luise Fischer & Timo Gnambs

University of Bamberg, Bamberg, Germany

Theresa Rohm & Claus H. Carstensen

Johannes Kepler University Linz, Linz, Austria

Timo Gnambs

You can also search for this author in PubMed   Google Scholar

Contributions

TR analyzed and interpreted the data used in this study. TR conducted the literature review and drafted significant parts of the manuscript. CHC, LF and TG substantially revised the manuscript and provided substantial input regarding the statistical analyses. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Theresa Rohm .

Ethics declarations

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: table s1.

. Results of models for longitudinal competence measurement (N= 7276) with cluster robust standard error estimation. Table S2 . Effect sizes (Cohen’s d) for school type covariates per estimated model. Table S3 . Results of models for longitudinal competence development using WLEs (N= 7276) with cluster robust standard error estimation. Table S4 . Results of models for longitudinal competence development using WLEs (N= 7276) without cluster robust standard error estimation. Table S5 . Effect sizes (Cohen’s d) for school type covariates per estimated model.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Rohm, T., Carstensen, C.H., Fischer, L. et al. The achievement gap in reading competence: the effect of measurement non-invariance across school types. Large-scale Assess Educ 9 , 23 (2021). https://doi.org/10.1186/s40536-021-00116-2

Download citation

Received : 07 December 2020

Accepted : 15 October 2021

Published : 28 October 2021

DOI : https://doi.org/10.1186/s40536-021-00116-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Alignment method
  • Competence development
  • Measurement invariance
  • Multilevel item response theory
  • Multilevel structural equation modeling

research gap for reading comprehension

The Education Trust

  • Twitter Channel
  • Facebook Profile
  • YouTube Channel
  • Instagram Profile
  • Linkedin Profile
  • Pinterest Profile

Hiding In Plain Sight: How Complex Decoding Challenges Can Block Comprehension for Older Readers

The news about American students’ reading abilities isn’t good: For far too long, our education system has failed to teach many children to become proficient readers. Recently, there has been a lot of focus on learning losses experienced by students of all ages during the pandemic — including, unfortunately, in reading. Less than one-third of all students scored at Proficient or above on the 2022 National Assessment of Educational Progress (NAEP) reading assessment . But our struggles to effectively teach all students to read didn’t start in March 2020; they were evident long before then. For example, decades of assessment data shows that, fewer eighth graders demonstrate grade-level reading proficiency on tests like the NAEP than do fourth graders, even though eighth graders have had four more years of schooling than fourth graders have.

A lot happens between fourth grade and eighth grade, but one thing students rarely receive during those years is explicit instruction in foundational literacy skills. Why does explicit instruction in reading stop after third grade? Conventional wisdom holds that, once students have mastered basics like phonics, which focuses on decoding one- and two-syllable words, they know “how” to read and need little or no further support to be able to decode more complex words. After third grade, instruction generally shifts to support students’ higher-order reading skills, like fluency and comprehension.

People often blame smartphones and other digital distractions as the reason why so many kids today aren’t good readers. But the low reading proficiency rates of middle school students predate the advent of the smartphone, so the answer must lie elsewhere. An often-overlooked culprit may be the increasing demands placed on older students’ foundational literacy skills once they must independently read and comprehend complicated, discipline-specific texts: Students in upper elementary and middle school often encounter texts that feature sentences with more complicated syntax than those used in early elementary texts or in their everyday speech. These more complicated texts are loaded with words derived from other languages, like Latin and Greek, which have different spelling conventions than the words students learned in their basic phonics lessons. Older students must also draw on their implicit knowledge of morphology (e.g., recognizing the relationship between words like “merry” and “merriment”). A student whose foundational literacy skills allowed her to read third grade texts proficiently might struggle in sixth or seventh grade to decode abstract, multisyllabic words like “tenacious,” “inclination,” or “xeriscape.”

The widespread belief that the basic decoding skills students have acquired by the end of third grade will provide a sufficient foundation for their future reading growth may actually be undermining our efforts to move the needle on older students’ literacy. Complex decoding challenges are hiding in plain sight and can block older readers from comprehending grade-level texts.

In 2019, a team of researchers at the Educational Testing Service (ETS) published groundbreaking research that drew on an unusual dataset measuring the foundational literacy skills of students in upper elementary, middle, and high school. Assessments designed to test older students’ reading abilities almost always measure comprehension alone; they do not provide any information about students’ foundational skills, like decoding. By using this unique dataset, the ETS team was able to examine the relationship between older students’ decoding skills and their reading comprehension, and the team discovered a decoding threshold : Older students with low decoding skills had consistently low reading comprehension scores, while the students whose decoding scores were above a threshold value had much better comprehension scores. This phenomenon reveals that, while decoding skills alone are no silver bullet to address all older students’ literacy learning needs, the students whose decoding skills are not yet strong enough to read complex text accurately and efficiently will struggle to comprehend that complex text.

The relationship between decoding ability and comprehension at any age might seem obvious, but it has largely been missing from conversations about supporting older students’ literacy development. It’s hard to talk about something we cannot see or describe; without accurate diagnostic information about older students’ foundational skills, their teachers quite literally cannot know how to support them. The good news is that there is a path forward: Schools can now easily adopt validated screening assessments — e.g., tests like ROAR or Capti Assess — to measure foundational literacy skills among older students. Widespread use of a screening test will enable educators to identify the students who need continued instructional support in foundational literacy skills. Meanwhile, Reading Reimagined , a program by the Advanced Education Research and Development Fund (AERDF), is partnering with researchers to develop both assessment and instructional solutions that will allow teachers to support their older students to cross the decoding threshold, so they can focus on comprehending what they read. Having actionable data is, of course, just the first step to ending the literacy crisis, but every successful journey begins with a first step.

Rebecca Kockler is the Executive Director and Rebecca Sutherland is the Associate Director of Research of Reading Reimagined. Reading Reimagined is a program by the Advanced Education Research & Development Fund (AERDF) .

Related Content

Byron Jones headshot

Profiles in Education Equity: Byron Johns, Co-Founder, Black & Brown Coalition for Educational Equity and Excellence

Byron A. Johns is the co-founder of the Black and Brown Coalition for Educational Equity and Excellence (BBC), working…

Jeff Smink at a podium with people with signs behind him

Why We Launched the New York Campaign for Early Literacy

Rochester, New York is my hometown. My grandparents emigrated there nearly 100 years ago, and my mother is a…

Male child sitting Indian style in library aisle and reading a book

3 Things You Need to Know About Ongoing Reading Reform Efforts

On a road trip as a child, my brother and I proudly corrected a “mistake” my father made, and…

class assignments

Teachers Need to Learn How to Teach Reading Effectively

Every teacher has that handful of indelible memories of their teaching experiences, when the impossible becomes possible in an…

A graphic collage with children reading books and images of books on a shelf

Rigor and Representation in Children’s Books Foster a Love of Reading

Like millions of adults of a certain age, my reading journey began with libraries and public television. One standout…

young girl reading a book

The Literacy Crisis in the U.S. is Deeply Concerning—and Totally Preventable

Frederick Douglass said, “Once you learn to read, you will be forever free.” This was true in the 19th…

Suggestions or feedback?

MIT News | Massachusetts Institute of Technology

  • Machine learning
  • Social justice
  • Black holes
  • Classes and programs

Departments

  • Aeronautics and Astronautics
  • Brain and Cognitive Sciences
  • Architecture
  • Political Science
  • Mechanical Engineering

Centers, Labs, & Programs

  • Abdul Latif Jameel Poverty Action Lab (J-PAL)
  • Picower Institute for Learning and Memory
  • Lincoln Laboratory
  • School of Architecture + Planning
  • School of Engineering
  • School of Humanities, Arts, and Social Sciences
  • Sloan School of Management
  • School of Science
  • MIT Schwarzman College of Computing

Using ideas from game theory to improve the reliability of language models

Press contact :.

A digital illustration featuring two stylized figures engaged in a conversation over a tabletop board game.

Previous image Next image

Imagine you and a friend are playing a game where your goal is to communicate secret messages to each other using only cryptic sentences. Your friend's job is to guess the secret message behind your sentences. Sometimes, you give clues directly, and other times, your friend has to guess the message by asking yes-or-no questions about the clues you've given. The challenge is that both of you want to make sure you're understanding each other correctly and agreeing on the secret message.

MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers have created a similar "game" to help improve how AI understands and generates text. It is known as a “consensus game” and it involves two parts of an AI system — one part tries to generate sentences (like giving clues), and the other part tries to understand and evaluate those sentences (like guessing the secret message).

The researchers discovered that by treating this interaction as a game, where both parts of the AI work together under specific rules to agree on the right message, they could significantly improve the AI's ability to give correct and coherent answers to questions. They tested this new game-like approach on a variety of tasks, such as reading comprehension, solving math problems, and carrying on conversations, and found that it helped the AI perform better across the board.

Traditionally, large language models answer one of two ways: generating answers directly from the model (generative querying) or using the model to score a set of predefined answers (discriminative querying), which can lead to differing and sometimes incompatible results. With the generative approach, "Who is the president of the United States?" might yield a straightforward answer like "Joe Biden." However, a discriminative query could incorrectly dispute this fact when evaluating the same answer, such as "Barack Obama."

So, how do we reconcile mutually incompatible scoring procedures to achieve coherent, efficient predictions? 

"Imagine a new way to help language models understand and generate text, like a game. We've developed a training-free, game-theoretic method that treats the whole process as a complex game of clues and signals, where a generator tries to send the right message to a discriminator using natural language. Instead of chess pieces, they're using words and sentences," says Athul Jacob, an MIT PhD student in electrical engineering and computer science and CSAIL affiliate. "Our way to navigate this game is finding the 'approximate equilibria,' leading to a new decoding algorithm called 'equilibrium ranking.' It's a pretty exciting demonstration of how bringing game-theoretic strategies into the mix can tackle some big challenges in making language models more reliable and consistent."

When tested across many tasks, like reading comprehension, commonsense reasoning, math problem-solving, and dialogue, the team's algorithm consistently improved how well these models performed. Using the ER algorithm with the LLaMA-7B model even outshone the results from much larger models. "Given that they are already competitive, that people have been working on it for a while, but the level of improvements we saw being able to outperform a model that's 10 times the size was a pleasant surprise," says Jacob. 

"Diplomacy," a strategic board game set in pre-World War I Europe, where players negotiate alliances, betray friends, and conquer territories without the use of dice — relying purely on skill, strategy, and interpersonal manipulation — recently had a second coming. In November 2022, computer scientists, including Jacob, developed “Cicero,” an AI agent that achieves human-level capabilities in the mixed-motive seven-player game, which requires the same aforementioned skills, but with natural language. The math behind this partially inspired the Consensus Game. 

While the history of AI agents long predates when OpenAI's software entered the chat in November 2022, it's well documented that they can still cosplay as your well-meaning, yet pathological friend. 

The consensus game system reaches equilibrium as an agreement, ensuring accuracy and fidelity to the model's original insights. To achieve this, the method iteratively adjusts the interactions between the generative and discriminative components until they reach a consensus on an answer that accurately reflects reality and aligns with their initial beliefs. This approach effectively bridges the gap between the two querying methods. 

In practice, implementing the consensus game approach to language model querying, especially for question-answering tasks, does involve significant computational challenges. For example, when using datasets like MMLU, which have thousands of questions and multiple-choice answers, the model must apply the mechanism to each query. Then, it must reach a consensus between the generative and discriminative components for every question and its possible answers. 

The system did struggle with a grade school right of passage: math word problems. It couldn't generate wrong answers, which is a critical component of understanding the process of coming up with the right one. 

“The last few years have seen really impressive progress in both strategic decision-making and language generation from AI systems, but we’re just starting to figure out how to put the two together. Equilibrium ranking is a first step in this direction, but I think there’s a lot we’ll be able to do to scale this up to more complex problems,” says Jacob.   

An avenue of future work involves enhancing the base model by integrating the outputs of the current method. This is particularly promising since it can yield more factual and consistent answers across various tasks, including factuality and open-ended generation. The potential for such a method to significantly improve the base model's performance is high, which could result in more reliable and factual outputs from ChatGPT and similar language models that people use daily. 

"Even though modern language models, such as ChatGPT and Gemini, have led to solving various tasks through chat interfaces, the statistical decoding process that generates a response from such models has remained unchanged for decades," says Google Research Scientist Ahmad Beirami, who was not involved in the work. "The proposal by the MIT researchers is an innovative game-theoretic framework for decoding from language models through solving the equilibrium of a consensus game. The significant performance gains reported in the research paper are promising, opening the door to a potential paradigm shift in language model decoding that may fuel a flurry of new applications."

Jacob wrote the paper with MIT-IBM Watson Lab researcher Yikang Shen and MIT Department of Electrical Engineering and Computer Science assistant professors Gabriele Farina and Jacob Andreas, who is also a CSAIL member. They presented their work at the International Conference on Learning Representations (ICLR) earlier this month, where it was highlighted as a "spotlight paper." The research also received a “best paper award” at the NeurIPS R0-FoMo Workshop in December 2023.

Share this news article on:

Press mentions, quanta magazine.

MIT researchers have developed a new procedure that uses game theory to improve the accuracy and consistency of large language models (LLMs), reports Steve Nadis for Quanta Magazine . “The new work, which uses games to improve AI, stands in contrast to past approaches, which measured an AI program’s success via its mastery of games,” explains Nadis. 

Previous item Next item

Related Links

  • Article: "Game Theory Can Make AI More Correct and Efficient"
  • Jacob Andreas
  • Athul Paul Jacob
  • Language & Intelligence @ MIT
  • Computer Science and Artificial Intelligence Laboratory (CSAIL)
  • Department of Electrical Engineering and Computer Science
  • MIT-IBM Watson AI Lab

Related Topics

  • Computer science and technology
  • Artificial intelligence
  • Human-computer interaction
  • Natural language processing
  • Game theory
  • Electrical Engineering & Computer Science (eecs)

Related Articles

Headshots of Athul Paul Jacob, Maohao Shen, Victor Butoi, and Andi Peng.

Reasoning and reliability in AI

Large red text says “AI” in front of a dynamic, colorful, swirling background. 2 floating hands made of dots attempt to grab the text, and strange glowing blobs dance around the image.

Explained: Generative AI

Illustration of a disembodied brain with glowing tentacles reaching out to different squares of images at the ends

Synthetic imagery sets new bar in AI training efficiency

Two iPads displaying a girl wearing a hijab seated on a plane are on either side of an image of a plane in flight.

Simulating discrimination in virtual reality

More mit news.

Janabel Xia dancing in front of a blackboard. Her back is arched, head thrown back, hair flying, and arms in the air as she looks at the camera and smiles.

Janabel Xia: Algorithms, dance rhythms, and the drive to succeed

Read full story →

Headshot of Jonathan Byrnes outdoors

Jonathan Byrnes, MIT Center for Transportation and Logistics senior lecturer and visionary in supply chain management, dies at 75

Colorful rendering shows a lattice of black and grey balls making a honeycomb-shaped molecule, the MOF. Snaking around it is the polymer, represented as a translucent string of teal balls. Brown molecules, representing toxic gas, also float around.

Researchers develop a detector for continuously monitoring toxic gases

Portrait photo of Hanjun Lee

The beauty of biology

Three people sit on a stage, one of them speaking. Red and white panels with the MIT AgeLab logo are behind them.

Navigating longevity with industry leaders at MIT AgeLab PLAN Forum

Jeong Min Park poses leaning on an outdoor sculpture in Killian Court.

Jeong Min Park earns 2024 Schmidt Science Fellowship

  • More news on MIT News homepage →

Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA, USA

  • Map (opens in new window)
  • Events (opens in new window)
  • People (opens in new window)
  • Careers (opens in new window)
  • Accessibility
  • Social Media Hub
  • MIT on Facebook
  • MIT on YouTube
  • MIT on Instagram

Microsoft Research Blog

Mattersim: a deep-learning model for materials under real-world conditions.

Published May 13, 2024

By Han Yang , Senior Researcher Jielan Li , Researcher 2 Hongxia Hao , Senior Researcher Ziheng Lu , Principal Researcher

Share this page

  • Share on Facebook
  • Share on Twitter
  • Share on LinkedIn
  • Share on Reddit
  • Subscribe to our RSS feed

The image features a complex network of interconnected nodes with a molecular structure, illuminated in blue against a dark background.

In the quest for groundbreaking materials crucial to nanoelectronics, energy storage, and healthcare, a critical challenge looms: predicting a material’s properties before it is even created. This is no small feat, with any combination of 118 elements in the periodic table, and the range of temperatures and pressures under which materials are synthesized and operated. These factors drastically affect atomic interactions within materials, making accurate property prediction and behavior simulation exceedingly demanding.

Here at Microsoft Research, we developed MatterSim , a deep-learning model for accurate and efficient materials simulation and property prediction over a broad range of elements, temperatures, and pressures to enable the in silico materials design. MatterSim employs deep learning to understand atomic interactions from the very fundamental principles of quantum mechanics, across a comprehensive spectrum of elements and conditions—from 0 to 5,000 Kelvin (K), and from standard atmospheric pressure to 10,000,000 atmospheres. In our experiment, MatterSim efficiently handles simulations for a variety of materials, including metals, oxides, sulfides, halides, and their various states such as crystals, amorphous solids, and liquids. Additionally, it offers customization options for intricate prediction tasks by incorporating user-provided data.

Figure 1: There are two subfigures. On the left-hand side, atomic structures of 12 materials belonging to metals, oxides, sulfides, halides, and organic molecules are shown. On the right-hand side, the temperature and pressure ranges of materials' application and synthesis are plotted.

Simulating materials under realistic conditions across the periodic table

MatterSim’s learning foundation is built on large-scale synthetic data, generated through a blend of active learning, generative models, and molecular dynamics simulations. This data generation strategy ensures extensive coverage of material space, enabling the model to predict energies, atomic forces, and stresses. It serves as a machine-learning force field with a level of accuracy compatible with first-principles predictions. Notably, MatterSim achieves a10-fold increase in accuracy for material property predictions at finite temperatures and pressures when compared to previous state-of-the-art models. Our research demonstrates its proficiency in simulating a vast array of material properties, including thermal, mechanical, and transport properties, and can even predict phase diagrams.

Figure 2: There are three subfigures. The panel on the left shows a comparison of the highest phonon frequency predicted by MatterSim and by first-principles methods. The two values are for each material is very close, leading to a nearly straight line in the parity plot. The middle panel depicts the same relation of free energies of around 50 materials and comparison between MatterSim and first-principles results. The right panel shows the phase diagram of MgO predicted using MatterSim. The x-axis denotes the temperature and the y-axis denotes the pressure. The pressure ranges of where MgO’s B1 phase is below 500 GPa and this range decreases with temperature increase. The blue lines show the prediction from MatterSim and fits well with the shaded region which is the result from experiment measurement.

Adapting to complex design tasks

While trained on broad synthetic datasets, MatterSim is also adaptable for specific design requirements by incorporating additional data. The model utilizes active learning and fine-tuning to customize predictions with high data efficiency. For example, simulating water properties — a task seemingly straightforward but computationally intensive — is significantly optimized with MatterSim’s adaptive capability. The model requires only 3% of the data compared to traditional methods, to match experimental accuracy that would otherwise require 30 times more resources for a specialized model and exponentially more for first-principles methods.

Figure 3: There are two panels in this figure. The right panel shows the structure of Li2B12H12, a complex material system used for solid-state batteries. This system is used in the benchmark of the performance of MatterSim. The left panel panels show the comparison between number of data point needed to train a model from scratch and customize from MatterSim to achieve the same accuracy. MatterSim requires 3% and 10% of the data for the two tasks compared with training from scratch.

Spotlight: Event Series

various abstract 3D shapes on a light blue background

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch Episodes 1 & 2 on-demand.

Bridging the gap between atomistic models and real-world measurements

Translating material properties from atomic structures is a complex task, often too intricate for current methods based on statistics, such as molecular dynamics. MatterSim addresses this by mapping these relationships directly through machine learning. It incorporates custom adaptor modules that refine the model to predict material properties from structural data, eliminating the need for intricate simulations. Benchmarking against MatBench (opens in new tab) , a renowned material property prediction benchmark set, MatterSim demonstrates significant accuracy improvement and outperforms all specialized property-specific models, showcasing its robust capability in direct material property prediction from domain-specific data.

Looking ahead 

As MatterSim research advances, the emphasis is on experimental validation to reinforce its potential role in pivotal sectors, including the design of catalysts for sustainability, energy storage breakthroughs, and nanotechnology advancements. The planned integration of MatterSim with generative AI models and reinforcement learning heralds a new era in the systematic pursuit of novel materials. This synergy is expected to revolutionize the field, streamlining guided creation of materials tailored for diverse applications ranging from semiconductor technologies to biomedical engineering. Such progress promises to expedite material development and bolster sustainable industrial practices, thereby fostering technological advancements that will benefit society. 

Related publications

Mattersim: a deep learning atomistic model across elements, temperatures and pressures, meet the authors.

Portrait of Han Yang

Senior Researcher

Portrait of Jielan Li

Researcher 2

Portrait of Hongxia Hao

Hongxia Hao

Portrait of Ziheng Lu

Principal Researcher

Continue reading

Microsoft Research Podcast - Abstracts hero with a microphone icon

Abstracts: March 21, 2024

The general model architecture of ViSNet. (a) Model sketch of ViSNet. ViSNet embeds the 3D structures of molecules and extracts the geometric information through a series of ViSNet blocks and outputs the molecule properties such as energy, forces, and HOMO-LUMO gap through an output block. (b) Flowchart of one ViSNet Block. One ViSNet block consists of two modules: i) Scalar2Vec, responsible for attaching scalar embeddings to vectors.; ii) Vec2Scalar. The inputs of Scalar2Vec are the node embedding, edge embedding, direction unit and the relative positions between two atoms.

ViSNet: A general molecular geometry modeling framework for predicting molecular properties and simulating molecular dynamics

four crystalline structures

MatterGen: Property-guided materials design

A schematic diagram illustrating the goal of Distributional Graphormer (DiG). A molecular system is represented by a basic descriptor D, such as the amino acid sequence for a protein. DiG transforms D into a structural ensemble S, which consists of multiple possible conformations and their probabilities. S is expected to follow the equilibrium distribution of the molecular system. A legend shows a example of D and S for Adenylate kinase protein.

Distributional Graphormer: Toward equilibrium distribution prediction for molecular systems

Research areas.

research gap for reading comprehension

Related labs

  • Microsoft Research Lab - Asia
  • Follow on Twitter
  • Like on Facebook
  • Follow on LinkedIn
  • Subscribe on Youtube
  • Follow on Instagram

Share this page:

IMAGES

  1. Research-based Reading Comprehension Strategies Posters (B& W o

    research gap for reading comprehension

  2. (PDF) Effective Practices for Developing Reading Comprehension

    research gap for reading comprehension

  3. (PDF) School Age Gender Gap in Reading Comprehension

    research gap for reading comprehension

  4. How to Teach Reading Comprehension using a Research-Based Approach

    research gap for reading comprehension

  5. (PDF) Factors Affecting the Reading Comprehension of Intermediate Level

    research gap for reading comprehension

  6. Reading Comprehension Questionnaire

    research gap for reading comprehension

VIDEO

  1. Research gap dalam skripsi

  2. Writing the Research Gap

  3. What is research gap and why it is important?

  4. RESEARCH GAP: WHAT, WHY, HOW? (A Lecture in URDU)

  5. 396,Generation gap/english reading paragraph/English reading practice @Englishreadingpractice

  6. What is the Aveksana research gap score

COMMENTS

  1. Reading Comprehension Research: Implications for Practice and Policy

    Despite decades of experimental research in reading comprehension (Scammacca et al., 2016), little classroom time is used to teach evidence-based ... These types of on-going partnerships between practitioners and researchers may help shrink the research-to-practice gap in literacy by producing effective interventions that practitioners want to ...

  2. The Science of Reading Comprehension Instruction

    Decades of research offer important understandings about the nature of comprehension and its development. Drawing on both classic and contemporary research, in this article, we identify some key understandings about reading comprehension processes and instruction, including these: Comprehension instruction should begin early, teaching word-reading and bridging skills (including ...

  3. Reading comprehension research: Implications for practice and policy

    Reading comprehension is one of the most complex cognitive activities in which humans engage, making it difficult to teach, measure, and research. Despite decades of research in reading comprehension, international and national reading scores indicate stagnant growth for U.S. adolescents. In this article, we review the theoretical and empirical research in reading comprehension. We first ...

  4. New Research on Reading Comprehension (and 5 Tips for Teachers

    The researchers discovered a "knowledge threshold" when it comes to reading comprehension: If students were unfamiliar with 59 percent of the terms in a topic, their ability to understand the text was "compromised.". In the study, 3,534 high school students were presented with a list of 44 terms and asked to identify whether each was ...

  5. Reading Development and Difficulties: Bridging the Gap Between Research

    This book provides an overview of current research on the development of reading skills as well as practices to assist educational professionals with assessment, prevention, and intervention for students with reading difficulties. The book reviews the Componential Model of Reading (CMR) and provides assessment techniques, instructional ...

  6. Refocusing reading comprehension: Aligning theory with assessment and

    A stronger alignment of theory to assessment and instruction should lead to more robust and enduring educational practices. Anchoring this work closely to theory will begin to bridge the large gap between research and practice. Theory-driven assessments have the potential to evaluate the many factors and processes involved in reading comprehension.

  7. Reading Comprehension Research: Implications for Practice and Policy

    Moreover, reading comprehension can be defined as a thought process through which the reader realizes an idea, understands it in terms of their background or experience, and interprets it in terms ...

  8. Reading Comprehension: Bridging the Gap Between Research and Practice

    Reading Comprehension: Bridging the Gap Between Research and Practice ABSTRACT This work traces the history of prevailing philosophical frameworks, theories, and resulting instructional implications in the field of reading comprehens ion from the late nineteenth century to the present day.

  9. PDF RC-MAPS: Bridging the Comprehension Gap in EAP Reading

    2008; Urquhart & Weir, 1991). Although research has given teachers direction regarding the approach to use when providing strategy instruction in their class-rooms, it has been left to teachers to develop the specific teaching tools required. In this article, I propose Reading Comprehension MAP for Situation-based

  10. Levels of Reading Comprehension in Higher Education: Systematic Review

    This review is a guide to direct future research, broadening the study focus on the level of reading comprehension using digital technology, experimental designs, second languages, and investigations that relate reading comprehension with other factors (gender, cognitive abilities, etc.) that can explain the heterogeneity in the different ...

  11. A systematic review of the effectiveness of reading comprehension

    Introduction. Being able to read is a foundational skill: it enables participation in education and society, it improves health outcomes and supports engagement in cultural and democratic processes (Castles et al., Citation 2018).It is therefore unsurprising that teaching of reading is seen across the world as both an educational and public health priority (Progress in International Literacy ...

  12. Discovering the literacy gap: A systematic review of reading and

    1. Introduction. Teachers strive to enrich students' literacy by helping them become consumers of literature and producers of writing. Classroom teachers recognize that reading and writing complement each other and include the two skills simultaneously in instruction (Gao, Citation 2013; Grabe & Zhang, Citation 2013; Ulusoy & Dedeoglu, Citation 2011).

  13. Reviewing Evidence on the Relations Between Oral Reading Fluency and

    Reading fluency is the execution of multiple cognitive and language processes (Berninger et al., 2001).In research and practice, oral reading fluency (ORF) is commonly defined and measured as words read correctly per minute (WCPM), thereby assessing accuracy and rate concurrently and excluding the role of prosody (Kuhn & Schwanenflugel, 2019).In reviews on curriculum-based measures for reading ...

  14. The achievement gap in reading competence: the effect of measurement

    In particular, students' reading competence is essential for the comprehension of educational content in secondary school (Edossa et al., 2019; ... Implications for research on the achievement gap in reading competence. Studies on reading competence development have presented either parallel development (e.g., Retelsdorf & Möller, ...

  15. Inferencing in Reading Comprehension: Examining Variations in

    Research evidence suggests that gap-filling inferences are more difficult than text-based inferences for struggling readers (Cain & Oakhill, 1999), but it is less clear how these different types of inferences should impact practice and policy through the content standards to be taught in schools and assessments of reading comprehension.

  16. Reading Comprehension Research: Implications for Practice and Policy

    Reading comprehension is one of the most complex cognitive activities in which humans engage, making it difficult to teach, measure, and research. Despite decades of research in reading comprehension, international and national reading scores indicate stagnant growth for U.S. adolescents. In this article, we review the theoretical and empirical ...

  17. The New Literacies of Online Research and Comprehension: Rethinking the

    A significant gap persisted for online research and comprehension after we conditioned on pretest differences in offline reading, offline writing, and prior knowledge scores. The results of the questionnaire indicated that West Town students had greater access to the Internet at home and were required to use the Internet more in school.

  18. The Role of Background Knowledge in Reading Comprehension: A Critical

    The Role of Domain Knowledge. The Construction-Integration model identifies a critical role for background knowledge in reading (Kintsch, Citation 1998; Kintsch & Van Dijk, Citation 1978).Knowledge can be classified according to its specificity; background knowledge comprises all of the world knowledge that the reader brings to the task of reading. This can include episodic (events ...

  19. [PDF] The New Literacies of Online Research and Comprehension

    A significant gap persisted for online research and comprehension after we conditioned on pretest differences in offline reading, offline writing, and prior knowledge scores. The results of the questionnaire indicated that West Town students had greater access to the Internet at home and were required to use the Internet more in school.

  20. PDF Improving Reading Comprehension

    The teacher researchers intended to improve reading comprehension by using higher-order thinking skills such as predicting, making connections, visualizing, inferring, questioning, and summarizing. In their classrooms the teacher researchers modeled these strategies through the think-aloud process and graphic organizers.

  21. Hiding In Plain Sight: How Complex Decoding Challenges Can Block

    Meanwhile, Reading Reimagined, a program by the Advanced Education Research and Development Fund (AERDF), is partnering with researchers to develop both assessment and instructional solutions that will allow teachers to support their older students to cross the decoding threshold, so they can focus on comprehending what they read. Having ...

  22. Vocabulary and Reading Comprehension Revisited: Evidence for High-, Mid

    Although a substantial number of studies have found vocabulary knowledge to be a significant predictor of reading success in L2 learners and have established certain vocabulary size and lexical coverage targets for comprehension (e.g., Hazenberg & Hulstijn, 1996; Hu & Nation, 2000; Laufer, 1992a; Nation, 2006; Schmitt, Jiang, & Grabe, 2011), most of those studies have predominantly focused on ...

  23. What the Science of Reading Is Not

    —EAB District Leadership Forum Narrowing the Third Grade Reading Gap, 2019. ... The Simple View of Reading proves reading comprehension is a product of decoding, word recognition, and language comprehension. ... "NICHD reading research programs, which, to date have studied over 34,000 children and adults, ...

  24. Amira Learning Tackles K-6 Reading Comprehension with Amira 2.0

    Amira Learning Tackles K-6 Reading Comprehension with Amira 2.0 - Puts AI to Work for the Good of Students ... and Renaissance and built from a foundation of 20 years of research from Carnegie Mellon University. Amira Learning's mission is to help close the 43 million-person literacy gap in America by creating personalized and engaging reading ...

  25. Using ideas from game theory to improve the reliability of language

    MIT researchers' "consensus game" is a game-theoretic approach for language model decoding. The equilibrium-ranking algorithm harmonizes generative and discriminative querying to enhance prediction accuracy across various tasks, outperforming larger models and demonstrating the potential of game theory in improving language model consistency and truthfulness.

  26. MatterSim: A deep-learning model for materials under real-world

    Here at Microsoft Research, we developed MatterSim, a deep-learning model for accurate and efficient materials simulation and property prediction over a broad range of elements, temperatures, and pressures to enable the in silico materials design. MatterSim employs deep learning to understand atomic interactions from the very fundamental ...

  27. Expanding English Vocabulary Knowledge through Reading: Insights from

    This resembles findings of studies examining reading comprehension, with evidence suggesting a negative relationship between reading times and levels of comprehension (e.g. Chang and Choi, 2014; Serrano and Pellicer-Sánchez, 2019). When this relationship has been examined in the context of grammar learning, studies have also yielded ...

  28. The Happiness Gap Between Left and Right Isn't Closing

    The happiness gap has been with us for at least 50 years, and most research seeking to explain it has focused on conservatives. ... Lahtinen wrote that one of the key findings in his research was ...