Advertisement

Advertisement

The Impact of Peer Assessment on Academic Performance: A Meta-analysis of Control Group Studies

  • Meta-Analysis
  • Open access
  • Published: 10 December 2019
  • Volume 32 , pages 481–509, ( 2020 )

Cite this article

You have full access to this open access article

  • Kit S. Double   ORCID: orcid.org/0000-0001-8120-1573 1 ,
  • Joshua A. McGrane 1 &
  • Therese N. Hopfenbeck 1  

97k Accesses

140 Citations

120 Altmetric

Explore all metrics

Peer assessment has been the subject of considerable research interest over the last three decades, with numerous educational researchers advocating for the integration of peer assessment into schools and instructional practice. Research synthesis in this area has, however, largely relied on narrative reviews to evaluate the efficacy of peer assessment. Here, we present a meta-analysis (54 studies, k = 141) of experimental and quasi-experimental studies that evaluated the effect of peer assessment on academic performance in primary, secondary, or tertiary students across subjects and domains. An overall small to medium effect of peer assessment on academic performance was found ( g = 0.31, p < .001). The results suggest that peer assessment improves academic performance compared with no assessment ( g = 0.31, p = .004) and teacher assessment ( g = 0.28, p = .007), but was not significantly different in its effect from self-assessment ( g = 0.23, p = .209). Additionally, meta-regressions examined the moderating effects of several feedback and educational characteristics (e.g., online vs offline, frequency, education level). Results suggested that the effectiveness of peer assessment was remarkably robust across a wide range of contexts. These findings provide support for peer assessment as a formative practice and suggest several implications for the implementation of peer assessment into the classroom.

Similar content being viewed by others

research paper of assessment

The Impact of Peer Feedback on Student Learning Effectiveness: A Meta-analysis Based on 39 Experimental or Quasiexperimental Studies

research paper of assessment

A Systematic Review of Peer Assessment Design Elements

Maryam Alqassab, Jan-Willem Strijbos, … Jessica To

research paper of assessment

Peer-Assisted Learning Strategies (PALS): A Validated Classwide Program for Improving Reading and Mathematics Performance

Avoid common mistakes on your manuscript.

Feedback is often regarded as a central component of educational practice and crucial to students’ learning and development (Fyfe & Rittle-Johnson, 2016 ; Hattie and Timperley 2007 ; Hays, Kornell, & Bjork, 2010 ; Paulus, 1999 ). Peer assessment has been identified as one method for delivering feedback efficiently and effectively to learners (Topping 1998 ; van Zundert et al. 2010 ). The use of students to generate feedback about the performance of their peers is referred to in the literature using various terms, including peer assessment, peer feedback, peer evaluation, and peer grading. In this article, we adopt the term peer assessment, as it more generally refers to the method of peers assessing or being assessed by each other, whereas the term feedback is used when we refer to the actual content or quality of the information exchanged between peers. This feedback can be delivered in a variety of forms including written comments, grading, or verbal feedback (Topping 1998 ). Importantly, by performing both the role of assessor and being assessed themselves, students’ learning can potentially benefit more than if they are just assessed (Reinholz 2016 ).

Peer assessments tend to be highly correlated with teacher assessments of the same students (Falchikov and Goldfinch 2000 ; Li et al. 2016 ; Sanchez et al. 2017 ). However, in addition to establishing comparability between teacher and peer assessment scores, it is important to determine whether peer assessment also has a positive effect on future academic performance. Several narrative reviews have argued for the positive formative effects of peer assessment (e.g., Black and Wiliam 1998a ; Topping 1998 ; van Zundert et al. 2010 ) and have additionally identified a number of potentially important moderators for the effect of peer assessment. This meta-analysis will build upon these reviews and provide quantitative evaluations for some of the instructional features identified in these narrative reviews by utilising them as moderators within our analysis.

Evaluating the Evidence for Peer Assessment

Empirical studies.

Despite the optimism surrounding peer assessment as a formative practice, there are relatively few control group studies that evaluate the effect of peer assessment on academic performance (Flórez and Sammons 2013 ; Strijbos and Sluijsmans 2010 ). Most studies on peer assessment have tended to focus on either students’ or teachers’ subjective perceptions of the practice rather than its effect on academic performance (e.g., Brown et al. 2009 ; Young and Jackman 2014 ). Moreover, interventions involving peer assessment often confound the effect of peer assessment with other assessment practices that are theoretically related under the umbrella of formative assessment (Black and Wiliam 2009 ). For instance, Wiliam et al. ( 2004 ) reported a mean effect size of .32 in favor of a formative assessment intervention but they were unable to determine the unique contribution of peer assessment to students’ achievement, as it was one of more than 15 assessment practices included in the intervention.

However, as shown in Fig. 1 , there has been a sharp increase in the number of studies related to peer assessment, with over 75% of relevant studies published in the last decade. Although it is still far from being the dominant outcome measure in research on formative practices, many of these recent studies have examined the effect of peer assessment on objective measures of academic performance (e.g., Gielen et al. 2010a ; Liu et al. 2016 ; Wang et al. 2014a ). The number of studies of peer assessment using control group designs also appears to be increasing in frequency (e.g., van Ginkel et al. 2017 ; Wang et al. 2017 ). These studies have typically compared the formative effect of peer assessment with either teacher assessment (e.g., Chaney and Ingraham 2009 ; Sippel and Jackson 2015 ; van Ginkel et al. 2017 ) or no assessment conditions (e.g., Kamp et al. 2014 ; L. Li and Steckelberg 2004 ; Schonrock-Adema et al. 2007 ). Given the increase in peer assessment research, and in particular experimental research, it seems pertinent to synthesise this new body of research, as it provides a basis for critically evaluating the overall effectiveness of peer assessment and its moderators.

figure 1

Number of records returned by year. The following search terms were used: ‘peer assessment’ or ‘peer grading or ‘peer evaluation’ or ‘peer feedback’. Data were collated by searching Web of Science ( www.webofknowledge.com ) for the following keywords: ‘peer assessment’ or ‘peer grading’ or ‘peer evaluation’ or ‘peer feedback’ and categorising by year

Previous Reviews

Efforts to synthesise peer assessment research have largely been limited to narrative reviews, which have made very strong claims regarding the efficacy of peer assessment. For example, in a review of peer assessment with tertiary students, Topping ( 1998 ) argued that the effects of peer assessment are, ‘as good as or better than the effects of teacher assessment’ (p. 249). Similarly, in a review on peer and self-assessment with tertiary students, Dochy et al. ( 1999 ) concluded that peer assessment can have a positive effect on learning but may be hampered by social factors such as friendships, collusion, and perceived fairness. Reviews into peer assessment have also tended to focus on determining the accuracy of peer assessments, which is typically established by the correlation between peer and teacher assessments for the same performances. High correlations have been observed between peer and teacher assessments in three meta-analyses to date ( r = .69, .63, and .68 respectively; Falchikov and Goldfinch 2000 ; H. Li et al. 2016 ; Sanchez et al. 2017 ). Given that peer assessment is often advocated as a formative practice (e.g., Black and Wiliam 1998a ; Topping 1998 ), it is important to expand on these correlational meta-analyses to examine the formative effect that peer assessment has on academic performance.

In addition to examining the correlation between peer and teacher grading, Sanchez et al. ( 2017 ) additionally performed a meta-analysis on the formative effect of peer grading (i.e., a numerical or letter grade was provided to a student by their peer) in intervention studies. They found that there was a significant positive effect of peer grading on academic performance for primary and secondary (grades 3 to 12) students ( g = .29). However, it is unclear whether their findings would generalise to other forms of peer feedback (e.g., written or verbal feedback) and to tertiary students, both of which we will evaluate in the current meta-analysis.

Moderators of the Effectiveness of Peer Assessment

Theoretical frameworks of peer assessment propose that it is beneficial in at least two respects. Firstly, peer assessment allows students to critically engage with the assessed material, to compare and contrast performance with their peers, and to identify gaps or errors in their own knowledge (Topping 1998 ). In addition, peer assessment may improve the communication of feedback, as peers may use similar and more accessible language, as well as reduce negative feelings of being evaluated by an authority figure (Liu et al. 2016 ). However, the efficacy of peer assessment, like traditional feedback, is likely to be contingent on a range of factors including characteristics of the learning environment, the student, and the assessment itself (Kluger and DeNisi 1996 ; Ossenberg et al. 2018 ). Some of the characteristics that have been proposed to moderate the efficacy of feedback include anonymity (e.g., Rotsaert et al. 2018 ; Yu and Liu 2009 ), scaffolding (e.g., Panadero and Jonsson 2013 ), quality and timing of the feedback (Diab 2011 ), and elaboration (e.g., Gielen et al. 2010b ). Drawing on the previously mentioned narrative reviews and empirical evidence, we now briefly outline the evidence for each of the included theoretical moderators.

It is somewhat surprising that most studies that examine the effect of peer assessment tend to only assess the impact on the assessee and not the assessor (van Popta et al. 2017 ). Assessing may confer several distinct advantages such as drawing comparisons with peers’ work and increased familiarity with evaluative criteria. Several studies have compared the effect of assessing with being assessed. Lundstrom and Baker ( 2009 ) found that assessing a peer’s written work was more beneficial for their own writing than being assessed by a peer. Meanwhile, Graner ( 1987 ) found that students who were receiving feedback from a peer and acted as an assessor did not perform better than students who acted as an assessor but did not receive peer feedback. Reviewing peers’ work is also likely to help students become better reviewers of their own work and to revise and improve their own work (Rollinson 2005 ). While, in practice, students will most often act as both assessor and assessee during peer assessment, it is useful to gain a greater insight into the relative impact of performing each of these roles for both practical reasons and to help determine the mechanisms by which peer assessment improves academic performance.

Peer Assessment Type

The characteristics of peer assessment vary greatly both in practice and within the research literature. Because meta-analysis is unable to capture all of the nuanced dimensions that determine the type, intensity, and quality of peer assessment, we focus on distinguishing between what we regard as the most prevalent types of peer assessment in the literature: grading, peer dialogs, and written assessment. Each of these peer assessment types is widely used in the classroom and often in various combinations (e.g., written qualitative feedback in combination with a numerical grade). While these assessment types differ substantially in terms of their cognitive complexity and comprehensiveness, each has shown at least some evidence of impactive academic performance (e.g., Sanchez et al. 2017 ; Smith et al. 2009 ; Topping 2009 ).

Freeform/Scaffolding

Peer assessment is often implemented in conjunction with some form of scaffolding, for example, rubrics, and scoring scripts. Scaffolding has been shown to improve both the quality peer assessment and increase the amount of feedback assessors provide (Peters, Körndle & Narciss, 2018 ). Peer assessment has also been shown to be more accurate when rubrics are utilised. For example, Panadero, Romero, & Strijbos ( 2013 ) found that students were less likely to overscore their peers.

Increasingly, peer assessment has been performed online due in part to the growth in online learning activities as well as the ease by which peer assessment can be implemented online (van Popta et al. 2017 ). Conducting peer assessment online can significantly reduce the logistical burden of implementing peer assessment (e.g., Tannacito and Tuzi 2002 ). Several studies have shown that peer assessment can effectively be carried out online (e.g., Hsu 2016 ; Li and Gao 2016 ). Van Popta et al. ( 2017 ) argue that the cognitive processes involved in peer assessment, such as evaluating, explaining, and suggesting, similarly play out in online and offline environments. However, the social processes involved in peer assessment are likely to substantively differ between online and offline peer assessment (e.g., collaborating, discussing), and it is unclear whether this might limit the benefits of peer assessment through one or the other medium. To the authors’ knowledge, no prior studies have compared the effects of online and offline peer assessment on academic performance.

Because peer assessment is fundamentally a collaborative assessment practice, interpersonal variables play a substantial role in determining the type and quality of peer assessment (Strijbos and Wichmann 2018 ). Some researchers have argued that anonymous peer assessment is advantageous because assessors are more likely to be honest in their feedback, and interpersonal processes cannot influence how assessees receive the assessment feedback (Rotsaert et al. 2018 ). Qualitative evidence suggests that anonymous peer assessment results in improved feedback quality and more positive perceptions towards peer assessment (Rotsaert et al. 2018 ; Vanderhoven et al. 2015 ). A recent qualitative review by Panadero and Alqassab ( 2019 ) found that three studies had compared anonymous peer assessment to a control group (i.e., open peer assessment) and looked at academic performance as the outcome. Their review found mixed evidence regarding the benefit of anonymity in peer assessment with one of the included studies finding an advantage of anonymity, but the other two finding little benefit of anonymity. Others have questioned whether anonymity impairs the development of cognitive and interpersonal development by limiting the collaborative nature of peer assessment (Strijbos and Wichmann 2018 ).

Peers are often novices at providing constructive assessment and inexperienced learners tend to provide limited feedback (Hattie and Timperley 2007 ). Several studies have therefore suggested that peer assessment becomes more effective as students’ experience with peer assessment increases. For example, with greater experience, peers tend to use scoring criteria to a greater extent (Sluijsmans et al. 2004 ). Similarly, training peer assessment over time can improve the quality of feedback they provide, although the effects may be limited by the extent of a student’s relevant domain knowledge (Alqassab et al. 2018 ). Frequent peer assessment may also increase positive learner perceptions of peer assessment (e.g., Sluijsmans et al. 2004 ). However, other studies have found that learner perceptions of peer assessment are not necessarily positive (Alqassab et al. 2018 ). This may suggest that learner perceptions of peer assessment vary depending on its characteristics (e.g., quality, detail).

Current Study

Given the previous reliance on narrative reviews and the increasing research and teacher interest in peer assessment, as well as the popularity of instructional theories advocating for peer assessment and formative assessment practices in the classroom, we present a quantitative meta-analytic review to develop and synthesise the evidence in relation to peer assessment. This meta-analysis evaluates the effect of peer assessment on academic performance when compared to no assessment as well as teacher assessment. To do this, the meta-analysis only evaluates intervention studies that utilised experimental or quasi-experimental designs, i.e., only studies with control groups, so that the effects of maturation and other confounding variables are mitigated. Control groups can be either passive (e.g., no feedback) or active (e.g., teacher feedback). We meta-analytically address two related research questions:

What effect do peer assessment interventions have on academic performance relative to the observed control groups?

What characteristics moderate the effectiveness of peer assessment?

Working Definitions

The specific methods of peer assessment can vary considerably, but there are a number of shared characteristics across most methods. Peers are defined as individuals at similar (i.e., within 1–2 grades) or identical education levels. Peer assessment must involve assessing or being assessed by peers, or both. Peer assessment requires the communication (either written, verbal, or online) of task-relevant feedback, although the style of feedback can differ markedly, from elaborate written and verbal feedback to holistic ratings of performance.

We took a deliberately broad definition of academic performance for this meta-analysis including traditional outcomes (e.g., test performance or essay writing) and also practical skills (e.g., constructing a circuit in science class). Despite this broad interpretation of academic performance, we did not include any studies that were carried out in a professional/organisational setting other than professional skills (e.g., teacher training) that were being taught in a traditional educational setting (e.g., a university).

Selection Criteria

To be included in this meta-analysis, studies had to meet several criteria. Firstly, a study needed to examine the effect of peer assessment. Secondly, the assessment could be delivered in any form (e.g., written, verbal, online), but needed to be distinguishable from peer-coaching/peer-tutoring. Thirdly, a study needed to compare the effect of peer assessment with a control group. Pre-post designs that did not include a control/comparison group were excluded because we could not discount the effects of maturation or other confounding variables. Moreover, the comparison group could take the form of either a passive control (e.g., a no assessment condition) or an active control (e.g., teacher assessment). Fourthly, a study needed to examine the effect of peer assessment on a non-self-reported measure of academic performance.

In addition to these criteria, a study needed to be carried out in an educational context or be related to educational outcomes in some way. Any level of education (i.e., tertiary, secondary, primary) was acceptable. A study also needed to provide sufficient data to calculate an effect size. If insufficient data was available in the manuscript, the authors were contacted by email to request the necessary data (additional information was provided for a single study). Studies also needed to be written in English.

Literature Search

The literature search was carried out on 8 June 2018 using PsycInfo , Google Scholar , and ERIC. Google Scholar was used to check for additional references as it does not allow for the exporting of entries. These three electronic databases were selected due to their relevance to educational instruction and practice. Results were not filtered based on publication date, but ERIC only holds records from 1966 to present. A deliberately wide selection of search terms was used in the first instance to capture all relevant articles. The search terms included ‘peer grading’ or ‘peer assessment’ or ‘peer evaluation’ or ‘peer feedback’, which were paired with ‘learning’ or ‘performance’ or ‘academic achievement’ or ‘academic performance’ or ‘grades’. All peer assessment-related search terms were included with and without hyphenation. In addition, an ancestry search (i.e., back-search) was performed on the reference lists of the included articles. Conference programs for major educational conferences were searched. Finally, unpublished results were sourced by emailing prominent authors in the field and through social media. Although there is significant disagreement about the inclusion of unpublished data and conference abstracts, i.e., ‘grey literature’ (Cook et al. 1993 ), we opted to include it in the first instance because including only published studies can result in a meta-analysis over-estimating effect sizes due to publication bias (Hopewell et al. 2007 ). It should, however, be noted that none of the substantive conclusions changed when the analyses were re-run with the grey literature excluded.

The database search returned 4072 records. An ancestry search returned an additional 37 potentially relevant articles. No unpublished data could be found. After duplicates were removed, two reviewers independently screened titles and abstracts for relevance. A kappa statistic was calculated to assess inter-rater reliability between the two coders and was found to be .78 (89.06% overall agreement, CI .63 to .94), which is above the recommended minimum levels of inter-rater reliability (Fleiss 1971 ). Subsequently, the full text of articles that were deemed relevant based on their abstracts was examined to ensure that they met the selection criteria described previously. Disagreements between the coders were discussed and, when necessary, resolved by a third coder. Ultimately, 55 articles with 143 effect sizes were found that met the inclusion criteria and included in the meta-analysis. The search process is depicted in Fig. 2 .

figure 2

Flow chart for the identification, screening protocol, and inclusion of publications in the meta-analyses

Data Extraction

A research assistant and the first author extracted data from the included papers. We took an iterative approach to the coding procedure whereby the coders refined the classification of each variable as they progressed through the included studies to ensure that the classifications best characterised the extant literature. Below, the coding strategy is reviewed along with the classifications utilised. Frequency statistics and inter-rater reliability for the extracted data for the different classifications are presented in Table 1 . All extracted variable showed at least moderate agreement except for whether the peer assessment was freeform or structured, which showed fair agreement (Landis and Koch 1977 ).

Publication Type

Publications were classified into journal articles, conference papers, dissertations, reports, or unpublished records.

Education Level

Education level was coded as either graduate tertiary, undergraduate tertiary, secondary, or primary. Given the small number of studies that utilised graduate samples ( N = 2), we subsequently combined this classification with undergraduate to form a general tertiary category. In addition, we recorded the grade level of the students. Generally speaking, primary education refers to the ages of 6–12, secondary education refers to education from 13–18, and tertiary education is undertaken after the age of 18.

Age and Sex

The percentage of students in a study that were female was recorded. In addition, we recorded the mean age from each study. Unfortunately, only 55.5% of studies recorded participants’ sex and only 18.5% of studies recorded mean age information.

The subject area associated with the academic performance measure was coded. We also recorded the nature of the academic performance variable for descriptive purposes.

Assessment Role

Studies were coded as to whether the students acted as peer assessors, assessees, or both assessors and assessees.

Comparison Group

Four types of comparison group were found in the included studies: no assessment, teacher assessment, self-assessment, and reader-control. In many instances, a no assessment condition could be characterised as typical instruction; that is, two versions of a course were run—one with peer assessment and one without peer assessment. As such, while no specific teacher assessment comparison condition is referenced in the article, participants would most likely have received some form of teacher feedback as is typical in standard instructional practice. Studies were classified as having teacher assessment on the basis of a specific reference to teacher feedback being provided.

Studies were classified as self-assessment controls if there was an explicit reference to a self-assessment activity, e.g., self-grading/rating. Studies that only included revision, e.g., working alone on revising an assignment, were classified as no assessment rather than self-assessment because they did not necessarily involve explicit self-assessment. Studies where both the comparison and intervention groups received teacher assessment (in addition to peer assessment in the case of the intervention group) were coded as no assessment to reflect the fact that the comparison group received no additional assessment compared to the peer assessment condition. In addition, Philippakos and MacArthur ( 2016 ) and Cho and MacArthur ( 2011 ) were notable in that they utilised a reader-control condition whereby students read, but did not assess peers’ work. Due to the small frequency of this control condition, we ultimately classified them as no assessment controls.

Peer assessment was characterised using coding we believed best captured the theoretical distinctions in the literature. Our typology of peer assessment used three distinct components, which were combined for classification:

Did the peer feedback include a dialog between peers?

Did the peer feedback include written comments?

Did the peer feedback include grading?

Each study was classified using a dichotomous present/absent scoring system for each of the three components.

Studies were dichotomously classified as to whether a specific rubric, assessment script, or scoring system was provided to students. Studies that only provided basic instructions to students to conduct the peer feedback were coded as freeform.

Was the Assessment Online?

Studies were classified based on whether the peer assessment was online or offline.

Studies were classified based on whether the peer assessment was anonymous or identified.

Frequency of Assessment

Studies were coded dichotomously as to whether they involved only a single peer assessment occasion or, alternatively, whether students provided/received peer feedback on multiple occasions.

The level of transfer between the peer assessment task and the academic performance measure was coded into three categories:

No transfer—the peer-assessed task was the same as the academic performance measure. For example, a student’s assignment was assessed by peers and this feedback was utilised to make revisions before it was graded by their teacher.

Near transfer—the peer-assessed task was in the same or very similar format as the academic performance measure, e.g., an essay on a different, but similar topic.

Far transfer—the peer-assessed task was in a different form to the academic performance task, although they may have overlapping content. For example, a student’s assignment was peer assessed, while the final course exam grade was the academic performance measure.

We recorded how participants were allocated to a condition. Three categories of allocation were found in the included studies: random allocation at the class level, at the student level, or at the year/semester level. As only two studies allocated students to conditions at the year/semester level, we combined these studies with the studies allocated at the classroom level (i.e., as quasi-experiments).

Statistical Analyses of Effect Sizes

Effect size estimation and heterogeneity.

A random effects, multi-level meta-analysis was carried out using R version 3.4.3 (R Core Team 2017 ). The primary outcome was standardised mean difference between peer assessment and comparison (i.e., control) conditions. A common effect size metric, Hedge’s g , was calculated. A positive Hedge’s g value indicates comparatively higher values in the dependent variable in the peer assessment group (i.e., higher academic performance). Heterogeneity in the effect sizes was estimated using the I 2 statistic. I 2 is equivalent to the percentage of variation between studies that is due to heterogeneity (Schwarzer et al. 2015 ). Large values of the I 2 statistics suggest higher heterogeneity between studies in the analysis.

Meta-regressions were performed to examine the moderating effects of the various factors that differed across the studies. We report the results of these meta-regressions alongside sub-groups analyses. While it was possible to determine whether sub-groups differed significantly from each other by determining whether the confidence interval around their effect sizes overlap, sub-groups analysis may also produce biased estimates when heteroscedasticity or multicollinearity are present (Steel and Kammeyer-Mueller 2002 ). We performed meta-regressions separately for each predictor to test the overall effect of a moderator.

Finally, as this meta-analysis included students from primary school to graduate school, which are highly varied participant and educational contexts, we opted to analyse the data both in complete form, as well as after controlling for each level of education. As such, we were able to look at the effect of each moderator across education levels and for each education level separately.

Robust Variance Estimation

Often meta-analyses include multiple effect sizes from the same sample (e.g., the effect of peer assessment on two different measures of academic performance). Including these dependent effect sizes in a meta-analysis can be problematic, as this can potentially bias the results of the analysis in favour of studies that have more effect sizes. Recently, Robust Variance Estimation (RVE) was developed as a technique to address such concerns (Hedges et al. 2010 ). RVE allows for the modelling of dependence between effect sizes even when the nature of the dependence is not specifically known. Under such situations, RVE results in unbiased estimates of fixed effects when dependent effect sizes are included in the analysis (Moeyaert et al. 2017 ). A correlated effects structure was specified for the meta-analysis (i.e., the random error in the effects from a single paper were expected to be correlated due to similar participants, procedures). A rho value of .8 was specified for the correlated effects (i.e., effects from the same study) as is standard practice when the correlation is unknown (Hedges et al. 2010 ). A sensitivity analysis indicated that none of the results varied as a function of the chosen rho. We utilised the ‘robumeta’ package (Fisher et al. 2017 ) to perform the meta-analyses. Our approach was to use only summative dependent variables when they were provided (e.g., overall writing quality score rather than individual trait measures), but to utilise individual measures when overall indicators were not available. When a pre-post design was used in a study, we adjusted the effect size for pre-intervention differences in academic performance as long as there was sufficient data to do so (e.g., t tests for pre-post change).

Overall Meta-analysis of the Effect of Peer Assessment

Prior to conducting the analysis, two effect sizes ( g = 2.06 and 1.91) were identified as outliers and removed using the outlier labelling rule (Hoaglin and Iglewicz 1987 ). Descriptive characteristics of the included studies are presented in Table 2 . The meta-analysis indicated that there was a significant positive effect of peer assessment on academic performance ( g = 0.31, SE = .06, 95% CI = .18 to .44, p < .001). A density graph of the recorded effect sizes is provided in Fig. 3 . A sensitivity analysis indicated that the effect size estimates did not differ with different values of rho. Heterogeneity between the studies’ effect sizes was large, I 2 = 81.08%, supporting the use of a meta-regression/sub-groups analysis in order to explain the observed heterogeneity in effect sizes.

figure 3

A density plot of effect sizes

Meta-Regressions and Sub-Groups Analyses

Effect sizes for sub-groups are presented in Table 3 . The results of the meta-regressions are presented in Table 4 .

A meta-regression with tertiary students as the reference category indicated that there was no significant difference in effect size as a function of education level. The effect of peer assessment was similar for secondary students ( g = .44, p < .001) and primary school students ( g = .41, p = .006) and smaller for tertiary students ( g = .21, p = .043). There is, however, a strong theoretical basis for examining effects separately at different education levels (primary, secondary, tertiary), because of the large degree of heterogeneity across such a wide span of learning contexts (e.g., pedagogical practices, intellectual and social development of the students). We therefore will proceed by reporting the data both as a whole and separately for each of the education levels for all of the moderators considered here. Education level is contrast coded such that tertiary is compared to the average of secondary and primary and secondary and primary are compared to each other.

A meta-regression indicated that the effect size was not significantly different when comparing peer assessment with teacher assessment, than when comparing peer assessment with no assessment ( b = .02, 95% CI − .26 to .31, p = .865). The difference between peer assessment vs. no assessment and peer assessment vs. self-assessment was also not significant ( b = − .03, CI − .44 to .38, p = .860), see Table 4 . An examination of sub-groups suggested that peer assessment had a moderate positive effect compared to no assessment controls ( g = .31, p = .004) and teacher assessment ( g = .28, p = .007) and was not significantly different compared with self-assessment ( g = .23, p = .209). The meta-regression was also re-run with education level as a covariate but the results were unchanged.

Meta-regressions indicated that the participant’s role was not a significant moderator of the effect size; see Table 4 . However, given the extremely small number of studies where participants did not act as both assessees ( n = 2) and assessors ( n = 4), we did not perform a sub-groups analysis, as such analyses are unreliable with small samples (Fisher et al. 2017 ).

Subject Area

Given that many subject areas had few studies (see Table 1 ) and the writing subject area made up the majority of effect sizes (40.74%), we opted to perform a meta-regression comparing writing with other subject areas. However, the effect of peer assessment did not differ between writing ( g = .30 , p = .001) and other subject areas ( g = .31 , p = .002); b = − .003, 95% CI − .25 to .25, p = .979. Similarly, the results did not substantially change when education level was entered into the model.

The effect of peer assessment did not differ significantly when peer assessment included a written component ( g = .35 , p < .001) than when it did not ( g = .20 , p = .015) , b = .144, 95% CI − .10 to .39, p = .241. Including education as a variable in the model did not change the effect written feedback. Similarly, studies with a dialog component ( g = .21 , p = .033) did not differ significantly from those that did not ( g = .35 , p < .001), b = − .137, 95% CI − .39 to .12, p = .279.

Studies where peer feedback included a grading component ( g = .37 , p < .001) did not differ significantly from those that did not ( g = .17 , p = .138). However, when education level was included in the model, the model indicated significant interaction effect between grading in tertiary students and the average effect of grading in primary and secondary students ( b = .395, 95% CI .06 to .73, p = .022). A follow-up sub-groups analysis showed that grading was beneficial for academic performance in tertiary students ( g = .55 , p = .009), but not secondary school students ( g = .002, p = .991) or primary school students ( g = − .08, p = .762). When the three variables used to characterise peer assessment were entered simultaneously, the results were unchanged.

The average effect size was not significantly different for studies where assessment was freeform, i.e., where no specific script or rubric was given ( g = .42, p = .030) compared to those where a specific script or rubric was provided ( g = .29, p < .001); b = − .13, 95% CI − .51 to .25, p = .455. However, there were few studies where feedback was freeform ( n = 9, k =29). The results were unchanged when education level was controlled for in the meta-regression.

Studies where peer assessment was online ( g = .38, p = .003) did not differ from studies where assessment was offline ( g = .24, p = .004); b = .16, 95% CI − .10 to .42, p = .215. This result was unchanged when education level was included in the meta-regression.

There was no significant difference in terms of effect size between studies where peer assessment was anonymised ( g = .27, p = .019) and those where it was not ( g = .25, p = .004); b = .03, 95% CI − .22 to .28, p = .811). Nor was the effect significant when education level was controlled for.

Studies where peer assessment was performed just a single time ( g = .19, p = .103) did not differ significantly from those where it was performed multiple times ( g = .37, p < .001); b = -.17, 95% CI − .45 to .11, p = .223. Although it is worth noting that the results of the sub-groups analysis suggest that the effect of peer assessment was not significant when only considering studies that applied it a single time. The result did not change when education was included in the model.

There was no significant difference in effect size between studies utilising far transfer ( g = .21, p = .124) than those with near ( g = .42, p < .001) or no transfer ( g = .29, p = .017). Although it is worth noting that the sub-groups analysis suggests that the effect of peer assessment was only significant when there was no transfer to the criterion task. As shown in Table 4 , this was also not significant when analysed using meta-regressions either with or without education in the model.

Studies that allocated participants to experimental condition at the student level ( g = .21, p = .14) did not differ from those that allocated condition at the classroom/semester level ( g = .31, p < .001 and g  = .79, p  = .223 respectively), see Table 4 for meta-regressions.

Publication Bias

Risk of publication bias was assessed by inspecting the funnel plots (see Fig. 4 ) of the relationship between observed effects and standard error for asymmetry (Schwarzer et al. 2015 ). Egger’s test was also run by including standard error as a predictor in a meta-regression. Based on the funnel plots and a non-significant Egger’s test of asymmetry ( b = .886, p = .226), risk of publication bias was judged to be low

figure 4

A funnel plot showing the relationship between standard error and observed effect size for the academic performance meta-analysis

Proponents of peer assessment argue that it is an effective classroom technique for improving academic performance (Topping 2009 ). While previous narrative reviews have argued for the benefits of peer assessment, the current meta-analysis quantifies the effect of peer assessment interventions on academic performance within educational contexts. Overall, the results suggest that there is a positive effect of peer assessment on academic performance in primary, secondary, and tertiary students. The magnitude of the overall effect size was within the small to medium range for effect sizes (Sawilowsky 2009 ). These findings also suggest that that the benefits of peer assessment are robust across many contextual factors, including different feedback and educational characteristics.

Recently, researchers have increasingly advocated for the role of assessment in promoting learning in educational practice (Wiliam 2018 ). Peer assessment forms a core part of theories of formative assessment because it is seen as providing new information about the learning process to the teacher or student, which in turn facilitates later performance (Pellegrino et al. 2001 ). The current results provide support for the position that peer assessment can be an effective classroom technique for improving academic performance. The result suggest that peer assessment is effective compared to both no assessment (which often involved ‘teaching as usual’) and teacher assessment, suggesting that peer assessment can play an important formative role in the classroom. The findings suggest that structuring classroom activities in a way that utilises peer assessment may be an effective way to promote learning and optimise the use of teaching resources by permitting the teacher to focus on assisting students with greater difficulties or for more complex tasks. Importantly, the results indicate that peer assessment can be effective across a wide range of subject areas, education levels, and assessment types. Pragmatically, this suggests that classroom teachers can implement peer assessment in a variety of ways and tailor the peer assessment design to the particular characteristics and constraints of their classroom context.

Notably, the results of this quantitative meta-analysis align well with past narrative reviews (e.g., Black and Wiliam 1998a ; Topping 1998 ; van Zundert et al. 2010 ). The fact that both quantitative and qualitative syntheses of the literature suggest that peer assessment can be beneficial provides a stronger basis for recommending peer assessment as a practice. However, several of the moderators of the effectiveness of peer feedback that have been argued for in the available narrative reviews (e.g., rubrics; Panadero and Jonsson 2013 ) have received little support from this quantitative meta-analysis. As detailed below, this may suggest that the prominence of such feedback characteristics in narrative reviews is more driven by theoretical considerations rather than quantitative empirical evidence. However, many of these moderating variables are complex, for example, rubrics can take many forms, and due to this complexity may not lend themselves as well to quantitative synthesis/aggregation (for a detailed discussion on combining qualitative and quantitative evidence, see Gorard 2002 ).

Mechanisms and Moderators

Indeed, the current findings suggest that the feedback characteristics deemed important by current theories of peer assessment may not be as significant as first thought. Previously, individual studies have argued for the importance of characteristics such as rubrics (Panadero and Jonsson 2013 ), anonymity (Bloom & Hautaluoma, 1987 ), and allowing students to practice peer assessment (Smith, Cooper, & Lancaster, 2002 ). While these feedback characteristics have been shown to affect the efficacy of peer assessment in individual studies, we find little evidence that they moderate the effect of peer assessment when analysed across studies. Many of the current models of peer assessment rely on qualitative evidence, theoretical arguments, and pedagogical experience to formulate theories about what determines effective peer assessment. While such evidence should not be discounted, the current findings also point to the need for better quantitative and experimental studies to test some of the assumptions embedded in these models. We suggest that the null findings observed in this meta-analysis regarding the proposed moderators of peer assessment efficacy should be interpreted cautiously, as more studies that experimentally manipulate these variables are needed to provide more definitive insight into how to design better peer assessment procedures.

While the current findings are ambiguous regarding the mechanisms of peer assessment, it is worth noting that without a solid understanding of the mechanisms underlying peer assessment effects, it is difficult to identify important moderators or optimally use peer assessment in the classroom. Often the research literature makes somewhat broad claims about the possible benefits of peer assessment. For example, Topping ( 1998 , p.256) suggested that peer assessment may, ‘promote a sense of ownership, personal responsibility, and motivation… [and] might also increase variety and interest, activity and interactivity, identification and bonding, self-confidence, and empathy for others’. Others have argued that peer assessment is beneficial because it is less personally evaluative—with evidence suggesting that teacher assessment is often personally evaluative (e.g., ‘good boy, that is correct’) which may have little or even negative effects on performance particularly if the assessee has low self-efficacy (Birney, Beckmann, Beckmann & Double 2017 ; Double and Birney 2017 , 2018 ; Hattie and Timperley 2007 ). However, more research is needed to distinguish between the many proposed mechanisms for peer assessment’s formative effects made within the extant literature, particularly as claims about the mechanisms of the effectiveness of peer assessment are often evidenced by student self-reports about the aspects of peer assessment they rate as useful. While such self-reports may be informative, more experimental research that systematically manipulates aspects of the design of peer assessment is likely to provide greater clarity about what aspects of peer assessment drive the observed benefits.

Our findings did indicate an important role for grading in determining the effectiveness of peer feedback. We found that peer grading was beneficial for tertiary students but not beneficial for primary or secondary school students. This finding suggests that grading appears to add little to the peer feedback process in non-tertiary students. In contrast, a recent meta-analysis by Sanchez et al. ( 2017 ) on peer grading found a benefit for non-tertiary students, albeit based on a relatively small number of studies compared with the current meta-analysis. In contrast, the present findings suggest that there may be significant qualitative differences in the performance of peer grading as students develop. For example, the criteria students use to assesses ability may change as they age (Stipek and Iver 1989 ). It is difficult to ascertain precisely why grading has positive additive effects in only tertiary students, but there are substantial differences in pedagogy, curriculum, motivation of learning, and grading systems that may account for these differences. One possibility is that tertiary students are more ‘grade orientated’ and therefore put more weight on peer assessment which includes a specific grade. Further research is needed to explore the effects of grading at different educational levels.

One of the more unexpected findings of this meta-analysis was the positive effect of peer assessment compared to teacher assessment. This finding is somewhat counterintuitive given the greater qualifications and pedagogical experience of the teacher. In addition, in many of the studies, the teacher had privileged knowledge about, and often graded the outcome assessment. Thus, it seems reasonable to expect that teacher feedback would better align with assessment objectives and therefore produce better outcomes. Despite all these advantages, teacher assessment appeared to be less efficacious than peer assessment for academic performance. It is possible that the pedagogical disadvantages of peer assessment are compensated for by affective or motivational aspects of peer assessment, or by the substantial benefits of acting as an assessor. However, more experimental research is needed to rule out the effects of potential methodological issues discussed in detail below.

Limitations

A major limitation of the current results is that they cannot adequately distinguish between the effect of assessing versus being an assessee. Most of the current studies confound giving and receiving peer assessment in their designs (i.e., the students in the peer assessment group both provide assessment and receive it), and therefore, no substantive conclusions can be drawn about whether the benefits of peer assessment extend from giving feedback, receiving feedback, or both. This raises the possibility that the benefit of peer assessment comes more from assessing, rather than being assessed (Usher 2018 ). Consistent with this, Lundstrom and Baker ( 2009 ) directly compared the effects of giving and receiving assessment on students’ writing performance and found that assessing was more beneficial than being assessed. Similarly, Graner ( 1987 ) found that assessing papers without being assessed was as effective for improving writing performance as assessing papers and receiving feedback.

Furthermore, more true experiments are needed, as there is evidence from these results that they produce more conservative estimates of the effect of peer assessment. The studies included in this meta-analysis were not only predominantly randomly allocated at the classroom level (i.e., quasi-experiments), but in all but one case, were not analysed using appropriate techniques for analysing clustered data (e.g., multi-level modelling). This is problematic because it makes disentangling classroom-level effects (e.g., teacher quality) from the intervention effect difficult, which may lead to biased statistical inferences (Hox 1998 ). While experimental designs with individual allocation are often not pragmatic for classroom interventions, online peer assessment interventions appear to be obvious candidates for increased true experiments. In particular, carefully controlled experimental designs that examine the effect of specific assessment characteristics, rather than ‘black-box’ studies of the effectiveness of peer assessment, are crucial for understanding when and how peer assessment is most likely to be effective. For example, peer assessment may be counterproductive when learning novel tasks due to students’ inadequate domain knowledge (Könings et al. 2019 ).

While the current results provide an overall estimate of the efficacy of peer assessment in improving academic performance when compared to teacher and no assessment, it should be noted that these effects are averaged across a wide range of outcome measures, including science project grades, essay writing ratings, and end-of-semester exam scores. Aggregating across such disparate outcomes is always problematic in meta-analysis and is a particular concern for meta-analyses in educational research, as some outcome measures are likely to be more sensitive to interventions than others (William, 2010 ). A further issue is that the effect of moderators may differ between academic domains. For example, some assessment characteristics may be important when teaching writing but not mathematics. Because there were too few studies in the individual academic domains (with the exception of writing), we are unable to account for these differential effects. The effects of the moderators reported here therefore need to be considered as overall averages that provide information about the extent to which the effect of a moderator generalises across domains.

Finally, the findings of the current meta-analysis are also somewhat limited by the fact that few studies gave a complete profile of the participants and measures used. For example, few studies indicated that ability of peer reviewer relative to the reviewee and age difference between the peers was not necessarily clear. Furthermore, it was not possible to classify the academic performance measures in the current study further, such as based on novelty, or to code for the quality of the measures, including their reliability and validity, because very few studies provide comprehensive details about the outcome measure(s) they utilised. Moreover, other important variables such as fidelity of treatment were almost never reported in the included manuscripts. Indeed, many of the included variables needed to be coded based on inferences from the included studies’ text and were not explicitly stated, even when one would reasonably expect that information to be made clear in a peer-reviewed manuscript. The observed effect sizes reported here should therefore be taken as an indicator of average efficacy based on the extant literature and not an indication of expected effects for specific implementations of peer assessment.

Overall, our findings provide support for the use of peer assessment as a formative practice for improving academic performance. The results indicate that peer assessment is more effective than no assessment and teacher assessment and not significantly different in its effect from self-assessment. These findings are consistent with current theories of formative assessment and instructional best practice and provide strong empirical support for the continued use of peer assessment in the classroom and other educational contexts. Further experimental work is needed to clarify the contextual and educational factors that moderate the effectiveness of peer assessment, but the present findings are encouraging for those looking to utilise peer assessment to enhance learning.

References marked with an * were included in the meta-analysis

* AbuSeileek, A. F., & Abualsha'r, A. (2014). Using peer computer-mediated corrective feedback to support EFL learners'. Language Learning & Technology, 18 (1), 76-95.

Alqassab, M., Strijbos, J. W., & Ufer, S. (2018). Training peer-feedback skills on geometric construction tasks: Role of domain knowledge and peer-feedback levels. European Journal of Psychology of Education, 33 (1), 11–30.

Article   Google Scholar  

* Anderson, N. O., & Flash, P. (2014). The power of peer reviewing to enhance writing in horticulture: Greenhouse management. International Journal of Teaching and Learning in Higher Education, 26 (3), 310–334.

* Bangert, A. W. (1995). Peer assessment: an instructional strategy for effectively implementing performance-based assessments. (Unpublished doctoral dissertation). University of South Dakota.

* Benson, N. L. (1979). The effects of peer feedback during the writing process on writing performance, revision behavior, and attitude toward writing. (Unpublished doctoral dissertation). University of Colorado, Boulder.

* Bhullar, N., Rose, K. C., Utell, J. M., & Healey, K. N. (2014). The impact of peer review on writing in apsychology course: Lessons learned. Journal on Excellence in College Teaching, 25(2), 91-106.

* Birjandi, P., & Hadidi Tamjid, N. (2012). The role of self-, peer and teacher assessment in promoting Iranian EFL learners’ writing performance. Assessment & Evaluation in Higher Education, 37 (5), 513–533.

Birney, D. P., Beckmann, J. F., Beckmann, N., & Double, K. S. (2017). Beyond the intellect: Complexity and learning trajectories in Raven’s Progressive Matrices depend on self-regulatory processes and conative dispositions. Intelligence, 61 , 63–77.

Black, P., & Wiliam, D. (1998a). Assessment and classroom learning. Assessment in Education: Principles, Policy & Practice, 5 (1), 7–74.

Google Scholar  

Black, P., & Wiliam, D. (2009). Developing the theory of formative assessment. Educational Assessment, Evaluation and Accountability (formerly: Journal of Personnel Evaluation in Education), 21 (1), 5.

Bloom, A. J., & Hautaluoma, J. E. (1987). Effects of message valence, communicator credibility, and source anonymity on reactions to peer feedback. The Journal of Social Psychology, 127 (4), 329–338.

Brown, G. T., Irving, S. E., Peterson, E. R., & Hirschfeld, G. H. (2009). Use of interactive–informal assessment practices: New Zealand secondary students' conceptions of assessment. Learning and Instruction, 19 (2), 97–111.

* Califano, L. Z. (1987). Teacher and peer editing: Their effects on students' writing as measured by t-unit length, holistic scoring, and the attitudes of fifth and sixth grade students (Unpublished doctoral dissertation), Northern Arizona University.

* Chaney, B. A., & Ingraham, L. R. (2009). Using peer grading and proofreading to ratchet student expectations in preparing accounting cases. American Journal of Business Education, 2 (3), 39-48.

* Chang, S. H., Wu, T. C., Kuo, Y. K., & You, L. C. (2012). Project-based learning with an online peer assessment system in a photonics instruction for enhancing led design skills. Turkish Online Journal of Educational Technology-TOJET, 11(4), 236–246.

* Cho, K., & MacArthur, C. (2011). Learning by reviewing. Journal of Educational Psychology, 103 (1), 73.

Cho, K., Schunn, C. D., & Charney, D. (2006). Commenting on writing: Typology and perceived helpfulness of comments from novice peer reviewers and subject matter experts. Written Communication, 23 (3), 260–294.

Cook, D. J., Guyatt, G. H., Ryan, G., Clifton, J., Buckingham, L., Willan, A., et al. (1993). Should unpublished data be included in meta-analyses?: Current convictions and controversies. JAMA, 269 (21), 2749–2753.

*Crowe, J. A., Silva, T., & Ceresola, R. (2015). The effect of peer review on student learning outcomes in a research methods course.  Teaching Sociology, 43 (3), 201–213.

* Diab, N. M. (2011). Assessing the relationship between different types of student feedback and the quality of revised writing . Assessing Writing, 16(4), 274-292.

Demetriadis, S., Egerter, T., Hanisch, F., & Fischer, F. (2011). Peer review-based scripted collaboration to support domain-specific and domain-general knowledge acquisition in computer science. Computer Science Education, 21 (1), 29–56.

Dochy, F., Segers, M., & Sluijsmans, D. (1999). The use of self-, peer and co-assessment in higher education: A review. Studies in Higher Education, 24 (3), 331–350.

Double, K. S., & Birney, D. (2017). Are you sure about that? Eliciting confidence ratings may influence performance on Raven’s progressive matrices. Thinking & Reasoning, 23 (2), 190–206.

Double, K. S., & Birney, D. P. (2018). Reactivity to confidence ratings in older individuals performing the latin square task. Metacognition and Learning, 13(3), 309–326.

* Enders, F. B., Jenkins, S., & Hoverman, V. (2010). Calibrated peer review for interpreting linear regression parameters: Results from a graduate course. Journal of Statistics Education , 18 (2).

* English, R., Brookes, S. T., Avery, K., Blazeby, J. M., & Ben-Shlomo, Y. (2006). The effectiveness and reliability of peer-marking in first-year medical students. Medical Education, 40 (10), 965-972.

* Erfani, S. S., & Nikbin, S. (2015). The effect of peer-assisted mediation vs. tutor-intervention within dynamic assessment framework on writing development of Iranian Intermediate EFL Learners. English Language Teaching, 8 (4), 128–141.

Falchikov, N., & Goldfinch, J. (2000). Student peer assessment in higher education: A meta-analysis comparing peer and teacher marks. Review of Educational Research, 70 (3), 287–322.

* Farrell, K. J. (1977). A comparison of three instructional approaches for teaching written composition to high school juniors: teacher lecture, peer evaluation, and group tutoring (Unpublished doctoral dissertation), Boston University, Boston.

Fisher, Z., Tipton, E., & Zhipeng, Z. (2017). robumeta: Robust variance meta-regression (Version 2). Retrieved from https://CRAN.R-project.org/package = robumeta

Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76 (5), 378.

Flórez, M. T., & Sammons, P. (2013). Assessment for learning: Effects and impact: CfBT Education Trust . England: Reading.

Fyfe, E. R., & Rittle-Johnson, B. (2016). Feedback both helps and hinders learning: The causal role of prior knowledge. Journal of Educational Psychology, 108 (1), 82.

Gielen, S., Peeters, E., Dochy, F., Onghena, P., & Struyven, K. (2010a). Improving the effectiveness of peer feedback for learning. Learning and Instruction, 20 (4), 304–315.

* Gielen, S., Tops, L., Dochy, F., Onghena, P., & Smeets, S. (2010b). A comparative study of peer and teacher feedback and of various peer feedback forms in a secondary school writing curriculum. British Educational Research Journal , 36 (1), 143-162.

Gorard, S. (2002). Can we overcome the methodological schism? Four models for combining qualitative and quantitative evidence. Research Papers in Education Policy and Practice, 17 (4), 345–361.

Graner, M. H. (1987). Revision workshops: An alternative to peer editing groups. The English Journal, 76 (3), 40–45.

Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77 (1), 81–112.

Hays, M. J., Kornell, N., & Bjork, R. A. (2010). The costs and benefits of providing feedback during learning. Psychonomic bulletin & review, 17 (6), 797–801.

Hedges, L. V. (1981). Distribution theory for Glass's estimator of effect size and related estimators. journal of . Educational Statistics, 6 (2), 107–128.

Hedges, L. V., Tipton, E., & Johnson, M. C. (2010). Robust variance estimation in meta-regression with dependent effect size estimates. Research Synthesis Methods, 1 (1), 39–65.

Higgins, J. P., & Green, S. (2011). Cochrane handbook for systematic reviews of interventions. The Cochrane Collaboration. Version 5.1.0, www.handbook.cochrane.org

Hoaglin, D. C., & Iglewicz, B. (1987). Fine-tuning some resistant rules for outlier labeling. Journal of the American Statistical Association, 82 (400), 1147–1149.

Hopewell, S., McDonald, S., Clarke, M. J., & Egger, M. (2007). Grey literature in meta-analyses of randomized trials of health care interventions. Cochrane Database of Systematic Reviews .

* Horn, G. C. (2009). Rubrics and revision: What are the effects of 3 RD graders using rubrics to self-assess or peer-assess drafts of writing? (Unpublished doctoral thesis), Boise State University

Hox, J. J. (1998). Multilevel modeling: When and why. In I. Balderjahn, R. Mathar, & M. Schader (Eds.), Classification, data analysis, and data highways (pp. 147–154). New Yor: Springer Verlag.

Chapter   Google Scholar  

* Hsia, L. H., Huang, I., & Hwang, G. J. (2016). A web-based peer-assessment approach to improving junior high school students’ performance, self-efficacy and motivation in performing arts courses. British Journal of Educational Technology, 47 (4), 618–632.

* Hsu, T. C. (2016). Effects of a peer assessment system based on a grid-based knowledge classification approach on computer skills training. Journal of Educational Technology & Society , 19 (4), 100-111.

* Hussein, M. A. H., & Al Ashri, El Shirbini A. F. (2013). The effectiveness of writing conferences and peer response groups strategies on the EFL secondary students' writing performance and their self efficacy (A Comparative Study). Egypt: National Program Zero.

* Hwang, G. J., Hung, C. M., & Chen, N. S. (2014). Improving learning achievements, motivations and problem-solving skills through a peer assessment-based game development approach. Educational Technology Research and Development, 62 (2), 129–145.

* Hwang, G. J., Tu, N. T., & Wang, X. M. (2018). Creating interactive E-books through learning by design: The impacts of guided peer-feedback on students’ learning achievements and project outcomes in science courses. Journal of Educational Technology & Society, 21 (1), 25–36.

* Kamp, R. J., van Berkel, H. J., Popeijus, H. E., Leppink, J., Schmidt, H. G., & Dolmans, D. H. (2014). Midterm peer feedback in problem-based learning groups: The effect on individual contributions and achievement. Advances in Health Sciences Education, 19 (1), 53–69.

* Karegianes, M. J., Pascarella, E. T., & Pflaum, S. W. (1980). The effects of peer editing on the writing proficiency of low-achieving tenth grade students. The Journal of Educational Research , 73 (4), 203-207.

* Khonbi, Z. A., & Sadeghi, K. (2013). The effect of assessment type (self vs. peer) on Iranian university EFL students’ course achievement. Procedia-Social and Behavioral Sciences , 70 , 1552-1564.

Kluger, A. N., & DeNisi, A. (1996). The effects of feedback interventions on performance: A historical review, a meta-analysis, and a preliminary feedback intervention theory. Psychological Bulletin, 119 (2), 254.

Könings, K. D., van Zundert, M., & van Merriënboer, J. J. G. (2019). Scaffolding peer-assessment skills: Risk of interference with learning domain-specific skills? Learning and Instruction, 60 , 85–94.

* Kurihara, N. (2017). Do peer reviews help improve student writing abilities in an EFL high school classroom? TESOL Journal, 8 (2), 450–470.

Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33 (1), 159–174.

* Li, L., & Gao, F. (2016). The effect of peer assessment on project performance of students at different learning levels. Assessment & Evaluation in Higher Education, 41 (6), 885–900.

* Li, L., & Steckelberg, A. (2004). Using peer feedback to enhance student meaningful learning . Chicago: Association for Educational Communications and Technology.

Li, H., Xiong, Y., Zang, X., Kornhaber, M. L., Lyu, Y., Chung, K. S., & Suen, K. H. (2016). Peer assessment in the digital age: a meta-analysis comparing peer and teacher ratings. Assessment & Evaluation in Higher Education, 41 (2), 245–264.

* Lin, Y.-C. A. (2009). An examination of teacher feedback, face-to-face peer feedback, and google documents peer feedback in Taiwanese EFL college students’ writing. (Unpublished doctoral dissertation), Alliant International University, San Diego, United States

Lipsey, M. W., & Wilson, D. B. (2001). Practical Meta-analysis . Thousand Oaks: SAGE publications.

* Liu, C.-C., Lu, K.-H., Wu, L. Y., & Tsai, C.-C. (2016). The impact of peer review on creative self-efficacy and learning performance in Web 2.0 learning activities. Journal of Educational Technology & Society, 19 (2):286-297

Lundstrom, K., & Baker, W. (2009). To give is better than to receive: The benefits of peer review to the reviewer's own writing. Journal of Second Language Writing, 18 (1), 30–43.

* McCurdy, B. L., & Shapiro, E. S. (1992). A comparison of teacher-, peer-, and self-monitoring with curriculum-based measurement in reading among students with learning disabilities. The Journal of Special Education , 26 (2), 162-180.

Moeyaert, M., Ugille, M., Natasha Beretvas, S., Ferron, J., Bunuan, R., & Van den Noortgate, W. (2017). Methods for dealing with multiple outcomes in meta-analysis: a comparison between averaging effect sizes, robust variance estimation and multilevel meta-analysis. International Journal of Social Research Methodology, 20 (6), 559–572.

* Montanero, M., Lucero, M., & Fernandez, M.-J. (2014). Iterative co-evaluation with a rubric of narrative texts in primary education. Journal for the Study of Education and Development, 37 (1), 184-198.

Morris, S. B. (2008). Estimating effect sizes from pretest-posttest-control group designs. Organizational Research Methods, 11 (2), 364–386.

* Olson, V. L. B. (1990). The revising processes of sixth-grade writers with and without peer feedback. The Journal of Educational Research, 84(1), 22–29.

Ossenberg, C., Henderson, A., & Mitchell, M. (2018). What attributes guide best practice for effective feedback? A scoping review. Advances in Health Sciences Education , 1–19.

* Ozogul, G., Olina, Z., & Sullivan, H. (2008). Teacher, self and peer evaluation of lesson plans written by preservice teachers. Educational Technology Research and Development, 56 (2), 181.

Panadero, E., & Alqassab, M. (2019). An empirical review of anonymity effects in peer assessment, peer feedback, peer review, peer evaluation and peer grading. Assessment & Evaluation in Higher Education , 1–26.

Panadero, E., & Jonsson, A. (2013). The use of scoring rubrics for formative assessment purposes revisited: A review. Educational Research Review, 9 , 129–144.

Panadero, E., Romero, M., & Strijbos, J. W. (2013). The impact of a rubric and friendship on peer assessment: Effects on construct validity, performance, and perceptions of fairness and comfort. Studies in Educational Evaluation, 39 (4), 195–203.

* Papadopoulos, P. M., Lagkas, T. D., & Demetriadis, S. N. (2012). How to improve the peer review method: Free-selection vs assigned-pair protocol evaluated in a computer networking course. Computers & Education, 59 (2), 182–195.

Paulus, T. M. (1999). The effect of peer and teacher feedback on student writing. Journal of second language writing, 8 (3), 265–289.

Pellegrino, J. W., Chudowsky, N., & Glaser, R. (2001). Knowing what students know: the science and design of educational assessment . Washington: National Academy Press.

Peters, O., Körndle, H., & Narciss, S. (2018). Effects of a formative assessment script on how vocational students generate formative feedback to a peer’s or their own performance. European Journal of Psychology of Education, 33 (1), 117–143.

* Philippakos, Z. A., & MacArthur, C. A. (2016). The effects of giving feedback on the persuasive writing of fourth-and fifth-grade students. Reading Research Quarterly, 51 (4), 419-433.

* Pierson, H. (1967). Peer and teacher correction: A comparison of the effects of two methods of teaching composition in grade nine English classes. (Unpublished doctoral dissertation), New York University.

* Prater, D., & Bermudez, A. (1993). Using peer response groups with limited English proficient writers. Bilingual Research Journal , 17 (1-2), 99-116.

Reinholz, D. (2016). The assessment cycle: A model for learning through peer assessment. Assessment & Evaluation in Higher Education, 41 (2), 301–315.

* Rijlaarsdam, G., & Schoonen, R. (1988). Effects of a teaching program based on peer evaluation on written composition and some variables related to writing apprehension. (Unpublished doctoral dissertation), Amsterdam University, Amsterdam

Rollinson, P. (2005). Using peer feedback in the ESL writing class. ELT Journal, 59 (1), 23–30.

Rotsaert, T., Panadero, E., & Schellens, T. (2018). Anonymity as an instructional scaffold in peer assessment: its effects on peer feedback quality and evolution in students’ perceptions about peer assessment skills. European Journal of Psychology of Education, 33 (1), 75–99.

* Rudd II, J. A., Wang, V. Z., Cervato, C., & Ridky, R. W. (2009). Calibrated peer review assignments for the Earth Sciences. Journal of Geoscience Education , 57 (5), 328-334.

* Ruegg, R. (2015). The relative effects of peer and teacher feedback on improvement in EFL students' writing ability. Linguistics and Education, 29 , 73-82.

* Sadeghi, K., & Abolfazli Khonbi, Z. (2015). Iranian university students’ experiences of and attitudes towards alternatives in assessment. Assessment & Evaluation in Higher Education, 40 (5), 641–665.

* Sadler, P. M., & Good, E. (2006). The impact of self- and peer-grading on student learning. Educational Assessment , 11 (1), 1-31.

Sanchez, C. E., Atkinson, K. M., Koenka, A. C., Moshontz, H., & Cooper, H. (2017). Self-grading and peer-grading for formative and summative assessments in 3rd through 12th grade classrooms: A meta-analysis. Journal of Educational Psychology, 109 (8), 1049.

Sawilowsky, S. S. (2009). New effect size rules of thumb. Journal of Modern Applied Statistical Methods, 8 (2), 26.

* Schonrock-Adema, J., Heijne-Penninga, M., van Duijn, M. A., Geertsma, J., & Cohen-Schotanus, J. (2007). Assessment of professional behaviour in undergraduate medical education: Peer assessment enhances performance. Medical Education, 41 (9), 836-842.

Schwarzer, G., Carpenter, J. R., & Rücker, G. (2015). Meta-analysis with R . Cham: Springer.

Book   Google Scholar  

* Sippel, L., & Jackson, C. N. (2015). Teacher vs. peer oral corrective feedback in the German language classroom. Foreign Language Annals , 48 (4), 688-705.

Sluijsmans, D. M., Brand-Gruwel, S., van Merriënboer, J. J., & Martens, R. L. (2004). Training teachers in peer-assessment skills: Effects on performance and perceptions. Innovations in Education and Teaching International, 41 (1), 59–78.

Smith, H., Cooper, A., & Lancaster, L. (2002). Improving the quality of undergraduate peer assessment: A case for student and staff development. Innovations in education and teaching international, 39 (1), 71–81.

Smith, M. K., Wood, W. B., Adams, W. K., Wieman, C., Knight, J. K., Guild, N., & Su, T. T. (2009). Why peer discussion improves student performance on in-class concept questions. Science, 323 (5910), 122–124.

Steel, P. D., & Kammeyer-Mueller, J. D. (2002). Comparing meta-analytic moderator estimation techniques under realistic conditions. Journal of Applied Psychology, 87 (1), 96.

Stipek, D., & Iver, D. M. (1989). Developmental change in children's assessment of intellectual competence. Child Development , 521–538.

Strijbos, J. W., & Wichmann, A. (2018). Promoting learning by leveraging the collaborative nature of formative peer assessment with instructional scaffolds. European Journal of Psychology of Education, 33 (1), 1–9.

Strijbos, J.-W., Narciss, S., & Dünnebier, K. (2010). Peer feedback content and sender's competence level in academic writing revision tasks: Are they critical for feedback perceptions and efficiency? Learning and Instruction, 20 (4), 291–303.

* Sun, D. L., Harris, N., Walther, G., & Baiocchi, M. (2015). Peer assessment enhances student learning: The results of a matched randomized crossover experiment in a college statistics class. PLoS One 10(12),

Tannacito, T., & Tuzi, F. (2002). A comparison of e-response: Two experiences, one conclusion. Kairos, 7 (3), 1–14.

Team, R. (2017). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2017: R Core Team.

Topping, K. (1998). Peer assessment between students in colleges and universities. Review of Educational Research, 68 (3), 249-276.

Topping, K. (2009). Peer assessment. Theory Into Practice, 48 (1), 20–27.

Usher, N. (2018). Learning about academic writing through holistic peer assessment. (Unpiblished doctoral thesis), University of Oxford, Oxford, UK.

* van den Boom, G., Paas, F., & van Merriënboer, J. J. (2007). Effects of elicited reflections combined with tutor or peer feedback on self-regulated learning and learning outcomes. Learning and Instruction , 17 (5), 532-548.

* van Ginkel, S., Gulikers, J., Biemans, H., & Mulder, M. (2017). The impact of the feedback source on developing oral presentation competence. Studies in Higher Education, 42 (9), 1671-1685.

van Popta, E., Kral, M., Camp, G., Martens, R. L., & Simons, P. R. J. (2017). Exploring the value of peer feedback in online learning for the provider. Educational Research Review, 20 , 24–34.

van Zundert, M., Sluijsmans, D., & van Merriënboer, J. (2010). Effective peer assessment processes: Research findings and future directions. Learning and Instruction, 20 (4), 270–279.

Vanderhoven, E., Raes, A., Montrieux, H., Rotsaert, T., & Schellens, T. (2015). What if pupils can assess their peers anonymously? A quasi-experimental study. Computers & Education, 81 , 123–132.

Wang, J.-H., Hsu, S.-H., Chen, S. Y., Ko, H.-W., Ku, Y.-M., & Chan, T.-W. (2014a). Effects of a mixed-mode peer response on student response behavior and writing performance. Journal of Educational Computing Research, 51 (2), 233–256.

* Wang, J. H., Hsu, S. H., Chen, S. Y., Ko, H. W., Ku, Y. M., & Chan, T. W. (2014b). Effects of a mixed-mode peer response on student response behavior and writing performance. Journal of Educational Computing Research , 51 (2), 233-256.

* Wang, X.-M., Hwang, G.-J., Liang, Z.-Y., & Wang, H.-Y. (2017). Enhancing students’ computer programming performances, critical thinking awareness and attitudes towards programming: An online peer-assessment attempt. Journal of Educational Technology & Society, 20 (4), 58-68.

Wiliam, D. (2010). What counts as evidence of educational achievement? The role of constructs in the pursuit of equity in assessment. Review of Research in Education, 34 (1), 254–284.

Wiliam, D. (2018). How can assessment support learning? A response to Wilson and Shepard, Penuel, and Pellegrino. Educational Measurement: Issues and Practice, 37 (1), 42–44.

Wiliam, D., Lee, C., Harrison, C., & Black, P. (2004). Teachers developing assessment for learning: Impact on student achievement. Assessment in Education: Principles, Policy & Practice, 11 (1), 49–65.

* Wise, W. G. (1992). The effects of revision instruction on eighth graders' persuasive writing (Unpublished doctoral dissertation), University of Maryland, Maryland

* Wong, H. M. H., & Storey, P. (2006). Knowing and doing in the ESL writing class. Language Awareness , 15 (4), 283.

* Xie, Y., Ke, F., & Sharma, P. (2008). The effect of peer feedback for blogging on college students' reflective learning processes. The Internet and Higher Education , 11 (1), 18-25.

Young, J. E., & Jackman, M. G.-A. (2014). Formative assessment in the Grenadian lower secondary school: Teachers’ perceptions, attitudes and practices. Assessment in Education: Principles, Policy & Practice, 21 (4), 398–411.

Yu, F.-Y., & Liu, Y.-H. (2009). Creating a psychologically safe online space for a student-generated questions learning activity via different identity revelation modes. British Journal of Educational Technology, 40 (6), 1109–1123.

Download references

Acknowledgements

The authors would like to thank Kristine Gorgen and Jessica Chan for their help coding the studies included in the meta-analysis.

Author information

Authors and affiliations.

Department of Education, University of Oxford, Oxford, England

Kit S. Double, Joshua A. McGrane & Therese N. Hopfenbeck

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Kit S. Double .

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

(XLSX 40 kb)

Effect Size Calculation

Standardised mean differences were calculated as a measure of effect size. Standardised mean difference ( d ) was calculated using the following formula, which is typically used in meta-analyses (e.g., Lipsey and Wilson 2001 ).

As standardized mean difference ( d ) is known to have a slight positive bias (Hedges 1981 ), we applied a correction to bias-correct estimates (resulting in what is often referred to as Hedge’s g ).

For studies where there was insufficient information to calculate Hedge’s g using the above method, we used the online effect size calculator developed by Lipsey and Wilson ( 2001 ) available http://www.campbellcollaboration.org/escalc . For pre-post design studies where adjusted means were not provided, we used the critical value relevant to the difference between peer feedback and control groups from the reported pre-intervention adjusted analysis (e.g., Analysis of Covariances) as suggested by Higgins and Green ( 2011 ). For pre-post designs studies where both pre and post intervention means and standard deviations were provided, we used an effect size estimate based on the mean pre-post change in the peer feedback group minus the mean pre-post change in the control group, divided by the pooled pre-intervention standard deviation as such an approach minimised bias and improves estimate precision (Morris 2008 ).

Variance estimates for each effect size were calculated using the following formula:

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Double, K.S., McGrane, J.A. & Hopfenbeck, T.N. The Impact of Peer Assessment on Academic Performance: A Meta-analysis of Control Group Studies. Educ Psychol Rev 32 , 481–509 (2020). https://doi.org/10.1007/s10648-019-09510-3

Download citation

Published : 10 December 2019

Issue Date : June 2020

DOI : https://doi.org/10.1007/s10648-019-09510-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Peer assessment
  • Meta-analysis
  • Experimental design
  • Effect size
  • Formative assessment
  • Find a journal
  • Publish with us
  • Track your research
  • Open supplemental data
  • Reference Manager
  • Simple TEXT file

People also looked at

Systematic review article, a critical review of research on student self-assessment.

research paper of assessment

  • Educational Psychology and Methodology, University at Albany, Albany, NY, United States

This article is a review of research on student self-assessment conducted largely between 2013 and 2018. The purpose of the review is to provide an updated overview of theory and research. The treatment of theory involves articulating a refined definition and operationalization of self-assessment. The review of 76 empirical studies offers a critical perspective on what has been investigated, including the relationship between self-assessment and achievement, consistency of self-assessment and others' assessments, student perceptions of self-assessment, and the association between self-assessment and self-regulated learning. An argument is made for less research on consistency and summative self-assessment, and more on the cognitive and affective mechanisms of formative self-assessment.

This review of research on student self-assessment expands on a review published as a chapter in the Cambridge Handbook of Instructional Feedback ( Andrade, 2018 , reprinted with permission). The timespan for the original review was January 2013 to October 2016. A lot of research has been done on the subject since then, including at least two meta-analyses; hence this expanded review, in which I provide an updated overview of theory and research. The treatment of theory presented here involves articulating a refined definition and operationalization of self-assessment through a lens of feedback. My review of the growing body of empirical research offers a critical perspective, in the interest of provoking new investigations into neglected areas.

Defining and Operationalizing Student Self-Assessment

Without exception, reviews of self-assessment ( Sargeant, 2008 ; Brown and Harris, 2013 ; Panadero et al., 2016a ) call for clearer definitions: What is self-assessment, and what is not? This question is surprisingly difficult to answer, as the term self-assessment has been used to describe a diverse range of activities, such as assigning a happy or sad face to a story just told, estimating the number of correct answers on a math test, graphing scores for dart throwing, indicating understanding (or the lack thereof) of a science concept, using a rubric to identify strengths and weaknesses in one's persuasive essay, writing reflective journal entries, and so on. Each of those activities involves some kind of assessment of one's own functioning, but they are so different that distinctions among types of self-assessment are needed. I will draw those distinctions in terms of the purposes of self-assessment which, in turn, determine its features: a classic form-fits-function analysis.

What is Self-Assessment?

Brown and Harris (2013) defined self-assessment in the K-16 context as a “descriptive and evaluative act carried out by the student concerning his or her own work and academic abilities” (p. 368). Panadero et al. (2016a) defined it as a “wide variety of mechanisms and techniques through which students describe (i.e., assess) and possibly assign merit or worth to (i.e., evaluate) the qualities of their own learning processes and products” (p. 804). Referring to physicians, Epstein et al. (2008) defined “concurrent self-assessment” as “ongoing moment-to-moment self-monitoring” (p. 5). Self-monitoring “refers to the ability to notice our own actions, curiosity to examine the effects of those actions, and willingness to use those observations to improve behavior and thinking in the future” (p. 5). Taken together, these definitions include self-assessment of one's abilities, processes , and products —everything but the kitchen sink. This very broad conception might seem unwieldy, but it works because each object of assessment—competence, process, and product—is subject to the influence of feedback from oneself.

What is missing from each of these definitions, however, is the purpose of the act of self-assessment. Their authors might rightly point out that the purpose is implied, but a formal definition requires us to make it plain: Why do we ask students to self-assess? I have long held that self-assessment is feedback ( Andrade, 2010 ), and that the purpose of feedback is to inform adjustments to processes and products that deepen learning and enhance performance; hence the purpose of self-assessment is to generate feedback that promotes learning and improvements in performance. This learning-oriented purpose of self-assessment implies that it should be formative: if there is no opportunity for adjustment and correction, self-assessment is almost pointless.

Why Self-Assess?

Clarity about the purpose of self-assessment allows us to interpret what otherwise appear to be discordant findings from research, which has produced mixed results in terms of both the accuracy of students' self-assessments and their influence on learning and/or performance. I believe the source of the discord can be traced to the different ways in which self-assessment is carried out, such as whether it is summative and formative. This issue will be taken up again in the review of current research that follows this overview. For now, consider a study of the accuracy and validity of summative self-assessment in teacher education conducted by Tejeiro et al. (2012) , which showed that students' self-assigned marks tended to be higher than marks given by professors. All 122 students in the study assigned themselves a grade at the end of their course, but half of the students were told that their self-assigned grade would count toward 5% of their final grade. In both groups, students' self-assessments were higher than grades given by professors, especially for students with “poorer results” (p. 791) and those for whom self-assessment counted toward the final grade. In the group that was told their self-assessments would count toward their final grade, no relationship was found between the professor's and the students' assessments. Tejeiro et al. concluded that, although students' and professor's assessments tend to be highly similar when self-assessment did not count toward final grades, overestimations increased dramatically when students' self-assessments did count. Interviews of students who self-assigned highly discrepant grades revealed (as you might guess) that they were motivated by the desire to obtain the highest possible grades.

Studies like Tejeiro et al's. (2012) are interesting in terms of the information they provide about the relationship between consistency and honesty, but the purpose of the self-assessment, beyond addressing interesting research questions, is unclear. There is no feedback purpose. This is also true for another example of a study of summative self-assessment of competence, during which elementary-school children took the Test of Narrative Language and then were asked to self-evaluate “how you did in making up stories today” by pointing to one of five pictures, from a “very happy face” (rating of five) to a “very sad face” (rating of one) ( Kaderavek et al., 2004 . p. 37). The usual results were reported: Older children and good narrators were more accurate than younger children and poor narrators, and males tended to more frequently overestimate their ability.

Typical of clinical studies of accuracy in self-evaluation, this study rests on a definition and operationalization of self-assessment with no value in terms of instructional feedback. If those children were asked to rate their stories and then revise or, better yet, if they assessed their stories according to clear, developmentally appropriate criteria before revising, the valence of their self-assessments in terms of instructional feedback would skyrocket. I speculate that their accuracy would too. In contrast, studies of formative self-assessment suggest that when the act of self-assessing is given a learning-oriented purpose, students' self-assessments are relatively consistent with those of external evaluators, including professors ( Lopez and Kossack, 2007 ; Barney et al., 2012 ; Leach, 2012 ), teachers ( Bol et al., 2012 ; Chang et al., 2012 , 2013 ), researchers ( Panadero and Romero, 2014 ; Fitzpatrick and Schulz, 2016 ), and expert medical assessors ( Hawkins et al., 2012 ).

My commitment to keeping self-assessment formative is firm. However, Gavin Brown (personal communication, April 2011) reminded me that summative self-assessment exists and we cannot ignore it; any definition of self-assessment must acknowledge and distinguish between formative and summative forms of it. Thus, the taxonomy in Table 1 , which depicts self-assessment as serving formative and/or summative purposes, and focuses on competence, processes, and/or products.

www.frontiersin.org

Table 1 . A taxonomy of self-assessment.

Fortunately, a formative view of self-assessment seems to be taking hold in various educational contexts. For instance, Sargeant (2008) noted that all seven authors in a special issue of the Journal of Continuing Education in the Health Professions “conceptualize self-assessment within a formative, educational perspective, and see it as an activity that draws upon both external and internal data, standards, and resources to inform and make decisions about one's performance” (p. 1). Sargeant also stresses the point that self-assessment should be guided by evaluative criteria: “Multiple external sources can and should inform self-assessment, perhaps most important among them performance standards” (p. 1). Now we are talking about the how of self-assessment, which demands an operationalization of self-assessment practice. Let us examine each object of self-assessment (competence, processes, and/or products) with an eye for what is assessed and why.

What is Self-Assessed?

Monitoring and self-assessing processes are practically synonymous with self-regulated learning (SRL), or at least central components of it such as goal-setting and monitoring, or metacognition. Research on SRL has clearly shown that self-generated feedback on one's approach to learning is associated with academic gains ( Zimmerman and Schunk, 2011 ). Self-assessment of the products , such as papers and presentations, are the easiest to defend as feedback, especially when those self-assessments are grounded in explicit, relevant, evaluative criteria and followed by opportunities to relearn and/or revise ( Andrade, 2010 ).

Including the self-assessment of competence in this definition is a little trickier. I hesitated to include it because of the risk of sneaking in global assessments of one's overall ability, self-esteem, and self-concept (“I'm good enough, I'm smart enough, and doggone it, people like me,” Franken, 1992 ), which do not seem relevant to a discussion of feedback in the context of learning. Research on global self-assessment, or self-perception, is popular in the medical education literature, but even there, scholars have begun to question its usefulness in terms of influencing learning and professional growth (e.g., see Sargeant et al., 2008 ). Eva and Regehr (2008) seem to agree in the following passage, which states the case in a way that makes it worthy of a long quotation:

Self-assessment is often (implicitly or otherwise) conceptualized as a personal, unguided reflection on performance for the purposes of generating an individually derived summary of one's own level of knowledge, skill, and understanding in a particular area. For example, this conceptualization would appear to be the only reasonable basis for studies that fit into what Colliver et al. (2005) has described as the “guess your grade” model of self-assessment research, the results of which form the core foundation for the recurring conclusion that self-assessment is generally poor. This unguided, internally generated construction of self-assessment stands in stark contrast to the model put forward by Boud (1999) , who argued that the phrase self-assessment should not imply an isolated or individualistic activity; it should commonly involve peers, teachers, and other sources of information. The conceptualization of self-assessment as enunciated in Boud's description would appear to involve a process by which one takes personal responsibility for looking outward, explicitly seeking feedback, and information from external sources, then using these externally generated sources of assessment data to direct performance improvements. In this construction, self-assessment is more of a pedagogical strategy than an ability to judge for oneself; it is a habit that one needs to acquire and enact rather than an ability that one needs to master (p. 15).

As in the K-16 context, self-assessment is coming to be seen as having value as much or more so in terms of pedagogy as in assessment ( Silver et al., 2008 ; Brown and Harris, 2014 ). In the end, however, I decided that self-assessing one's competence to successfully learn a particular concept or complete a particular task (which sounds a lot like self-efficacy—more on that later) might be useful feedback because it can inform decisions about how to proceed, such as the amount of time to invest in learning how to play the flute, or whether or not to seek help learning the steps of the jitterbug. An important caveat, however, is that self-assessments of competence are only useful if students have opportunities to do something about their perceived low competence—that is, it serves the purpose of formative feedback for the learner.

How to Self-Assess?

Panadero et al. (2016a) summarized five very different taxonomies of self-assessment and called for the development of a comprehensive typology that considers, among other things, its purpose, the presence or absence of criteria, and the method. In response, I propose the taxonomy depicted in Table 1 , which focuses on the what (competence, process, or product), the why (formative or summative), and the how (methods, including whether or not they include standards, e.g., criteria) of self-assessment. The collections of examples of methods in the table is inexhaustive.

I put the methods in Table 1 where I think they belong, but many of them could be placed in more than one cell. Take self-efficacy , for instance, which is essentially a self-assessment of one's competence to successfully undertake a particular task ( Bandura, 1997 ). Summative judgments of self-efficacy are certainly possible but they seem like a silly thing to do—what is the point, from a learning perspective? Formative self-efficacy judgments, on the other hand, can inform next steps in learning and skill building. There is reason to believe that monitoring and making adjustments to one's self-efficacy (e.g., by setting goals or attributing success to effort) can be productive ( Zimmerman, 2000 ), so I placed self-efficacy in the formative row.

It is important to emphasize that self-efficacy is task-specific, more or less ( Bandura, 1997 ). This taxonomy does not include general, holistic evaluations of one's abilities, for example, “I am good at math.” Global assessment of competence does not provide the leverage, in terms of feedback, that is provided by task-specific assessments of competence, that is, self-efficacy. Eva and Regehr (2008) provided an illustrative example: “We suspect most people are prompted to open a dictionary as a result of encountering a word for which they are uncertain of the meaning rather than out of a broader assessment that their vocabulary could be improved” (p. 16). The exclusion of global evaluations of oneself resonates with research that clearly shows that feedback that focuses on aspects of a task (e.g., “I did not solve most of the algebra problems”) is more effective than feedback that focuses on the self (e.g., “I am bad at math”) ( Kluger and DeNisi, 1996 ; Dweck, 2006 ; Hattie and Timperley, 2007 ). Hence, global self-evaluations of ability or competence do not appear in Table 1 .

Another approach to student self-assessment that could be placed in more than one cell is traffic lights . The term traffic lights refers to asking students to use green, yellow, or red objects (or thumbs up, sideways, or down—anything will do) to indicate whether they think they have good, partial, or little understanding ( Black et al., 2003 ). It would be appropriate for traffic lights to appear in multiple places in Table 1 , depending on how they are used. Traffic lights seem to be most effective at supporting students' reflections on how well they understand a concept or have mastered a skill, which is line with their creators' original intent, so they are categorized as formative self-assessments of one's learning—which sounds like metacognition.

In fact, several of the methods included in Table 1 come from research on metacognition, including self-monitoring , such as checking one's reading comprehension, and self-testing , e.g., checking one's performance on test items. These last two methods have been excluded from some taxonomies of self-assessment (e.g., Boud and Brew, 1995 ) because they do not engage students in explicitly considering relevant standards or criteria. However, new conceptions of self-assessment are grounded in theories of the self- and co-regulation of learning ( Andrade and Brookhart, 2016 ), which includes self-monitoring of learning processes with and without explicit standards.

However, my research favors self-assessment with regard to standards ( Andrade and Boulay, 2003 ; Andrade and Du, 2007 ; Andrade et al., 2008 , 2009 , 2010 ), as does related research by Panadero and his colleagues (see below). I have involved students in self-assessment of stories, essays, or mathematical word problems according to rubrics or checklists with criteria. For example, two studies investigated the relationship between elementary or middle school students' scores on a written assignment and a process that involved them in reading a model paper, co-creating criteria, self-assessing first drafts with a rubric, and revising ( Andrade et al., 2008 , 2010 ). The self-assessment was highly scaffolded: students were asked to underline key phrases in the rubric with colored pencils (e.g., underline “clearly states an opinion” in blue), then underline or circle in their drafts the evidence of having met the standard articulated by the phrase (e.g., his or her opinion) with the same blue pencil. If students found they had not met the standard, they were asked to write themselves a reminder to make improvements when they wrote their final drafts. This process was followed for each criterion on the rubric. There were main effects on scores for every self-assessed criterion on the rubric, suggesting that guided self-assessment according to the co-created criteria helped students produce more effective writing.

Panadero and his colleagues have also done quasi-experimental and experimental research on standards-referenced self-assessment, using rubrics or lists of assessment criteria that are presented in the form of questions ( Panadero et al., 2012 , 2013 , 2014 ; Panadero and Romero, 2014 ). Panadero calls the list of assessment criteria a script because his work is grounded in research on scaffolding (e.g., Kollar et al., 2006 ): I call it a checklist because that is the term used in classroom assessment contexts. Either way, the list provides standards for the task. Here is a script for a written summary that Panadero et al. (2014) used with college students in a psychology class:

• Does my summary transmit the main idea from the text? Is it at the beginning of my summary?

• Are the important ideas also in my summary?

• Have I selected the main ideas from the text to make them explicit in my summary?

• Have I thought about my purpose for the summary? What is my goal?

Taken together, the results of the studies cited above suggest that students who engaged in self-assessment using scripts or rubrics were more self-regulated, as measured by self-report questionnaires and/or think aloud protocols, than were students in the comparison or control groups. Effect sizes were very small to moderate (η 2 = 0.06–0.42), and statistically significant. Most interesting, perhaps, is one study ( Panadero and Romero, 2014 ) that demonstrated an association between rubric-referenced self-assessment activities and all three phases of SRL; forethought, performance, and reflection.

There are surely many other methods of self-assessment to include in Table 1 , as well as interesting conversations to be had about which method goes where and why. In the meantime, I offer the taxonomy in Table 1 as a way to define and operationalize self-assessment in instructional contexts and as a framework for the following overview of current research on the subject.

An Overview of Current Research on Self-Assessment

Several recent reviews of self-assessment are available ( Brown and Harris, 2013 ; Brown et al., 2015 ; Panadero et al., 2017 ), so I will not summarize the entire body of research here. Instead, I chose to take a birds-eye view of the field, with goal of reporting on what has been sufficiently researched and what remains to be done. I used the references lists from reviews, as well as other relevant sources, as a starting point. In order to update the list of sources, I directed two new searches 1 , the first of the ERIC database, and the second of both ERIC and PsychINFO. Both searches included two search terms, “self-assessment” OR “self-evaluation.” Advanced search options had four delimiters: (1) peer-reviewed, (2) January, 2013–October, 2016 and then October 2016–March 2019, (3) English, and (4) full-text. Because the focus was on K-20 educational contexts, sources were excluded if they were about early childhood education or professional development.

The first search yielded 347 hits; the second 1,163. Research that was unrelated to instructional feedback was excluded, such as studies limited to self-estimates of performance before or after taking a test, guesses about whether a test item was answered correctly, and estimates of how many tasks could be completed in a certain amount of time. Although some of the excluded studies might be thought of as useful investigations of self-monitoring, as a group they seemed too unrelated to theories of self-generated feedback to be appropriate for this review. Seventy-six studies were selected for inclusion in Table S1 (Supplementary Material), which also contains a few studies published before 2013 that were not included in key reviews, as well as studies solicited directly from authors.

The Table S1 in the Supplementary Material contains a complete list of studies included in this review, organized by the focus or topic of the study, as well as brief descriptions of each. The “type” column Table S1 (Supplementary Material) indicates whether the study focused on formative or summative self-assessment. This distinction was often difficult to make due to a lack of information. For example, Memis and Seven (2015) frame their study in terms of formative assessment, and note that the purpose of the self-evaluation done by the sixth grade students is to “help students improve their [science] reports” (p. 39), but they do not indicate how the self-assessments were done, nor whether students were given time to revise their reports based on their judgments or supported in making revisions. A sentence or two of explanation about the process of self-assessment in the procedures sections of published studies would be most useful.

Figure 1 graphically represents the number of studies in the four most common topic categories found in the table—achievement, consistency, student perceptions, and SRL. The figure reveals that research on self-assessment is on the rise, with consistency the most popular topic. Of the 76 studies in the table in the appendix, 44 were inquiries into the consistency of students' self-assessments with other judgments (e.g., a test score or teacher's grade). Twenty-five studies investigated the relationship between self-assessment and achievement. Fifteen explored students' perceptions of self-assessment. Twelve studies focused on the association between self-assessment and self-regulated learning. One examined self-efficacy, and two qualitative studies documented the mental processes involved in self-assessment. The sum ( n = 99) of the list of research topics is more than 76 because several studies had multiple foci. In the remainder of this review I examine each topic in turn.

www.frontiersin.org

Figure 1 . Topics of self-assessment studies, 2013–2018.

Consistency

Table S1 (Supplementary Material) reveals that much of the recent research on self-assessment has investigated the accuracy or, more accurately, consistency, of students' self-assessments. The term consistency is more appropriate in the classroom context because the quality of students' self-assessments is often determined by comparing them with their teachers' assessments and then generating correlations. Given the evidence of the unreliability of teachers' grades ( Falchikov, 2005 ), the assumption that teachers' assessments are accurate might not be well-founded ( Leach, 2012 ; Brown et al., 2015 ). Ratings of student work done by researchers are also suspect, unless evidence of the validity and reliability of the inferences made about student work by researchers is available. Consequently, much of the research on classroom-based self-assessment should use the term consistency , which refers to the degree of alignment between students' and expert raters' evaluations, avoiding the purer, more rigorous term accuracy unless it is fitting.

In their review, Brown and Harris (2013) reported that correlations between student self-ratings and other measures tended to be weakly to strongly positive, ranging from r ≈ 0.20 to 0.80, with few studies reporting correlations >0.60. But their review included results from studies of any self-appraisal of school work, including summative self-rating/grading, predictions about the correctness of answers on test items, and formative, criteria-based self-assessments, a combination of methods that makes the correlations they reported difficult to interpret. Qualitatively different forms of self-assessment, especially summative and formative types, cannot be lumped together without obfuscating important aspects of self-assessment as feedback.

Given my concern about combining studies of summative and formative assessment, you might anticipate a call for research on consistency that distinguishes between the two. I will make no such call for three reasons. One is that we have enough research on the subject, including the 22 studies in Table S1 (Supplementary Material) that were published after Brown and Harris's review (2013 ). Drawing only on studies included in Table S1 (Supplementary Material), we can say with confidence that summative self-assessment tends to be inconsistent with external judgements ( Baxter and Norman, 2011 ; De Grez et al., 2012 ; Admiraal et al., 2015 ), with males tending to overrate and females to underrate ( Nowell and Alston, 2007 ; Marks et al., 2018 ). There are exceptions ( Alaoutinen, 2012 ; Lopez-Pastor et al., 2012 ) as well as mixed results, with students being consistent regarding some aspects of their learning but not others ( Blanch-Hartigan, 2011 ; Harding and Hbaci, 2015 ; Nguyen and Foster, 2018 ). We can also say that older, more academically competent learners tend to be more consistent ( Hacker et al., 2000 ; Lew et al., 2010 ; Alaoutinen, 2012 ; Guillory and Blankson, 2017 ; Butler, 2018 ; Nagel and Lindsey, 2018 ). There is evidence that consistency can be improved through experience ( Lopez and Kossack, 2007 ; Yilmaz, 2017 ; Nagel and Lindsey, 2018 ), the use of guidelines ( Bol et al., 2012 ), feedback ( Thawabieh, 2017 ), and standards ( Baars et al., 2014 ), perhaps in the form of rubrics ( Panadero and Romero, 2014 ). Modeling and feedback also help ( Labuhn et al., 2010 ; Miller and Geraci, 2011 ; Hawkins et al., 2012 ; Kostons et al., 2012 ).

An outcome typical of research on the consistency of summative self-assessment can be found in row 59, which summarizes the study by Tejeiro et al. (2012) discussed earlier: Students' self-assessments were higher than marks given by professors, especially for students with poorer results, and no relationship was found between the professors' and the students' assessments in the group in which self-assessment counted toward the final mark. Students are not stupid: if they know that they can influence their final grade, and that their judgment is summative rather than intended to inform revision and improvement, they will be motivated to inflate their self-evaluation. I do not believe we need more research to demonstrate that phenomenon.

The second reason I am not calling for additional research on consistency is a lot of it seems somewhat irrelevant. This might be because the interest in accuracy is rooted in clinical research on calibration, which has very different aims. Calibration accuracy is the “magnitude of consent between learners' true and self-evaluated task performance. Accurately calibrated learners' task performance equals their self-evaluated task performance” ( Wollenschläger et al., 2016 ). Calibration research often asks study participants to predict or postdict the correctness of their responses to test items. I caution about generalizing from clinical experiments to authentic classroom contexts because the dismal picture of our human potential to self-judge was painted by calibration researchers before study participants were effectively taught how to predict with accuracy, or provided with the tools they needed to be accurate, or motivated to do so. Calibration researchers know that, of course, and have conducted intervention studies that attempt to improve accuracy, with some success (e.g., Bol et al., 2012 ). Studies of formative self-assessment also suggest that consistency increases when it is taught and supported in many of the ways any other skill must be taught and supported ( Lopez and Kossack, 2007 ; Labuhn et al., 2010 ; Chang et al., 2012 , 2013 ; Hawkins et al., 2012 ; Panadero and Romero, 2014 ; Lin-Siegler et al., 2015 ; Fitzpatrick and Schulz, 2016 ).

Even clinical psychological studies that go beyond calibration to examine the associations between monitoring accuracy and subsequent study behaviors do not transfer well to classroom assessment research. After repeatedly encountering claims that, for example, low self-assessment accuracy leads to poor task-selection accuracy and “suboptimal learning outcomes” ( Raaijmakers et al., 2019 , p. 1), I dug into the cited studies and discovered two limitations. The first is that the tasks in which study participants engage are quite inauthentic. A typical task involves studying “word pairs (e.g., railroad—mother), followed by a delayed judgment of learning (JOL) in which the students predicted the chances of remembering the pair… After making a JOL, the entire pair was presented for restudy for 4 s [ sic ], and after all pairs had been restudied, a criterion test of paired-associate recall occurred” ( Dunlosky and Rawson, 2012 , p. 272). Although memory for word pairs might be important in some classroom contexts, it is not safe to assume that results from studies like that one can predict students' behaviors after criterion-referenced self-assessment of their comprehension of complex texts, lengthy compositions, or solutions to multi-step mathematical problems.

The second limitation of studies like the typical one described above is more serious: Participants in research like that are not permitted to regulate their own studying, which is experimentally manipulated by a computer program. This came as a surprise, since many of the claims were about students' poor study choices but they were rarely allowed to make actual choices. For example, Dunlosky and Rawson (2012) permitted participants to “use monitoring to effectively control learning” by programming the computer so that “a participant would need to have judged his or her recall of a definition entirely correct on three different trials, and once they judged it entirely correct on the third trial, that particular key term definition was dropped [by the computer program] from further practice” (p. 272). The authors note that this study design is an improvement on designs that did not require all participants to use the same regulation algorithm, but it does not reflect the kinds of decisions that learners make in class or while doing homework. In fact, a large body of research shows that students can make wise choices when they self-pace the study of to-be-learned materials and then allocate study time to each item ( Bjork et al., 2013 , p. 425):

In a typical experiment, the students first study all the items at an experimenter-paced rate (e.g., study 60 paired associates for 3 s each), which familiarizes the students with the items; after this familiarity phase, the students then either choose which items they want to restudy (e.g., all items are presented in an array, and the students select which ones to restudy) and/or pace their restudy of each item. Several dependent measures have been widely used, such as how long each item is studied, whether an item is selected for restudy, and in what order items are selected for restudy. The literature on these aspects of self-regulated study is massive (for a comprehensive overview, see both Dunlosky and Ariel, 2011 and Son and Metcalfe, 2000 ), but the evidence is largely consistent with a few basic conclusions. First, if students have a chance to practice retrieval prior to restudying items, they almost exclusively choose to restudy unrecalled items and drop the previously recalled items from restudy ( Metcalfe and Kornell, 2005 ). Second, when pacing their study of individual items that have been selected for restudy, students typically spend more time studying items that are more, rather than less, difficult to learn. Such a strategy is consistent with a discrepancy-reduction model of self-paced study (which states that people continue to study an item until they reach mastery), although some key revisions to this model are needed to account for all the data. For instance, students may not continue to study until they reach some static criterion of mastery, but instead, they may continue to study until they perceive that they are no longer making progress.

I propose that this research, which suggests that students' unscaffolded, unmeasured, informal self-assessments tend to lead to appropriate task selection, is better aligned with research on classroom-based self-assessment. Nonetheless, even this comparison is inadequate because the study participants were not taught to compare their performance to the criteria for mastery, as is often done in classroom-based self-assessment.

The third and final reason I do not believe we need additional research on consistency is that I think it is a distraction from the true purposes of self-assessment. Many if not most of the articles about the accuracy of self-assessment are grounded in the assumption that accuracy is necessary for self-assessment to be useful, particularly in terms of subsequent studying and revision behaviors. Although it seems obvious that accurate evaluations of their performance positively influence students' study strategy selection, which should produce improvements in achievement, I have not seen relevant research that tests those conjectures. Some claim that inaccurate estimates of learning lead to the selection of inappropriate learning tasks ( Kostons et al., 2012 ) but they cite research that does not support their claim. For example, Kostons et al. cite studies that focus on the effectiveness of SRL interventions but do not address the accuracy of participants' estimates of learning, nor the relationship of those estimates to the selection of next steps. Other studies produce findings that support my skepticism. Take, for instance, two relevant studies of calibration. One suggested that performance and judgments of performance had little influence on subsequent test preparation behavior ( Hacker et al., 2000 ), and the other showed that study participants followed their predictions of performance to the same degree, regardless of monitoring accuracy ( van Loon et al., 2014 ).

Eva and Regehr (2008) believe that:

Research questions that take the form of “How well do various practitioners self-assess?” “How can we improve self-assessment?” or “How can we measure self-assessment skill?” should be considered defunct and removed from the research agenda [because] there have been hundreds of studies into these questions and the answers are “Poorly,” “You can't,” and “Don't bother” (p. 18).

I almost agree. A study that could change my mind about the importance of accuracy of self-assessment would be an investigation that goes beyond attempting to improve accuracy just for the sake of accuracy by instead examining the relearning/revision behaviors of accurate and inaccurate self-assessors: Do students whose self-assessments match the valid and reliable judgments of expert raters (hence my use of the term accuracy ) make better decisions about what they need to do to deepen their learning and improve their work? Here, I admit, is a call for research related to consistency: I would love to see a high-quality investigation of the relationship between accuracy in formative self-assessment, and students' subsequent study and revision behaviors, and their learning. For example, a study that closely examines the revisions to writing made by accurate and inaccurate self-assessors, and the resulting outcomes in terms of the quality of their writing, would be most welcome.

Table S1 (Supplementary Material) indicates that by 2018 researchers began publishing studies that more directly address the hypothesized link between self-assessment and subsequent learning behaviors, as well as important questions about the processes learners engage in while self-assessing ( Yan and Brown, 2017 ). One, a study by Nugteren et al. (2018 row 19 in Table S1 (Supplementary Material)), asked “How do inaccurate [summative] self-assessments influence task selections?” (p. 368) and employed a clever exploratory research design. The results suggested that most of the 15 students in their sample over-estimated their performance and made inaccurate learning-task selections. Nugteren et al. recommended helping students make more accurate self-assessments, but I think the more interesting finding is related to why students made task selections that were too difficult or too easy, given their prior performance: They based most task selections on interest in the content of particular items (not the overarching content to be learned), and infrequently considered task difficulty and support level. For instance, while working on the genetics tasks, students reported selecting tasks because they were fun or interesting, not because they addressed self-identified weaknesses in their understanding of genetics. Nugteren et al. proposed that students would benefit from instruction on task selection. I second that proposal: Rather than directing our efforts on accuracy in the service of improving subsequent task selection, let us simply teach students to use the information at hand to select next best steps, among other things.

Butler (2018 , row 76 in Table S1 (Supplementary Material)) has conducted at least two studies of learners' processes of responding to self-assessment items and how they arrived at their judgments. Comparing generic, decontextualized items to task-specific, contextualized items (which she calls after-task items ), she drew two unsurprising conclusions: the task-specific items “generally showed higher correlations with task performance,” and older students “appeared to be more conservative in their judgment compared with their younger counterparts” (p. 249). The contribution of the study is the detailed information it provides about how students generated their judgments. For example, Butler's qualitative data analyses revealed that when asked to self-assess in terms of vague or non-specific items, the children often “contextualized the descriptions based on their own experiences, goals, and expectations,” (p. 257) focused on the task at hand, and situated items in the specific task context. Perhaps as a result, the correlation between after-task self-assessment and task performance was generally higher than for generic self-assessment.

Butler (2018) notes that her study enriches our empirical understanding of the processes by which children respond to self-assessment. This is a very promising direction for the field. Similar studies of processing during formative self-assessment of a variety of task types in a classroom context would likely produce significant advances in our understanding of how and why self-assessment influences learning and performance.

Student Perceptions

Fifteen of the studies listed in Table S1 (Supplementary Material) focused on students' perceptions of self-assessment. The studies of children suggest that they tend to have unsophisticated understandings of its purposes ( Harris and Brown, 2013 ; Bourke, 2016 ) that might lead to shallow implementation of related processes. In contrast, results from the studies conducted in higher education settings suggested that college and university students understood the function of self-assessment ( Ratminingsih et al., 2018 ) and generally found it to be useful for guiding evaluation and revision ( Micán and Medina, 2017 ), understanding how to take responsibility for learning ( Lopez and Kossack, 2007 ; Bourke, 2014 ; Ndoye, 2017 ), prompting them to think more critically and deeply ( van Helvoort, 2012 ; Siow, 2015 ), applying newfound skills ( Murakami et al., 2012 ), and fostering self-regulated learning by guiding them to set goals, plan, self-monitor and reflect ( Wang, 2017 ).

Not surprisingly, positive perceptions of self-assessment were typically developed by students who actively engaged the formative type by, for example, developing their own criteria for an effective self-assessment response ( Bourke, 2014 ), or using a rubric or checklist to guide their assessments and then revising their work ( Huang and Gui, 2015 ; Wang, 2017 ). Earlier research suggested that children's attitudes toward self-assessment can become negative if it is summative ( Ross et al., 1998 ). However, even summative self-assessment was reported by adult learners to be useful in helping them become more critical of their own and others' writing throughout the course and in subsequent courses ( van Helvoort, 2012 ).

Achievement

Twenty-five of the studies in Table S1 (Supplementary Material) investigated the relation between self-assessment and achievement, including two meta-analyses. Twenty of the 25 clearly employed the formative type. Without exception, those 20 studies, plus the two meta-analyses ( Graham et al., 2015 ; Sanchez et al., 2017 ) demonstrated a positive association between self-assessment and learning. The meta-analysis conducted by Graham and his colleagues, which included 10 studies, yielded an average weighted effect size of 0.62 on writing quality. The Sanchez et al. meta-analysis revealed that, although 12 of the 44 effect sizes were negative, on average, “students who engaged in self-grading performed better ( g = 0.34) on subsequent tests than did students who did not” (p. 1,049).

All but two of the non-meta-analytic studies of achievement in Table S1 (Supplementary Material) were quasi-experimental or experimental, providing relatively rigorous evidence that their treatment groups outperformed their comparison or control groups in terms of everything from writing to dart-throwing, map-making, speaking English, and exams in a wide variety of disciplines. One experiment on summative self-assessment ( Miller and Geraci, 2011 ), in contrast, resulted in no improvements in exam scores, while the other one did ( Raaijmakers et al., 2017 ).

It would be easy to overgeneralize and claim that the question about the effect of self-assessment on learning has been answered, but there are unanswered questions about the key components of effective self-assessment, especially social-emotional components related to power and trust ( Andrade and Brown, 2016 ). The trends are pretty clear, however: it appears that formative forms of self-assessment can promote knowledge and skill development. This is not surprising, given that it involves many of the processes known to support learning, including practice, feedback, revision, and especially the intellectually demanding work of making complex, criteria-referenced judgments ( Panadero et al., 2014 ). Boud (1995a , b) predicted this trend when he noted that many self-assessment processes undermine learning by rushing to judgment, thereby failing to engage students with the standards or criteria for their work.

Self-Regulated Learning

The association between self-assessment and learning has also been explained in terms of self-regulation ( Andrade, 2010 ; Panadero and Alonso-Tapia, 2013 ; Andrade and Brookhart, 2016 , 2019 ; Panadero et al., 2016b ). Self-regulated learning (SRL) occurs when learners set goals and then monitor and manage their thoughts, feelings, and actions to reach those goals. SRL is moderately to highly correlated with achievement ( Zimmerman and Schunk, 2011 ). Research suggests that formative assessment is a potential influence on SRL ( Nicol and Macfarlane-Dick, 2006 ). The 12 studies in Table S1 (Supplementary Material) that focus on SRL demonstrate the recent increase in interest in the relationship between self-assessment and SRL.

Conceptual and practical overlaps between the two fields are abundant. In fact, Brown and Harris (2014) recommend that student self-assessment no longer be treated as an assessment, but as an essential competence for self-regulation. Butler and Winne (1995) introduced the role of self-generated feedback in self-regulation years ago:

[For] all self-regulated activities, feedback is an inherent catalyst. As learners monitor their engagement with tasks, internal feedback is generated by the monitoring process. That feedback describes the nature of outcomes and the qualities of the cognitive processes that led to those states (p. 245).

The outcomes and processes referred to by Butler and Winne are many of the same products and processes I referred to earlier in the definition of self-assessment and in Table 1 .

In general, research and practice related to self-assessment has tended to focus on judging the products of student learning, while scholarship on self-regulated learning encompasses both processes and products. The very practical focus of much of the research on self-assessment means it might be playing catch-up, in terms of theory development, with the SRL literature, which is grounded in experimental paradigms from cognitive psychology ( de Bruin and van Gog, 2012 ), while self-assessment research is ahead in terms of implementation (E. Panadero, personal communication, October 21, 2016). One major exception is the work done on Self-regulated Strategy Development ( Glaser and Brunstein, 2007 ; Harris et al., 2008 ), which has successfully integrated SRL research with classroom practices, including self-assessment, to teach writing to students with special needs.

Nicol and Macfarlane-Dick (2006) have been explicit about the potential for self-assessment practices to support self-regulated learning:

To develop systematically the learner's capacity for self-regulation, teachers need to create more structured opportunities for self-monitoring and the judging of progression to goals. Self-assessment tasks are an effective way of achieving this, as are activities that encourage reflection on learning progress (p. 207).

The studies of SRL in Table S1 (Supplementary Material) provide encouraging findings regarding the potential role of self-assessment in promoting achievement, self-regulated learning in general, and metacognition and study strategies related to task selection in particular. The studies also represent a solution to the “methodological and theoretical challenges involved in bringing metacognitive research to the real world, using meaningful learning materials” ( Koriat, 2012 , p. 296).

Future Directions for Research

I agree with ( Yan and Brown, 2017 ) statement that “from a pedagogical perspective, the benefits of self-assessment may come from active engagement in the learning process, rather than by being “veridical” or coinciding with reality, because students' reflection and metacognitive monitoring lead to improved learning” (p. 1,248). Future research should focus less on accuracy/consistency/veridicality, and more on the precise mechanisms of self-assessment ( Butler, 2018 ).

An important aspect of research on self-assessment that is not explicitly represented in Table S1 (Supplementary Material) is practice, or pedagogy: Under what conditions does self-assessment work best, and how are those conditions influenced by context? Fortunately, the studies listed in the table, as well as others (see especially Andrade and Valtcheva, 2009 ; Nielsen, 2014 ; Panadero et al., 2016a ), point toward an answer. But we still have questions about how best to scaffold effective formative self-assessment. One area of inquiry is about the characteristics of the task being assessed, and the standards or criteria used by learners during self-assessment.

Influence of Types of Tasks and Standards or Criteria

Type of task or competency assessed seems to matter (e.g., Dolosic, 2018 , Nguyen and Foster, 2018 ), as do the criteria ( Yilmaz, 2017 ), but we do not yet have a comprehensive understanding of how or why. There is some evidence that it is important that the criteria used to self-assess are concrete, task-specific ( Butler, 2018 ), and graduated. For example, Fastre et al. (2010) revealed an association between self-assessment according to task-specific criteria and task performance: In a quasi-experimental study of 39 novice vocational education students studying stoma care, they compared concrete, task-specific criteria (“performance-based criteria”) such as “Introduces herself to the patient” and “Consults the care file for details concerning the stoma” to vaguer, “competence-based criteria” such as “Shows interest, listens actively, shows empathy to the patient” and “Is discrete with sensitive topics.” The performance-based criteria group outperformed the competence-based group on tests of task performance, presumably because “performance-based criteria make it easier to distinguish levels of performance, enabling a step-by-step process of performance improvement” (p. 530).

This finding echoes the results of a study of self-regulated learning by Kitsantas and Zimmerman (2006) , who argued that “fine-grained standards can have two key benefits: They can enable learners to be more sensitive to small changes in skill and make more appropriate adaptations in learning strategies” (p. 203). In their study, 70 college students were taught how to throw darts at a target. The purpose of the study was to examine the role of graphing of self-recorded outcomes and self-evaluative standards in learning a motor skill. Students who were provided with graduated self-evaluative standards surpassed “those who were provided with absolute standards or no standards (control) in both motor skill and in motivational beliefs (i.e., self-efficacy, attributions, and self-satisfaction)” (p. 201). Kitsantas and Zimmerman hypothesized that setting high absolute standards would limit a learner's sensitivity to small improvements in functioning. This hypothesis was supported by the finding that students who set absolute standards reported significantly less awareness of learning progress (and hit the bull's-eye less often) than students who set graduated standards. “The correlation between the self-evaluation and dart-throwing outcomes measures was extraordinarily high ( r = 0.94)” (p. 210). Classroom-based research on specific, graduated self-assessment criteria would be informative.

Cognitive and Affective Mechanisms of Self-Assessment

There are many additional questions about pedagogy, such as the hoped-for investigation mentioned above of the relationship between accuracy in formative self-assessment, students' subsequent study behaviors, and their learning. There is also a need for research on how to help teachers give students a central role in their learning by creating space for self-assessment (e.g., see Hawe and Parr, 2014 ), and the complex power dynamics involved in doing so ( Tan, 2004 , 2009 ; Taras, 2008 ; Leach, 2012 ). However, there is an even more pressing need for investigations into the internal mechanisms experienced by students engaged in assessing their own learning. Angela Lui and I call this the next black box ( Lui, 2017 ).

Black and Wiliam (1998) used the term black box to emphasize the fact that what happened in most classrooms was largely unknown: all we knew was that some inputs (e.g., teachers, resources, standards, and requirements) were fed into the box, and that certain outputs (e.g., more knowledgeable and competent students, acceptable levels of achievement) would follow. But what, they asked, is happening inside, and what new inputs will produce better outputs? Black and Wiliam's review spawned a great deal of research on formative assessment, some but not all of which suggests a positive relationship with academic achievement ( Bennett, 2011 ; Kingston and Nash, 2011 ). To better understand why and how the use of formative assessment in general and self-assessment in particular is associated with improvements in academic achievement in some instances but not others, we need research that looks into the next black box: the cognitive and affective mechanisms of students who are engaged in assessment processes ( Lui, 2017 ).

The role of internal mechanisms has been discussed in theory but not yet fully tested. Crooks (1988) argued that the impact of assessment is influenced by students' interpretation of the tasks and results, and Butler and Winne (1995) theorized that both cognitive and affective processes play a role in determining how feedback is internalized and used to self-regulate learning. Other theoretical frameworks about the internal processes of receiving and responding to feedback have been developed (e.g., Nicol and Macfarlane-Dick, 2006 ; Draper, 2009 ; Andrade, 2013 ; Lipnevich et al., 2016 ). Yet, Shute (2008) noted in her review of the literature on formative feedback that “despite the plethora of research on the topic, the specific mechanisms relating feedback to learning are still mostly murky, with very few (if any) general conclusions” (p. 156). This area is ripe for research.

Self-assessment is the act of monitoring one's processes and products in order to make adjustments that deepen learning and enhance performance. Although it can be summative, the evidence presented in this review strongly suggests that self-assessment is most beneficial, in terms of both achievement and self-regulated learning, when it is used formatively and supported by training.

What is not yet clear is why and how self-assessment works. Those of you who like to investigate phenomena that are maddeningly difficult to measure will rejoice to hear that the cognitive and affective mechanisms of self-assessment are the next black box. Studies of the ways in which learners think and feel, the interactions between their thoughts and feelings and their context, and the implications for pedagogy will make major contributions to our field.

Author Contributions

The author confirms being the sole contributor of this work and has approved it for publication.

Conflict of Interest Statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/feduc.2019.00087/full#supplementary-material

1. ^ I am grateful to my graduate assistants, Joanna Weaver and Taja Young, for conducting the searches.

Admiraal, W., Huisman, B., and Pilli, O. (2015). Assessment in massive open online courses. Electron. J. e-Learning , 13, 207–216.

Google Scholar

Alaoutinen, S. (2012). Evaluating the effect of learning style and student background on self-assessment accuracy. Comput. Sci. Educ. 22, 175–198. doi: 10.1080/08993408.2012.692924

CrossRef Full Text | Google Scholar

Al-Rawahi, N. M., and Al-Balushi, S. M. (2015). The effect of reflective science journal writing on students' self-regulated learning strategies. Int. J. Environ. Sci. Educ. 10, 367–379. doi: 10.12973/ijese.2015.250a

Andrade, H. (2010). “Students as the definitive source of formative assessment: academic self-assessment and the self-regulation of learning,” in Handbook of Formative Assessment , eds H. Andrade and G. Cizek (New York, NY: Routledge, 90–105.

Andrade, H. (2013). “Classroom assessment in the context of learning theory and research,” in Sage Handbook of Research on Classroom Assessment , ed J. H. McMillan (New York, NY: Sage), 17–34. doi: 10.4135/9781452218649.n2

Andrade, H. (2018). “Feedback in the context of self-assessment,” in Cambridge Handbook of Instructional Feedback , eds A. Lipnevich and J. Smith (Cambridge: Cambridge University Press), 376–408.

PubMed Abstract

Andrade, H., and Boulay, B. (2003). The role of rubric-referenced self-assessment in learning to write. J. Educ. Res. 97, 21–34. doi: 10.1080/00220670309596625

Andrade, H., and Brookhart, S. (2019). Classroom assessment as the co-regulation of learning. Assessm. Educ. Principles Policy Pract. doi: 10.1080/0969594X.2019.1571992

Andrade, H., and Brookhart, S. M. (2016). “The role of classroom assessment in supporting self-regulated learning,” in Assessment for Learning: Meeting the Challenge of Implementation , eds D. Laveault and L. Allal (Heidelberg: Springer), 293–309. doi: 10.1007/978-3-319-39211-0_17

Andrade, H., and Du, Y. (2007). Student responses to criteria-referenced self-assessment. Assess. Evalu. High. Educ. 32, 159–181. doi: 10.1080/02602930600801928

Andrade, H., Du, Y., and Mycek, K. (2010). Rubric-referenced self-assessment and middle school students' writing. Assess. Educ. 17, 199–214. doi: 10.1080/09695941003696172

Andrade, H., Du, Y., and Wang, X. (2008). Putting rubrics to the test: The effect of a model, criteria generation, and rubric-referenced self-assessment on elementary school students' writing. Educ. Meas. 27, 3–13. doi: 10.1111/j.1745-3992.2008.00118.x

Andrade, H., and Valtcheva, A. (2009). Promoting learning and achievement through self- assessment. Theory Pract. 48, 12–19. doi: 10.1080/00405840802577544

Andrade, H., Wang, X., Du, Y., and Akawi, R. (2009). Rubric-referenced self-assessment and self-efficacy for writing. J. Educ. Res. 102, 287–302. doi: 10.3200/JOER.102.4.287-302

Andrade, H. L., and Brown, G. T. L. (2016). “Student self-assessment in the classroom,” in Handbook of Human and Social Conditions in Assessment , eds G. T. L. Brown and L. R. Harris (New York, NY: Routledge), 319–334.

PubMed Abstract | Google Scholar

Baars, M., Vink, S., van Gog, T., de Bruin, A., and Paas, F. (2014). Effects of training self-assessment and using assessment standards on retrospective and prospective monitoring of problem solving. Learn. Instruc. 33, 92–107. doi: 10.1016/j.learninstruc.2014.04.004

Balderas, I., and Cuamatzi, P. M. (2018). Self and peer correction to improve college students' writing skills. Profile. 20, 179–194. doi: 10.15446/profile.v20n2.67095

Bandura, A. (1997). Self-efficacy: The Exercise of Control . New York, NY: Freeman.

Barney, S., Khurum, M., Petersen, K., Unterkalmsteiner, M., and Jabangwe, R. (2012). Improving students with rubric-based self-assessment and oral feedback. IEEE Transac. Educ. 55, 319–325. doi: 10.1109/TE.2011.2172981

Baxter, P., and Norman, G. (2011). Self-assessment or self deception? A lack of association between nursing students' self-assessment and performance. J. Adv. Nurs. 67, 2406–2413. doi: 10.1111/j.1365-2648.2011.05658.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Bennett, R. E. (2011). Formative assessment: a critical review. Assess. Educ. 18, 5–25. doi: 10.1080/0969594X.2010.513678

Birjandi, P., and Hadidi Tamjid, N. (2012). The role of self-, peer and teacher assessment in promoting Iranian EFL learners' writing performance. Assess. Evalu. High. Educ. 37, 513–533. doi: 10.1080/02602938.2010.549204

Bjork, R. A., Dunlosky, J., and Kornell, N. (2013). Self-regulated learning: beliefs, techniques, and illusions. Annu. Rev. Psychol. 64, 417–444. doi: 10.1146/annurev-psych-113011-143823

Black, P., Harrison, C., Lee, C., Marshall, B., and Wiliam, D. (2003). Assessment for Learning: Putting it into Practice . Berkshire: Open University Press.

Black, P., and Wiliam, D. (1998). Inside the black box: raising standards through classroom assessment. Phi Delta Kappan 80, 139–144; 146–148.

Blanch-Hartigan, D. (2011). Medical students' self-assessment of performance: results from three meta-analyses. Patient Educ. Counsel. 84, 3–9. doi: 10.1016/j.pec.2010.06.037

Bol, L., Hacker, D. J., Walck, C. C., and Nunnery, J. A. (2012). The effects of individual or group guidelines on the calibration accuracy and achievement of high school biology students. Contemp. Educ. Psychol. 37, 280–287. doi: 10.1016/j.cedpsych.2012.02.004

Boud, D. (1995a). Implementing Student Self-Assessment, 2nd Edn. Australian Capital Territory: Higher Education Research and Development Society of Australasia.

Boud, D. (1995b). Enhancing Learning Through Self-Assessment. London: Kogan Page.

Boud, D. (1999). Avoiding the traps: Seeking good practice in the use of self-assessment and reflection in professional courses. Soc. Work Educ. 18, 121–132. doi: 10.1080/02615479911220131

Boud, D., and Brew, A. (1995). Developing a typology for learner self-assessment practices. Res. Dev. High. Educ. 18, 130–135.

Bourke, R. (2014). Self-assessment in professional programmes within tertiary institutions. Teach. High. Educ. 19, 908–918. doi: 10.1080/13562517.2014.934353

Bourke, R. (2016). Liberating the learner through self-assessment. Cambridge J. Educ. 46, 97–111. doi: 10.1080/0305764X.2015.1015963

Brown, G., Andrade, H., and Chen, F. (2015). Accuracy in student self-assessment: directions and cautions for research. Assess. Educ. 22, 444–457. doi: 10.1080/0969594X.2014.996523

Brown, G. T., and Harris, L. R. (2013). “Student self-assessment,” in Sage Handbook of Research on Classroom Assessment , ed J. H. McMillan (Los Angeles, CA: Sage), 367–393. doi: 10.4135/9781452218649.n21

Brown, G. T. L., and Harris, L. R. (2014). The future of self-assessment in classroom practice: reframing self-assessment as a core competency. Frontline Learn. Res. 3, 22–30. doi: 10.14786/flr.v2i1.24

Butler, D. L., and Winne, P. H. (1995). Feedback and self-regulated learning: a theoretical synthesis. Rev. Educ. Res. 65, 245–281. doi: 10.3102/00346543065003245

Butler, Y. G. (2018). “Young learners' processes and rationales for responding to self-assessment items: cases for generic can-do and five-point Likert-type formats,” in Useful Assessment and Evaluation in Language Education , eds J. Davis et al. (Washington, DC: Georgetown University Press), 21–39. doi: 10.2307/j.ctvvngrq.5

CrossRef Full Text

Chang, C.-C., Liang, C., and Chen, Y.-H. (2013). Is learner self-assessment reliable and valid in a Web-based portfolio environment for high school students? Comput. Educ. 60, 325–334. doi: 10.1016/j.compedu.2012.05.012

Chang, C.-C., Tseng, K.-H., and Lou, S.-J. (2012). A comparative analysis of the consistency and difference among teacher-assessment, student self-assessment and peer-assessment in a Web-based portfolio assessment environment for high school students. Comput. Educ. 58, 303–320. doi: 10.1016/j.compedu.2011.08.005

Colliver, J., Verhulst, S, and Barrows, H. (2005). Self-assessment in medical practice: a further concern about the conventional research paradigm. Teach. Learn. Med. 17, 200–201. doi: 10.1207/s15328015tlm1703_1

Crooks, T. J. (1988). The impact of classroom evaluation practices on students. Rev. Educ. Res. 58, 438–481. doi: 10.3102/00346543058004438

de Bruin, A. B. H., and van Gog, T. (2012). Improving self-monitoring and self-regulation: From cognitive psychology to the classroom , Learn. Instruct. 22, 245–252. doi: 10.1016/j.learninstruc.2012.01.003

De Grez, L., Valcke, M., and Roozen, I. (2012). How effective are self- and peer assessment of oral presentation skills compared with teachers' assessments? Active Learn. High. Educ. 13, 129–142. doi: 10.1177/1469787412441284

Dolosic, H. (2018). An examination of self-assessment and interconnected facets of second language reading. Read. Foreign Langu. 30, 189–208.

Draper, S. W. (2009). What are learners actually regulating when given feedback? Br. J. Educ. Technol. 40, 306–315. doi: 10.1111/j.1467-8535.2008.00930.x

Dunlosky, J., and Ariel, R. (2011). “Self-regulated learning and the allocation of study time,” in Psychology of Learning and Motivation , Vol. 54 ed B. Ross (Cambridge, MA: Academic Press), 103–140. doi: 10.1016/B978-0-12-385527-5.00004-8

Dunlosky, J., and Rawson, K. A. (2012). Overconfidence produces underachievement: inaccurate self evaluations undermine students' learning and retention. Learn. Instr. 22, 271–280. doi: 10.1016/j.learninstruc.2011.08.003

Dweck, C. (2006). Mindset: The New Psychology of Success. New York, NY: Random House.

Epstein, R. M., Siegel, D. J., and Silberman, J. (2008). Self-monitoring in clinical practice: a challenge for medical educators. J. Contin. Educ. Health Prof. 28, 5–13. doi: 10.1002/chp.149

Eva, K. W., and Regehr, G. (2008). “I'll never play professional football” and other fallacies of self-assessment. J. Contin. Educ. Health Prof. 28, 14–19. doi: 10.1002/chp.150

Falchikov, N. (2005). Improving Assessment Through Student Involvement: Practical Solutions for Aiding Learning in Higher and Further Education . London: Routledge Falmer.

Fastre, G. M. J., van der Klink, M. R., Sluijsmans, D., and van Merrienboer, J. J. G. (2012). Drawing students' attention to relevant assessment criteria: effects on self-assessment skills and performance. J. Voc. Educ. Train. 64, 185–198. doi: 10.1080/13636820.2011.630537

Fastre, G. M. J., van der Klink, M. R., and van Merrienboer, J. J. G. (2010). The effects of performance-based assessment criteria on student performance and self-assessment skills. Adv. Health Sci. Educ. 15, 517–532. doi: 10.1007/s10459-009-9215-x

Fitzpatrick, B., and Schulz, H. (2016). “Teaching young students to self-assess critically,” Paper presented at the Annual Meeting of the American Educational Research Association (Washington, DC).

Franken, A. S. (1992). I'm Good Enough, I'm Smart Enough, and Doggone it, People Like Me! Daily affirmations by Stuart Smalley. New York, NY: Dell.

Glaser, C., and Brunstein, J. C. (2007). Improving fourth-grade students' composition skills: effects of strategy instruction and self-regulation procedures. J. Educ. Psychol. 99, 297–310. doi: 10.1037/0022-0663.99.2.297

Gonida, E. N., and Leondari, A. (2011). Patterns of motivation among adolescents with biased and accurate self-efficacy beliefs. Int. J. Educ. Res. 50, 209–220. doi: 10.1016/j.ijer.2011.08.002

Graham, S., Hebert, M., and Harris, K. R. (2015). Formative assessment and writing. Elem. Sch. J. 115, 523–547. doi: 10.1086/681947

Guillory, J. J., and Blankson, A. N. (2017). Using recently acquired knowledge to self-assess understanding in the classroom. Sch. Teach. Learn. Psychol. 3, 77–89. doi: 10.1037/stl0000079

Hacker, D. J., Bol, L., Horgan, D. D., and Rakow, E. A. (2000). Test prediction and performance in a classroom context. J. Educ. Psychol. 92, 160–170. doi: 10.1037/0022-0663.92.1.160

Harding, J. L., and Hbaci, I. (2015). Evaluating pre-service teachers math teaching experience from different perspectives. Univ. J. Educ. Res. 3, 382–389. doi: 10.13189/ujer.2015.030605

Harris, K. R., Graham, S., Mason, L. H., and Friedlander, B. (2008). Powerful Writing Strategies for All Students . Baltimore, MD: Brookes.

Harris, L. R., and Brown, G. T. L. (2013). Opportunities and obstacles to consider when using peer- and self-assessment to improve student learning: case studies into teachers' implementation. Teach. Teach. Educ. 36, 101–111. doi: 10.1016/j.tate.2013.07.008

Hattie, J., and Timperley, H. (2007). The power of feedback. Rev. Educ. Res. 77, 81–112. doi: 10.3102/003465430298487

Hawe, E., and Parr, J. (2014). Assessment for learning in the writing classroom: an incomplete realization. Curr. J. 25, 210–237. doi: 10.1080/09585176.2013.862172

Hawkins, S. C., Osborne, A., Schofield, S. J., Pournaras, D. J., and Chester, J. F. (2012). Improving the accuracy of self-assessment of practical clinical skills using video feedback: the importance of including benchmarks. Med. Teach. 34, 279–284. doi: 10.3109/0142159X.2012.658897

Huang, Y., and Gui, M. (2015). Articulating teachers' expectations afore: Impact of rubrics on Chinese EFL learners' self-assessment and speaking ability. J. Educ. Train. Stud. 3, 126–132. doi: 10.11114/jets.v3i3.753

Kaderavek, J. N., Gillam, R. B., Ukrainetz, T. A., Justice, L. M., and Eisenberg, S. N. (2004). School-age children's self-assessment of oral narrative production. Commun. Disord. Q. 26, 37–48. doi: 10.1177/15257401040260010401

Karnilowicz, W. (2012). A comparison of self-assessment and tutor assessment of undergraduate psychology students. Soc. Behav. Person. 40, 591–604. doi: 10.2224/sbp.2012.40.4.591

Kevereski, L. (2017). (Self) evaluation of knowledge in students' population in higher education in Macedonia. Res. Pedag. 7, 69–75. doi: 10.17810/2015.49

Kingston, N. M., and Nash, B. (2011). Formative assessment: a meta-analysis and a call for research. Educ. Meas. 30, 28–37. doi: 10.1111/j.1745-3992.2011.00220.x

Kitsantas, A., and Zimmerman, B. J. (2006). Enhancing self-regulation of practice: the influence of graphing and self-evaluative standards. Metacogn. Learn. 1, 201–212. doi: 10.1007/s11409-006-9000-7

Kluger, A. N., and DeNisi, A. (1996). The effects of feedback interventions on performance: a historical review, a meta-analysis, and a preliminary feedback intervention theory. Psychol. Bull. 119, 254–284. doi: 10.1037/0033-2909.119.2.254

Kollar, I., Fischer, F., and Hesse, F. (2006). Collaboration scripts: a conceptual analysis. Educ. Psychol. Rev. 18, 159–185. doi: 10.1007/s10648-006-9007-2

Kolovelonis, A., Goudas, M., and Dermitzaki, I. (2012). Students' performance calibration in a basketball dribbling task in elementary physical education. Int. Electron. J. Elem. Educ. 4, 507–517.

Koriat, A. (2012). The relationships between monitoring, regulation and performance. Learn. Instru. 22, 296–298. doi: 10.1016/j.learninstruc.2012.01.002

Kostons, D., van Gog, T., and Paas, F. (2012). Training self-assessment and task-selection skills: a cognitive approach to improving self-regulated learning. Learn. Instruc. 22, 121–132. doi: 10.1016/j.learninstruc.2011.08.004

Labuhn, A. S., Zimmerman, B. J., and Hasselhorn, M. (2010). Enhancing students' self-regulation and mathematics performance: the influence of feedback and self-evaluative standards Metacogn. Learn. 5, 173–194. doi: 10.1007/s11409-010-9056-2

Leach, L. (2012). Optional self-assessment: some tensions and dilemmas. Assess. Evalu. High. Educ. 37, 137–147. doi: 10.1080/02602938.2010.515013

Lew, M. D. N., Alwis, W. A. M., and Schmidt, H. G. (2010). Accuracy of students' self-assessment and their beliefs about its utility. Assess. Evalu. High. Educ. 35, 135–156. doi: 10.1080/02602930802687737

Lin-Siegler, X., Shaenfield, D., and Elder, A. D. (2015). Contrasting case instruction can improve self-assessment of writing. Educ. Technol. Res. Dev. 63, 517–537. doi: 10.1007/s11423-015-9390-9

Lipnevich, A. A., Berg, D. A. G., and Smith, J. K. (2016). “Toward a model of student response to feedback,” in The Handbook of Human and Social Conditions in Assessment , eds G. T. L. Brown and L. R. Harris (New York, NY: Routledge), 169–185.

Lopez, R., and Kossack, S. (2007). Effects of recurring use of self-assessment in university courses. Int. J. Learn. 14, 203–216. doi: 10.18848/1447-9494/CGP/v14i04/45277

Lopez-Pastor, V. M., Fernandez-Balboa, J.-M., Santos Pastor, M. L., and Aranda, A. F. (2012). Students' self-grading, professor's grading and negotiated final grading at three university programmes: analysis of reliability and grade difference ranges and tendencies. Assess. Evalu. High. Educ. 37, 453–464. doi: 10.1080/02602938.2010.545868

Lui, A. (2017). Validity of the responses to feedback survey: operationalizing and measuring students' cognitive and affective responses to teachers' feedback (Doctoral dissertation). University at Albany—SUNY: Albany NY.

Marks, M. B., Haug, J. C., and Hu, H. (2018). Investigating undergraduate business internships: do supervisor and self-evaluations differ? J. Educ. Bus. 93, 33–45. doi: 10.1080/08832323.2017.1414025

Memis, E. K., and Seven, S. (2015). Effects of an SWH approach and self-evaluation on sixth grade students' learning and retention of an electricity unit. Int. J. Prog. Educ. 11, 32–49.

Metcalfe, J., and Kornell, N. (2005). A region of proximal learning model of study time allocation. J. Mem. Langu. 52, 463–477. doi: 10.1016/j.jml.2004.12.001

Meusen-Beekman, K. D., Joosten-ten Brinke, D., and Boshuizen, H. P. A. (2016). Effects of formative assessments to develop self-regulation among sixth grade students: results from a randomized controlled intervention. Stud. Educ. Evalu. 51, 126–136. doi: 10.1016/j.stueduc.2016.10.008

Micán, D. A., and Medina, C. L. (2017). Boosting vocabulary learning through self-assessment in an English language teaching context. Assess. Evalu. High. Educ. 42, 398–414. doi: 10.1080/02602938.2015.1118433

Miller, T. M., and Geraci, L. (2011). Training metacognition in the classroom: the influence of incentives and feedback on exam predictions. Metacogn. Learn. 6, 303–314. doi: 10.1007/s11409-011-9083-7

Murakami, C., Valvona, C., and Broudy, D. (2012). Turning apathy into activeness in oral communication classes: regular self- and peer-assessment in a TBLT programme. System 40, 407–420. doi: 10.1016/j.system.2012.07.003

Nagel, M., and Lindsey, B. (2018). The use of classroom clickers to support improved self-assessment in introductory chemistry. J. College Sci. Teach. 47, 72–79.

Ndoye, A. (2017). Peer/self-assessment and student learning. Int. J. Teach. Learn. High. Educ. 29, 255–269.

Nguyen, T., and Foster, K. A. (2018). Research note—multiple time point course evaluation and student learning outcomes in an MSW course. J. Soc. Work Educ. 54, 715–723. doi: 10.1080/10437797.2018.1474151

Nicol, D., and Macfarlane-Dick, D. (2006). Formative assessment and self-regulated learning: a model and seven principles of good feedback practice. Stud. High. Educ. 31, 199–218. doi: 10.1080/03075070600572090

Nielsen, K. (2014), Self-assessment methods in writing instruction: a conceptual framework, successful practices and essential strategies. J. Res. Read. 37, 1–16. doi: 10.1111/j.1467-9817.2012.01533.x.

Nowell, C., and Alston, R. M. (2007). I thought I got an A! Overconfidence across the economics curriculum. J. Econ. Educ. 38, 131–142. doi: 10.3200/JECE.38.2.131-142

Nugteren, M. L., Jarodzka, H., Kester, L., and Van Merriënboer, J. J. G. (2018). Self-regulation of secondary school students: self-assessments are inaccurate and insufficiently used for learning-task selection. Instruc. Sci. 46, 357–381. doi: 10.1007/s11251-018-9448-2

Panadero, E., and Alonso-Tapia, J. (2013). Self-assessment: theoretical and practical connotations. When it happens, how is it acquired and what to do to develop it in our students. Electron. J. Res. Educ. Psychol. 11, 551–576. doi: 10.14204/ejrep.30.12200

Panadero, E., Alonso-Tapia, J., and Huertas, J. A. (2012). Rubrics and self-assessment scripts effects on self-regulation, learning and self-efficacy in secondary education. Learn. Individ. Differ. 22, 806–813. doi: 10.1016/j.lindif.2012.04.007

Panadero, E., Alonso-Tapia, J., and Huertas, J. A. (2014). Rubrics vs. self-assessment scripts: effects on first year university students' self-regulation and performance. J. Study Educ. Dev. 3, 149–183. doi: 10.1080/02103702.2014.881655

Panadero, E., Alonso-Tapia, J., and Reche, E. (2013). Rubrics vs. self-assessment scripts effect on self-regulation, performance and self-efficacy in pre-service teachers. Stud. Educ. Evalu. 39, 125–132. doi: 10.1016/j.stueduc.2013.04.001

Panadero, E., Brown, G. L., and Strijbos, J.-W. (2016a). The future of student self-assessment: a review of known unknowns and potential directions. Educ. Psychol. Rev. 28, 803–830. doi: 10.1007/s10648-015-9350-2

Panadero, E., Jonsson, A., and Botella, J. (2017). Effects of self-assessment on self-regulated learning and self-efficacy: four meta-analyses. Educ. Res. Rev. 22, 74–98. doi: 10.1016/j.edurev.2017.08.004

Panadero, E., Jonsson, A., and Strijbos, J. W. (2016b). “Scaffolding self-regulated learning through self-assessment and peer assessment: guidelines for classroom implementation,” in Assessment for Learning: Meeting the Challenge of Implementation , eds D. Laveault and L. Allal (New York, NY: Springer), 311–326. doi: 10.1007/978-3-319-39211-0_18

Panadero, E., and Romero, M. (2014). To rubric or not to rubric? The effects of self-assessment on self-regulation, performance and self-efficacy. Assess. Educ. 21, 133–148. doi: 10.1080/0969594X.2013.877872

Papanthymou, A., and Darra, M. (2018). Student self-assessment in higher education: The international experience and the Greek example. World J. Educ. 8, 130–146. doi: 10.5430/wje.v8n6p130

Punhagui, G. C., and de Souza, N. A. (2013). Self-regulation in the learning process: actions through self-assessment activities with Brazilian students. Int. Educ. Stud. 6, 47–62. doi: 10.5539/ies.v6n10p47

Raaijmakers, S. F., Baars, M., Paas, F., van Merriënboer, J. J. G., and van Gog, T. (2019). Metacognition and Learning , 1–22. doi: 10.1007/s11409-019-09189-5

Raaijmakers, S. F., Baars, M., Schapp, L., Paas, F., van Merrienboer, J., and van Gog, T. (2017). Training self-regulated learning with video modeling examples: do task-selection skills transfer? Instr. Sci. 46, 273–290. doi: 10.1007/s11251-017-9434-0

Ratminingsih, N. M., Marhaeni, A. A. I. N., and Vigayanti, L. P. D. (2018). Self-assessment: the effect on students' independence and writing competence. Int. J. Instruc. 11, 277–290. doi: 10.12973/iji.2018.11320a

Ross, J. A., Rolheiser, C., and Hogaboam-Gray, A. (1998). “Impact of self-evaluation training on mathematics achievement in a cooperative learning environment,” Paper presented at the annual meeting of the American Educational Research Association (San Diego, CA).

Ross, J. A., and Starling, M. (2008). Self-assessment in a technology-supported environment: the case of grade 9 geography. Assess. Educ. 15, 183–199. doi: 10.1080/09695940802164218

Samaie, M., Nejad, A. M., and Qaracholloo, M. (2018). An inquiry into the efficiency of whatsapp for self- and peer-assessments of oral language proficiency. Br. J. Educ. Technol. 49, 111–126. doi: 10.1111/bjet.12519

Sanchez, C. E., Atkinson, K. M., Koenka, A. C., Moshontz, H., and Cooper, H. (2017). Self-grading and peer-grading for formative and summative assessments in 3rd through 12th grade classrooms: a meta-analysis. J. Educ. Psychol. 109, 1049–1066. doi: 10.1037/edu0000190

Sargeant, J. (2008). Toward a common understanding of self-assessment. J. Contin. Educ. Health Prof. 28, 1–4. doi: 10.1002/chp.148

Sargeant, J., Mann, K., van der Vleuten, C., and Metsemakers, J. (2008). “Directed” self-assessment: practice and feedback within a social context. J. Contin. Educ. Health Prof. 28, 47–54. doi: 10.1002/chp.155

Shute, V. (2008). Focus on formative feedback. Rev. Educ. Res. 78, 153–189. doi: 10.3102/0034654307313795

Silver, I., Campbell, C., Marlow, B., and Sargeant, J. (2008). Self-assessment and continuing professional development: the Canadian perspective. J. Contin. Educ. Health Prof. 28, 25–31. doi: 10.1002/chp.152

Siow, L.-F. (2015). Students' perceptions on self- and peer-assessment in enhancing learning experience. Malaysian Online J. Educ. Sci. 3, 21–35.

Son, L. K., and Metcalfe, J. (2000). Metacognitive and control strategies in study-time allocation. J. Exp. Psychol. 26, 204–221. doi: 10.1037/0278-7393.26.1.204

Tan, K. (2004). Does student self-assessment empower or discipline students? Assess. Evalu. Higher Educ. 29, 651–662. doi: 10.1080/0260293042000227209

Tan, K. (2009). Meanings and practices of power in academics' conceptions of student self-assessment. Teach. High. Educ. 14, 361–373. doi: 10.1080/13562510903050111

Taras, M. (2008). Issues of power and equity in two models of self-assessment. Teach. High. Educ. 13, 81–92. doi: 10.1080/13562510701794076

Tejeiro, R. A., Gomez-Vallecillo, J. L., Romero, A. F., Pelegrina, M., Wallace, A., and Emberley, E. (2012). Summative self-assessment in higher education: implications of its counting towards the final mark. Electron. J. Res. Educ. Psychol. 10, 789–812.

Thawabieh, A. M. (2017). A comparison between students' self-assessment and teachers' assessment. J. Curri. Teach. 6, 14–20. doi: 10.5430/jct.v6n1p14

Tulgar, A. T. (2017). Selfie@ssessment as an alternative form of self-assessment at undergraduate level in higher education. J. Langu. Linguis. Stud. 13, 321–335.

van Helvoort, A. A. J. (2012). How adult students in information studies use a scoring rubric for the development of their information literacy skills. J. Acad. Librarian. 38, 165–171. doi: 10.1016/j.acalib.2012.03.016

van Loon, M. H., de Bruin, A. B. H., van Gog, T., van Merriënboer, J. J. G., and Dunlosky, J. (2014). Can students evaluate their understanding of cause-and-effect relations? The effects of diagram completion on monitoring accuracy. Acta Psychol. 151, 143–154. doi: 10.1016/j.actpsy.2014.06.007

van Reybroeck, M., Penneman, J., Vidick, C., and Galand, B. (2017). Progressive treatment and self-assessment: Effects on students' automatisation of grammatical spelling and self-efficacy beliefs. Read. Writing 30, 1965–1985. doi: 10.1007/s11145-017-9761-1

Wang, W. (2017). Using rubrics in student self-assessment: student perceptions in the English as a foreign language writing context. Assess. Evalu. High. Educ. 42, 1280–1292. doi: 10.1080/02602938.2016.1261993

Wollenschläger, M., Hattie, J., Machts, N., Möller, J., and Harms, U. (2016). What makes rubrics effective in teacher-feedback? Transparency of learning goals is not enough. Contemp. Educ. Psychol. 44–45, 1–11. doi: 10.1016/j.cedpsych.2015.11.003

Yan, Z., and Brown, G. T. L. (2017). A cyclical self-assessment process: towards a model of how students engage in self-assessment. Assess. Evalu. High. Educ. 42, 1247–1262. doi: 10.1080/02602938.2016.1260091

Yilmaz, F. N. (2017). Reliability of scores obtained from self-, peer-, and teacher-assessments on teaching materials prepared by teacher candidates. Educ. Sci. 17, 395–409. doi: 10.12738/estp.2017.2.0098

Zimmerman, B. J. (2000). Self-efficacy: an essential motive to learn. Contemp. Educ. Psychol. 25, 82–91. doi: 10.1006/ceps.1999.1016

Zimmerman, B. J., and Schunk, D. H. (2011). “Self-regulated learning and performance: an introduction and overview,” in Handbook of Self-Regulation of Learning and Performance , eds B. J. Zimmerman and D. H. Schunk (New York, NY: Routledge), 1–14.

Keywords: self-assessment, self-evaluation, self-grading, formative assessment, classroom assessment, self-regulated learning (SRL)

Citation: Andrade HL (2019) A Critical Review of Research on Student Self-Assessment. Front. Educ. 4:87. doi: 10.3389/feduc.2019.00087

Received: 27 April 2019; Accepted: 02 August 2019; Published: 27 August 2019.

Reviewed by:

Copyright © 2019 Andrade. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Heidi L. Andrade, handrade@albany.edu

This article is part of the Research Topic

Advances in Classroom Assessment Theory and Practice

Charting the Future of Assessments E.T.S. Research Institute: Charting the Future of Assessments

This in-depth report delves into four areas of the future of assessment: skills for the future, innovative measures, operations breakthroughs and feedback.

In the future we will measure what matters, not what's easy to measure.

A view of a crowded classroom from the last row of seats. The background is blurred; the foreground shows a woman with her back to the camera, taking notes on her laptop.

Download the full report

Read the highlights

Meet the authors

Patrick c. kyllonen.

Distinguished Presidential Appointee

Patrick Kyllonen is a distinguished presidential appointee at ETS. He has conducted innovative research on (a) higher education assessment, (b) workforce readiness, (c) international large-scale assessment (e.g., Program for International Student Assessment, or PISA), and (d) 21st century skills assessment, such as creativity, collaborative problem solving and situational interviews. He received his B.A. from St. John's University and Ph.D. from Stanford University and is author of Generating Items for Cognitive Tests (with S. Irvine, 2001), Learning and Individual Differences (with P. L. Ackerman & R. D. Roberts, 1999), Extending Intelligence: Enhancement and New Constructs (with R. Roberts & L. Stankov, 2008), and Innovative Assessment of Collaboration (with A. von Davier & M. Zhu, 2017). He is a fellow of the American Psychological Association and the American Educational Research Association, recipient of The Technical Cooperation Program Achievement Award for the “design, development, and evaluation of the Trait-Self Description (TSD) Personality Inventory,” and was a coauthor of the National Academy of Sciences 2012 report, Education for Life and Work: Developing Transferable Knowledge and Skills in the 21st Century.

Patrick C. Kyllonen

Chief Executive Officer

As CEO of ETS, Amit Sevak leads the largest private educational assessment organization in the world, with 2500 employees across 200 countries serving over 50 million people each year. Amit has been a driving force in education, learning and workforce development around the globe. He has led the University of Europe in Madrid in Spain, INTI International University in Malaysia and Universidad Tecnológica de México (UNITEC) in Mexico. His transformational style of leadership consistently led to innovation, better learning and improved job prospects for hundreds of thousands of students and workers.

Amit Sevak

Teresa Ober

Research Scientist

Teresa M. Ober is a Research Scientist at ETS. She received a Ph.D. in Educational Psychology from the Graduate Center of the City University of New York (CUNY). Before joining ETS, she was an Assistant Research Professor at the University of Notre Dame where she conducted research on student engagement and learning. Throughout her research, she maintains a focus on issues related to fairness and equity in educational assessment.

Teresa Ober

Managing Senior Research Scientist

Ikkyu Choi is a as lManaging Senior Research Scientist at ETS. He received a Ph.D. in applied linguistics from the University of California–Los Angeles with a specialization in language assessment in 2013. At ETS, he contributed to the development of statistical, analytic, and interpretative procedures for constructed-response data as well as automated item generation and associated workflow enhancements. His research has been published in peer-reviewed journals in multiple disciplines including educational measurement, language assessment, and natural language processing, and he received the 2019 best article award from International Language Testing Association (ILTA). He chaired a committee and served on multiple committees for ILTA. He has also served on the editorial advisory board of the National Council on Measurement in Education Instructional Topics in Educational Measurement Series and Language Assessment Quarterly.

Ikkyu Choi

Jesse Sparks

Jesse R. Sparks is a senior research scientist in the Learning and Assessment Foundations and Innovations research center in the Research & Development division at ETS. She joined ETS in 2013 after completing her Ph.D. in the learning sciences (with a certificate in cognitive science) at Northwestern University. Her research at ETS has focused on applying theory and research in the cognitive and learning sciences—especially areas of multiple source comprehension, discourse processing, social contexts of learning, and design of learning environments—toward the design of digital assessments for literacy, science, and social science domains, including scenario-based tasks that provide an overarching purpose, narrative, and virtual characters, game-based assessments that support student engagement, and adaptive assessments that capture and respond to students’ background characteristics and dynamic solution processes.

Jesse Sparks

Dan Fishtein

Research Project Manager

Daniel Fishtein is a Research Project Manager in the ETS Research Institute. He received an M.A. in Communication Studies, focusing on science communication and public communication and a B.A. in Psychology, both from the University of Rhode Island. Since joining Research in 2018, Daniel has worked on mission-driven themes such as increasing diversity in graduate admissions and tracking the trajectories of CTE students and their careers. Daniel managed the development and operationalization of the PSQ assessment as the RPM for the PSQ Research Team. Additionally, Daniel provides internal communication support and planning to the VP of ETS Research Institute and the Office of the CEO. Prior to joining The Research Institute, he worked for 4 years in the Policy Evaluation and Research Center (PERC) at ETS, providing data analysis and support to a number of higher education-focused projects.

Dan Fishtein

Sign up for ETS updates

Stay up to date with the latest news, announcements and articles

ETS Updates

To ensure we provide you with the most relevant content, please tell us a little more about yourself. Your choice helps us customize our communications to fit your needs.

Thank you for subscribing!

  • Open access
  • Published: 10 April 2024

Cultural adaptation and validation of the caring behaviors assessment tool into Spanish

  • Juan M. Leyva-Moral 1 ,
  • Carolina Watson 1 ,
  • Nina Granel 1 ,
  • Cecilia Raij-Johansen 1 &
  • Ricardo A. Ayala 2 , 3  

BMC Nursing volume  23 , Article number:  240 ( 2024 ) Cite this article

Metrics details

The aim of the research was to translate, culturally adapt and validate the Caring Behaviors Assessment (CBA) tool in Spain, ensuring its appropriateness in the Spanish cultural context.

Three-phase cross-cultural adaptation and validation study. Phase 1 involved the transculturation process, which included translation of the CBA tool from English to Spanish, back-translation, and refinement of the translated tool based on pilot testing and linguistic and cultural adjustments. Phase 2 involved training research assistants to ensure standardized administration of the instrument. Phase 3 involved administering the transculturally-adapted tool to a non-probabilistic sample of 402 adults who had been hospitalized within the previous 6 months. Statistical analyses were conducted to assess the consistency of the item-scale, demographic differences, validity of the tool, and the importance of various caring behaviors within the Spanish cultural context. R statistical software version 4.3.3 and psych package version 2.4.1 were used for statistical analyses.

The overall internal consistency of the CBA tool was high, indicating its reliability for assessing caring behaviors. The subscales within the instrument also demonstrated high internal consistency. Descriptive analysis revealed that Spanish participants prioritized technical and cognitive aspects of care over emotional and existential dimensions.

Conclusions

The new version of the tool proved to be valid, reliable and culturally situated, which will facilitate the provision of objective and reliable data on patients beliefs about what is essential in terms of care behaviors in Spain.

• This paper provides a culturally translated, adapted, and validated version of the Caring Behaviors Assessment tool in the Spanish context, which can be used to obtain reliable and culturally adapted data on essential aspects of patient care.

• The findings of this study contribute to the wider global clinical community by demonstrating the importance of considering cultural factors when assessing and evaluating patient care from patients’ own perspective, and also emphasizes the need for culturally sensitive approaches in healthcare settings.

• This validated instrument facilitates the measurement of caring behaviors in the Spanish context, allowing for objective evaluation and improvement. Use of the Caring Behaviors Assessment tool could thus serve as a valuable resource for both future research and clinical practice.

Peer Review reports

Caring,  as a complex culturally derived phenomenon, encompasses recognition of individuals’ uniqueness and includes moral, emotional, and cognitive dimensions [ 1 ]. Within the field of nursing, the professional act of caring is defined as an interpersonal process characterized by nurses’ expertise, competencies, personal maturity, and interpersonal sensitivity. The ultimate aim is to meet patients’ bio-psycho-social needs, ensuring their protection, emotional support, and overall satisfaction [ 2 ]. Furthermore, caring has been understood as the pivotal element that patients expect and should encounter to feel satisfied with nursing services [ 3 ]. Therefore, the concept of caring is dynamic, requiring adaptation to diverse sociocultural contexts.

Drawing on humanistic, transformative, integrative, and complex ontological and epistemological perspectives, various nursing theories have been developed that focus on promoting human-centred care [ 4 , 5 ]. One such perspective is the theory of human-to-human relationships proposed by Travelbee [ 6 ], which emphasizes the unique and irreplaceable nature of anyone who has lived or will live in this world. In this perspective, therapeutic human relationships evolve through a series of interactive steps, including the emergence of identities and the development of empathy (and later sympathy) until finally establishing rapport with persons receiving care [ 7 ].

Similarly, Watson [ 8 , 9 ] has elaborated a care process consisting of the following ten steps (caritas process): 1) consciously practising kindness and honesty while providing care; 2) being authentically present in a facilitative manner; 3) cultivating spirituality by transcending the self; 4) developing and maintaining a relationship of trust; 5) supporting the expression of both positive and negative feelings; 6) using creativity to obtain information during the care process; 7) engaging in genuine teaching and learning that take a global view of phenomena, while considering the perspective of the other; 8) creating healing environments that enhance integrity, comfort, dignity, and peace; 9) consciously and intentionally assisting with basic needs while enhancing the mind, body, and spirit; 10) remaining open to the experience of life and death, including care of both the professional and the patient’s soul. In short, caring is the essence of nursing and is a fundamental element for establishing effective nurse-patient relationships and achieving high-quality health outcomes.

The quality of nursing care is directly related to patients’ general experience and satisfaction. Evidence shows that patient experience with nursing care is a crucial predictor of patient satisfaction [ 10 , 11 ]. Studies indicate that providing expert and integrated care contributes to patients’ sense of safety and feeling embraced [ 12 ]. Conversely, professional nursing practice based on the biomedical model has been associated with low patient satisfaction and limited professional fulfilment among nurses [ 13 ].

Nevertheless, measuring nursing care plays an essential role in assessing its effectiveness and quality. By measuring nursing care, healthcare organisations and policymakers can identify areas for improvement and make evidence-based decisions to enhance patient outcomes. While caring cannot be reduced to a mere collection of actions and behaviours, this step is crucial in systematising the components of care that impact patients’ experiences [ 14 ] and in determining the contribution of nursing to health systems [ 15 ]. Watson [ 9 ] argues that, without engaging in philosophical contradictions, the use of quantitative instruments to assess care is necessary to provide scientific evidence. Such evidence helps managers and researchers to evaluate the complex and unique role of nursing and its effects on health.

The presence of an adequate number of well-trained nurses is known to reduce the risk of patient mortality, with outcomes similar to those achieved by physicians [ 16 ]. Nevertheless, nursing care extends beyond numerical values and clinical outcomes. It is well-established that discrepancies exist between the perceptions of nurses and patients regarding what constitutes care, primarily due to the uniqueness of each individual; hence the application of individualized care is promoted and takes into account the sociocultural context [ 17 ]. Moreover, humanised care is associated with high levels of patient and family satisfaction in various contexts [ 18 ].

One of the oldest and most widely used tools for assessing nursing care is the Caring Behaviours Assessment (CBA) tool, developed by Cronin and Harrison [ 19 ]. The authors were concerned about the exclusion of patients’ perspective in care settings and sought to identify which behaviours communicated care and how their effectiveness could be evaluated. Consequently, they created and validated the CBA, which comprises 63 items, grouped into seven subscales based on Watson’s ten carative factors. The instrument has been translated and validated in several languages, including Chilean Spanish [ 15 ]. However, the Spanish spoken in Spain exhibits distinct differences to the Chilean variety in word usage, meaning and cultural nuances, influenced by other languages spoken in the country such as Catalan or Galician. Consequently, despite extensive debate in recent years, there are currently no reliable assessment instruments available in the Spanish context that adequately consider cultural nuances in patients’ experiences. Therefore, using the CBA in an apparently similar but different language variety could lead to misinterpretation [ 20 ].

The aim of this study is to report the process of cultural translation, adaptation, and validation of the CBA in Spain, which to the best of our knowledge is the only culturally grounded version available. This new version of the CBA will provide a reliable means to obtain objective, tangible, and culturally adapted data on patients’ perceptions of the elements they deem to be essential in their care.

Approval was obtained from the relevant Ethics Committee on 2020 ( ethics committee name, hidden for blinding purposes ). Then, a study organised in three phases was undertaken on 2021–2022. The phases were as follows: 1. Transculturation. 2. Training. 3. Administration.

A previous publication reported the process of creating a version of the CBA in Latin-American Spanish, namely in Chile. The authors of that publication suggested several steps for obtaining a transculturally adapted version, which we used here. These steps were as follows:

Translating the CBA from English to Spanish : one translation (draft 1) was done by a non-nursing translator, and another one (draft 2) by two bilingual nurses, who were familiar with Watson’s theory. The two drafts were then contrasted, leading to an agreed translation (draft 3).

Back-translation from Spanish into English : A bilingual nurse who was familiar with the subject but unfamiliar with the CBA, back-translated draft 3 into English (draft 4).

Refining the Spanish draft prior to the pilot test : the authors reworked a refined version (draft 5) by contrasting the back-translation with the original CBA in English.

Pilot-testing the translated version : Once satisfactorily refined, the translated version was tested with 36 volunteers. This step included interviewing them to identify their understanding of each item.

Linguistic and cultural adjustment: draft 5 was further adjusted by analyzing the volunteers’ responses and using three linguistic criteria: semantic disambiguation, morpho-syntax, and language. This step aimed to ensure one of the key traits of the CBA: plain language. As in the Latin-American version by Ayala and Calvo [ 15 ], conjugation was adjusted (i.e., use of the subjunctive tense instead of the present tense), so that the items reflected hypothetical situations. Otherwise, it would be all too easy for patients to misconstrue that they were being asked to assess the actual care provided by specific nursing staff. Equally, the order of the Likert-type scale was maintained from 1 to 5, left to right. Lastly, grammatical structures and words that sounded natural in spoken Spanish were double-checked with a linguistic consultant. This process led to the preliminary version of the CBA in Spanish.

A team of research assistants was trained in the application of the instrument to ensure a standardised administration process. The training included, for example, that informed consent had to be obtained from all participants before they were given a copy of the questionnaire, that the instructions had to be read aloud to the participants clearly and calmly, that the instrument had to be completed privately, and that the assistants had to remain nearby and attend to participants’ queries. This phase was crucial to minimize the risk of inducing an observer effect on responses.

We administered the transculturally-adapted version of the CBA to a non-probability sample ( N  = 402). To test its psychometric properties [ 21 ], the preliminary version was applied to a sample of adults (between 5 and 10 per item; with a mean age of 39.5 years [SD = 16.5]), who had been hospitalised within the previous 6 months (mean = 2.75 times). This phase aimed to assess the CBA with users of similar characteristics and under similar conditions to those of the final intended users: the CBA is specifically designed to be used in hospital settings.

The procedure yielded 402 observations, providing a significant amount of data for the analysis of item/scale and subscale/scale consistency, as well as the overall reliability of the CBA in measuring a single construct. Of the 402 observations, 120 were excluded from the analysis as they were from health practitioners. As a result, the final sample size was for the analysis was N  = 282.

Statistical analysis

Our objective was to analyse the single items and item‐scale consistency, as well as explain potential differences in perceptions based on demographic data. In addition to assessing the validity of the scale, we also aimed to determine the relevance of diverse caring behaviours within the particular cultural setting of the study. To achieve this, we used correlation analyses to examine the associations between caring behaviors and relevant cultural factors.

Analyses were performed by examining mean and SD (± 1SD) values per item to identify the highest‐ and the lowest‐ranking behaviours. In addition, a Kaiser–Meyer–Olkin (KMO) factor adequacy and Bartlett’s test for sphericity were used to know if our dataset could be factored. Afterwards, Exploratory Factor Analysis (EFA) was used to find common structure in data. The final number of factors was obtained using a parallel analysis. The factorial method employed was minimum residual with Varimax rotation.

Finally, Cronbach’s alpha as well as McDonald’s omega were used to estimate internal consistency and reliability respectively. All statistical analyses were performed using R statistical software (v4.3.3) [ 22 ] and the package psych (v2.4.1) [ 23 ].

As previously mentioned, 120 out of the 402 participants were health professionals. Our initial intention was to retain them in the sample, but their responses made the items markedly redundant, likely due to their familiarity with philosophies of care or a self-validating effect. Therefore, these participants were excluded from the sample. The paragraphs below report the results of the validation tests.

Scores by items

As per descriptive statistics, we calculated mean scores ± 1SD for each of the 63 items of the CBA. The five highest‐ranking and five lowest‐ranking behaviours are listed below (Tables  1 and 2 ). The means ranged from a maximum of [4.87] (± 0.44) for item 3 “Know what they’re doing” to a minimum of [2.88] (± 1.06) for item 25 “Visit me if I move to another hospital unit.”

Cronbach’s alpha and MacDonald’s omega scores by subscales

To calculate the mean ± 1SD per subscale, the items were grouped into their respective subscales. Table 3 shows the scores by subscales alongside their reliability coefficients (ω). As expected, the subscale “Existential/phenomenological/spiritual forces” was the lowest-ranking subscale (3.76 ± 0.34), while “Human needs assistance” was the highest-ranking subscale (4.49 ± 0.23). Nevertheless, both Cronbach’s alpha and McDonald’s omega were 0.8 or higher in all subscales. Importantly, Cronbach’s alpha for the overall scale was 0.96, indicating that the instrument shows a high internal consistency, while McDonald’s omega showed high reliability (0, 97).

Consideration of scale purification

After running the statistical tests, we were dissatisfied with some of the results and deliberated on the need for scale purification [ 24 ]. We found that the items correlating less highly with the overall scale, typically those carrying some existential meaning, were not automatically associated by the respondents with nursing care, and some even considered they were not pertinent to nurses’ work.

Additionally, numerous participants informed us that some items were confusing or sounded redundant. This result had already been detected during the linguistic phase of the study (phase 1), when participants often pointed out that some questions were being asked twice, although differently, which they found somewhat tiresome or repetitive (see Table  4 ).

The decision to perform scale purification for the sake of simplicity required some debate among the listed researchers, as our aim was to have a very high correlation in all of the items. Naturally, this is not the aim of validating an instrument per se. More problematic still were the items that had relatively lower correlations but were meaningful from a theoretical perspective [ 25 ].

We thus aimed to combine personal judgement and statistical criteria, as keeping those items could allow changes in perception to be assessed across time. Furthermore, when removing the items in question, the overall Cronbach’s alpha increased only minimally (from 0.960 to 0.963). Therefore, we decided to keep all 63 items, as in the original CBA [ 19 ], resulting in the validated version of the CBA questionnaire in Spanish. The final version and the item-by-item translation are provided in the Supplementary material .

Exploratory factor analysis

Interestingly, EFA showed that while subscales 1, 2 and 5 are conceptually linked (Humanism/Faith-hope/Sensitivity, Helping/trust, Supportive/protective/corrective environment), these were also strongly associated in the dataset. Similarly, subscales 4 and 6 (Teaching/learning, Human needs assistance) and 3 and 7 (Expression of positive/negative feelings, Existential/phenomenological/spiritual forces) formed somewhat 4 separate groupings on their own. This was also highlighted by the parallel analysis, which showed that 5 factors were found. The latter was reassuring in terms of how well structured the CBA tool is. Additionally, EFA enabled us to identify that the highest loadings (L, see Table  5 ) were item 17 “Really listen to me when I talk” (L = 0.71); item 36 “Ask me what I want to know about my health/illness” (L = 0.70); item 37 “Help me set realistic goals for my health” (L = 0.69); item 06 “Encourage me to believe in myself” (L = 0.69); item 07 “Point out positive things about me and my condition” (L = 0.67); and item 28 “Encourage me to talk about how I feel” (L = 0.67).

KMO and Bartlett’s sphericity test showed that our data set was able to be factorized. KMO overall was 0.93, while Bartlett’s sphericity test (X 2  = 11126.8, p  < 0.05) also suggested that our dataset could be used in EFA. This analysis was done using 5 factors, as shown by the parallel analysis. Table 5 shows the item loadings higher than 0.5 for each factor, while the results for the EFA are shown on Table  6 . The first 3 factors explain 30% of observed variability, while adding factors 4 and 5, completed the 45% of variability explanation (see Table  6 ).

The variability explained after the EFA clearly demonstrates how complex the observed variability becomes following the application of the CBA tool.

How respondents answered the open‐ended question

Some carefully selected examples of the participants’ responses are shown in Table  7 . Additionally, in Phase 1 participants seemed surprised by the items relating to existential/phenomenological/spiritual dimensions. The participants disagreed that these dimensions pertained to nursing care (i.e., “What have nurses become now? Psychologists?”).

Discussion of cultural adaptation and validity of the CBA

The steps taken to ensure accurate cultural adaptation of the Spanish version of the CBA were essential to creating a version tailored to Spanish users, considering the specific features of a region influenced by several languages. Cronbach’s alpha for overall reliability was high (0.96), and all its subscales were 0.8 or higher. The overall Chronbach’s alpha is reassuring as it mirrors that of the Chilean Spanish CBA validated by Ayala and Calvo in 2017 [ 15 ], although in our study there was more dispersion across the subscales. Equally, McDonald’s omega showed high reliability.

Research studies conducted in different regions have also validated CBA versions for patients in the USA [ 26 ], Saudi Arabia [ 27 ] and Jordan [ 28 ]. These studies consistently reported overall Cronbach’s alpha values above 0.8, adding cumulative evidence in support of the CBA as a valid instrument to measure nurses’ caring behaviours.

Moreover, a descriptive analysis was conducted to identify the caring behaviours receiving the highest and lowest ranking. As expected, some items showed weaker correlations with the overall scale, and some participants even considered them “irrelevant” or unrelated to nurses’ duties. When we compared our study to that performed by Ayala and Calvo [ 15 ] and the original by Cronin and Harrison [ 19 ], similarities were found in the results for most of the items. However, differences were found in the item “consider my spiritual needs”, which was rated lower by the Spanish sample. This discrepancy may be related to cultural and contextual factors influencing perceptions and expectations regarding caring behaviours.

Emergence of a 5-dimensional factorial solution for the CBA scale in the Spanish context

Our study presents evidence for a 5-dimensional factorial solution for the CBA scale in the Spanish healthcare context. The convergence of findings suggests that the identified dimensions capture meaningful variance in the dataset and reflect underlying patterns of caring behaviors within the Spanish healthcare context.

Our findings suggest a strong theoretical coherence among certain dimensions within the CBA (Caring Behavior Assessment) scale, reflecting interconnected clusters of caring behaviors. For instance, subscales 1, 2, and 5 demonstrate conceptual linkage, forming a cohesive first dimension that encompasses ‘Humanism/Faith-hope/Sensitivity, Helping/Trust, and Supportive/Protective/Corrective Environment’. Specifically, our analysis reveals an expanded understanding within the first dimension, encompassing not only the initial three carative factors as in the original version but also incorporating two additional factors. These include the formation of a humanistic-altruistic system of values, the installation of faith-hope, the cultivation of sensitivity to oneself and others, the development of a helping-trust relationship, and the provision for a supportive, protective, and corrective environment. This expanded dimension highlights the interconnectedness of empathy, compassion, trust, and reliability within caregiving relationships, reinforcing the foundational principles outlined in Watson’s Theory of Transpersonal Care [ 8 ] and also supported by established theories of patient-centered care [ 29 ]. Additionally, this dimension highlights the importance of providing a supportive, protective, and corrective mental, physical, sociocultural, and spiritual environment, aligning closely with Watson’s emphasis on creating conducive environments for healing and growth. By recognizing this evolution in our analysis, we underscore the ongoing refinement and adaptation of theoretical frameworks to specific contexts better capture the complexities of caregiving dynamics and promote holistic patient care.

While subscales 1, 2, and 5 form a single cohesive dimension, subscales 3, 4, 6 and 7, form separate groupings, resulting in a total of five dimensions, each representing specific facets of caring behaviors. The second dimension, ‘Teaching/Learning’, focuses on the educational aspects of caregiving and skills training. This dimension aligns with the principles of transpersonal care, emphasizing the importance of nurturing the growth and development of both caregivers and recipients through shared learning experiences. The third dimension, ‘Human Needs Assistance,’ emphasizes the importance of fulfilling the fundamental needs of people receiving care, reflecting the humanistic approach to caregiving that prioritizes the preservation of dignity and autonomy. The subscale ‘Expression of Positive/Negative Feelings’ captures the acknowledgement and validation of the emotional experiences of patients receiving care, resonating with the empathetic and compassionate aspects of transpersonal care. Lastly, the dimension ‘Existential/Phenomenological/Spiritual Forces’ addresses the existential, phenomenological, and spiritual aspects of caregiving. This dimension emphasizes the interconnectedness of mind, body, and spirit, echoing the holistic perspective of transpersonal care, which acknowledges the spiritual essence and interconnectedness of all beings. This comprehensive framework illuminates the multifaceted nature of caregiving, addressing diverse aspects essential for holistic patient care and well-being.

Relevant findings and preferences of Spanish individuals

The highest-ranking items among the Spanish participants mainly related to technical and cognitive components, such as competence in clinical procedures and the handling of equipment. Conversely, the lowest-ranking behaviours related to emotional and existential dimensions, such as talking about life outside the hospital, understanding patients’ experiences, and considering spiritual needs. These results may indicate that, within the Spanish context, these components are perceived by patients as less important than technical competencies, thus highlighting their priorities in terms of their care, even though the respondents were not hospitalised. These results suggest that clinical skills and technical competencies play an important role in patients’ perceptions of the quality of nursing care in Spain [ 30 ]. This finding is supported by a prior study [ 31 ] comparing nursing practice in Spain with that in the UK.

The prioritization of technical competencies over emotional and existential dimensions in nursing care may be explained by people’s prioritizing. Individuals usually prioritize basic needs and gradually move to more complex ones after basic needs are met. The perception of care may follow a similar pattern. The primary focus may thus be on safety and meeting the standard of performance required to guarantee this basic need, with less emphasis on the overall experience of wellbeing and being looked after. This approach also tends to be used in healthcare delivery, where the main focus is usually placed on survival-related outcomes [ 32 ]. However, as healthcare evolves toward value-based and person-focused approaches, there is growing awareness of the need to expand services and prioritize broader aspects of care. Expectations may thus be informed by factors such as recovery and quality of life, and become aligned with patients’ priorities, expectations and desire for comprehensive care and enhanced overall quality of life. By understanding this dynamic, healthcare professionals can better navigate the complexities of patient expectations and ensure the delivery of care in accordance with diverse needs and preferences.

However, to ensure comprehensive nursing care aligned with the expectations of individuals in Spain, it is essential to have a deep understanding of their individual needs and priorities. Validation studies conducted for specific populations may shed light on the elements of healthcare that are highly valued and contribute to humanisation. For example, research focusing on transgender populations has shown that being asked about their preferred form of address is highly valued [ 33 ] but does not seem to be a priority for the general population in our setting. Similarly, individuals in end-of-life processes place great importance on the ability of nurses and clinicians to show compassion and empathise with their feelings, while these qualities were not prioritised in the participants in our sample [ 34 ]. Equally, women going through challenging experiences, such as miscarriage, stressed that a key element of the care they required was being helped to cope with the future and understand their feelings [ 35 ].

In a similar vein, another study focused on how the general population perceived the quality of nursing services. The findings of that study revealed that various dimensions of quality, such as psychological, physical, and communication components, were rated at a moderate level, suggesting that there was room for improvement in meeting patients’ expectations [ 36 ]. This finding emphasises the importance of tailoring nursing care to specific populations to address the complexity of individual preferences, and highlights the need to focus on the multidimensional aspects of care to enhance the overall quality of nursing activity.

An awareness of contemporary nursing training and the scope of nurses’ work in society could fruitfully contribute to shifting such expectations away from a focus on technical and knowledge-related issues. As stated by López-Verdugo et al. [ 37 ], society often relies on misinformation when referring to nursing work, which is also often based on widely disseminated myths and stereotypes. A stereotyped image of nursing work, and of nurses themselves, may well lie beneath the reaction of some of the Spanish participants in our study when asked about the importance of emotional and spiritual needs in nursing care. Participants may not always fully appreciate the importance of integrated care, just as contemporary nursing remains largely unknown in Spain [ 37 ]. Therefore, a change in perspective is needed to foster greater appreciation of the profession for more rewarding experiences during periods of health and illness, both for users and for healthcare providers.

Previous research has emphasised human care as a driving force in nursing practice, highlighting that quality care relies on a holistic view of care that extends beyond technical proficiency [ 38 ]. Several studies have underscored that human care, which encompasses emotional support, effective communication, and attention to patients’ psychosocial needs, is essential for promoting patient satisfaction and achieving favourable health care outcomes [ 39 ].

A drawback of the CBA is its relatively long length, leading to a risk of tiring respondents. This limitation has been acknowledged in previous literature [ 15 ]. In addition, during the cultural adaptation phase of the present study, participants reported that some items were somewhat repetitive. To address this concern, future research could focus on validating abbreviated versions of this and other instruments. This approach would allow more streamlined integration of theoretical perspectives into routine assessments in clinical practice. Similarly, exploring the perspectives of specific population groups could provide a more nuanced understanding of their unique expectations regarding healthcare.

As patient-centered care gains recognition as a fundamental aspect of quality healthcare, understanding and measuring caring behaviors become necessary for healthcare organizations and professionals, highlighting the importance of tools like the CBA scale.

The interplay between theory and practice has gained prominence in nursing care over the past two decades. This dynamic encompasses various dimensions, ranging from abstract concepts like human sensitivity and emotional engagement to more tangible factors such as clinical skills. In this context, the use of tools to assess and translate nursing care into workable data have gained importance in healthcare policy and management. Indeed, such objective data can be useful for decision-makers in higher-level management, as nurses’ work is key to user satisfaction and the transformation of the biomedical paradigm in health care. Adapting and validating instruments can thus contribute to these processes.

Similarly, implementing ‘tooling up’ strategies can be a useful way of rendering nurses’ often invisible work visible, which, in the process, could incentivise a humane approach, which is perceived to have been lost in the evolutionary loop of healthcare in the industrialised world.

To support this endeavour, this article provides a validated version of the CBA for users in Spain. This version remains true to the original CBA but incorporates certain modifications into the Spanish version for respondents’ ease of use. Through a process of translation, cultural adaptation and statistical analysis, this new version has been demonstrated be a valid and culturally-appropriate instrument, which provides reliable, objective, comparable and culturally-sensitive data on patients’ perceptions of the most essential elements of care during hospitalization.

All authors declare that they have no conflicts of interest. The individuals who participated in this study were research participants and were not involved in the design, conduct, or preparation of the manuscript.

Relevance for clinical practice

The study addressed the problem of the lack of a culturally translated, adapted and culturally validated version of the Caring Behaviors Assessment (CBA) tool in the Spanish context. This was a significant issue as it hindered the collection of objective and culturally sensitive data on essential aspects of care.

The research will have an impact on several groups. First, it will benefit healthcare professionals and providers, policymakers and managers by providing them with a reliable instrument to evaluate and improve patient care. This instrument could enhance their understanding of patient needs and preferences, enabling them to identify areas for improvement and promote person-centered care.

Second, the research could directly benefit the Spanish-speaking population. Through the CBA tool, individuals will be able to ask for care that aligns more closely with their personal values and preferences, thus promoting a shift towards person-centered care.

Availability of data and materials

The datasets used and/or analyzsed during the current study are available from the corresponding author on reasonable request.

Abbreviations

Caring Behaviors Assessment

Exploratory Factor Analysis

Standard Deviation

Widar M, Ek AC, Ahlström G. Caring and uncaring experiences as narrated by persons with long-term pain after a stroke. Scand J Caring Sci. 2007;21(1):41–7.

Article   PubMed   Google Scholar  

Drahošová L, Jarošová D. Concept caring in nursing. Cent Eur J Nurs Midwifery. 2016;7(2):453–60.

Article   Google Scholar  

Haryani AL. Predictors of nurse’s caring behavior towards patients with Critical Illness. KnE Life Sci. 2019;4(13):12–22.

Alligood MR. Nursing theory: utilization & application. 5th ed. Vol. 11, International Journal of Aeroacoustics. Mosby; 2013. p. 488.

Turkel MC, Watson J, Giovannoni J. Caring science or science of caring. Nurs Sci Q. 2018;31(1):66–71.

Travelbee J. Interpersonal aspects of nursing. 2nd ed. Philadelphia: F. A. Davis Company; 1971.

Google Scholar  

Travelbee J. What’s wrong with sympathy? Am J Nurs. 1964;64:68–71.

Article   CAS   PubMed   Google Scholar  

Watson J. Watson’s theory of human caring and subjective living experiences: carative factors/caritas processes as a disciplinary guide to the professional nursing practice. Texto Contexto - Enfermagem. 2007;16(1):129–35.

Watson J. Caring science and human caring theory: transforming personal and professional practices of nursing and health care. J Health Hum Serv Adm. 2009;31(4):466–82.

PubMed   Google Scholar  

Al-Awamreh K, Suliman M. Patients’ satisfaction with the quality of nursing care in Thalassemia units. Appl Nurs Res. 2019;47:46–51.

Chen X, Zhang Y, Qin W, Yu Z, Yu J, Lin Y, et al. How does overall hospital satisfaction relate to patient experience with nursing care? a cross-sectional study in China. BMJ Open. 2022;12(1):e053899. Available from: https://pubmed.ncbi.nlm.nih.gov/35039296/ . [cited 2024 Mar 19].

Article   PubMed   PubMed Central   Google Scholar  

van Dusseldorp L, Groot M, Adriaansen M, van Vught A, Vissers K, Peters J. What does the nurse practitioner mean to you? A patient-oriented qualitative study in oncological/palliative care. J Clin Nurs. 2019;28(3–4):589–602.

Jara PC, Behn V, Ortiz N, Valenzuela S. Nursing in Chile. In: Breda KL, editor. Nursing and globalization in the Americas: a critical perspective. New York: Baywood; 2009. p. 55–98.

Spichiger E, Wallhagen MI, Benner P. Nursing as a caring practice from a phenomenological perspective. Scand J Caring Sci. 2005;19(4):303–9.

Ayala RA, Calvo MJ. Cultural adaptation and validation of the caring behaviors assessment tool in Chile. Nurs Health Sci. 2017;19(4):459–66.

Coster S, Watkins M, Norman IJ. What is the impact of professional nursing on patients’ outcomes globally? An overview of research evidence. Int J Nurs Stud. 2018;78:76–83.

Patistea E, Siamanta H. A literature review of patients’ compared with nurses’ perceptions of caring: implications for practice and research. J Prof Nurs. 1999;15(5):302–12.

De La Nube P, Pulla P, Mesa-Cano IC, Alexis Ramírez-Coronel A. Patient family perceptions of nursing staff’s humanized care: systematic review. Int J Innov Sci Res Technol. 2021;6(4):545–51.

Cronin SN, Harrison B. Importance of nurse caring behaviors as perceived by patients after myocardial infarction. Heart Lung. 1988;17(4):374–80.

CAS   PubMed   Google Scholar  

Palmieri PA, Leyva-Moral JM, Camacho-Rodriguez DE, Granel-Gimenez N, Ford EW, Mathieson KM, et al. Hospital survey on patient safety culture (HSOPSC): a multi-method approach for target-language instrument translation, adaptation, and validation to improve the equivalence of meaning for cross-cultural research. BMC Nurs. 2020;19(1):23.

Streiner DL, Norman GR. Health measurement scales: a practical guide to their development and use. 4th ed. Oxford: University Press; 2008. p. 1–452.

R Core Team. R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2023.

Revelle W. Psych: procedures for psychological, psychometric, and personality research. Evanston: Northwestern University; 2024.

Wieland A, Kock F, Josiassen A. Scale purification: state-of-the-art review and guidelines. Int J Contemp Hospitality Manage. 2018;30(11):3346–62.

Redondo YP, Cambra Fierro JJ. Educational level as moderating element of long-term orientation of supply relationships. J Mark Manage. 2008;24(3–4):383–408.

Dorsey C, Phillips KD, Williams C. Adult sickle cell patients’ perceptions of nurses’ caring behaviors. ABNF J. 2001;12(5):95–100.

Suliman WA, Welmann E, Omer T, Thomas L. Applying watson’s nursing theory to assess patient perceptions of being cared for in a multicultural environment. J Nurs Res. 2009;17(4):293–300.

Omari FH, Abualrub R, Ayasreh IR. Perceptions of patients and nurses towards nurse caring behaviors in coronary care units in Jordan. J Clin Nurs. 2013;22(21–22):3183–91.

Morgan S, Yoder LH. A concept analysis of person-centered care. J Holist Nurs. 2012;30(1):6–15. Available from: https://pubmed.ncbi.nlm.nih.gov/21772048/ . Cited 2024 Mar 19.

Granel N, Manresa-Domínguez JM, Watson CE, Gómez-Ibáñez R, Bernabeu-Tamayo MD. Nurses’ perceptions of patient safety culture: a mixed-methods study. BMC Health Serv Res. 2020;20(1):584.

Granel N, Bernabeu-Tamayo MD. Mapping nursing practices in rehabilitation units in Spain and the United Kingdom: a multiple case study. Nurs Health Sci. 2020;22(3):521–8.

Porter ME. What is value in health care? N Engl J Med. 2010;363(26):2477–81.

Santander-Morillas K, Leyva-Moral JM, Villar-Salgueiro M, Aguayo-González M, Téllez-Velasco D, Granel-Giménez N, et al. TRANSALUD: a qualitative study of the healthcare experiences of transgender people in Barcelona (Spain). PLoS One. 2022;17(8):e0271484.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Sinclair S, Beamer K, Hack TF, McClement S, Raffin Bouchal S, Chochinov HM, et al. Sympathy, empathy, and compassion: a grounded theory study of palliative care patients’ understandings, experiences, and preferences. Palliat Med. 2017;31(5):437–47.

Radford EJ, Hughes M. Women’s experiences of early miscarriage: implications for nursing care. J Clin Nurs. 2015;24(11–12):1457–65.

Yusefi AR, Sarvestani SR, Kavosi Z, Bahmaei J, Mehrizi MM, Mehralian G. Patients’ perceptions of the quality of nursing services. BMC Nurs. 2022;21(1):131.

López-Verdugo M, Ponce-Blandón JA, López-Narbona FJ, Romero-Castillo R, Guerra-Martín MD. Social image of nursing. An integrative review about a yet unknown profession. Nurs Rep. 2021;11(2):460–74.

Albinsson G, Arnesson K. The emotion work of nurses in a person-centred care model. Int J Work Organisation Emot. 2019;10(1):28–49.

Bao L, Shi C, Lai J, Zhan Y. Impact of humanized nursing care on negative emotions and quality of life of patients with mental disorders. Am J Transl Res. 2021;13(11):13123–8.

PubMed   PubMed Central   Google Scholar  

Download references

Acknowledgements

We would like to express our gratitude to all the participants who took part in the study. We also wish to thank Dr Pedro Hervé (U. Magallanes, Chile) for providing statistical support. Lastly, we would like to thank Dr Sherill N. Cronin (Bellarmine University, USA) for giving us permission to use and translate the CBA tool into Spanish.

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and affiliations.

Nursing Department, Faculty of Medicine, Universitat Autònoma de Barcelona, Av. Can Domènech S/N, Cerdanyola del Vallès, 08193, Spain

Juan M. Leyva-Moral, Carolina Watson, Nina Granel & Cecilia Raij-Johansen

Universidad de Las Américas, Santiago de Chile, Chile

Ricardo A. Ayala

Ghent University, Ghent, Belgium

You can also search for this author in PubMed   Google Scholar

Contributions

JM.LM. and RA.A. made substantial contributions to the conception and design of the project. including the development of survey instruments and strategic planning for project dissemination. C.W. and N.G. played a key role in data acquisition, overseeing survey implementation and managing outreach efforts. JM.LM. and RA.A analyzed and interpreted data. C.W., C.RJ., and N.G. were involved in drafting and revising the manuscript. JM.LM and RA.A critically reviewed it for significant intellectual content. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Carolina Watson .

Ethics declarations

Ethics approval and consent to participate.

All ethical principles of biomedical research advocated in the Declaration of Helsinki were respected. This study has been reviewed and approved by the UAB Research Ethics Committee in accordance with ethical standards and guidelines. Approval reference number: (approval reference number CEEAH 5194). Participants were provided with a thorough explanation of the study procedures before accessing the questionnaire, ensuring their voluntary participation, with a commitment to maintaining the anonymity of the collected data. Informed consent was obtained from each participant before the completion of the questionnaires.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary material 1., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Leyva-Moral, J.M., Watson, C., Granel, N. et al. Cultural adaptation and validation of the caring behaviors assessment tool into Spanish. BMC Nurs 23 , 240 (2024). https://doi.org/10.1186/s12912-024-01892-2

Download citation

Received : 17 December 2023

Accepted : 22 March 2024

Published : 10 April 2024

DOI : https://doi.org/10.1186/s12912-024-01892-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Caring behaviors
  • Transculturation
  • Humanization
  • Nursing care

BMC Nursing

ISSN: 1472-6955

research paper of assessment

We use cookies. Read more about them in our Privacy Policy .

  • Accept site cookies
  • Reject site cookies

Grantham Research Institute on Climate Change and the Environment

Optimal climate policy under exogenous and endogenous technical change: making sense of the different approaches

research paper of assessment

How does technical change affect economically optimal emission trajectories? Many low-carbon technologies, such as photovoltaic (PV) cells, wind energy and batteries, have become much cheaper in recent decades. Technical change can be an argument to postpone emission abatement, to wait for technology to become cheaper. Conversely, it can be an argument for earlier abatement, when abatement itself is the driver of future cost reductions. Whether technical change means prioritising or postponing emission abatement also depends on the economic objective. This can be either cost-benefit analysis, where the goal is to find a welfare-maximising balance between abatement costs and avoided climate damages (benefits); or it can be cost-effectiveness analysis, where the objective is to minimise abatement costs to stay below a given temperature.

In this paper, the authors assess, both qualitatively and quantitatively, the effect of technical change on optimal climate policy in integrated assessment models (IAMs), which provide key inputs to decision-makers for economically efficient climate policies. They also develop a transparent model to represent the key features of technical change and reproduce how costs differ between scenarios with early vs. later abatement.

Key messages for decision-makers

  • Technical change is one of the key assumptions in any IAM that estimates mitigation costs. By conducting a systematic survey of how technical change is currently represented in the main IAMs, the authors find that a diversity of approaches continues to exist. This makes it important to conduct an up-to-date assessment of what difference technical change can make to IAM results.
  • Deployment of abatement technologies brings down their cost and is referred to as endogenous technical change because the process does not happen without climate policy, such as the declining cost of PV cells. This can be through learning-by-doing, economies of scale, R&D that requires feedback from deployment, etc.
  • When cost reductions are unrelated to the deployment of a technology, the process is called exogenous, resulting in the technology improving through the passage of time; for example, the development of lithium-ion batteries for smartphones, which was helpful for the development of electric cars.
  • Under cost-benefit analysis, technical change reduces optimal long-term emissions and temperature substantially.
  • Under cost-effectiveness analysis, technical change has a small effect on transient emissions and temperatures, but it has a large, negative impact on carbon prices almost irrespective of the policy instruments available.
  • Fast exogenous technological change creates an incentive to abate later, with less initial abatement, as the reduced abatement costs in the future are anticipated.
  • By contrast, fast endogenous technological change has almost no effect on initial abatement because cheap future abatement depends crucially on early abatement – and the policymaker anticipates this. Each tonne of abated emissions will make future abatement even cheaper, referred to as the ‘endogenous future gain effect’.
  • This endogenous future gain effect is excluded in 18 of the 22 models studied in our survey. Adding the endogenous future gain incentive in these models would lead to earlier optimal abatement.
  • Early-stage R&D into green technologies, for which deployment is not yet required, theoretically has the same dynamic properties as exogenous technical change, even though it is developed in anticipation of future abatement.

Sign up to our newsletter

IMAGES

  1. Parts of a Research Paper

    research paper of assessment

  2. (PDF) Summative Assessment (Wing Institute Original Paper)

    research paper of assessment

  3. Self Assessment Essay Examples

    research paper of assessment

  4. ⭐ Types of research paper pdf. Research Paper Structure. 2022-10-14

    research paper of assessment

  5. Evaluation Essay

    research paper of assessment

  6. (PDF) Environmental impact assessment: Retrospect and prospect

    research paper of assessment

VIDEO

  1. Portfolio assessment for b.ed course

  2. Question Paper assessment exam.tnschools.gov.in

  3. Previous year question paper ll Assessment for Learning ll May 2023

  4. HPU B.ed previous year question paper Assessment for learning #2ndsemexam #previousyearquestions

  5. 5th assessment model question paper EVS environmental studies answers march2024

  6. assessment for learning previous year question paper 2nd semester #hpu b.ed

COMMENTS

  1. (PDF) ASSESSMENT AND EVALUATION IN EDUCATION

    Aims: This study explored learning and assessment of Civic Education from the pupils' standpoint. Study Design: Employed a qualitative research paradigm, in particular a descriptive research design.

  2. Full article: A practical approach to assessment for learning and

    Assessment for learning (AfL) and differentiated instruction (DI) both imply a focus on learning processes and affect student learning positively. However, both AfL and DI prove to be difficult to implement for teachers. Two chemistry and two physics teachers were studied when designing and implementing the formative assessment of conceptual ...

  3. Formative assessment: A systematic review of critical teacher

    Research design specifics, such as research questions, instruments, and analysis methods; • Research sample, such as number of schools, teachers, and students; • Type(s) of formative assessment approach: DBDM and/or AfL; • Results, such as the evidence with regard to the role of the teacher in formative assessment (i.e., the prerequisites

  4. Assessing the Assessment: Evidence of Reliability and Validity in the

    The assessment task requires candidates to collect assessment records (e.g., feedback provided, evaluation criteria, summary of students' performance, video-recorded clips) and assessment samples for three focus students, including one student with a significant learning need, that illustrate identified patterns from a whole-class analysis of ...

  5. PDF Theoretical Framework for Educational Assessment: A Synoptic Review

    Research on authentic assessment has explored various aspects including design, scoring, effects on teaching and learning, professional development, validity, reliability, and costs. Those are relative to authentic assessment (used interchangeably with performance assessment) in the classroom and will be reviewed.

  6. The past, present and future of educational assessment: A

    To see the horizon of educational assessment, a history of how assessment has been used and analysed from the earliest records, through the 20th century, and into contemporary times is deployed. Since paper-and-pencil assessments validity and integrity of candidate achievement has mattered. Assessments have relied on expert judgment. With the massification of education, formal group ...

  7. PDF Assessment in Higher Education and Student Learning

    academics appear to rely on traditional pen and paper examinations to determine student knowledge (Carless et al., 2010; Duncan & Buskirk-Cohen, 2011; Gilles et al., ... This research brings awareness to assessment practices in higher education. Only with awareness, will instructors learn the value of assessment, its effect on learning, and be

  8. The power of assessment feedback in teaching and learning: a ...

    The paper contributes to the extant literature on assessment feedback by highlighting the integral role it plays in improving teaching and learning in the education field. The article is intended for educators (school administrators/leaders and teachers) and students whose goal is to facilitate teaching and learning for school effectiveness.

  9. The Impact of Peer Assessment on Academic Performance: A ...

    Peer assessment has been the subject of considerable research interest over the last three decades, with numerous educational researchers advocating for the integration of peer assessment into schools and instructional practice. Research synthesis in this area has, however, largely relied on narrative reviews to evaluate the efficacy of peer assessment. Here, we present a meta-analysis (54 ...

  10. The quality of assessment tasks as a determinant of learning

    His research interest is focused on research methods and assessment and evaluation in higher education. He is currently co-researcher of the FLOASS Project-Learning outcomes and learning analytics in higher education: A framework for action from sustainable assessment (RTI2018-093630-B-I00). Author of articles, book chapters and contributions ...

  11. PDF Assessment criteria for the research paper

    LAW5082: MASTERS RESEARCH UNIT Assessment criteria for the research paper Aspects of the research paper that will be relevant to the determination of a final grade are as follows: Problem Definition and Methodology Statement of the research problem, the aims of the paper and the significance of the research. Explanation of scope of the study.

  12. PDF Issues and Concerns in Classroom Assessment Practices

    (Research Centre, In Education, University of Calicut), Kerala, India [email protected] Mob: 9447847053 Abstract Assessment is an integral part of any teaching learning process. Assessment has large number of functions to perform, whether it is formative or summative assessment. This paper analyse the

  13. A Critical Review of Research on Student Self-Assessment

    This article is a review of research on student self-assessment conducted largely between 2013 and 2018. The purpose of the review is to provide an updated overview of theory and research. The treatment of theory involves articulating a refined definition and operationalization of self-assessment. The review of 76 empirical studies offers a critical perspective on what has been investigated ...

  14. Full article: Effects of Classroom Assessment Practices on Students

    GENDER. Previous research findings (e.g., Citation Wang, 2004) suggest that gender may also need to be considered when investigating the impact of classroom assessment on students' achievement goals.Specifically, in Wang's study, performance-approach goals were found to be positively related to both perceptions of the classroom assessment environment as being learning-oriented and test ...

  15. Artificial Intelligence in Technology-Enhanced Assessment: A Survey of

    This paper explored how machine learning methods facilitate PE, intelligent assessment and data-intensive research in education. By using machine learning, especially DL models, in adaptive ITS, the learning paths can be changed dynamically and personalized based on the learner's progress and pace.

  16. Charting the Future of Assessment

    Daniel managed the development and operationalization of the PSQ assessment as the RPM for the PSQ Research Team. Additionally, Daniel provides internal communication support and planning to the VP of ETS Research Institute and the Office of the CEO. Prior to joining The Research Institute, he worked for 4 years in the Policy Evaluation and ...

  17. Reviewing Assessment Tools for Measuring Country Statistical Capacity

    Country statistical capacity is increasingly recognized as crucial for development, but no academic study exists that reviews the available assessment tools. This paper offers the first review study that fills this gap, paying particular attention to data and practical measurement challenges.

  18. Cultural adaptation and validation of the caring behaviors assessment

    The aim of the research was to translate, culturally adapt and validate the Caring Behaviors Assessment (CBA) tool in Spain, ensuring its appropriateness in the Spanish cultural context. Three-phase cross-cultural adaptation and validation study. Phase 1 involved the transculturation process, which included translation of the CBA tool from English to Spanish, back-translation, and refinement ...

  19. Full article: Self-assessment is about more than self: the enabling

    The purpose of this conceptual article is twofold. First, we articulate the interplay between feedback literacy and self-assessment based on a reframing and integration of the two concepts. Secondly, we unfold the self-assessment process into three steps: (1) determining and applying assessment criteria, (2) self-reflection, and (3) self ...

  20. PDF The Effects of Formative Assessment on Academic Achievement ...

    * This paper was derived from the doctoral dissertation by Ceyhun Ozan conducted under the supervision of Prof. Dr. Remzi ... According to the research results, formative assessment was the third most influential factor among 138 factors for students' achievement. In the same order, feedback, which

  21. Assessing Cognitive Functions Remotely Using a Music-Game ...

    There is a growing need to develop ways of assessing cognitive functions remotely and providing interventions using a web-based approach. The Ipsilon Test is a music-based cognitive assessment and training tablet application. On each trial, it presents simplified musical notation with colours, spatial cues, and gestural references so that users can translate spatial information into motor ...

  22. Optimal climate policy under exogenous and endogenous technical change

    In this paper, the authors assess, both qualitatively and quantitatively, the effect of technical change on optimal climate policy in integrated assessment models (IAMs), which provide key inputs to decision-makers for economically efficient climate policies.

  23. Conservation

    Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications. ... Statewide Assessment of Public ...

  24. PDF The Impact of Assessment for Learning on Students' Achievement in

    Assessment for learning or constructive assessment is defined as a process used by teachers and learners during instruction that provides feedback to adjust ongoing teaching and learning to improve students' achievement of intended instructional goals (Sadler, 1989). For Pophan (2008), assessment for learning is a planned process in

  25. Research on visual quality assessment and landscape elements influence

    Rural landscapes have significant ecological, historical, and cultural value including numerous green spaces and forest spaces that should be protected and utilized. With the growing demand for green tourism in rural areas in recent years, rural greenways have become increasingly crucial for promoting urban-rural development by connecting linear spatial corridors such as landscape patches ...