AI Essay Grader

CoGrader is an AI Essay Grader that helps teachers save 80% of the time grading essays with instant first-pass feedback & grades, based on your rubrics.

Cograder platform

Pick your own Rubrics to Grade Essays with AI

We have +30 Rubrics in our Library - but you can also build your own rubrics.

Argumentive Essays

Rubrics from 6th to 12th Grade and Higher Education. Grades Claim/Focus, Support/Evidence, Organization and Language/Style.

Informative Essays

Rubrics from 6th to 12th Grade and Higher Education. Grades Clarity/Focus, Development, Organization and Language/Style.

Narrative Essays

Rubrics from 6th to 12th Grade and Higher Education. Grades Plot/Ideas, Development/Elaboration, Organization and Language/Style.

Analytical Essays

Rubrics from 6th to 12th Grade and Higher Education. Grades Claim/Focus, Analysis/Evidence, Organization and Language/Style.

AP Essays, DBQs & LEQs

Grade Essays from AP Classes, including DBQs & LEQs. Grades according to the AP rubrics.

+30 Rubrics Available

You can also build your own rubric/criteria.

Your AI Essay Grading Tool

It's a never-ending task that consumes valuable time and energy, often leaving teachers frustrated and overwhelmed

With CoGrader, grading becomes a breeze. You will have more time for what really matters: teaching, supporting students and providing them with meaningful feedback.

First Screen of Cograder application

Meet your AI Grader

Leverage Artificial Intelligence (AI) to get First-Pass Feedback on your students' assignments instantaneously, detect ChatGPT usage and see class data analytics.

Save time and Effort

Streamline your grading process and save hours or days.

Ensure fairness and consistency

Remove human biases from the equation with CoGrader's objective and fair grading system.

Provide better feedback

Provide lightning-fast comprehensive feedback to your students, helping them understand their performance better.

Class Analytics

Get an x-ray of your class's performance to spot challenges and strengths, and inform planning.

Google Classroom Integration

Import assignments from Google Classroom to CoGrader, and export reviewed feedback and grades back to it.

Canvas and Schoology compatibility

Export your assignments in bulk and upload them to CoGrader with one click.

Used at 1000+ schools

Backed by UC Berkeley

Berkeley logo

Teachers love CoGrader

Avatar Image

How does CoGrader work?

It's easy to supercharge your grading process

Import Assignments from Google Classroom

CoGrader will automatically import the prompt given to the students and all files they have turned in.

ai essay grading software

Define Grading Criteria

Use our rubric template, based on your State's standards, or set up your own grading criteria to align with your evaluation standards, specific requirements and teaching objectives. Built in Rubrics include Argument, Narrative and Informative pieces.

ai essay grading software

Get Grades, Feedback and Reports

CoGrader generates detailed feedback and justification reports for each student, highlighting areas of improvement together with the grade.

ai essay grading software

Review and Adjust

The teacher has the final say! Adjust the grades and the feedback so you can make sure every student gets the attention they deserve.

ai essay grading software

Want to see what Education looks like in 2023?

Get started right away, not sure yet, frequently asked questions.

CoGrader: the AI copilot for teachers.

You can think of CoGrader as your teaching assistant, who streamlines grading by drafting initial feedback and grade suggestions, saving you time and hassle, and providing top notch feedback for the kids. You can use standardized rubrics and customize criteria, ensuring that your grading process is fair and consistent. Plus, you can detect if your student used ChatGPT to answer the assignment.

CoGrader considers the rubric and your grading instructions to automatically grade and suggest feedback, using AI. Currently CoGrader integrates with Google Classroom and will soon integrate with other LMS. If you don't use Google Classroom, let your LMS provider know that you are interested, so they speed up the process.

Try it out! We have designed CoGrader to be user-friendly and intuitive. We offer training and support to help you get started. Let us know if you need any help.

Privacy matters to us and we're committed to protecting student privacy. We are FERPA-compliant. We use student names to match assignments with the right students, but we quickly change them into a code that keeps the information private, and we get rid of the original names. We don't keep any other personal information about the students. The only thing we do keep is the text of the students' answers to assignments, because we need it for our grading service. This information is kept safe using Google’s secure system, known as OAuth2, which follows all the rules to make sure the information stays private. For a complete understanding of our commitment to privacy and the measures we take to ensure it, we encourage you to read our detailed privacy policy.

CoGrader finally allows educators to provide specific and timely feedback. In addition, it saves time and hassle, ensures consistency and accuracy in grading, reduces biases, and promotes academic integrity.

Soon, we'll indicate whether students have used ChatGPT or other AI systems for assignments, but achieving 100% accurate detection is not possible due to the complexity of human and AI-generated writing. Claims to the contrary are misinformation, as they overlook the nuanced nature of modern technology.

CoGrader uses cutting-edge generative AI algorithms that have undergone rigorous testing and human validation to ensure accuracy and consistency. In comparisons to manual grading, CoGrader typically shows only a small difference of up to ~5% in grades, often less than the variance between human graders. Some teachers have noted that this variance can be influenced by personal bias or the workload of grading. While CoGrader works hard to minimize errors and offer reliable results, it is always a good practice to review and validate the grades (and feedback) before submitting them.

CoGrader is designed to assist educators by streamlining the grading process with AI-driven suggestions. However, the final feedback and grades remain the responsibility of the educator. While CoGrader aims for accuracy and fairness, it should be used as an aid, not a replacement, for professional judgment. Educators should review and validate the grades and feedback before finalizing results. The use of CoGrader constitutes acceptance of these terms, and we expressly disclaim any liability for errors or inconsistencies. The final grading decision always rests with the educator.

Just try it out! We'll guide you along the way. If you have any questions, we're here to help. Once you're in, you'll experience saving countless hours and procrastination, and make grading efficient, fair, and helpful.

EssayGrader

EssayGrader

EssayGrader

EssayGrader Logo - Go to homepage

Welcome to EssayGrader – where innovation meets education! 📚 In the realm of education, where the demands on teachers seem endless, EssayGrader emerged as a beacon of relief, born from a singular vision: lightening the grading burden for teachers. Picture this: educators faced with a daunting 200-to-1 student-to-teacher ratio for a single writing assignment. It's a challenge we understand all too well. With the power of artificial intelligence at our fingertips, we crafted a solution that not only eases this load but transforms the entire grading experience. At the heart of our journey are 4 passionate individuals: Payton and Suraj, visionary software engineers, Chan, a world class product marketer, and Ashley, a dedicated English teacher. Together, they embarked on the mission to revolutionize how teachers approach grading. The result? EssayGrader, a groundbreaking product meticulously designed to save time and energy. It's astonishing: what used to take an average of 10 minutes per essay can now be expertly handled in just 30 seconds, marking a phenomenal 95% reduction in grading time. EssayGrader is more than just a tool; it's a testament to our unwavering commitment to educators. We've witnessed the exhaustion that plagues classrooms, and we set out to create something impactful. Through relentless dedication and numerous iterations, we've developed a product that resonates deeply with teachers – a tool they love and trust. But our journey doesn't end here. EssayGrader is a dynamic creation in constant evolution. We are not merely satisfied; we are driven to refine, improve, and adapt. How? By engaging with our users, the lifeblood of our community. Your insights fuel our innovation. We are steadfast in our promise to listen, learn, and implement changes that empower both educators and students. Together, we're not just envisioning the future of education; we're shaping it. Join us on this transformative expedition. Be a part of the EssayGrader family, where every click signifies progress, every grade transforms a student's journey, and every educator finds the support they deserve. Together, let's redefine education. Together, let's make a difference.

Meet the team

ai essay grading software

Chan is responsible for the day-to-day operations at EssayGrader. He comes from a family of teachers. After completing his Comp Sci degree in the US, Chan founded a school for underprivileged children in India serving 700+ students. He loved teaching Math and Physics for high schoolers during this time. He then relocated to Canada, subsequently held positions in Product Management/Marketing and Sales Engineering at some of the world's largest software companies.

Photo of Suraj, team member of EssayGrader.ai

Suraj is responsible for the technology behind EssayGrader. He has over 15 years of software development experience working with enterprise companies and tech startups. Suraj deeply cares about the people that work for EssayGrader and works hard to bring out the best in them. In his spare time, you can find Suraj spending time with his family and playing video games.

Photo of Payton, team member of EssayGrader.ai

After noticing his wife Ashley's unreasonably heavy grading load, Payton came up with the idea to start EssayGrader. With over 10 years of software development experience, he was the original architect of the platform and now advises the software development team on continuous product improvements. In his free time, he enjoys playing video games, working out, and spending time with his family.

Photo of Ashley, team member of EssayGrader.ai

Ashley is an amazing English teacher with over 10 years of experience. She regularly advises the EssayGrader team, drawing on her daily struggles as a teacher to help build a product loved by all teachers. In her free time, she enjoys watching TV, spending time with her family, and going out with friends.

Automate grading whilst maintaining academic integrity

Save up to 90% in marking costs with our bespoke AI models. Compatible with most marking and LMS systems in higher education and beyond.

*Enter a valid email address

We care about your data in our privacy policy .

Thank you for subscribing.

Join organisations already reducing marking time with Gradingly

Setting the new benchmark in AI grading for education

Our goal is to automate the entire marking process, including scoring and written feedback for all subject areas. Early tests and simulations show a reduction in marker involvement of 70-90% using our bespoke-trained AI models, while maintaining a >90% agreement rate compared to human markers.

This “marker-in-the-loop” approach focuses on aiding existing markers with pre-filled feedback and scores. In the long run, the remaining human marker involvement will be utilised for benchmarking, spot checking, and borderline cases.

Applicable in high-stakes testing for higher education and practice environments, we believe AI grading can play a significant role in reshaping the educational landscape. This movement can lead to more personalised feedback, support, and ultimately a new learning experience.

With a reduction in the marking workload, more teaching resources can be allocated to support students and increase learning efficiency.

Reduction in marking time/costs

Return on investment over 5 years

Agreement rate compared to human

“Working with Gradingly is the perfect partnership: their technical expertise and ‘can-do’ approach has made our content accessible in ways we couldn’t fathom.”

— Tom O'Reilly

Director, Prosperity Education

Ethical, inclusive and responsible AI models

Especially important for high-stake environments, our diligent approach is

focused on maintaining the highest academic integrity.

Bespoke AI Models

Multiple subject areas, ethical ai development, trusted worldwide, considering ai grading.

Sign up to our mailing list for occasional product and research updates

We’re proud of our success stories

Case studies from a small collection of our amazing customers who are benefiting from AI solutions.

  • Prosperity English
  • English Education Group

Contact our friendly team

We’d love to hear from you. Please fill out this form or shoot us an email.

Our friendly team is here to help.

Come say hello at our office HQ.

Lavant House, PO18 9AB, Chichester (United Kingdom)

*First name is required

*Company name is required

*Message is required

...

Download our white paper

  • Open supplemental data
  • Reference Manager
  • Simple TEXT file

People also looked at

Original research article, explainable automated essay scoring: deep learning really has pedagogical value.

ai essay grading software

  • School of Computing and Information Systems, Faculty of Science and Technology, Athabasca University, Edmonton, AB, Canada

Automated essay scoring (AES) is a compelling topic in Learning Analytics for the primary reason that recent advances in AI find it as a good testbed to explore artificial supplementation of human creativity. However, a vast swath of research tackles AES only holistically; few have even developed AES models at the rubric level, the very first layer of explanation underlying the prediction of holistic scores. Consequently, the AES black box has remained impenetrable. Although several algorithms from Explainable Artificial Intelligence have recently been published, no research has yet investigated the role that these explanation models can play in: (a) discovering the decision-making process that drives AES, (b) fine-tuning predictive models to improve generalizability and interpretability, and (c) providing personalized, formative, and fine-grained feedback to students during the writing process. Building on previous studies where models were trained to predict both the holistic and rubric scores of essays, using the Automated Student Assessment Prize’s essay datasets, this study focuses on predicting the quality of the writing style of Grade-7 essays and exposes the decision processes that lead to these predictions. In doing so, it evaluates the impact of deep learning (multi-layer perceptron neural networks) on the performance of AES. It has been found that the effect of deep learning can be best viewed when assessing the trustworthiness of explanation models. As more hidden layers were added to the neural network, the descriptive accuracy increased by about 10%. This study shows that faster (up to three orders of magnitude) SHAP implementations are as accurate as the slower model-agnostic one. It leverages the state-of-the-art in natural language processing, applying feature selection on a pool of 1592 linguistic indices that measure aspects of text cohesion, lexical diversity, lexical sophistication, and syntactic sophistication and complexity. In addition to the list of most globally important features, this study reports (a) a list of features that are important for a specific essay (locally), (b) a range of values for each feature that contribute to higher or lower rubric scores, and (c) a model that allows to quantify the impact of the implementation of formative feedback.

Automated essay scoring (AES) is a compelling topic in Learning Analytics (LA) for the primary reason that recent advances in AI find it as a good testbed to explore artificial supplementation of human creativity. However, a vast swath of research tackles AES only holistically; only a few have even developed AES models at the rubric level, the very first layer of explanation underlying the prediction of holistic scores ( Kumar et al., 2017 ; Taghipour, 2017 ; Kumar and Boulanger, 2020 ). None has attempted to explain the whole decision process of AES, from holistic scores to rubric scores and from rubric scores to writing feature modeling. Although several algorithms from XAI (explainable artificial intelligence) ( Adadi and Berrada, 2018 ; Murdoch et al., 2019 ) have recently been published (e.g., LIME, SHAP) ( Ribeiro et al., 2016 ; Lundberg and Lee, 2017 ), no research has yet investigated the role that these explanation models (trained on top of predictive models) can play in: (a) discovering the decision-making process that drives AES, (b) fine-tuning predictive models to improve generalizability and interpretability, and (c) providing teachers and students with personalized, formative, and fine-grained feedback during the writing process.

One of the key anticipated benefits of AES is the elimination of human bias such as rater fatigue, rater’s expertise, severity/leniency, scale shrinkage, stereotyping, Halo effect, rater drift, perception difference, and inconsistency ( Taghipour, 2017 ). At its turn, AES may suffer from its own set of biases (e.g., imperfections in training data, spurious correlations, overrepresented minority groups), which has incited the research community to look for ways to make AES more transparent, accountable, fair, unbiased, and consequently trustworthy while remaining accurate. This required changing the perception that AES is merely a machine learning and feature engineering task ( Madnani et al., 2017 ; Madnani and Cahill, 2018 ). Hence, researchers have advocated that AES should be seen as a shared task requiring several methodological design decisions along the way such as curriculum alignment, construction of training corpora, reliable scoring process, and rater performance evaluation, where the goal is to build and deploy fair and unbiased scoring models to be used in large-scale assessments and classroom settings ( Rupp, 2018 ; West-Smith et al., 2018 ; Rupp et al., 2019 ). Unfortunately, although these measures are intended to design reliable and valid AES systems, they may still fail to build trust among users, keeping the AES black box impenetrable for teachers and students.

It has been previously recognized that divergence of opinion among human and machine graders has been only investigated superficially ( Reinertsen, 2018 ). So far, researchers investigated the characteristics of essays through qualitative analyses which ended up rejected by AES systems (requiring a human to score them) ( Reinertsen, 2018 ). Others strived to justify predicted scores by identifying essay segments that actually caused the predicted scores. In spite of the fact that these justifications hinted at and quantified the importance of these spatial cues, they did not provide any feedback as to how to improve those suboptimal essay segments ( Mizumoto et al., 2019 ).

Related to this study and the work of Kumar and Boulanger (2020) is Revision Assistant, a commercial AES system developed by Turnitin ( Woods et al., 2017 ; West-Smith et al., 2018 ), which in addition to predicting essays’ holistic scores provides formative, rubric-specific, and sentence-level feedback over multiple drafts of a student’s essay. The implementation of Revision Assistant moved away from the traditional approach to AES, which consists in using a limited set of features engineered by human experts representing only high-level characteristics of essays. Like this study, it rather opted for including a large number of low-level writing features, demonstrating that expert-designed features are not required to produce interpretable predictions. Revision Assistant’s performance was reported on two essay datasets, one of which was the Automated Student Assessment Prize (ASAP) 1 dataset. However, performance on the ASAP dataset was reported in terms of quadratic weighted kappa and this for holistic scores only. Models predicting rubric scores were trained only with the other dataset which was hosted on and collected through Revision Assistant itself.

In contrast to feature-based approaches like the one adopted by Revision Assistant, other AES systems are implemented using deep neural networks where features are learned during model training. For example, Taghipour (2017) in his doctoral dissertation leverages a recurrent neural network to improve accuracy in predicting holistic scores, implement rubric scoring (i.e., organization and argument strength), and distinguish between human-written and computer-generated essays. Interestingly, Taghipour compared the performance of his AES system against other AES systems using the ASAP corpora, but he did not use the ASAP corpora when it came to train rubric scoring models although ASAP provides two corpora provisioning rubric scores (#7 and #8). Finally, research was also undertaken to assess the generalizability of rubric-based models by performing experiments across various datasets. It was found that the predictive power of such rubric-based models was related to how much the underlying feature set covered a rubric’s criteria ( Rahimi et al., 2017 ).

Despite their numbers, rubrics (e.g., organization, prompt adherence, argument strength, essay length, conventions, word choices, readability, coherence, sentence fluency, style, audience, ideas) are usually investigated in isolation and not as a whole, with the exception of Revision Assistant which provides feedback at the same time on the following five rubrics: claim, development, audience, cohesion, and conventions. The literature reveals that rubric-specific automated feedback includes numerical rubric scores as well as recommendations on how to improve essay quality and correct errors ( Taghipour, 2017 ). Again, except for Revision Assistant which undertook a holistic approach to AES including holistic and rubric scoring and provision of rubric-specific feedback at the sentence level, AES has generally not been investigated as a whole or as an end-to-end product. Hence, the AES used in this study and developed by Kumar and Boulanger (2020) is unique in that it uses both deep learning (multi-layer perceptron neural network) and a huge pool of linguistic indices (1592), predicts both holistic and rubric scores, explaining holistic scores in terms of rubric scores, and reports which linguistic indices are the most important by rubric. This study, however, goes one step further and showcases how to explain the decision process behind the prediction of a rubric score for a specific essay, one of the main AES limitations identified in the literature ( Taghipour, 2017 ) that this research intends to address, at least partially.

Besides providing explanations of predictions both globally and individually, this study not only goes one step further toward the automated provision of formative feedback but also does so in alignment with the explanation model and the predictive model, allowing to better map feedback to the actual characteristics of an essay. Woods et al. (2017) succeeded in associating sentence-level expert-derived feedback with strong/weak sentences having the greatest influence on a rubric score based on the rubric, essay score, and the sentence characteristics. While Revision Assistant’s feature space consists of counts and binary occurrence indicators of word unigrams, bigrams and trigrams, character four-grams, and part-of-speech bigrams and trigrams, they are mainly textual and locational indices; by nature they are not descriptive or self-explanative. This research fills this gap by proposing feedback based on a set of linguistic indices that can encompass several sentences at a time. However, the proposed approach omits locational hints, leaving the merging of the two approaches as the next step to be addressed by the research community.

Although this paper proposes to extend the automated provision of formative feedback through an interpretable machine learning method, it rather focuses on the feasibility of automating it in the context of AES instead of evaluating the pedagogical quality (such as the informational and communicational value of feedback messages) or impact on students’ writing performance, a topic that will be kept for an upcoming study. Having an AES system that is capable of delivering real-time formative feedback sets the stage to investigate (1) when feedback is effective, (2) the types of feedback that are effective, and (3) whether there exist different kinds of behaviors in terms of seeking and using feedback ( Goldin et al., 2017 ). Finally, this paper omits describing the mapping between the AES model’s linguistic indices and a pedagogical language that is easily understandable by students and teachers, which is beyond its scope.

Methodology

This study showcases the application of the PDR framework ( Murdoch et al., 2019 ), which provides three pillars to describe interpretations in the context of the data science life cycle: P redictive accuracy, D escriptive accuracy, and R elevancy to human audience(s). It is important to note that in a broader sense both terms “explainable artificial intelligence” and “interpretable machine learning” can be used interchangeably with the following meaning ( Murdoch et al., 2019 ): “the use of machine-learning models for the extraction of relevant knowledge about domain relationships contained in data.” Here “predictive accuracy” refers to the measurement of a model’s ability to fit data; “descriptive accuracy” is the degree at which the relationships learned by a machine learning model can be objectively captured; and “relevant knowledge” implies that a particular audience gets insights into a chosen domain problem that guide its communication, actions, and discovery ( Murdoch et al., 2019 ).

In the context of this article, formative feedback that assesses students’ writing skills and prescribes remedial writing strategies is the relevant knowledge sought for, whose effectiveness on students’ writing performance will be validated in an upcoming study. However, the current study puts forward the tools and evaluates the feasibility to offer this real-time formative feedback. It also measures the predictive and descriptive accuracies of AES and explanation models, two key components to generate trustworthy interpretations ( Murdoch et al., 2019 ). Naturally, the provision of formative feedback is dependent on the speed of training and evaluating new explanation models every time a new essay is ingested by the AES system. That is why this paper investigates the potential of various SHAP implementations for speed optimization without compromising the predictive and descriptive accuracies. This article will show how the insights generated by the explanation model can serve to debug the predictive model and contribute to enhance the feature selection and/or engineering process ( Murdoch et al., 2019 ), laying the foundation for the provision of actionable and impactful pieces of knowledge to educational audiences, whose relevancy will be judged by the human stakeholders and estimated by the magnitude of resulting changes.

Figure 1 overviews all the elements and steps encompassed by the AES system in this study. The following subsections will address each facet of the overall methodology, from hyperparameter optimization to relevancy to both students and teachers.

www.frontiersin.org

Figure 1. A flow chart exhibiting the sequence of activities to develop an end-to-end AES system and how the various elements work together to produce relevant knowledge to the intended stakeholders.

Automated Essay Scoring System, Dataset, and Feature Selection

As previously mentioned, this paper reuses the AES system developed by Kumar and Boulanger (2020) . The AES models were trained using the ASAP’s seventh essay corpus. These narrative essays were written by Grade-7 students in the setting of state-wide assessments in the United States and had an average length of 171 words. Students were asked to write a story about patience. Kumar and Boulanger’s work consisted in training a predictive model for each of the four rubrics according to which essays were graded: ideas, organization, style, and conventions. Each essay was scored by two human raters on a 0−3 scale (integer scale). Rubric scores were resolved by adding the rubric scores assigned by the two human raters, producing a resolved rubric score between 0 and 6. This paper is a continuation of Boulanger and Kumar (2018 , 2019 , 2020) and Kumar and Boulanger (2020) where the objective is to open the AES black box to explain the holistic and rubric scores that it predicts. Essentially, the holistic score ( Boulanger and Kumar, 2018 , 2019 ) is determined and justified through its four rubrics. Rubric scores, in turn, are investigated to highlight the writing features that play an important role within each rubric ( Kumar and Boulanger, 2020 ). Finally, beyond global feature importance, it is not only indispensable to identify which writing indices are important for a particular essay (local), but also to discover how they contribute to increase or decrease the predicted rubric score, and which feature values are more/less desirable ( Boulanger and Kumar, 2020 ). This paper is a continuation of these previous works by adding the following link to the AES chain: holistic score, rubric scores, feature importance, explanations, and formative feedback. The objective is to highlight the means for transparent and trustable AES while empowering learning analytics practitioners with the tools to debug these models and equip educational stakeholders with an AI companion that will semi-autonomously generate formative feedback to teachers and students. Specifically, this paper analyzes the AES reasoning underlying its assessment of the “style” rubric, which looks for command of language, including effective and compelling word choice and varied sentence structure, that clearly supports the writer’s purpose and audience.

This research’s approach to AES leverages a feature-based multi-layer perceptron (MLP) deep neural network to predict rubric scores. The AES system is fed by 1592 linguistic indices quantitatively measured by the Suite of Automatic Linguistic Analysis Tools 2 (SALAT), which assess aspects of grammar and mechanics, sentiment analysis and cognition, text cohesion, lexical diversity, lexical sophistication, and syntactic sophistication and complexity ( Kumar and Boulanger, 2020 ). The purpose of using such a huge pool of low-level writing features is to let deep learning extract the most important ones; the literature supports this practice since there is evidence that features automatically selected are not less interpretable than those engineered ( Woods et al., 2017 ). However, to facilitate this process, this study opted for a semi-automatic strategy that consisted of both filter and embedded methods. Firstly, the original ASAP’s seventh essay dataset consists of a training set of 1567 essays and a validation and testing sets of 894 essays combined. While the texts of all 2461 essays are still available to the public, only the labels (the rubric scores of two human raters) of the training set have been shared with the public. Yet, this paper reused the unlabeled 894 essays of the validation and testing sets for feature selection, a process that must be carefully carried out by avoiding being informed by essays that will train the predictive model. Secondly, feature data were normalized, and features with variances lower than 0.01 were pruned. Thirdly, the last feature of any pair of features having an absolute Pearson correlation coefficient greater than 0.7 was also pruned (the one that comes last in terms of the column ordering in the datasets). After the application of these filter methods, the number of features was reduced from 1592 to 282. Finally, the Lasso and Ridge regression regularization methods (whose combination is also called ElasticNet) were applied during the training of the rubric scoring models. Lasso is responsible for pruning further features, while Ridge regression is entrusted with eliminating multicollinearity among features.

Hyperparameter Optimization and Training

To ensure a fair evaluation of the potential of deep learning, it is of utmost importance to minimally describe this study’s exploration of the hyperparameter space, a step that is often found to be missing when reporting the outcomes of AES models’ performance ( Kumar and Boulanger, 2020 ). First, a study should list the hyperparameters it is going to investigate by testing for various values of each hyperparameter. For example, Table 1 lists all hyperparameters explored in this study. Note that L 1 and L 2 are two regularization hyperparameters contributing to feature selection. Second, each study should also report the range of values of each hyperparameter. Finally, the strategy to explore the selected hyperparameter subspace should be clearly defined. For instance, given the availability of high-performance computing resources and the time/cost of training AES models, one might favor performing a grid (a systematic testing of all combinations of hyperparameters and hyperparameter values within a subspace) or a random search (randomly selecting a hyperparameter value from a range of values per hyperparameter) or both by first applying random search to identify a good starting candidate and then grid search to test all possible combinations in the vicinity of the starting candidate’s subspace. Of particular interest to this study is the neural network itself, that is, how many hidden layers should a neural network have and how many neurons should compose each hidden layer and the neural network as a whole. These two variables are directly related to the size of the neural network, with the number of hidden layers being a defining trait of deep learning. A vast swath of literature is silent about the application of interpretable machine learning in AES and even more about measuring its descriptive accuracy, the two components of trustworthiness. Hence, this study pioneers the comprehensive assessment of deep learning impact on AES’s predictive and descriptive accuracies.

www.frontiersin.org

Table 1. Hyperparameter subspace investigated in this article along with best hyperparameter values per neural network architecture.

Consequently, the 1567 labeled essays were divided into a training set (80%) and a testing set (20%). No validation set was put aside; 5-fold cross-validation was rather used for hyperparameter optimization. Table 1 delineates the hyperparameter subspace from which 800 different combinations of hyperparameter values were randomly selected out of a subspace of 86,248,800 possible combinations. Since this research proposes to investigate the potential of deep learning to predict rubric scores, several architectures consisting of 2 to 6 hidden layers and ranging from 9,156 to 119,312 parameters were tested. Table 1 shows the best hyperparameter values per depth of neural networks.

Again, the essays of the testing set were never used during the training and cross-validation processes. In order to retrieve the best predictive models during training, every time the validation loss reached a record low, the model was overwritten. Training stopped when no new record low was reached during 100 epochs. Moreover, to avoid reporting the performance of overfit models, each model was trained five times using the same set of best hyperparameter values. Finally, for each resulting predictive model, a corresponding ensemble model (bagging) was also obtained out of the five models trained during cross-validation.

Predictive Models and Predictive Accuracy

Table 2 delineates the performance of predictive models trained previously by Kumar and Boulanger (2020) on the four scoring rubrics. The first row lists the agreement levels between the resolved and predicted rubric scores measured by the quadratic weighted kappa. The second row is the percentage of accurate predictions; the third row reports the percentages of predictions that are either accurate or off by 1; and the fourth row reports the percentages of predictions that are either accurate or at most off by 2. Prediction of holistic scores is done merely by adding up all rubric scores. Since the scale of rubric scores is 0−6 for every rubric, then the scale of holistic scores is 0−24.

www.frontiersin.org

Table 2. Rubric scoring models’ performance on testing set.

While each of these rubric scoring models might suffer from its own systemic bias and hence cancel off each other’s bias by adding up the rubric scores to derive the holistic score, this study (unlike related works) intends to highlight these biases by exposing the decision making process underlying the prediction of rubric scores. Although this paper exclusively focuses on the Style rubric, the methodology put forward to analyze the local and global importance of writing indices and their context-specific contributions to predicted rubric scores is applicable to every rubric and allows to control for these biases one rubric at a time. Comparing and contrasting the role that a specific writing index plays within each rubric context deserves its own investigation, which has been partly addressed in the study led by Kumar and Boulanger (2020) . Moreover, this paper underscores the necessity to measure the predictive accuracy of rubric-based holistic scoring using additional metrics to account for these rubric-specific biases. For example, there exist several combinations of rubric scores to obtain a holistic score of 16 (e.g., 4-4-4-4 vs. 4-3-4-5 vs. 3-5-2-6). Even though the predicted holistic score might be accurate, the rubric scores could all be inaccurate. Similarity or distance metrics (e.g., Manhattan and Euclidean) should then be used to describe the authenticity of the composition of these holistic scores.

According to what Kumar and Boulanger (2020) report on the performance of several state-of-the-art AES systems trained on ASAP’s seventh essay dataset, the AES system they developed and which will be reused in this paper proved competitive while being fully and deeply interpretable, which no other AES system does. They also supply further information about the study setting, essay datasets, rubrics, features, natural language processing (NLP) tools, model training, and evaluation against human performance. Again, this paper showcases the application of explainable artificial intelligence in automated essay scoring by focusing on the decision process of the Rubric #3 (Style) scoring model. Remember that the same methodology is applicable to each rubric.

Explanation Model: SHAP

SH apley A dditive ex P lanations (SHAP) is a theoretically justified XAI framework that can provide simultaneously both local and global explanations ( Molnar, 2020 ); that is, SHAP is able to explain individual predictions taking into account the uniqueness of each prediction, while highlighting the global factors influencing the overall performance of a predictive model. SHAP is of keen interest because it unifies all algorithms of the class of additive feature attribution methods, adhering to a set of three properties that are desirable in interpretable machine learning: local accuracy, missingness, and consistency ( Lundberg and Lee, 2017 ). A key advantage of SHAP is that feature contributions are all expressed in terms of the outcome variable (e.g., rubric scores), providing a same scale to compare the importance of each feature against each other. Local accuracy refers to the fact that no matter the explanation model, the sum of all feature contributions is always equal to the prediction explained by these features. The missingness property implies that the prediction is never explained by unmeasured factors, which are always assigned a contribution of zero. However, the converse is not true; a contribution of zero does not imply an unobserved factor, it can also denote a feature irrelevant to explain the prediction. The consistency property guarantees that a more important feature will always have a greater magnitude than a less important one, no matter how many other features are included in the explanation model. SHAP proves superior to other additive attribution methods such as LIME (Local Interpretable Model-Agnostic Explanations), Shapley values, and DeepLIFT in that they never comply with all three properties, while SHAP does ( Lundberg and Lee, 2017 ). Moreover, the way SHAP assesses the importance of a feature differs from permutation importance methods (e.g., ELI5), measured as the decrease in model performance (accuracy) as a feature is perturbated, in that it is based on how much a feature contributes to every prediction.

Essentially, a SHAP explanation model (linear regression) is trained on top of a predictive model, which in this case is a complex ensemble deep learning model. Table 3 demonstrates a scale explanation model showing how SHAP values (feature contributions) work. In this example, there are five instances and five features describing each instance (in the context of this paper, an instance is an essay). Predictions are listed in the second to last column, and the base value is the mean of all predictions. The base value constitutes the reference point according to which predictions are explained; in other words, reasons are given to justify the discrepancy between the individual prediction and the mean prediction (the base value). Notice that the table does not contain the actual feature values; these are SHAP values that quantify the contribution of each feature to the predicted score. For example, the prediction of Instance 1 is 2.46, while the base value is 3.76. Adding up the feature contributions of Instance 1 to the base value produces the predicted score:

www.frontiersin.org

Table 3. Array of SHAP values: local and global importance of features and feature coverage per instance.

Hence, the generic equation of the explanation model ( Lundberg and Lee, 2017 ) is:

where g(x) is the prediction of an individual instance x, σ 0 is the base value, σ i is the feature contribution of feature x i , x i ∈ {0,1} denotes whether feature x i is part of the individual explanation, and j is the total number of features. Furthermore, the global importance of a feature is calculated by adding up the absolute values of its corresponding SHAP values over all instances, where n is the total number of instances and σ i ( j ) is the feature contribution for instance i ( Lundberg et al., 2018 ):

Therefore, it can be seen that Feature 3 is the most globally important feature, while Feature 2 is the least important one. Similarly, Feature 5 is Instance 3’s most important feature at the local level, while Feature 2 is the least locally important. The reader should also note that a feature shall not necessarily be assigned any contribution; some of them are just not part of the explanation such as Feature 2 and Feature 3 in Instance 2. These concepts lay the foundation for the explainable AES system presented in this paper. Just imagine that each instance (essay) will be rather summarized by 282 features and that the explanations of all the testing set’s 314 essays will be provided.

Several implementations of SHAP exist: KernelSHAP, DeepSHAP, GradientSHAP, and TreeSHAP, among others. KernelSHAP is model-agnostic and works for any type of predictive models; however, KernelSHAP is very computing-intensive which makes it undesirable for practical purposes. DeepSHAP and GradientSHAP are two implementations intended for deep learning which takes advantage of the known properties of neural networks (i.e., MLP-NN, CNN, or RNN) to accelerate up to three orders of magnitude the processing time to explain predictions ( Chen et al., 2019 ). Finally, TreeSHAP is the most powerful implementation intended for tree-based models. TreeSHAP is not only fast; it is also accurate. While the three former implementations estimate SHAP values, TreeSHAP computes them exactly. Moreover, TreeSHAP not only measures the contribution of individual features, but it also considers interactions between pairs of features and assigns them SHAP values. Since one of the goals of this paper is to assess the potential of deep learning on the performance of both predictive and explanation models, this research tested the former three implementations. TreeSHAP is recommended for future work since the interaction among features is critical information to consider. Moreover, KernelSHAP, DeepSHAP, and GradientSHAP all require access to the whole original dataset to derive the explanation of a new instance, another constraint TreeSHAP is not subject to.

Descriptive Accuracy: Trustworthiness of Explanation Models

This paper reuses and adapts the methodology introduced by Ribeiro et al. (2016) . Several explanation models will be trained, using different SHAP implementations and configurations, per deep learning predictive model (for each number of hidden layers). The rationale consists in randomly selecting and ignoring 25% of the 282 features feeding the predictive model (e.g., turning them to zero). If it causes the prediction to change beyond a specific threshold (in this study 0.10 and 0.25 were tested), then the explanation model should also reflect the magnitude of this change while ignoring the contributions of these same features. For example, the original predicted rubric score of an essay might be 5; however, when ignoring the information brought in by a subset of 70 randomly selected features (25% of 282), the prediction may turn to 4. On the other side, if the explanation model also predicts a 4 while ignoring the contributions of the same subset of features, then the explanation is considered as trustworthy. This allows to compute the precision, recall, and F1-score of each explanation model (number of true and false positives and true and false negatives). The process is repeated 500 times for every essay to determine the average precision and recall of every explanation model.

Judging Relevancy

So far, the consistency of explanations with predictions has been considered. However, consistent explanations do not imply relevant or meaningful explanations. Put another way, explanations only reflect what predictive models have learned during training. How can the black box of these explanations be opened? Looking directly at the numerical SHAP values of each explanation might seem a daunting task, but there exist tools, mainly visualizations (decision plot, summary plot, and dependence plot), that allow to make sense out of these explanations. However, before visualizing these explanations, another question needs to be addressed: which explanations or essays should be picked for further scrutiny of the AES system? Given the huge number of essays to examine and the tedious task to understand the underpinnings of a single explanation, a small subset of essays should be carefully picked that should represent concisely the state of correctness of the underlying predictive model. Again, this study applies and adapts the methodology in Ribeiro et al. (2016) . A greedy algorithm selects essays whose predictions are explained by as many features of global importance as possible to optimize feature coverage. Ribeiro et al. demonstrated in unrelated studies (i.e., sentiment analysis) that the correctness of a predictive model can be assessed with as few as four or five well-picked explanations.

For example, Table 3 reveals the global importance of five features. The square root of each feature’s global importance is also computed and considered instead to limit the influence of a small group of very influential features. The feature coverage of Instance 1 is 100% because all features are engaged in the explanation of the prediction. On the other hand, Instance 2 has a feature coverage of 61.5% because only Features 1, 4, and 5 are part of the prediction’s explanation. The feature coverage is calculated by summing the square root of each explanation’s feature’s global importance together and dividing by the sum of the square roots of all features’ global importance:

Additionally, it can be seen that Instance 4 does not have any zero-feature value although its feature coverage is only 84.6%. The algorithm was constrained to discard from the explanation any feature whose contribution (local importance) was too close to zero. In the case of Table 3 ’s example, any feature whose absolute SHAP value is less than 0.10 is ignored, hence leading to a feature coverage of:

In this paper’s study, the real threshold was 0.01. This constraint was actually a requirement for the DeepSHAP and GradientSHAP implementations because they only output non-zero SHAP values contrary to KernelSHAP which generates explanations with a fixed number of features: a non-zero SHAP value indicates that the feature is part of the explanation, while a zero value excludes the feature from the explanation. Without this parameter, all 282 features would be part of the explanation although a huge number only has a trivial (very close to zero) SHAP value. Now, a much smaller but variable subset of features makes up each explanation. This is one way in which Ribeiro et al.’s SP-LIME algorithm (SP stands for Submodular Pick) has been adapted to this study’s needs. In conclusion, notice how Instance 4 would be selected in preference to Instance 5 to explain Table 3 ’s underlying predictive model. Even though both instances have four features explaining their prediction, Instance 4’s features are more globally important than Instance 5’s features, and therefore Instance 4 has greater feature coverage than Instance 5.

Whereas Table 3 ’s example exhibits the feature coverage of one instance at a time, this study computes it for a subset of instances, where the absolute SHAP values are aggregated (summed) per candidate subset. When the sum of absolute SHAP values per feature exceeds the set threshold, the feature is then considered as covered by the selected set of instances. The objective in this study was to optimize the feature coverage while minimizing the number of essays to validate the AES model.

Research Questions

One of this article’s objectives is to assess the potential of deep learning in automated essay scoring. The literature has often claimed ( Hussein et al., 2019 ) that there are two approaches to AES, feature-based and deep learning, as though these two approaches were mutually exclusive. Yet, the literature also puts forward that feature-based AES models may be more interpretable than deep learning ones ( Amorim et al., 2018 ). This paper embraces the viewpoint that these two approaches can also be complementary by leveraging the state-of-the-art in NLP and automatic linguistic analysis and harnessing one of the richest pools of linguistic indices put forward in the research community ( Crossley et al., 2016 , 2017 , 2019 ; Kyle, 2016 ; Kyle et al., 2018 ) and applying a thorough feature selection process powered by deep learning. Moreover, the ability of deep learning of modeling complex non-linear relationships makes it particularly well-suited for AES given that the importance of a writing feature is highly dependent on its context, that is, its interactions with other writing features. Besides, this study leverages the SHAP interpretation method that is well-suited to interpret very complex models. Hence, this study elected to work with deep learning models and ensembles to test SHAP’s ability to explain these complex models. Previously, the literature has revealed the difficulty to have at the same time both accurate and interpretable models ( Ribeiro et al., 2016 ; Murdoch et al., 2019 ), where favoring one comes at the expense of the other. However, this research shows how XAI makes it now possible to produce both accurate and interpretable models in the area of AES. Since ensembles have been repeatedly shown to boost the accuracy of predictive models, they were included as part of the tested deep learning architectures to maximize generalizability and accuracy, while making these predictive models interpretable and exploring whether deep learning can even enhance their descriptive accuracy further.

This study investigates the trustworthiness of explanation models, and more specifically, those explaining deep learning predictive models. For instance, does the depth, defined as the number of hidden layers, of an MLP neural network increases the trustworthiness of its SHAP explanation model? The answer to this question will help determine whether it is possible to have very accurate AES models while having competitively interpretable/explainable models, the corner stone for the generation of formative feedback. Remember that formative feedback is defined as “any kind of information provided to students about their actual state of learning or performance in order to modify the learner’s thinking or behavior in the direction of the learning standards” and that formative feedback “conveys where the student is, what are the goals to reach, and how to reach the goals” ( Goldin et al., 2017 ). This notion contrasts with summative feedback which basically is “a justification of the assessment results” ( Hao and Tsikerdekis, 2019 ).

As pointed out in the previous section, multiple SHAP implementations are evaluated in this study. Hence, this paper showcases whether the faster DeepSHAP and GradientSHAP implementations are as reliable as the slower KernelSHAP implementation . The answer to this research question will shed light on the feasibility of providing immediate formative feedback and this multiple times throughout students’ writing processes.

This study also looks at whether a summary of the data produces as trustworthy explanations as those from the original data . This question will be of interest to AES researchers and practitioners because it could allow to significantly decrease the processing time of the computing-intensive and model-agnostic KernelSHAP implementation and test further the potential of customizable explanations.

KernelSHAP allows to specify the total number of features that will shape the explanation of a prediction; for instance, this study experiments with explanations of 16 and 32 features and observes whether there exists a statistically significant difference in the reliability of these explanation models . Knowing this will hint at whether simpler or more complex explanations are more desirable when it comes to optimize their trustworthiness. If there is no statistically significant difference, then AES practitioners are given further flexibility in the selection of SHAP implementations to find the sweet spot between complexity of explanations and speed of processing. For instance, the KernelSHAP implementation allows to customize the number of factors making up an explanation, while the faster DeepSHAP and GradientSHAP do not.

Finally, this paper highlights the means to debug and compare the performance of predictive models through their explanations. Once a model is debugged, the process can be reused to fine-tune feature selection and/or feature engineering to improve predictive models and for the generation of formative feedback to both students and teachers.

The training, validation, and testing sets consist of 1567 essays, each of which has been scored by two human raters, who assigned a score between 0 and 3 per rubric (ideas, organization, style, and conventions). In particular, this article looks at predictive and descriptive accuracy of AES models on the third rubric, style. Note that although each essay has been scored by two human raters, the literature ( Shermis, 2014 ) is not explicit about whether only two or more human raters participated in the scoring of all 1567 essays; given the huge number of essays, it is likely that more than two human raters were involved in the scoring of these essays so that the amount of noise introduced by the various raters’ biases is unknown while probably being at some degree balanced among the two groups of raters. Figure 2 shows the confusion matrices of human raters on Style Rubric. The diagonal elements (dark gray) correspond to exact matches, whereas the light gray squares indicate adjacent matches. Figure 2A delineates the number of essays per pair of ratings, and Figure 2B shows the percentages per pair of ratings. The agreement level between each pair of human raters, measured by the quadratic weighted kappa, is 0.54; the percentage of exact matches is 65.3%; the percentage of adjacent matches is 34.4%; and 0.3% of essays are neither exact nor adjacent matches. Figures 2A,B specify the distributions of 0−3 ratings per group of human raters. Figure 2C exhibits the distribution of resolved scores (a resolved score is the sum of the two human ratings). The mean is 3.99 (with a standard deviation of 1.10), and the median and mode are 4. It is important to note that the levels of predictive accuracy reported in this article are measured on the scale of resolved scores (0−6) and that larger scales tend to slightly inflate quadratic weighted kappa values, which must be taken into account when comparing against the level of agreement between human raters. Comparison of percentages of exact and adjacent matches must also be made with this scoring scale discrepancy in mind.

www.frontiersin.org

Figure 2. Summary of the essay dataset (1567 Grade-7 narrative essays) investigated in this study. (A) Number of essays per pair of human ratings; the diagonal (dark gray squares) lists the numbers of exact matches while the light-gray squares list the numbers of adjacent matches; and the bottom row and the rightmost column highlight the distributions of ratings for both groups of human raters. (B) Percentages of essays per pair of human ratings; the diagonal (dark gray squares) lists the percentages of exact matches while the light-gray squares list the percentages of adjacent matches; and the bottom row and the rightmost column highlight the distributions (frequencies) of ratings for both groups of human raters. (C) The distribution of resolved rubric scores; a resolved score is the addition of its two constituent human ratings.

Predictive Accuracy and Descriptive Accuracy

Table 4 compiles the performance outcomes of the 10 predictive models evaluated in this study. The reader should remember that the performance of each model was averaged over five iterations and that two models were trained per number of hidden layers, one non-ensemble and one ensemble. Except for the 6-layer models, there is no clear winner among other models. Even for the 6-layer models, they are superior in terms of exact matches, the primary goal for a reliable AES system, but not according to adjacent matches. Nevertheless, on average ensemble models slightly outperform non-ensemble models. Hence, these ensemble models will be retained for the next analysis step. Moreover, given that five ensemble models were trained per neural network depth, the most accurate model among the five is selected and displayed in Table 4 .

www.frontiersin.org

Table 4. Performance of majority classifier and average/maximal performance of trained predictive models.

Next, for each selected ensemble predictive model, several explanation models are trained per predictive model. Every predictive model is explained by the “Deep,” “Grad,” and “Random” explainers, except for the 6-layer model where it was not possible to train a “Deep” explainer apparently due to a bug in the original SHAP code caused by either a unique condition in this study’s data or neural network architecture. However, this was beyond the scope of this study to fix and investigate this issue. As it will be demonstrated, no statistically significant difference exists between the accuracy of these explainers.

The “Random” explainer serves as a baseline model for comparison purpose. Remember that to evaluate the reliability of explanation models, the concurrent impact of randomly selecting and ignoring a subset of features on the prediction and explanation of rubric scores is analyzed. If the prediction changes significantly and its corresponding explanation changes (beyond a set threshold) accordingly (a true positive) or if the prediction remains within the threshold as does the explanation (a true negative), then the explanation is deemed as trustworthy. Hence, in the case of the Random explainer, it simulates random explanations by randomly selecting 32 non-zero features from the original set of 282 features. These random explanations consist only of non-zero features because, according to SHAP’s missingness property, a feature with a zero or a missing value never gets assigned any contribution to the prediction. If at least one of these 32 features is also an element of the subset of the ignored features, then the explanation is considered as untrustworthy, no matter the size of a feature’s contribution.

As for the layer-2 model, six different explanation models are evaluated. Recall that layer-2 models generated the least mean squared error (MSE) during hyperparameter optimization (see Table 1 ). Hence, this specific type of architecture was selected to test the reliability of these various explainers. The “Kernel” explainer is the most computing-intensive and took approximately 8 h of processing. It was trained using the full distributions of feature values in the training set and shaped explanations in terms of 32 features; the “Kernel-16” and “Kernel-32” models were trained on a summary (50 k -means centroids) of the training set to accelerate the processing by about one order of magnitude (less than 1 h). Besides, the “Kernel-16” explainer derived explanations in terms of 16 features, while the “Kernel-32” explainer explained predictions through 32 features. Table 5 exhibits the descriptive accuracy of these various explanation models according to a 0.10 and 0.25 threshold; in other words, by ignoring a subset of randomly picked features, it assesses whether or not the prediction and explanation change simultaneously. Note also how each explanation model, no matter the underlying predictive model, outperforms the “Random” model.

www.frontiersin.org

Table 5. Precision, recall, and F1 scores of the various explainers tested per type of predictive model.

The first research question addressed in this subsection asks whether there exists a statistically significant difference between the “Kernel” explainer, which generates 32-feature explanations and is trained on the whole training set, and the “Kernel-32” explainer which also generates 32-feature explanations and is trained on a summary of the training set. To determine this, an independent t-test was conducted using the precision, recall, and F1-score distributions (500 iterations) of both explainers. Table 6 reports the p -values of all the tests and for the 0.10 and 0.25 thresholds. It reveals that there is no statistically significant difference between the two explainers.

www.frontiersin.org

Table 6. p -values of independent t -tests comparing whether there exist statistically significant differences between the mean precisions, recalls, and F1-scores of 2-layer explainers and between those of the 2-layer’s, 4-layer’s, and 6-layer’s Gradient explainers.

The next research question tests whether there exists a difference in the trustworthiness of explainers shaping 16 or 32-feature explanations. Again t-tests were conducted to verify this. Table 6 lists the resulting p -values. Again, there is no statistically significant difference in the average precisions, recalls, and F1-scores of both explainers.

This leads to investigating whether the “Kernel,” “Deep,” and “Grad” explainers are equivalent. Table 6 exhibits the results of the t-tests conducted to verify this and reveals that none of the explainers produce a statistically significantly better performance than the other.

Armed with this evidence, it is now possible to verify whether deeper MLP neural networks produce more trustworthy explanation models. For this purpose, the performance of the “Grad” explainer for each type of predictive model will be compared against each other. The same methodology as previously applied is employed here. Table 6 , again, confirms that the explanation model of the 2-layer predictive model is statistically significantly less trustworthy than the 4-layer’s explanation model; the same can be said of the 4-layer and 6-layer models. The only exception is the difference in average precision between 2-layer and 4-layer models and between 4-layer and 6-layer models; however, there clearly exists a statistically significant difference in terms of precision (and also recall and F1-score) between 2-layer and 6-layer models.

The Best Subset of Essays to Judge AES Relevancy

Table 7 lists the four best essays optimizing feature coverage (93.9%) along with their resolved and predicted scores. Notice how two of the four essays were picked by the adapted SP-LIME algorithm with some strong disagreement between the human and the machine graders, two were picked with short and trivial text, and two were picked exhibiting perfect agreement between the human and machine graders. Interestingly, each pair of longer and shorter essays exposes both strong agreement and strong disagreement between the human and AI agents, offering an opportunity to debug the model and evaluate its ability to detect the presence or absence of more basic (e.g., very small number of words, occurrences of sentence fragments) and more advanced aspects (e.g., cohesion between adjacent sentences, variety of sentence structures) of narrative essay writing and to appropriately reward or penalize them.

www.frontiersin.org

Table 7. Set of best essays to evaluate the correctness of the 6-layer ensemble AES model.

Local Explanation: The Decision Plot

The decision plot lists writing features by order of importance from top to bottom. The line segments display the contribution (SHAP value) of each feature to the predicted rubric score. Note that an actual decision plot consists of all 282 features and that only the top portion of it (20 most important features) can be displayed (see Figure 3 ). A decision plot is read from bottom to top. The line starts at the base value and ends at the predicted rubric score. Given that the “Grad” explainer is the only explainer common to all predictive models, it has been selected to derive all explanations. The decision plots in Figure 3 show the explanations of the four essays in Table 7 ; the dashed line in these plots represents the explanation of the most accurate predictive model, that is the ensemble model with 6 hidden layers which also produced the most trustworthy explanation model. The predicted rubric score of each explanation model is listed in the bottom-right legend. Explanation of the writing features follow in a next subsection.

www.frontiersin.org

Figure 3. Comparisons of all models’ explanations of the most representative set of four essays: (A) Essay 228, (B) Essay 68, (C) Essay 219, and (D) Essay 124.

Global Explanation: The Summary Plot

It is advantageous to use SHAP to build explanation models because it provides a single framework to discover the writing features that are important to an individual essay (local) or a set of essays (global). While the decision plots list features of local importance, Figure 4 ’s summary plot ranks writing features by order of global importance (from top to bottom). All testing set’s 314 essays are represented as dots in the scatterplot of each writing feature. The position of a dot on the horizontal axis corresponds to the importance (SHAP value) of the writing feature for a specific essay and its color indicates the magnitude of the feature value in relation to the range of all 314 feature values. For example, large or small numbers of words within an essay generally contribute to increase or decrease rubric scores by up to 1.5 and 1.0, respectively. Decision plots can also be used to find the most important features for a small subset of essays; Figure 5 demonstrates the new ordering of writing indices when aggregating the feature contributions (summing the absolute values of SHAP values) of the four essays in Table 7 . Moreover, Figure 5 allows to compare the contributions of a feature to various essays. Note how the orderings in Figures 3 −5 can differ from each other, sharing many features of global importance as well as having their own unique features of local importance.

www.frontiersin.org

Figure 4. Summary plot listing the 32 most important features globally.

www.frontiersin.org

Figure 5. Decision plot delineating the best model’s explanations of Essays 228, 68, 219, and 124 (6-layer ensemble).

Definition of Important Writing Indices

The reader shall understand that it is beyond the scope of this paper to make a thorough description of all writing features. Nevertheless, the summary and decision plots in Figures 4 , 5 allow to identify a subset of features that should be examined in order to validate this study’s predictive model. Supplementary Table 1 combines and describes the 38 features in Figures 4 , 5 .

Dependence Plots

Although the summary plot in Figure 4 is insightful to determine whether small or large feature values are desirable, the dependence plots in Figure 6 prove essential to recommend whether a student should aim at increasing or decreasing the value of a specific writing feature. The dependence plots also reveal whether the student should directly act upon the targeted writing feature or indirectly on other features. The horizontal axis in each of the dependence plots in Figure 6 is the scale of the writing feature and the vertical axis is the scale of the writing feature’s contributions to the predicted rubric scores. Each dot in a dependence plot represents one of the testing set’s 314 essays, that is, the feature value and SHAP value belonging to the essay. The vertical dispersion of the dots on small intervals of the horizontal axis is indicative of interaction with other features ( Molnar, 2020 ). If the vertical dispersion is widespread (e.g., the [50, 100] horizontal-axis interval in the “word_count” dependence plot), then the contribution of the writing feature is most likely at some degree dependent on other writing feature(s).

www.frontiersin.org

Figure 6. Dependence plots: the horizontal axes represent feature values while vertical axes represent feature contributions (SHAP values). Each dot represents one of the 314 essays of the testing set and is colored according to the value of the feature with which it interacts most strongly. (A) word_count. (B) hdd42_aw. (C) ncomp_stdev. (D) dobj_per_cl. (E) grammar. (F) SENTENCE_FRAGMENT. (G) Sv_GI. (H) adjacent_overlap_verb_sent.

The contributions of this paper can be summarized as follows: (1) it proposes a means (SHAP) to explain individual predictions of AES systems and provides flexible guidelines to build powerful predictive models using more complex algorithms such as ensembles and deep learning neural networks; (2) it applies a methodology to quantitatively assess the trustworthiness of explanation models; (3) it tests whether faster SHAP implementations impact the descriptive accuracy of explanation models, giving insight on the applicability of SHAP in real pedagogical contexts such as AES; (4) it offers a toolkit to debug AES models, highlights linguistic intricacies, and underscores the means to offer formative feedback to novice writers; and more importantly, (5) it empowers learning analytics practitioners to make AI pedagogical agents accountable to the human educator, the ultimate problem holder responsible for the decisions and actions of AI ( Abbass, 2019 ). Basically, learning analytics (which encompasses tools such as AES) is characterized as an ethics-bound, semi-autonomous, and trust-enabled human-AI fusion that recurrently measures and proactively advances knowledge boundaries in human learning.

To exemplify this, imagine an AES system that supports instructors in the detection of plagiarism, gaming behaviors, and the marking of writing activities. As previously mentioned, essays are marked according to a grid of scoring rubrics: ideas, organization, style, and conventions. While an abundance of data (e.g., the 1592 writing metrics) can be collected by the AES tool, these data might still be insufficient to automate the scoring process of certain rubrics (e.g., ideas). Nevertheless, some scoring subtasks such as assessing a student’s vocabulary, sentence fluency, and conventions might still be assigned to AI since the data types available through existing automatic linguistic analysis tools prove sufficient to reliably alleviate the human marker’s workload. Interestingly, learning analytics is key for the accountability of AI agents to the human problem holder. As the volume of writing data (through a large student population, high-frequency capture of learning episodes, and variety of big learning data) accumulate in the system, new AI agents (predictive models) may apply for the job of “automarker.” These AI agents can be quite transparent through XAI ( Arrieta et al., 2020 ) explanation models, and a human instructor may assess the suitability of an agent for the job and hire the candidate agent that comes closest to human performance. Explanations derived from these models could serve as formative feedback to the students.

The AI marker can be assigned to assess the writing activities that are similar to those previously scored by the human marker(s) from whom it learns. Dissimilar and unseen essays can be automatically assigned to the human marker for reliable scoring, and the AI agent can learn from this manual scoring. To ensure accountability, students should be allowed to appeal the AI agent’s marking to the human marker. In addition, the human marker should be empowered to monitor and validate the scoring of select writing rubrics scored by the AI marker. If the human marker does not agree with the machine scores, the writing assignments may be flagged as incorrectly scored and re-assigned to a human marker. These flagged assignments may serve to update predictive models. Moreover, among the essays that are assigned to the machine marker, a small subset can be simultaneously assigned to the human marker for continuous quality control; that is, to continue comparing whether the agreement level between human and machine markers remains within an acceptable threshold. The human marker should be at any time able to “fire” an AI marker or “hire” an AI marker from a pool of potential machine markers.

This notion of a human-AI fusion has been observed in previous AES systems where the human marker’s workload has been found to be significantly alleviated, passing from scoring several hundreds of essays to just a few dozen ( Dronen et al., 2015 ; Hellman et al., 2019 ). As the AES technology matures and as the learning analytics tools continue to penetrate the education market, this alliance of semi-autonomous human and AI agents will lead to better evidence-based/informed pedagogy ( Nelson and Campbell, 2017 ). Such a human-AI alliance can also be guided to autonomously self-regulate its own hypothesis-authoring and data-acquisition processes for purposes of measuring and advancing knowledge boundaries in human learning.

Real-Time Formative Pedagogical Feedback

This paper provides the evidence that deep learning and SHAP can be used not only to score essays automatically but also to offer explanations in real-time. More specifically, the processing time to derive the 314 explanations of the testing set’s essays has been benchmarked for several types of explainers. It was found that the faster DeepSHAP and GradientSHAP implementations, which took only a few seconds of processing, did not produce less accurate explanations than the much slower KernelSHAP. KernelSHAP took approximately 8 h of processing to derive the explanation model of a 2-layer MLP neural network predictive model and 16 h for the 6-layer predictive model.

This finding also holds for various configurations of KernelSHAP, where the number of features (16 vs. 32) shaping the explanation (where all other features are assigned zero contributions) did not produce a statistically significant difference in the reliability of the explanation models. On average, the models had a precision between 63.9 and 64.1% and a recall between 41.0 and 42.9%. This means that after perturbation of the predictive and explanation models, on average 64% of the predictions the explanation model identified as changing were accurate. On the other side, only about 42% of all predictions that changed were detected by the various 2-layer explainers. An explanation was considered as untrustworthy if the sum of its feature contributions, when added to the average prediction (base value), was not within 0.1 from the perturbated prediction. Similarly, the average precision and recall of 2-layer explainers for the 0.25-threshold were about 69% and 62%, respectively.

Impact of Deep Learning on Descriptive Accuracy of Explanations

By analyzing the performance of the various predictive models in Table 4 , no clear conclusion can be reached as to which model should be deemed as the most desirable. Despite the fact that the 6-layer models slightly outperform the other models in terms of accuracy (percentage of exact matches between the resolved [human] and predicted [machine] scores), they are not the best when it comes to the percentages of adjacent (within 1 and 2) matches. Nevertheless, if the selection of the “best” model is based on the quadratic weighted kappas, the decision remains a nebulous one to make. Moreover, ensuring that machine learning actually learned something meaningful remains paramount, especially in contexts where the performance of a majority classifier is close to the human and machine performance. For example, a majority classifier model would get 46.3% of predictions accurate ( Table 4 ), while trained predictive models at best produce accurate predictions between 51.9 and 55.1%.

Since the interpretability of a machine learning model should be prioritized over accuracy ( Ribeiro et al., 2016 ; Murdoch et al., 2019 ) for questions of transparency and trust, this paper investigated whether the impact of the depth of a MLP neural network might be more visible when assessing its interpretability, that is, the trustworthiness of its corresponding SHAP explanation model. The data in Tables 1 , 5 , 6 effectively support the hypothesis that as the depth of the neural network increases, the precision and recall of the corresponding explanation model improve. Besides, this observation is particularly interesting because the 4-layer (Grad) explainer, which has hardly more parameters than the 2-layer model, is also more accurate than the 2-layer model, suggesting that the 6-layer explainer is most likely superior to other explainers not only because of its greater number of parameters, but also because of its number of hidden layers. By increasing the number of hidden layers, it can be seen that the precision and recall of an explanation model can pass on average from approximately 64 to 73% and from 42 to 52%, respectively, for the 0.10-threshold; and for the 0.25-threshold, from 69 to 79% and from 62 to 75%, respectively.

These results imply that the descriptive accuracy of an explanation model is an evidence of effective machine learning, which may exceed the level of agreement between the human and machine graders. Moreover, given that the superiority of a trained predictive model over a majority classifier is not always obvious, the consistency of its associated explanation model demonstrates this better. Note that theoretically the SHAP explanation model of the majority classifier should assign a zero contribution to each writing feature since the average prediction of such a model is actually the most frequent rubric score given by the human raters; hence, the base value is the explanation.

An interesting fact emerges from Figure 3 , that is, all explainers (2-layer to 6-layer) are more or less similar. It appears that they do not contradict each other. More specifically, they all agree on the direction of the contributions of the most important features. In other words, they unanimously determine that a feature should increase or decrease the predicted score. However, they differ from each other on the magnitude of the feature contributions.

To conclude, this study highlights the need to train predictive models that consider the descriptive accuracy of explanations. The idea is that explanation models consider predictions to derive explanations; explanations should be considered when training predictive models. This would not only help train interpretable models the very first time but also potentially break the status quo that may exist among similar explainers to possibly produce more powerful models. In addition, this research calls for a mechanism (e.g., causal diagrams) to allow teachers to guide the training process of predictive models. Put another way, as LA practitioners debug predictive models, their insights should be encoded in a language that will be understood by the machine and that will guide the training process to avoid learning the same errors and to accelerate the training time.

Accountable AES

Now that the superiority of the 6-layer predictive and explanation models has been demonstrated, some aspects of the relevancy of explanations should be examined more deeply, knowing that having an explanation model consistent with its underlying predictive model does not guarantee relevant explanations. Table 7 discloses the set of four essays that optimize the coverage of most globally important features to evaluate the correctness of the best AES model. It is quite intriguing to note that two of the four essays are among the 16 essays that have a major disagreement (off by 2) between the resolved and predicted rubric scores (1 vs. 3 and 4 vs. 2). The AES tool clearly overrated Essay 228, while it underrated Essay 219. Naturally, these two essays offer an opportunity to understand what is wrong with the model and ultimately debug the model to improve its accuracy and interpretability.

In particular, Essay 228 raises suspicion on the positive contributions of features such as “Ortho_N,” “lemma_mattr,” “all_logical,” “det_pobj_deps_struct,” and “dobj_per_cl.” Moreover, notice how the remaining 262 less important features (not visible in the decision plot in Figure 5 ) have already inflated the rubric score beyond the base value, more than any other essay. Given the very short length and very low quality of the essay, whose meaning is seriously undermined by spelling and grammatical errors, it is of utmost importance to verify how some of these features are computed. For example, is the average number of orthographic neighbors (Ortho_N) per token computed for unmeaningful tokens such as “R” and “whe”? Similarly, are these tokens considered as types in the type-token ratio over lemmas (lemma_mattr)? Given the absence of a meaningful grammatical structure conveying a complete idea through well-articulated words, it becomes obvious that the quality of NLP (natural language processing) parsing may become a source of (measurement) bias impacting both the way some writing features are computed and the predicted rubric score. To remedy this, two solutions are proposed: (1) enhancing the dataset with the part-of-speech sequence or the structure of dependency relationships along with associated confidence levels, or (2) augmenting the essay dataset with essays enclosing various types of non-sensical content to improve the learning of these feature contributions.

Note that all four essays have a text length smaller than the average: 171 words. Notice also how the “hdd42_aw” and “hdd42_fw” play a significant role to decrease the predicted score of Essays 228 and 68. The reader should note that these metrics require a minimum of 42 tokens in order to compute a non-zero D index, a measure of lexical diversity as explained in Supplementary Table 1 . Figure 6B also shows how zero “hdd42_aw” values are heavily penalized. This is extra evidence that supports the strong role that the number of words plays in determining these rubric scores, especially for very short essays where it is one of the few observations that can be reliably recorded.

Two other issues with the best trained AES model were identified. First, in the eyes of the model, the lowest the average number of direct objects per clause (dobj_per_cl), as seen in Figure 6D , the best it is. This appears to contradict one of the requirements of the “Style” rubric, which looks for a variety of sentence structures. Remember that direct objects imply the presence of transitive verbs (action verbs) and that the balanced usage of linking verbs and action verbs as well as of transitive and intransitive verbs is key to meet the requirement of variety of sentence structures. Moreover, note that the writing feature is about counting the number of direct objects per clause, not by sentence. Only one direct object is therefore possible per clause. On the other side, a sentence may contain several clauses, which determines if the sentence is a simple, compound, or a complex sentence. This also means that a sentence may have multiple direct objects and that a high ratio of direct objects per clause is indicative of sentence complexity. Too much complexity is also undesirable. Hence, it is fair to conclude that the higher range of feature values has reasonable feature contributions (SHAP values), while the lower range does not capture well the requirements of the rubric. The dependence plot should rather display a positive peak somewhere in the middle. Notice how the poor quality of Essay 228’s single sentence prevented the proper detection of the single direct object, “broke my finger,” and the so-called absence of direct objects was one of the reasons to wrongfully improve the predicted rubric score.

The model’s second issue discussed here is the presence of sentence fragments, a type of grammatical errors. Essentially, a sentence fragment is a clause that misses one of three critical components: a subject, a verb, or a complete idea. Figure 6E shows the contribution model of grammatical errors, all types combined, while Figure 6F shows specifically the contribution model of sentence fragments. It is interesting to see how SHAP further penalizes larger numbers of grammatical errors and that it takes into account the length of the essay (red dots represent essays with larger numbers of words; blue dots represent essays with smaller numbers of words). For example, except for essays with no identified grammatical errors, longer essays are less penalized than shorter ones. This is particularly obvious when there are 2−4 grammatical errors. The model increases the predicted rubric score only when there is no grammatical error. Moreover, the model tolerates longer essays with only one grammatical error, which sounds quite reasonable. On the other side, the model finds desirable high numbers of sentence fragments, a non-trivial type of grammatical errors. Even worse, the model decreases the rubric score of essays having no sentence fragment. Although grammatical issues are beyond the scope of the “Style” rubric, the model has probably included these features because of their impact on the quality of assessment of vocabulary usage and sentence fluency. The reader should observe how the very poor quality of an essay can even prevent the detection of such fundamental grammatical errors such as in the case of Essay 228, where the AES tool did not find any grammatical error or sentence fragment. Therefore, there should be a way for AES systems to detect a minimum level of text quality before attempting to score an essay. Note that the objective of this section was not to undertake thorough debugging of the model, but rather to underscore the effectiveness of SHAP in doing so.

Formative Feedback

Once an AES model is considered reasonably valid, SHAP can be a suitable formalism to empower the machine to provide formative feedback. For instance, the explanation of Essay 124, which has been assigned a rubric score of 3 by both human and machine markers, indicates that the top two factors contributing to decreasing the predicted rubric score are: (1) the essay length being smaller than average, and (2) the average number of verb lemma types occurring at least once in the next sentence (adjacent_overlap_verb_sent). Figures 6A,H give the overall picture in which the realism of the contributions of these two features can be analyzed. More specifically, Essay 124 is one of very few essays ( Figure 6H ) that makes redundant usage of the same verbs across adjacent sentences. Moreover, the essay displays poor sentence fluency where everything is only expressed in two sentences. To understand more accurately the impact of “adjacent_overlap_verb_sent” on the prediction, a few spelling errors have been corrected and the text has been divided in four sentences instead of two. Revision 1 in Table 8 exhibits the corrections made to the original essay. The decision plot’s dashed line in Figure 3D represents the original explanation of Essay 124, while Figure 7A demonstrates the new explanation of the revised essay. It can be seen that the “adjacent_overlap_verb_sent” feature is still the second most important feature in the new explanation of Essay 124, with a feature value of 0.429, still considered as very poor according to the dependence plot in Figure 6H .

www.frontiersin.org

Table 8. Revisions of Essay 124: improvement of sentence splitting, correction of some spelling errors, and elimination of redundant usage of same verbs (bold for emphasis in Essay 124’s original version; corrections in bold for Revisions 1 and 2).

www.frontiersin.org

Figure 7. Explanations of the various versions of Essay 124 and evaluation of feature effect for a range of feature values. (A) Explanation of Essay 124’s first revision. (B) Forecasting the effect of changing the ‘adjacent_overlap_verb_sent’ feature on the rubric score. (C) Explanation of Essay 124’s second revision. (D) Comparison of the explanations of all Essay 124’s versions.

To show how SHAP could be leveraged to offer remedial formative feedback, the revised version of Essay 124 will be explained again for eight different values of “adjacent_overlap_verb_sent” (0, 0.143, 0.286, 0.429, 0.571, 0.714, 0.857, 1.0), while keeping the values of all other features constant. The set of these eight essays are explained by a newly trained SHAP explainer (Gradient), producing new SHAP values for each feature and each “revised” essay. Notice how the new model, called the feedback model, allows to foresee by how much a novice writer can hope to improve his/her score according to the “Style” rubric. If the student employs different verbs at every sentence, the feedback model estimates that the rubric score could be improved from 3.47 up to 3.65 ( Figure 7B ). Notice that the dashed line represents Revision 1, while other lines simulate one of the seven other altered essays. Moreover, it is important to note how changing the value of a single feature may influence the contributions that other features may have on the predicted score. Again, all explanations look similar in terms of direction, but certain features differ in terms of the magnitude of their contributions. However, the reader should observe how the targeted feature varies not only in terms of magnitude, but also of direction, allowing the student to ponder the relevancy of executing the recommended writing strategy.

Thus, upon receiving this feedback, assume that a student sets the goal to improve the effectiveness of his/her verb choice by eliminating any redundant verb, producing Revision 2 in Table 8 . The student submits his essay again to the AES system, which finally gives a new rubric score of 3.98, a significant improvement from the previous 3.47, allowing the student to get a 4 instead of a 3. Figure 7C exhibits the decision plot of Revision 2. To better observe how the various revisions of the student’s essay changed over time, their respective explanations have been plotted in the same decision plot ( Figure 7D ). Notice this time that the ordering of the features has changed to list the features of common importance to all of the essay’s versions. The feature ordering in Figures 7A−C complies with the same ordering as in Figure 3D , the decision plot of the original essay. These figures underscore the importance of tracking the interaction between the various features so that the model understands well the impact that changing one feature has on the others. TreeSHAP, an implementation for tree-based models, offers this capability and its potential on improving the quality of feedback provided to novice writers will be tested in a future version of this AES system.

This paper serves as a proof of concept of the applicability of XAI techniques in automated essay scoring, providing learning analytics practitioners and educators with a methodology on how to “hire” AI markers and make them accountable to their human counterparts. In addition to debug predictive models, SHAP explanation models can serve as some formalism of a broader learning analytics platform, where aspects of prescriptive analytics (provision of remedial formative feedback) can be added on top of the more pervasive predictive analytics.

However, the main weakness of the approach put forward in this paper consists in omitting many types of spatio-temporal data. In other words, it ignores precious information inherent to the writing process, which may prove essential to guess the intent of the student, especially in contexts of poor sentence structures and high grammatical inaccuracy. Hence, this paper calls for adapting current NLP technologies to educational purposes, where the quality of writing may be suboptimal, which is contrary to many utopian scenarios where NLP is used for content analysis, opinion mining, topic modeling, or fact extraction trained on corpora of high-quality texts. By capturing the writing process preceding a submission of an essay to an AES tool, other kinds of explanation models can also be trained to offer feedback not only from a linguistic perspective but also from a behavioral one (e.g., composing vs. revising); that is, the AES system could inform novice writers about suboptimal and optimal writing strategies (e.g., planning a revision phase after bursts of writing).

In addition, associating sections of text with suboptimal writing features, those whose contributions lower the predicted score, would be much more informative. This spatial information would not only allow to point out what is wrong and but also where it is wrong, answering more efficiently the question why an essay is wrong. This problem could be simply approached through a multiple-inputs and mixed-data feature-based (MLP) neural network architecture fed by both linguistic indices and textual data ( n -grams), where the SHAP explanation model would assign feature contributions to both types of features and any potential interaction between them. A more complex approach could address the problem through special types of recurrent neural networks such as Ordered-Neurons LSTMs (long short-term memory), which are well adapted to the parsing of natural language, and where the natural sequence of text is not only captured but also its hierarchy of constituents ( Shen et al., 2018 ). After all, this paper highlights the fact that the potential of deep learning can reach beyond the training of powerful predictive models and be better visible in the higher trustworthiness of explanation models. This paper also calls for optimizing the training of predictive models by considering the descriptive accuracy of explanations and the human expert’s qualitative knowledge (e.g., indicating the direction of feature contributions) during the training process.

Data Availability Statement

The datasets and code of this study can be found in these Open Science Framework’s online repositories: https://osf.io/fxvru/ .

Author Contributions

VK architected the concept of an ethics-bound, semi-autonomous, and trust-enabled human-AI fusion that measures and advances knowledge boundaries in human learning, which essentially defines the key traits of learning analytics. DB was responsible for its implementation in the area of explainable automated essay scoring and for the training and validation of the predictive and explanation models. Together they offer an XAI-based proof of concept of a prescriptive model that can offer real-time formative remedial feedback to novice writers. Both authors contributed to the article and approved its publication.

Research reported in this article was supported by the Academic Research Fund (ARF) publication grant of Athabasca University under award number (24087).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/feduc.2020.572367/full#supplementary-material

  • ^ https://www.kaggle.com/c/asap-aes
  • ^ https://www.linguisticanalysistools.org/

Abbass, H. A. (2019). Social integration of artificial intelligence: functions, automation allocation logic and human-autonomy trust. Cogn. Comput. 11, 159–171. doi: 10.1007/s12559-018-9619-0

CrossRef Full Text | Google Scholar

Adadi, A., and Berrada, M. (2018). Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6, 52138–52160. doi: 10.1109/ACCESS.2018.2870052

Amorim, E., Cançado, M., and Veloso, A. (2018). “Automated essay scoring in the presence of biased ratings,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , New Orleans, LA, 229–237.

Google Scholar

Arrieta, A. B., Díaz-Rodríguez, N., Ser, J., Del Bennetot, A., Tabik, S., Barbado, A., et al. (2020). Explainable Artificial Intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inform. Fusion 58, 82–115. doi: 10.1016/j.inffus.2019.12.012

Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., et al. (2007). The English lexicon project. Behav. Res. Methods 39, 445–459. doi: 10.3758/BF03193014

PubMed Abstract | CrossRef Full Text | Google Scholar

Boulanger, D., and Kumar, V. (2018). “Deep learning in automated essay scoring,” in Proceedings of the International Conference of Intelligent Tutoring Systems , eds R. Nkambou, R. Azevedo, and J. Vassileva (Cham: Springer International Publishing), 294–299. doi: 10.1007/978-3-319-91464-0_30

Boulanger, D., and Kumar, V. (2019). “Shedding light on the automated essay scoring process,” in Proceedings of the International Conference on Educational Data Mining , 512–515.

Boulanger, D., and Kumar, V. (2020). “SHAPed automated essay scoring: explaining writing features’ contributions to English writing organization,” in Intelligent Tutoring Systems , eds V. Kumar and C. Troussas (Cham: Springer International Publishing), 68–78. doi: 10.1007/978-3-030-49663-0_10

Chen, H., Lundberg, S., and Lee, S.-I. (2019). Explaining models by propagating Shapley values of local components. arXiv [Preprint]. Available online at: https://arxiv.org/abs/1911.11888 (accessed September 22, 2020).

Crossley, S. A., Bradfield, F., and Bustamante, A. (2019). Using human judgments to examine the validity of automated grammar, syntax, and mechanical errors in writing. J. Writ. Res. 11, 251–270. doi: 10.17239/jowr-2019.11.02.01

Crossley, S. A., Kyle, K., and McNamara, D. S. (2016). The tool for the automatic analysis of text cohesion (TAACO): automatic assessment of local, global, and text cohesion. Behav. Res. Methods 48, 1227–1237. doi: 10.3758/s13428-015-0651-7

Crossley, S. A., Kyle, K., and McNamara, D. S. (2017). Sentiment analysis and social cognition engine (SEANCE): an automatic tool for sentiment, social cognition, and social-order analysis. Behav. Res. Methods 49, 803–821. doi: 10.3758/s13428-016-0743-z

Dronen, N., Foltz, P. W., and Habermehl, K. (2015). “Effective sampling for large-scale automated writing evaluation systems,” in Proceedings of the Second (2015) ACM Conference on Learning @ Scale , 3–10.

Goldin, I., Narciss, S., Foltz, P., and Bauer, M. (2017). New directions in formative feedback in interactive learning environments. Int. J. Artif. Intellig. Educ. 27, 385–392. doi: 10.1007/s40593-016-0135-7

Hao, Q., and Tsikerdekis, M. (2019). “How automated feedback is delivered matters: formative feedback and knowledge transfer,” in Proceedings of the 2019 IEEE Frontiers in Education Conference (FIE) , Covington, KY, 1–6.

Hellman, S., Rosenstein, M., Gorman, A., Murray, W., Becker, L., Baikadi, A., et al. (2019). “Scaling up writing in the curriculum: batch mode active learning for automated essay scoring,” in Proceedings of the Sixth (2019) ACM Conference on Learning @ Scale , (New York, NY: Association for Computing Machinery).

Hussein, M. A., Hassan, H., and Nassef, M. (2019). Automated language essay scoring systems: a literature review. PeerJ Comput. Sci. 5:e208. doi: 10.7717/peerj-cs.208

Kumar, V., and Boulanger, D. (2020). Automated essay scoring and the deep learning black box: how are rubric scores determined? Int. J. Artif. Intellig. Educ. doi: 10.1007/s40593-020-00211-5

Kumar, V., Fraser, S. N., and Boulanger, D. (2017). Discovering the predictive power of five baseline writing competences. J. Writ. Anal. 1, 176–226.

Kyle, K. (2016). Measuring Syntactic Development In L2 Writing: Fine Grained Indices Of Syntactic Complexity And Usage-Based Indices Of Syntactic Sophistication. Dissertation, Georgia State University, Atlanta, GA.

Kyle, K., Crossley, S., and Berger, C. (2018). The tool for the automatic analysis of lexical sophistication (TAALES): version 2.0. Behav. Res. Methods 50, 1030–1046. doi: 10.3758/s13428-017-0924-4

Lundberg, S. M., Erion, G. G., and Lee, S.-I. (2018). Consistent individualized feature attribution for tree ensembles. arXiv [Preprint]. Available online at: https://arxiv.org/abs/1802.03888 (accessed September 22, 2020).

Lundberg, S. M., and Lee, S.-I. (2017). “A unified approach to interpreting model predictions,” in Advances in Neural Information Processing Systems , eds I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, et al. (Red Hook, NY: Curran Associates, Inc), 4765–4774.

Madnani, N., and Cahill, A. (2018). “Automated scoring: beyond natural language processing,” in Proceedings of the 27th International Conference on Computational Linguistics , (Santa Fe: Association for Computational Linguistics), 1099–1109.

Madnani, N., Loukina, A., von Davier, A., Burstein, J., and Cahill, A. (2017). “Building better open-source tools to support fairness in automated scoring,” in Proceedings of the First (ACL) Workshop on Ethics in Natural Language Processing , (Valencia: Association for Computational Linguistics), 41–52.

McCarthy, P. M., and Jarvis, S. (2010). MTLD, vocd-D, and HD-D: a validation study of sophisticated approaches to lexical diversity assessment. Behav. Res. Methods 42, 381–392. doi: 10.3758/brm.42.2.381

Mizumoto, T., Ouchi, H., Isobe, Y., Reisert, P., Nagata, R., Sekine, S., et al. (2019). “Analytic score prediction and justification identification in automated short answer scoring,” in Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications , Florence, 316–325.

Molnar, C. (2020). Interpretable Machine Learning . Abu Dhabi: Lulu

Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., and Yu, B. (2019). Definitions, methods, and applications in interpretable machine learning. Proc. Natl. Acad. Sci. U.S.A. 116, 22071–22080. doi: 10.1073/pnas.1900654116

Nelson, J., and Campbell, C. (2017). Evidence-informed practice in education: meanings and applications. Educ. Res. 59, 127–135. doi: 10.1080/00131881.2017.1314115

Rahimi, Z., Litman, D., Correnti, R., Wang, E., and Matsumura, L. C. (2017). Assessing students’ use of evidence and organization in response-to-text writing: using natural language processing for rubric-based automated scoring. Int. J. Artif. Intellig. Educ. 27, 694–728. doi: 10.1007/s40593-017-0143-2

Reinertsen, N. (2018). Why can’t it mark this one? A qualitative analysis of student writing rejected by an automated essay scoring system. English Austral. 53:52.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). “Why should i trust you?”: explaining the predictions of any classifier. CoRR, abs/1602.0. arXiv [Preprint]. Available online at: http://arxiv.org/abs/1602.04938 (accessed September 22, 2020).

Rupp, A. A. (2018). Designing, evaluating, and deploying automated scoring systems with validity in mind: methodological design decisions. Appl. Meas. Educ. 31, 191–214. doi: 10.1080/08957347.2018.1464448

Rupp, A. A., Casabianca, J. M., Krüger, M., Keller, S., and Köller, O. (2019). Automated essay scoring at scale: a case study in Switzerland and Germany. ETS Res. Rep. Ser. 2019, 1–23. doi: 10.1002/ets2.12249

Shen, Y., Tan, S., Sordoni, A., and Courville, A. C. (2018). Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks. CoRR, abs/1810.0. arXiv [Preprint]. Available online at: http://arxiv.org/abs/1810.09536 (accessed September 22, 2020).

Shermis, M. D. (2014). State-of-the-art automated essay scoring: competition, results, and future directions from a United States demonstration. Assess. Writ. 20, 53–76. doi: 10.1016/j.asw.2013.04.001

Taghipour, K. (2017). Robust Trait-Specific Essay Scoring using Neural Networks and Density Estimators. Dissertation, National University of Singapore, Singapore.

West-Smith, P., Butler, S., and Mayfield, E. (2018). “Trustworthy automated essay scoring without explicit construct validity,” in Proceedings of the 2018 AAAI Spring Symposium Series , (New York, NY: ACM).

Woods, B., Adamson, D., Miel, S., and Mayfield, E. (2017). “Formative essay feedback using predictive scoring models,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , (New York, NY: ACM), 2071–2080.

Keywords : explainable artificial intelligence, SHAP, automated essay scoring, deep learning, trust, learning analytics, feedback, rubric

Citation: Kumar V and Boulanger D (2020) Explainable Automated Essay Scoring: Deep Learning Really Has Pedagogical Value. Front. Educ. 5:572367. doi: 10.3389/feduc.2020.572367

Received: 14 June 2020; Accepted: 09 September 2020; Published: 06 October 2020.

Reviewed by:

Copyright © 2020 Kumar and Boulanger. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: David Boulanger, [email protected]

This article is part of the Research Topic

Learning Analytics for Supporting Individualization: Data-informed Adaptation of Learning

Revolutionize Your Writing Process with Smodin AI Grader: A Smarter Way to get feedback and achieve academic excellence!

/common/grader/hero/graderProcess1.png

For Students

Stay ahead of the curve, with objective feedback and tools to improve your writing.

Your Virtual Tutor

Harness the expertise of a real-time virtual teacher who will guide every paragraph in your writing process, ensuring you produce an A+ masterpiece in a fraction of the time.

Unbiased Evaluation

Ensure an impartial and objective assessment, removing any potential bias or subjectivity that may be an influence in traditional grading methods.

Perfect your assignments

With the “Write with AI” tool, transform your ideas into words with a few simple clicks. Excel at all your essays, assignments, reports etc. and witness your writing skills soar to new heights

For teachers

Revolutionize your Teaching Methods

Spend less on grading

Embrace the power of efficiency and instant feedback with our cutting-edge tool, designed to save you time while providing a fair and unbiased evaluation, delivering consistent and objective feedback.

Reach out to more students

Upload documents in bulk and establish your custom assessment criteria, ensuring a tailored evaluation process. Expand your reach and impact by engaging with more students.

Focus on what you love

Let AI Grading handle the heavy lifting of assessments for you. With its data-driven algorithms and standardized criteria, it takes care of all your grading tasks, freeing up your valuable time to do what you're passionate about: teaching.

Grader Rubrics

Pick the systematic frameworks that work as guidelines for assessing and evaluating the quality, proficiency, and alignment of your work, allowing for consistent and objective grading without any bias.

Analytical Thinking

Originality

Organization

Focus Point

Write with AI

Set your tone and keywords, and generate brilliance through your words

ai essay grading software

AI Grader Average Deviation from Real Grade

Our AI grader matches human scores 82% of the time* AI Scores are 100% consistent**

Deviation from real grade (10 point scale)

Graph: A dataset of essays were graded by professional graders on a range of 1-10 and cross-referenced against the detailed criteria within the rubric to determine their real scores. Deviation was defined by the variation of scores from the real score. The graph contains an overall score (the average of all criterias) as well as each individual criteria. The criteria are the premade criteria available on Smodin's AI Grader, listed in the graph as column headings. The custom rubrics were made using Smodin's AI Grader custom criteria generator to produce each criteria listed in Smodin's premade criterias (the same criteria as the column headings). The overall score for Smodin Premade Rubrics matched human scores 73% of the time with our advanced AI, while custom rubrics generated by Smodin's custom rubric generator matched human grades 82% of the time with our advanced AI. The average deviation from the real scores for all criteria is shown above.

* Rubrics created using Smodin's AI custom criteria matched human scores 82% of the time on the advanced AI setting. Smodin's premade criteria matched human scores 73% of the time. When the AI score differed from the human scores, 86% of the time the score only differed by 1 point on a 10 point scale.

** The AI grader provides 100% consistency, meaning that the same essay will produce the same score every time it's graded. All grades used in the data were repeated 3 times and produced 100% consistency across all 3 grading attempts.

ai essay grading software

AI Feedback

Unleash the Power of Personalized Feedback: Elevate Your Writing with the Ultimate Web-based Feedback Tool

Elevate your essay writing skills with Smodin AI Grader, and achieve the success you deserve with Smodin. the ultimate AI-powered essay grader tool. Whether you are a student looking to improve your grades or a teacher looking to provide valuable feedback to your students, Smodin has got you covered. Get objective feedback to improve your essays and excel at writing like never before! Don't miss this opportunity to transform your essay-writing journey and unlock your full potential.

Smodin AI Grader: The Best AI Essay Grader for Writing Improvement

As a teacher or as a student, writing essays can be a daunting task. It takes time, effort, and a lot of attention to detail. But what if there was a tool that could make the process easier? Meet Smodin Ai Grader, the best AI essay grader on the market that provides objective feedback and helps you to improve your writing skills.

Objective Feedback with Smodin - The Best AI Essay Grader

Traditional grading methods can often be subjective, with different teachers providing vastly different grades for the same piece of writing. Smodin eliminates this problem by providing consistent and unbiased feedback, ensuring that all students are evaluated fairly. With advanced algorithms, Smodin can analyze and grade essays in real-time, providing instant feedback on strengths and weaknesses.

Improve Your Writing Skills with Smodin - The Best AI Essay Grader

Smodin can analyze essays quickly and accurately, providing detailed feedback on different aspects of your writing, including structure, grammar, vocabulary, and coherence. By identifying areas that need improvement and providing suggestions on how to make your writing more effective, if Smodin detects that your essay has a weak thesis statement, it will provide suggestions on how to improve it. If it detects that your essay has poor grammar, it will provide suggestions on how to correct the errors. This makes it easier for you to make improvements to your essay and get better grades and become a better writer.

Smodin Ai Grader for Teachers - The Best Essay Analysis Tool

For teachers, Smodin can be a valuable tool for grading essays quickly and efficiently, providing detailed feedback to students, and helping them improve their writing skills. With Smodin Ai Grader, teachers can grade essays in real-time, identify common errors, and provide suggestions on how to correct them.

Smodin Ai Grader for Students - The Best Essay Analysis Tool

For students, Smodin can be a valuable tool for improving your writing skills and getting better grades. By analyzing your essay's strengths and weaknesses, Smodin can help you identify areas that need improvement and provide suggestions on how to make your writing more effective. This can be especially useful for students who are struggling with essay writing and need extra help and guidance.

Increase your productivity - The Best AI Essay Grader

Using Smodin can save you a lot of time and effort. Instead of spending hours grading essays manually or struggling to improve your writing without feedback, you can use Smodin to get instant and objective feedback, allowing you to focus on other important tasks.

Smodin is the best AI essay grader on the market that uses advanced algorithms to provide objective feedback and help improve writing skills. With its ability to analyze essays quickly and accurately, Smodin can help students and teachers alike to achieve better results in essay writing.

© 2024 Smodin LLC

Your AI-Powered Grading System

Vexis is the ultimate grading game changer for educators. we're speeding up the grading process, freeing up your time to focus on teaching. sync with your progress, secure your students' data, and enjoy unbiased, accurate grading every time..

ai essay grading software

How it works

Upload. grade. share.

Personalized Feedback

Tailored insights for each student, improving their learning journey, unbiased grading, objective assessment, eliminating human bias for a fair grading system, detailed reports, in-depth analysis for each answer, providing a detailed grading breakdown, specialized technology, state-of-the-art tech, revolutionizing the grading process for educators, ocr capabilities, transform scanned answer sheets into digital data effortlessly, free form writing, made for free form checking, understands the context rather than matching the keywords.

ai essay grading software

Built for productivity

Set up classes as individual projects. enjoy unique functionalities for each class. keep your grading organized and your teaching streamlined with vexis., personalized vex reports, vex reports go beyond traditional evaluations. they offer personalized strategies, pinpointing areas of improvement and providing actionable insights. this empowers students to grow, evolve, and excel, transforming the learning experience into a journey of empowerment..

ai essay grading software

Cut grading time to 10%

Invest your time in lesson planning and curriculum development, not grading..

ai essay grading software

e-rater ®  Scoring Engine

Evaluates students’ writing proficiency with automatic scoring and feedback

Selection an option below to learn more.

How the e-rater engine uses AI technology

ETS is a global leader in educational assessment, measurement and learning science. Our AI technology, such as the e-rater ® scoring engine, informs decisions and creates opportunities for learners around the world.

The e-rater engine automatically:

  • assess and nurtures key writing skills
  • scores essays and provides feedback on writing using a model built on the theory of writing to assess both analytical and independent writing skills

About the e-rater Engine

This ETS capability identifies features related to writing proficiency.

How It Works

See how the e-rater engine provides scoring and writing feedback.

Custom Applications

Use standard prompts or develop your own custom model with ETS’s expertise.

Use in Criterion ® Service

Learn how the e-rater engine is used in the Criterion ® Service.

FEATURED RESEARCH

E-rater as a Quality Control on Human Scores

See All Research (PDF)

A man and woman standing by a city building window while looking at a tablet

Ready to begin? Contact us to learn how the e-rater service can enhance your existing program.

Young man with glasses and holding up a pen in a library

1 Minute Free Time for Everyday -

ai essay grading software

One Chance Only! Christmas Surprise Awaits - Act now and Receive a Free 5-Minute Bonus !

ai essay grading software

Savings Await

Create your own AI videos with

Grab now or wait till next year!

Create free AI videos from text with 800+ templates, 800+ realistic talking avatars, and 470+ text to speech voices!

ai essay grading software

Who & What an AI Essay Grader Is Designed for?

Top 7 ai essay grader tools to release teachers from endless grading, more ai examples in education.

Follow Us on Social Media

ai essay grading software

Generate Engaging Videos with AI for Free

ai essay grading software

Let Your Photo Come Alive and Talk

ai essay grading software

Face swap Online Free in Seconds

ai essay grading software

Convert Selfies to Professional Headshots

ai essay grading software

Free Text to Speech Online with realistic voices

ai essay grading software

Level up Content Creation with Vidnoz AI Tools

Top 7 AI Essay Grader for Smart & Fast Essay Scoring

ai essay grading software

Updated on April 25, 2024

SHARE THIS POST

Uncover the efficiency offered by AI essay graders in evaluating student writing. Release teachers from endless scoring and resharp the educational landscape.

As we dive deeper into the world of technological advancements, artificial intelligence becomes a major part of our lives. It has become an integral part of almost every aspect of our lives including education. Intervention of AI in education has resulted in great efficiency and effectiveness for teachers and students. Teachers who are looking to save their time and improve their accuracy often search for AI essay graders. These tools can grade the students’ assignments for you in a snap with great accuracy. If you are looking for such AI tools in the field of education, this blog has got you covered. Read further to discover the top AI essay grader tools. 

AI Essay Grader

An AI essay grader is designed to help both the teachers and students in academics. Its main objective is to revolutionize the grading process and enhance learning outcomes. 

If you are a teacher, you can expect this technology to reduce the burden of checking and grading the assignments that take so much time and this tool will do the job. As a result, you can focus on giving your students detailed feedback of their performance and improve your teaching strategies. As one of the best AI tools for teachers , it allows the teachers to be consistent and efficient in assessments and dedicate more time on other educational aspects.

AI essay grader is beneficial for the students as well. By using this tool, you can get immediate feedback on your performance and improve it. If you are a student, you can utilize this tool to deeply understand your strengths and areas of improvement in writing and refine your skills. Ultimately, you will be able to analyze yourself effectively and focus more on personal growth. 

All in all, an AI grader offers both teachers and students streamlined grading process and reliable educational support to enhance productivity. 

ai essay grading software

Vidnoz AI - Create Free Engaging AI Video with Talking Avatar

  • Easily create professional AI videos with realistic avatars.
  • Text-to-speech lip sync voices of different languages.
  • 800+ video templates for multiple scenarios.

After learning the amazing benefits of AI grader , you must want to try this tool to make your educational journey productive. So here are some of the amazing AI tools for grading that you can consider trying. 

EssayGrader AI

EssayGrader offers a great relief for teachers in education by revolutionizing the grading process and making it efficient. This tool reduces the time per essay from 10 minutes to just 30 seconds. As a result, you will be able to save 95% of your time. This tool is reliable and offers great accuracy allowing you to rest assured. The best part about this tool is its user-friendly interface that allows user engagement to make the learning process interesting. 

AI Essay Grader - EssayGrader

Key Features:

  • Adaptive learning algorithms for personalized feedback.
  • Comprehensive plagiarism detection.
  • Detailed analytics providing insights.
  • Integration with learning management systems.

Smodin AI Grader

Smodin AI Grader is another AI tool that makes the grading process seamless for the teachers. And for the students, this AI essay grader offers customized objective feedback which can help them to improve their writing. Another commendable feature of this tool is its virtual tutor service which ensures that the students get unbiased evaluations. Besides streamlined grading, it offers customized assessments to help both the students and teachers. All in all, it is an amazing tool to efficiently elevate writing skills and you can try it for free with some limited features.

Smodin AI Grader

  • Real-time feedback generation.
  • Multilingual support for various languages.
  • Interactive user interface.
  • Virtual tutor service.

SmartMarq AI Essay Scoring

SmartMarq makes grading essays easier. It uses both human and AI to grade essays. This approach ensures that the results are accurate and efficient. Users can create rules for grading, manage graders, and collect grades quickly. By using AI to help grade, it makes the process faster without losing quality. It's part of FastTest, a big system that helps with tests from start to finish, making tests and grading simpler and better.

SmartMarq AI Essay Grader

  • Cognitive assessment algorithms for logical reasoning.
  • Allows you to set marking rules.
  • 24/7 expert help available. 
  • Essay comparison functionalities to get a better judgment.

ClassX AI Essay Grader

ClassX's AI Essay Grader is another cutting-edge tool to grade essays. With the help of advanced AI technology, it not only grades the essays but also evaluates and gives feedback about the content’s structure, grammar, vocabulary, and overall impression. This innovation allows teachers to save their time and offer quick feedback to students. As a result, you can offer fair evaluations to your students without any biases.

Also Read: AI Video Presentation Maker: Fast, Easy & High-Quality >>

AI Essay Grader - ClassX

  • Subject-specific evaluation to get customized feedback for improvement.
  • AI-powered revision suggestions.
  • Grading consistency for fairness.
  • Compatibility with various file formats.

Progressay AI Marking

With the use of AI-powered rubrics, Progressay is able to improve accuracy and drastically cut down on marking time. It provides a user-friendly interface. Similar to Google Classroom, but with automatically graded essays, it provides real-time information on students' development. It is compatible with all devices and browsers. As its name implies, Progressay guarantees wiser grading for more educational improvement.

Progressay AI Essay Grader

  • Linguistic analysis for grammar and syntax.
  • Interactive progress dashboards for enhanced learning.
  • Prompt-specific evaluation.
  • Cloud-based storage to let you access anywhere.

IntelliMetric AI-Powered Essay Scoring

No matter the level of writing, IntelliMetric is a fantastic AI-based essay grading tool that can grade writing prompts. Accuracy may be maintained while significantly cutting down on the amount of time required to examine professional and student writing. By authorizing multiple writing prompt submissions with thorough adaptive feedback, IntelliMetric also assists you in becoming a better time manager. A noteworthy aspect to consider is the Legitimacy component of the program, which identifies, highlights, and eliminates irrelevant or detrimental remarks present in the content.

IntelliMetric AI Essay Scoring

  • Real-time multifaceted evaluation
  • Strategic educational integration by offering educational aid
  • Great academic reliability

Copyleaks AI Grader

Another excellent option for effectively grading standardized examinations at the state, federal, and university levels is the Copyleaks AI Grader. For improved evaluation accuracy, it integrates cutting-edge AI technology with the grammar API, plagiarism detector, and AI content detector. It gets rid of inconsistencies with its quick and precise grading method. This cutting-edge technology is unique in that it can grade in more than 100 languages, making it a flexible option for educational institutions throughout the globe.

Copyleaks AI Grader

  • Grammar Checker API to ensure error free writing.
  • LMS integration 
  • Ability to scan and grade physical exams
  • Works in more than 100 languages 

Intervention of AI in education is not only limited to essay grading. In fact, there are many other AI examples in education that benefit both the students and teachers. Let’s have a look at some of the amazing advancements of AI that make the education process quick and effective. 

AI Powerpoint Generator

By harnessing the capabilities of artificial intelligence, educators can streamline the process of developing visually compelling slides to support their lessons. That’s where AI powerpoint generators jump in. These tools enable teachers to save time, enhance the quality of visual aids, and ensure consistency across presentations. Additionally, AI-powered PowerPoint generators empower educators to allocate more time to student engagement and personalized instruction, ultimately enriching the learning experience for their students.

AI Video Creator

As a teacher, your goal is to ensure a clear understanding of content being taught in class and teaching through videos is the best way to do so. For that, you can use AI video generators like Vidnoz AI that simplify the process of creating presentations by converting PowerPoint slides into dynamic video presentations.This AI powered tool also has interesting features like talking avatar that can transform boring slides into engaging videos with voiceovers. Moreover, this tool also allows you to add animations and transitions. It will not only make the presentation interesting but also reduce the making time. 

Adaptive Learning System 

Another AI tool popular in the realm of education is the adaptive learning system. It is designed to cater to individual students' needs by analyzing their performance. As a result, students can get content based on their strengths and areas of improvement. 

AI-Powered Content Curation

Teachers must ensure that their students get the relevant and upt–date content which is a huge responsibility. As a teacher, you can fulfill this responsibility with AI-powered content curation. In such tools, there are AI algorithms that analyze and filter the information to recommend the most relevant and credible educational resources. This tool saves time for teachers and ensures that students get the most effective content according to their needs.

Conclusion 

AI in education is no less than a blessing for the teachers and students. By using the AI essay grader free and paid tools mentioned in this blog, you can save your time and ensure effective learning. However, the intervention of AI does not stop here. You will find many AI powered tools that can aid in the educational journey. One of the best and most effective examples of such a tool is from  Vidnoz 's AI video generator. It allows you to create engaging videos quickly and make the learning process effectives.

ai essay grading software

AI Headshot Generator

Easily create professional headshots from your selfies without physical photo shoot, saving time & energy.

ai essay grading software

Vidnoz Face Swapper

Swap your face into any photo, video, and GIF in 3 simple steps! Explore your new look and have more fun with Vidnoz FACE SWAP tool!

ai essay grading software

AI Solutions

How to Turn Articles into Videos with Automatic Voiceover (From PDF, URL, or Any Other Format)?

6 Best AI Image & Video Watermark Remover Free

6 Best AI Image & Video Watermark Remover Free

How to Make Online Courses via AI Course Creators [Top 5]

How to Make Online Courses via AI Course Creators [Top 5]

Best 5 AI Headshot Apps - Make Professional Headshots Easily

Best 5 AI Headshot Apps - Make Professional Headshots Easily

How to Change Face in Picture with a Photo Face Changer Online Free

How to Change Face in Picture with a Photo Face Changer Online Free

How to Make Video Lectures with Top AI Tools

How to Make Video Lectures with Top AI Tools

authot

Griffin, a former software engineer and technology enthusiast, has over 5 years of writing experience about technology. He is always looking for and sharing tools that promote creativity, productivity, and teamwork.

Talking Photo

AI Headshot

  • Assessment API
  • Higher Education
  • Personnel Testing
  • Contact/Demo

Transforming Assessment With AI-Powered Essay Scoring

IntelliMetric ® delivers accurate, consistent, and reliable scores for writing prompts for K-12 school districts, for higher education, for personnel testing, and as an API for software.

IntelliMetric ® Is The Gold Standard In AI Scoring Of Written Prompts

Trusted by educational institutions for surpassing human expert scoring, IntelliMetric ® is the go-to essay scoring platform for colleges and universities. IntelliMetric ®  also aids in hiring by identifying candidates with excellent communication skills. As an assessment API, it enhances software products and increases product value. Unlock its potential today.

Proven Capabilities Across Markets

Whether you’re a hiring manager, school district administrator, or higher education administrator, IntelliMetric® can help you meet your organization’s goals. Click below to learn how it works for your industry.

Powerful Features

IntelliMetric ® delivers relevant scores and expert feedback tailored to writers’ capabilities. IntelliMetric ® scores prompts of varying lengths, providing invaluable insights for both testing and writing improvement. Don't settle for less; unleash the power of IntelliMetric ® for scoring excellence.

ai essay grading software

IntelliMetric ® scores writing instantly to help organizations save time and money evaluating writing with the same level of accuracy and consistency as expert human scorers.

IntelliMetric ® can be used to either test writing skills or improve instruction by providing detailed and accurate feedback according to a rubric.

IntelliMetric ® gives teachers and business professionals the bandwidth to focus on other more impactful job duties by scoring writing prompts that would otherwise take countless hours each day.

Using Legitimacy detection, IntelliMetric ® ensures all writing is original without any plagiarism - and doesn’t contain any messages that diverge from the assigned writing prompt.

Case Studies and Testimonials

Below are use cases and testimonials from customers who used IntelliMetric ® to reach their goals by automating the process of analyzing and grading written responses. These users found IntelliMetric ® to be a vital tool in providing instant feedback and scoring written responses.

ai essay grading software

Santa Ana School Disctrict Through the use of IntelliMetric ® , the Santa Ana school district was able to evaluate student writing and their students were able to use the instantaneous feedback to drastically improve their writing. The majority of teachers found IntelliMetric to benefit their classrooms as an instructional tool and found that students were more motivated to write.

Santa Ana School District

I have worked with Vantage Learning’s MY Access Automated Essay Scoring product both as a teacher and as a secondary ELA Curriculum specialist for grades 6-12.  As a teacher, I immediately saw the benefits of the program. My students were more motivated to write because they knew that they would receive immediate feedback upon completion and submission of their essays.  I also taught my students how to use the “My Tutor” and “My Editor” feedback in order to revise their essays. In the past, I felt like Sisyphus pushing a boulder up the hill, but with MY Access that heavy weight was lifted and my students were revising using specific feedback from My Access and consequently their writing drastically improved. When it comes to giving instantaneous feedback, MY Access performed more efficiently than myself.   

More than 350 research studies conducted both in-house and by third-party experts have determined that IntelliMetric® has levels of consistency, accuracy and reliability that meet, and more often exceed, those of human expert scorers. After performing the practice DWA within SAUSD, I surveyed our English teachers and asked them about their recent experience with MY Access. Of the 85 teachers that responded to the survey, 82% of the teachers felt that their students’ experience with MY Access was either fair, good or very good. Similarly, 75% of the teachers thought the accuracy of the scoring was fair, good, or very good. Lastly, 77% of the teachers surveyed said that they would like to use MY Access in their classrooms as an instructional tool.   

Many of the teachers’ responses to the MY Access survey included a call for a plagiarism detector. At the time, we had not opted for the addition of Cite Smart, the onboard plagiarism detector for MY Access This year, however, we will be using it and teachers across the district are excited to have this much needed tool available.  

As a teacher and as an ELA curriculum specialist, I know of no other writing tool available to teachers that is more helpful than MY Access. When I tell teachers that we will be using MY Access for instruction and not just benchmarking this year, the most common reply I receive is “Oh great! That means that I can teach a lot more writing!” Think about it - if a secondary teacher has 175 students (35 students in 5 periods) and the teacher takes 10 minutes to provide feedback on each student’s paper, then it would take the teacher 29 hours (1,750 minutes) to give effective feedback to his/her students. MY Access is a writing teacher’s best friend!  

Jason Crabbe  

Secondary Language Arts Curriculum Specialist  

Santa Ana Unified School District

ai essay grading software

Arkansas State University “In 2018, our students performed poorly in style and mechanics. Other forms of intervention have not proven successful. We piloted IntelliWriter and IntelliMetric and produced great results. The University leadership has since implemented a full-scale rollout across all campuses” -  Dr. Melodie Philhours, Arkansas State University

ai essay grading software

The United Nations The United Nations utilizes IntelliMetric® via the Adaptera Platform for real-time evaluation of personnel writing assessments, offering a cost-effective solution to ensure communication skills in the workforce.

The United Nations, Department of Homeland Security and the world's largest online retail store all access IntelliMetric ® for immediate scoring of personnel writing assessments scored by IntelliMetric ® using the Adaptera Platform.  In a world where clear concise communication is essential in the workforce using IntelliMetric ® to score writing assessments provides immediate, cost-effective evaluation of your employee skills.

IntelliMetric ® Offers Multilingual Scoring & Support

Score written responses in your native language with IntelliMetric! The automated scoring platform offers instant feedback and scoring in several languages to provide more accuracy and consistency than human scoring wherever you’re located. Accessible any time or place for educational or professional needs, IntelliMetric® is the perfect solution to your scoring requirements.

Intelliwriter

ai essay grading software

IntelliMetric-powered Solutions

Automated Essay Scoring using AI.

Online Writing Instruction and Assessment.

Adaptive Learning and Assessment Platform.

District Level Capture and Assessment of Student Writing​.

AI-Powered Assessment and Instruction APIs.

Advanced AI-Driven Writing Mastery Tool.

  • Read The Blog
  • Our Mission

How AI Can Enhance the Grading Process

Artificial intelligence tools, combined with human expertise, can help teachers save time when they’re reviewing student work.

Photo of teaching writing on clipboard

After tucking my son into bed, I’m hit with the realization that since waking up at 6:00 a.m., I’ve not had a moment’s rest, nor have I managed more than brief exchanges with my wife. She’s deeply engrossed in grading sixth-grade math quizzes, a world away in her concentration. Shifting my focus, I dive into assessing a substantial stack of high school history essays, with a firm deadline to return them by tomorrow.

The punctuality I expect from my students is the same standard I set for myself in returning their assignments. Missing a deadline isn’t an option, unless unexpected events or an illness intervenes. In such cases, I may offer bonus points or postpone future deadlines. More than just a pledge, this is my commitment to their success, requiring their best effort and guaranteeing mine in return.

Enhancing Efficiency, Precision, and Fairness

Late that night, CoGrader —a new artificial intelligence (AI)–enhanced platform—piques my interest. A notification on social media directs me to their website, boasting a compelling promise: “Reduce grading time by 80% and provide instant feedback on student drafts.” The allure is heightened by the offer of a 30-day trial, free and without requiring a credit card.

Intrigued by this “AI copilot for teachers,” I sign up to see how its feedback stacks up against my own. I upload a student’s essay on Reconstruction, which I’ve already evaluated and annotated. The results astonish me with their accuracy and detail and are neatly presented in a customizable rubric.

“Your essay employs an organizational structure that shows the relationships between ideas, providing a cohesive analysis of the topic,” reads a portion of the written feedback. “You use transitions effectively to guide the reader through your argument.” This feedback mirrors my own observations, validating my assessment, and increasing my confidence in CoGrader.

CoGrader also deeply impresses me with its “Glow” section, which offers specific praise, and the guidance provided in its “Grow” counterpart, which provides similarly focused areas for improvement.

The inclusion of action items also fosters student inquiry, exemplified by one of the questions provided in my test upload: “How might you further enhance the connections between different sections of your essay to strengthen your argument?” This reflects my priority of encouraging critical thinking.

Even fatigued, I recognize that CoGrader’s value extends beyond saving time. By promising an “objective and fair grading system,” the platform provides a check against the unconscious biases that inadvertently influence grading, try as we might to curb them.

I dip into my caffeine stash and devote the rest of my waking hours to grading solo, yet I’m captivated by what CoGrader could offer me, my students, and the future of impactful feedback.

Annotated Feedback and Mitigating Teacher Guilt

The potential of CoGrader motivated me to contact its cofounder, Gabriel Adamante, to express my admiration and to get his thoughts about the rapidly advancing field of AI. I mentioned that I had expected something like CoGrader to emerge around the time of my son’s 10th or 11th birthday, not his fifth or sixth.

“I ask myself a lot of questions on what’s the responsible use of AI,” Adamante offered. “What is the line? I think we are living in the Wild West of AI right now. Everything is happening so fast, much faster than anyone would have expected. Like you said, you thought this would be five or six years away.”

I asked Adamante whether CoGrader, in this fast-paced environment, has plans to add annotated feedback on a student’s work.

“Honestly, I think we’re talking anywhere from three to five months until we do that,” Adamante said.

“That’s it?” I said in astonishment. “Really? I thought you were going to say three to five years.”

“No, no, no,” he replied. “It’s coming. It should come this year, in 2024. I won’t be happy if it comes in 2025.” 

Adamante acknowledged the dizzying effect of AI development, understanding the potential discomfort among educators.

“Of course teachers are mad when students use AI to do their work,” Adamante said. “That makes sense to me because the purpose of a kid doing the work is that they do the work. That’s the purpose. They should write to get practice. The purpose of a teacher grading is not that they grade. The purpose of a teacher grading is that they provide feedback to the student so that the student learns faster. The purpose of grading is not grading itself, whereas the purpose of writing an essay is writing an essay, because you’re practicing.”

Teachers who use CoGrader without reviewing the feedback contradict its intended purpose, which is to scaffold, but not replace, the human element, Adamante explained. He regularly communicates with educators, urging them to carefully read the comments instead of quickly clicking “approve” and moving on to the next submission. Taking time to read the results not only ensures that the teacher agrees, but also keeps the teacher informed about their students’ strengths and weaknesses.

Lessons Learned with ChatGPT 4 and Transparency

Following our conversation, I’ve remained mindful of applying Adamante’s insights to my current use of ChatGPT 4 to aid in providing written feedback . While I avoid asking it to generate feedback on my behalf, I do seek its assistance in clarifying my overarching comments and ensuring their coherence. With the paid subscription, I can even upload a Microsoft Word or PDF document for more precise and detailed assistance. I always meticulously review any output before sharing it with my students.

Adamante’s reminder highlights the importance of transparency in my use of AI to enhance feedback for my students. While I’m not particularly unsettled by my use of ChatGPT 4 as an assistance tool, I realize that I haven’t been as forthcoming about it as I should be regarding when and how I utilize it. I must delve into this matter within my stated classroom policies and through class discussions.

I want students to understand that I always thoroughly review their work and that employing AI to aid in providing feedback doesn’t diminish my dedication. Rather, it’s about finding the most effective way to leverage technology to support their growth as writers and learners. 

Reflecting on my discussion with Adamante, I’ve concluded that CoGrader’s precise feedback surpasses that of ChatGPT 4. Moreover, I foresee CoGrader further outshining it with the introduction of annotated feedback.

Now, if only I can get my wife on board with researching how AI can expedite and enhance her feedback on math work. We’d both get more rest and time together, before putting our son to sleep.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 08 May 2024

Accurate structure prediction of biomolecular interactions with AlphaFold 3

  • Josh Abramson   ORCID: orcid.org/0009-0000-3496-6952 1   na1 ,
  • Jonas Adler   ORCID: orcid.org/0000-0001-9928-3407 1   na1 ,
  • Jack Dunger 1   na1 ,
  • Richard Evans   ORCID: orcid.org/0000-0003-4675-8469 1   na1 ,
  • Tim Green   ORCID: orcid.org/0000-0002-3227-1505 1   na1 ,
  • Alexander Pritzel   ORCID: orcid.org/0000-0002-4233-9040 1   na1 ,
  • Olaf Ronneberger   ORCID: orcid.org/0000-0002-4266-1515 1   na1 ,
  • Lindsay Willmore   ORCID: orcid.org/0000-0003-4314-0778 1   na1 ,
  • Andrew J. Ballard   ORCID: orcid.org/0000-0003-4956-5304 1 ,
  • Joshua Bambrick   ORCID: orcid.org/0009-0003-3908-0722 2 ,
  • Sebastian W. Bodenstein 1 ,
  • David A. Evans 1 ,
  • Chia-Chun Hung   ORCID: orcid.org/0000-0002-5264-9165 2 ,
  • Michael O’Neill 1 ,
  • David Reiman   ORCID: orcid.org/0000-0002-1605-7197 1 ,
  • Kathryn Tunyasuvunakool   ORCID: orcid.org/0000-0002-8594-1074 1 ,
  • Zachary Wu   ORCID: orcid.org/0000-0003-2429-9812 1 ,
  • Akvilė Žemgulytė 1 ,
  • Eirini Arvaniti 3 ,
  • Charles Beattie   ORCID: orcid.org/0000-0003-1840-054X 3 ,
  • Ottavia Bertolli   ORCID: orcid.org/0000-0001-8578-3216 3 ,
  • Alex Bridgland 3 ,
  • Alexey Cherepanov   ORCID: orcid.org/0000-0002-5227-0622 4 ,
  • Miles Congreve 4 ,
  • Alexander I. Cowen-Rivers 3 ,
  • Andrew Cowie   ORCID: orcid.org/0000-0002-4491-1434 3 ,
  • Michael Figurnov   ORCID: orcid.org/0000-0003-1386-8741 3 ,
  • Fabian B. Fuchs 3 ,
  • Hannah Gladman 3 ,
  • Rishub Jain 3 ,
  • Yousuf A. Khan   ORCID: orcid.org/0000-0003-0201-2796 3 ,
  • Caroline M. R. Low 4 ,
  • Kuba Perlin 3 ,
  • Anna Potapenko 3 ,
  • Pascal Savy 4 ,
  • Sukhdeep Singh 3 ,
  • Adrian Stecula   ORCID: orcid.org/0000-0001-6914-6743 4 ,
  • Ashok Thillaisundaram 3 ,
  • Catherine Tong   ORCID: orcid.org/0000-0001-7570-4801 4 ,
  • Sergei Yakneen   ORCID: orcid.org/0000-0001-7827-9839 4 ,
  • Ellen D. Zhong   ORCID: orcid.org/0000-0001-6345-1907 3 ,
  • Michal Zielinski 3 ,
  • Augustin Žídek   ORCID: orcid.org/0000-0002-0748-9684 3 ,
  • Victor Bapst 1   na2 ,
  • Pushmeet Kohli   ORCID: orcid.org/0000-0002-7466-7997 1   na2 ,
  • Max Jaderberg   ORCID: orcid.org/0000-0002-9033-2695 2   na2 ,
  • Demis Hassabis   ORCID: orcid.org/0000-0003-2812-9917 1 , 2   na2 &
  • John M. Jumper   ORCID: orcid.org/0000-0001-6169-6580 1   na2  

Nature ( 2024 ) Cite this article

172k Accesses

1 Citations

1112 Altmetric

Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

  • Drug discovery
  • Machine learning
  • Protein structure predictions
  • Structural biology

The introduction of AlphaFold 2 1 has spurred a revolution in modelling the structure of proteins and their interactions, enabling a huge range of applications in protein modelling and design 2–6 . In this paper, we describe our AlphaFold 3 model with a substantially updated diffusion-based architecture, which is capable of joint structure prediction of complexes including proteins, nucleic acids, small molecules, ions, and modified residues. The new AlphaFold model demonstrates significantly improved accuracy over many previous specialised tools: far greater accuracy on protein-ligand interactions than state of the art docking tools, much higher accuracy on protein-nucleic acid interactions than nucleic-acid-specific predictors, and significantly higher antibody-antigen prediction accuracy than AlphaFold-Multimer v2.3 7,8 . Together these results show that high accuracy modelling across biomolecular space is possible within a single unified deep learning framework.

You have full access to this article via your institution.

Similar content being viewed by others

ai essay grading software

Highly accurate protein structure prediction with AlphaFold

ai essay grading software

De novo generation of multi-target compounds using deep generative chemistry

ai essay grading software

Augmenting large language models with chemistry tools

Author information.

These authors contributed equally: Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore

These authors jointly supervised this work: Victor Bapst, Pushmeet Kohli, Max Jaderberg, Demis Hassabis, John M. Jumper

Authors and Affiliations

Core Contributor, Google DeepMind, London, UK

Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J. Ballard, Sebastian W. Bodenstein, David A. Evans, Michael O’Neill, David Reiman, Kathryn Tunyasuvunakool, Zachary Wu, Akvilė Žemgulytė, Victor Bapst, Pushmeet Kohli, Demis Hassabis & John M. Jumper

Core Contributor, Isomorphic Labs, London, UK

Joshua Bambrick, Chia-Chun Hung, Max Jaderberg & Demis Hassabis

Google DeepMind, London, UK

Eirini Arvaniti, Charles Beattie, Ottavia Bertolli, Alex Bridgland, Alexander I. Cowen-Rivers, Andrew Cowie, Michael Figurnov, Fabian B. Fuchs, Hannah Gladman, Rishub Jain, Yousuf A. Khan, Kuba Perlin, Anna Potapenko, Sukhdeep Singh, Ashok Thillaisundaram, Ellen D. Zhong, Michal Zielinski & Augustin Žídek

Isomorphic Labs, London, UK

Alexey Cherepanov, Miles Congreve, Caroline M. R. Low, Pascal Savy, Adrian Stecula, Catherine Tong & Sergei Yakneen

You can also search for this author in PubMed   Google Scholar

Corresponding authors

Correspondence to Max Jaderberg , Demis Hassabis or John M. Jumper .

Supplementary information

Supplementary information.

This Supplementary Information file contains the following 9 sections: (1) Notation; (2) Data pipeline; (3) Model architecture; (4) Auxiliary heads; (5) Training and inference; (6) Evaluation; (7) Differences to AlphaFold2 and AlphaFold-Multimer; (8) Supplemental Results; and (9) Appendix: CCD Code and PDB ID tables.

Reporting Summary

Rights and permissions.

Reprints and permissions

About this article

Cite this article.

Abramson, J., Adler, J., Dunger, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature (2024). https://doi.org/10.1038/s41586-024-07487-w

Download citation

Received : 19 December 2023

Accepted : 29 April 2024

Published : 08 May 2024

DOI : https://doi.org/10.1038/s41586-024-07487-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Major alphafold upgrade offers boost for drug discovery.

  • Ewen Callaway

Nature (2024)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

ai essay grading software

IMAGES

  1. 5 Best Automated AI Essay Grader Software in 2024

    ai essay grading software

  2. 70 Best Automated essay grading software AI tools

    ai essay grading software

  3. Streamline essay grading with AI tool for teachers

    ai essay grading software

  4. AI Grader

    ai essay grading software

  5. 5 Best Automated AI Essay Grader Software in 2024

    ai essay grading software

  6. AI Essay Grading

    ai essay grading software

VIDEO

  1. Essay Grading Demo

  2. Essay Grading Tip ✏️

  3. Ai Tools For Students

COMMENTS

  1. AI Essay Grader

    ClassX's AI Essay Grader empowers teachers by automating the grading process without compromising on accuracy or fairness. The concept is elegantly simple: teachers input or copy the students' essays into the provided text box, select the appropriate grade level and subject, and ClassX's AI Essay Grader takes it from there.

  2. Essay Grader AI

    EssayGrader is an AI powered grading assistant that gives high quality, specific and accurate writing feedback for essays. On average it takes a teacher 10 minutes to grade a single essay, with EssayGrader that time is cut down to 30 seconds That's a 95% reduction in the time it takes to grade an essay, with the same results. Get started for free.

  3. Top 10 AI Solutions for Grading Papers to Streamline Your Essay

    Top 10 AI-Powered Essay Grading Software to Consider. 1. EssayGrader. EssayGrader.ai is the most accurate AI grading platform trusted by 30,000+ educators worldwide. On average it takes a teacher 10 minutes to grade a single essay, with EssayGrader that time is cut down to 30 seconds That's a 95% reduction in the time it takes to grade an essay ...

  4. 5 Best Automated AI Essay Grader Software in 2024

    Project Essay Grade by Measurement Incorporated (MI), is a great automated grading software that uses AI technology to read, understand, process and give you results. By the use of the advanced statistical techniques found in this software, PEG can analyze written prose, make calculations based on more than 300 measurements (fluency, diction ...

  5. AI Essay Grader

    Once you're in, you'll experience saving countless hours and procrastination, and make grading efficient, fair, and helpful. CoGrader is the Free AI Essay Grader for teachers. Use AI to save 80% of time spent grading essays, and enhance student performance by providing instant and comprehensive feedback. CoGrader supports Narrative, Informative ...

  6. EssayGrader: AI-Powered Essay Grading and Feedback for Students

    EssayGrader uses artificial intelligence to analyze essays and provide detailed feedback. Teachers can either define their own grading rubrics or use the default rubrics in EssayGrader. When an essay is submitted, EssayGrader will: Check for grammar, spelling, and punctuation errors. Summarize the key points and ideas in the essay.

  7. EssayGrader

    EssayGrader is a tool powered by AI that provides accurate and helpful feedback based on the same rubrics used by the grading teacher. Its features include speedy grading, comprehensive feedback, estimated grades, focused feedback, organized essays, show, don't tell, and personalized approach. The tool offers an easy-to-use guide for better ...

  8. EssayGrader

    With the power of artificial intelligence at our fingertips, we crafted a solution that not only eases this load but transforms the entire grading experience. At the heart of our journey are 4 passionate individuals: Payton and Suraj, visionary software engineers, Chan, a world class product marketer, and Ashley, a dedicated English teacher.

  9. About the e-rater Scoring Engine

    About the e-rater Scoring Engine. The e-rater automated scoring engine uses AI technology and Natural Language Processing (NLP) to evaluate the writing proficiency of student essays by providing automatic scoring and feedback. The engine provides descriptive feedback on the writer's grammar, mechanics, word use and complexity, style ...

  10. Gradingly

    Stay Ahead with the Latest Technology. Utilising various AI techniques, our engine is trained on thousands of essays to provide accurate results. Trust in the speed and precision of our API. Receive a fully marked essay within 30 seconds.*. Gradingly is working with governments, businesses and education providers worldwide.

  11. Essay Grader AI

    Grade essays quickly and efficiently using our AI enabled grading tools. Our Essay Grader helps thousands of teachers grade essays in seconds and provides them with high quality, specific feedback on essays. Leverage the power of AI for teachers. Start grading for free. Go to dashboard.

  12. Explainable Automated Essay Scoring: Deep Learning Really Has

    Automated essay scoring (AES) is a compelling topic in Learning Analytics for the primary reason that recent advances in AI find it as a good testbed to explore artificial supplementation of human creativity. However, a vast swath of research tackles AES only holistically; few have even developed AES models at the rubric level, the very first layer of explanation underlying the prediction of ...

  13. SmartMarq: Essay marking with rubrics and AI

    SmartMarq will streamline your essay marking process. SmartMarq makes it easy to implement large-scale, professional essay scoring. Once raters are done, run the results through our AI to train a custom machine learning model for your data, obtaining a second "rater.". Note that our powerful AI scoring is customized, specific to each one of ...

  14. MyEssayGrader

    MyEssayGrader is an AI-based essay grading tool that provides a virtual thought partner to teachers to streamline the grading process. This tool aims to revolutionize the way essays are graded by providing a quick and accurate evaluation of students' essays that can be returned in just under a minute. Not only does MyEssayGrader save teachers ...

  15. AI Grader

    Our AI grader matches human scores 82% of the time*AI Scores are 100% consistent**. Standard AI Advanced AI. Deviation from real grade (10 point scale) Real grade. Graph: A dataset of essays were graded by professional graders on a range of 1-10 and cross-referenced against the detailed criteria within the rubric to determine their real scores.

  16. About Us

    Welcome to Essay-Grader.ai - Revolutionizing Essay Grading! At Essay-Grader.ai, we are on a mission to transform the way essays are assessed and graded. Our innovative software harnesses the power of artificial intelligence to automate the essay grading process, providing educators with a seamless and efficient tool to evaluate student work.

  17. Vexis : Your AI Grader

    Vexis is your new AI powered grading system revolutionizing the grading process for the better. Your AI-Powered Grading System. Vexis is the ultimate grading game changer for educators. We're speeding up the grading process, freeing up your time to focus on teaching. Sync with your progress, secure your students' data, and enjoy unbiased ...

  18. The e-rater Scoring Engine

    Our AI technology, such as the e-rater ® scoring engine, informs decisions and creates opportunities for learners around the world. The e-rater engine automatically: assess and nurtures key writing skills. scores essays and provides feedback on writing using a model built on the theory of writing to assess both analytical and independent ...

  19. What is Automated Essay Scoring, Marking, Grading?

    Nathan Thompson, PhDApril 25, 2023. Automated essay scoring (AES) is an important application of machine learning and artificial intelligence to the field of psychometrics and assessment. In fact, it's been around far longer than "machine learning" and "artificial intelligence" have been buzzwords in the general public!

  20. StudyGleam: AI-Powered Grading for Handwritten Essays

    Revolutionize your grading process with StudyGleam's AI-driven platform. Convert handwritten English essays into digital text, assess with precision, and provide comprehensive feedback. Ideal for primary to junior college educators. Explore the future of edtech today!

  21. Top 7 AI Essay Grader for Smart & Fast Essay Scoring

    SmartMarq makes grading essays easier. It uses both human and AI to grade essays. This approach ensures that the results are accurate and efficient. Users can create rules for grading, manage graders, and collect grades quickly. By using AI to help grade, it makes the process faster without losing quality.

  22. EssayGrader

    The fastest way to grade essays. EssayGrader is an AI powered grading assistant that gives high quality, specific and accurate writing feedback for essays. ... EssayGrader analyzes essays with the power of AI. Our software is trained on massive amounts of diverse text data, inlcuding books, articles and websites. ...

  23. Teachers are using AI to grade essays. Students are using AI to write

    Teachers are turning to AI tools and platforms — such as ChatGPT, Writable, Grammarly and EssayGrader — to assist with grading papers, writing feedback, developing lesson plans and creating ...

  24. Home

    Trusted by educational institutions for surpassing human expert scoring, IntelliMetric® is the go-to essay scoring platform for colleges and universities. IntelliMetric® also aids in hiring by identifying candidates with excellent communication skills. As an assessment API, it enhances software products and increases product value.

  25. Using AI Grading Tools to Enhance the Process

    Late that night, CoGrader —a new artificial intelligence (AI)-enhanced platform—piques my interest. A notification on social media directs me to their website, boasting a compelling promise: "Reduce grading time by 80% and provide instant feedback on student drafts.". The allure is heightened by the offer of a 30-day trial, free and ...

  26. How teachers started using ChatGPT to grade assignments

    Teachers are embracing ChatGPT-powered grading. A new tool called Writable, which uses ChatGPT to help grade student writing assignments, is being offered widely to teachers in grades 3-12. Why it matters: Teachers have quietly used ChatGPT to grade papers since it first came out — but now schools are sanctioning and encouraging its use.

  27. Accurate structure prediction of biomolecular interactions with

    The introduction of AlphaFold 21 has spurred a revolution in modelling the structure of proteins and their interactions, enabling a huge range of applications in protein modelling and design2-6 ...