Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

Automatic assignment grading for instructor use in programming courses

zmievsa/autograder

Folders and files, repository files navigation.

A simple, secure, and versatile way to automatically grade programming assignments

Coverage

  • Blazingly fast (can grade hundreads of submissions using dozens of testcases in a few minutes. Seconds if grading python)
  • Easy to grade
  • Easy-to-write testcases
  • Testcase grade can be based on student's stdout
  • Can grade C, C++, Java, and Python code in regular mode
  • Can grade any programming language in stdout-only mode
  • A file with testcase grades and details can be generated for each student
  • You can customize the total points for the assignment, maximum running time of student's program, file names to be considered for grading, formatters for checking student stdout, and so much more .
  • Anti Cheating capabilities that make it nearly impossible for students to cheat
  • Grading submissions in multiple programming languages at once
  • JSON result output supported if autograder needs to be integrated as a part of a larger utility
  • Can check submissions for similarity (plagiarism)
  • Can detect and report memory leaks in C/C++ code

Installation

  • Run pip install autograder
  • gcc / clang for C/C++ support
  • Java JDK for java support
  • make for compiled stdout-only testcase support
  • Any interpreter/compiler necessary to run stdout-only testcases. For example, testcases with ruby in their shebang lines will require the ruby interpreter

pip install -U --no-cache-dir autograder

  • Run autograder guide path/to/directory/you'd/like/to/grade . The guide will create all of the necessary configurations and directories for grading and will explain how to grade.
  • Read the usage section of the docs

Supported Platforms

  • Linux is fully supported
  • OS X is fully supported
  • Stdout-testcases that require shebang lines are not and cannot be supported

Supported Programming Languages

  • CPython (3.8-3.11)
  • Any programming language if stdout-only grading is used

Code of conduct

Releases 20, contributors 3.

  • Python 97.5%

Automated Grading and Feedback Tools for Programming Education: A Systematic Review

We conducted a systematic literature review on automated grading and feedback tools for programming education. We analysed 121 research papers from 2017 to 2021 inclusive and categorised them based on skills assessed, approach, language paradigm, degree of automation and evaluation techniques. Most papers assess the correctness of assignments in object-oriented languages. Typically, these tools use a dynamic technique, primarily unit testing, to provide grades and feedback to the students or static analysis techniques to compare a submission with a reference solution or with a set of correct student submissions. However, these techniques’ feedback is often limited to whether the unit tests have passed or failed, the expected and actual output, or how they differ from the reference solution. Furthermore, few tools assess the maintainability, readability or documentation of the source code, with most using static analysis techniques, such as code quality metrics, in conjunction with grading correctness. Additionally, we found that most tools offered fully automated assessment to allow for near-instantaneous feedback and multiple resubmissions, which can increase student satisfaction and provide them with more opportunities to succeed. In terms of techniques used to evaluate the tools’ performance, most papers primarily use student surveys or compare the automatic assessment tools to grades or feedback provided by human graders. However, because the evaluation dataset is frequently unavailable, it is more difficult to reproduce results and compare tools to a collection of common assignments.

1. Introduction

Most computer science courses have grown significantly over the years, leading to more assignments to grade (Krusche2020, 103 ) . The time window for evaluating assignments is typically short as prompt and timely feedback increases student satisfaction (Kane2008, 41 ) , resulting in the danger of inconsistent grading and low feedback quality.

One method for providing a grade and feedback in good time is to use multiple human graders. This approach, however, increases the chance of variation in grading accuracy, consistency and feedback quality (Aziz2015, 11 ) . As a result, automatic grading tools have become widely popular, as they are able to assign grades and generate feedback consistently for large cohorts. Automatic Assessment Tools (AATs) may be used by instructors either to fully automate the marking and feedback process, or to indicate potential issues while manually assessing the submissions.

There is typically a relationship between grading and feedback when assessing student submissions for a given assignment. Formative assessment focuses on providing feedback to teachers and students to help students learn more effectively while providing an ongoing source of information about student misunderstanding (Dixson2016, 52 ) . Whereas summative assessments typically intend to capture what a student has learned and judge performance against some standards and are almost always graded (Dixson2016, 52 ) . While some AATs focus on providing only feedback for formative assessments, typically, many provide a grade and feedback for summative assessments. In this paper, we will refer to providing a grade and/or feedback as an assessment unless we explicitly discuss grading or feedback exclusively.

Using AATs to grade programming assignments originated in the 1960s (Hollingsworth1960, 77 ) . Traditionally, AATs have focused on grading program correctness using unit tests and pattern matching. They typically struggle, however, to grade design quality aspects and to identify student misunderstandings. Only recently have researchers begun to address these areas as well (Orr2021, 183 , 178 ) .

As part of implementing a unit test-based AAT, instructors must provide a comprehensive test suite. It typically takes considerable effort and requires students to follow a specific structure when implementing their solutions, such as replicating the exact output being tested or using predefined class and function names. The considerable effort and specific structure often lead to instructors designing short and well-defined coursework, as it is easier to implement a comprehensive test suite for such assignments. However, some instructors prefer to incorporate large-scale project-based assignments in their courses, which are typically less well-defined to allow students to control the direction of the assignment and use their creativity, both of which can motivate students to perform well in their assignments. These open-ended assignments are usually nearly impossible to automatically assess using unit test-based approaches, as the structure and functionality of the student assignments can change, making it difficult to implement unit tests.

We conducted a systematic literature review to investigate recent research into automated grading and feedback tools. Our review offers new insights into the current state of auto-grading tools and the associated research by contributing the following:

Categorised automatic assessment tools based on core programming skills (Messer2023, 127 ) .

A summary of the state of the art.

Detailed statistics of grading and feedback techniques, language paradigms graded, and evaluation techniques.

An in-depth discussion of the gaps and limitations of current research.

We utilised the programming skills and machine learning (ML) papers from the results of this systematic literature review to further investigate ML-based AATs by performing a meta-analysis focusing on ML-based AATs. In the meta-analysis (Messer2023, 127 ) , the ML papers from the current review were used as initial papers for a backward snowball search to find other ML-based AATs. After conducting the snowball search and finalising our included papers, we categorised the ML-based AATs using the core skills discussed in Section 2 , and the techniques and evaluation criteria utilised.

The format of the article is as follows: Sections 2 and 3 introduce our framework for categorising AATs. Section 4 presents existing reviews and discusses how our own work relates to and expands on this prior work. Section 5 details our methodology, inclusion and exclusion criteria and introduces the research questions addressed by this work. Section 6 presents an analysis of the selected sources and presents the results, and Section 7 discusses our findings. Finally, Section 8 summarises our conclusions and presents recommendations.

2. Programming Skills

We distinguish four core criteria for assessing programming assignments based on our experience: correctness, maintainability, readability, and documentation. These criteria are manually or automatically graded by evaluating a student’s source code statically or dynamically and have been active research areas within computer science education.

There are multiple facets of correctness research, including students’ conceptions and perspectives of correctness (Kolikant2008, 100 , 171 , 148 ) , how students fix their code to improve correctness (Souza2017, 47 , 6 ) , and assessment of correctness (rayyan-354359303, 161 , 140 , 63 ) . Furthermore, there has been a multitude of research into code quality, which includes readability and maintainability (Borstler2018, 21 , 170 , 86 ) . The research has included designing rubrics for grading code quality (Stegeman2016, 172 ) and investigating code quality issues within student programs (Keuning2017, 93 ) . Finally, code documentation is considered one of the best practices in software development (Kipyegen2013, 96 , 48 , 150 ) and is widely researched within the software engineering domain, including code summarisation (McBurney2015, 122 , 74 ) , how documentation is essential to software maintenance (Souza2005, 48 ) and how to prioritize documentation effort (McBurney2018, 123 ) . We separated code assessment into these four skills to provide a comprehensive overview of which skills are automatically graded.

The four core criteria provide our frame of reference for investigating research into AATs for evaluating student submissions:

Correctness – Evaluate whether a student has understood and implemented the tasks in a manner that conforms to the coursework specification. There are two commonly evaluated areas of correctness. The most common is correctness of functionality: testing whether a student has adequately implemented the assignment’s essential features, whether correctly calculating the Fibonacci sequence, implementing FizzBuzz , or creating a text-based adventure game. The other is the correctness of methodology, verifying that a student has used a specific language feature, such as recursion to calculate the Fibonacci sequence, a for-loop and modulo to implement FizzBuzz , or polymorphism and inheritance to create an object-oriented game.

Maintainability – Investigates how well a student has implemented maintainable or elegant code. This could include how well a student has used functions to reduce duplicated code, whether the method used to solve a problem is simplistic or overly complicated or if the student has used polymorphism and inheritance to reduce class coupling.

Readability – Analyses whether a student’s submission is easy to understand. While maintainable code contributes to the readability of the code, other aspects of source code can signify if the code is readable. These aspects include: following code style guidelines, meaningful naming of classes, functions and variables, replacing magic numbers with constants, and using whitespace to separate code blocks.

Documentation – Inspects the existence and quality of a student’s documentation, including inline comments and docstrings. While some believe code should be self-documenting, including inline comments to explain functionality and implementation is common practice. However, these are often useless (Raskin2005, 152 ) or poor quality (Steidl2013, 173 ) . Additionally, developers must include documentation in the form of docstrings to explain the purpose and the interactions for a specific class or function. These are typically used when other developers interact with or maintain the function (Steidl2013, 173 ) .

3. Categories of Automatic Assessment Tools

AATs have numerous advantages for both educators and students. They typically reduce the time it takes to assess each assignment and can be used to enhance the traditional assessment workflow and establish a real-time feedback system. The process of automatically assessing source code can be done in various ways, such as unit testing and comparing source code to an instructor’s model solution or other students’ correct submissions. Depending on the assignment’s intended learning outcomes, an AAT may employ multiple methods to grade various core skills.

In 2005, (Mutka2005, 3 ) published a “A Survey of Automated Approaches for Programming Assignments”, where they categorised automatic assessment of different features into two main categories: dynamic analysis and static analysis (Mutka2005, 3 ) .

3.1. Dynamic Analysis Automatic Assessment Tools

Dynamic analysis evaluates a running program’s attributes by determining which properties will hold for one or more executions (ball1999, 13 ) . Typical methods employed by dynamic analysis AATs include using a suite of unit tests to grade the correctness of a student’s submission by comparing either the printed output or return values of individual methods.

Additionally, dynamic analysis can be used to evaluate the student’s ability to write efficient code or to write complete test suites (Mutka2005, 3 ) . To create a comprehensive test suite, instructors typically need to be skilled at creating unit tests using complex language features such as reflection and invest significant time developing and evaluating the tests. Test suites are usually given to students in one of three ways: complete access before the final deadline (rayyan-354359326, 35 , 36 , 90 ) , no access before the final deadline (rayyan-354359314, 17 , 33 , 64 , 112 , 128 ) , or access to a subset of the test suite before the final deadline (rayyan-354359276, 106 , 134 , 189 ) . How the test suites are presented to students depends on the assessment approach. Typically, if the assignment is formative, students will be given complete access throughout; if the assignment is summative, the students will either have no access or only access to a subset of tests. Limiting students’ access to the complete set of tests or using hidden tests are typically used to limit the students’ ability to game the system by hardcoding return values.

3.2. Static Analysis Automatic Assessment Tools

Static analysis tools assess software without running it or taking inputs into account (Ayewah2008, 10 ) .

3.2.1. Static Analysis Tools

Many industry-standard static analysis tools have been implemented into AATs, including linters, such as pylint 1 1 1 pylint – a Python Linter: https://pypi.org/project/pylint/ and cpplint 2 2 2 cpplint – A C++ Linter: https://github.com/cpplint/cpplint , CheckStyle 3 3 3 CheckStyle – A style guide enforcement utility: https://checkstyle.sourceforge.io/ , FindBugs 4 4 4 FindBugs has since been abandoned, and replaced with SpotBugs: https://spotbugs.github.io/ (Ayewah2008, 10 ) , and PMD 5 5 5 PMD – A static analyser focusing on finding programming flaws: https://pmd.github.io/ . These static analysis tools typically focus on identifying or partially identifying issues, and checking maintainability, readability, and the existence of documentation.

3.2.2. Software Metrics

Software metrics are a technique for assessing code most commonly used in commercial settings that have been effectively integrated into AATs (Mutka2005, 3 ) . Metrics are typically used to evaluate maintainability; (Halstead1977, 69 ) and (Mccabe1976, 124 ) created such complexity measures. (Halstead1977, 69 , ) ’s metrics include program length, comprehension difficulty, and programming effort. whereas (Mccabe1976, 124 , ) ’s Cyclomatic Complexity utilises graph theory to evaluate the program’s complexity based on the control flow graph. Other than complexity, metrics can be used to evaluate object-oriented design principles, such as class coupling and depth of inheritance (Chidamber1994, 29 ) .

3.2.3. Comparative Automatic Assessment Tools

Other than adapting industry tools, static analysis has been used to compare a student’s submission with a model solution or to a set of applicable solutions (Mutka2005, 3 ) , typically focusing on grading correctness. These tools use various methods to evaluate the similarity between the student’s submission and the model solution(s). Source code can be written differently but implement the same functionality. To improve accuracy, AATs can convert the source code to an abstract syntax tree (Wang2020, 184 ) or a control flow graph (Sendjaja2021, 159 ) to abstract away from the syntactic representation. More comprehensive AATs have multiple methods for matching or partially matching student solutions to the model solution(s). Virtual Teaching Assistant (Chou2021, 30 ) incorporates four patterns to facilitate grading partial correctness: location-free, where the order of output tokens or characters or specific structure is not a determining factor in the awarded grade and location-specific, where the order of the output or the specific structure does impact the awarded grade.

Comparative approaches can also be used to generate feedback for the student. (Paassen2017, 137 , ) use edit-based approaches on student trace data to generate next-step hints for block-based programming languages.

3.3. Machine Learning Automatic Assessment Tools

Since (Mutka2005, 3 , ) published their review in 2005, research into applying machine learning to auto-grading has increased. Machine learning AATs use various techniques to grade or provide feedback on the correctness and maintainability of a student’s submission, typically utilising dynamic or static approaches. (Orr2021, 183 , ) trained a feed-forward neural network to grade the design quality and provide personalised feedback. They converted the source code to an abstract syntax tree and then converted it to a feature vector as the input to the regression model that predicts the score.

To grade correctness, (Dong2020, 53 , ) implemented two ML AATs into an online judge. The first model was trained using historical training data to predict what causes failed test results. The second model implemented was a knowledge-tracing model used to predict the probability of a student passing a new problem based on previous knowledge of a programming concept.

However, the approaches of both (Orr2021, 183 ) , and (Dong2020, 53 ) require a large ground truth dataset to train their models. A zero-shot learning approach can resolve the need for large training data. (Efremov2020, 57 , ) implemented such an approach to provide next-step hints to students where no prior historical data for a task exists. They implemented a Long Short-Term Memory (LSTM) neural network to provide a vector representation that can be used for ML of a block-based language, followed by a reinforcement learning approach for the hint policy.

3.4. Degree of Automation

(Mutka2005, 3 , ) also defined varying degrees of automated the assessment of programming assignments. Fully automated assessment is typically used for smaller assignments where unit testing or another form of automatic grading can easily be designed and implemented, such as assignments focusing on programming language basics (Mutka2005, 3 ) . Large courses use fully automated grading to reduce the workload on the instructors. However, fully automated grading tools have difficulty grading large-scale assignments, graphical user interfaces, maintainability, readability, and documentation, which are typically manually graded.

It is common to use a semi-automated approach, a mix of manual and automated assessment, to minimise the instructors’ workload on these large-scale assignments. Automating certain aspects of the assessment process allows the instructor more time to grade and give feedback on areas that cannot be easily automated, including code design (Mutka2005, 3 ) .

4. Related Work

Multiple studies have reviewed the literature and existing automatic assessment tools for programming assignments. These survey papers focused on literature that discussed grading assignments or feedback given to students on their work.

4.1. Grading

Most recently, (Paiva2022, 139 , ) reviewed which computer science domains were automatically assessed. They reviewed proposed testing techniques, how secure code execution is, feedback generation techniques, and how practical the techniques and tools are.

(Paiva2022, 139 ) ’s review showed that most automated assessment research focused on multiple programming domains, including visual programming, web development, parallelism and concurrency. Additionally, they found that the tools and techniques for assessing assignments are in one or more categories: functionality, code quality, software metrics, test development, or plagiarism. When evaluating feedback, they used (Keuning2018, 95 , ) ’s feedback categories (discussed in Section 4.2 ). A novel feedback approach they found was using automated program repair to fix software bugs.

Additionally, (Paiva2022, 139 ) classified each tool into web-based platforms, Moodle plug-ins, web services, cloud-based services, toolkits and Java libraries and analysed the application of the found techniques to the tools. Finally, they discussed the previous and future trends of the key topics; the most notable previous key topics include tool development, static analysis and feedback.

We previously introduced (Mutka2005, 3 , ) ’s survey on automated approaches for programming assignments, where they surveyed the literature to define different types of automated assessment approaches. They discussed the benefits and drawbacks of automated assessment and encouraged careful assignment design to provide students with several practical programming tasks.

Furthermore, (Ullah2018, 177 , ) expanded upon (Mutka2005, 3 ) ’s work by introducing a hybrid category consisting of AATs that are both dynamic and static approaches and investigating automated assessment tools released before the paper’s publication in 2018. They introduce a taxonomy of approaches for both static and dynamic approaches and discuss the approach, supported languages, advantages and limitations of existing tools.

(Aldriye2019, 4 , ) analysed a small set of existing grading systems in multiple areas, including usability, understandability of system feedback, and the advantages compared to other tools. Similarly, (Nayak2022, 131 , ) discussed the implementation and functionality of numerous automated assessment tools, and (Lajis2018, 104 , ) reviewed AATs which used semantic similarity to a model solution to grade submissions.

(Douce2005, 55 , ) , conducted a literature review of the most influential AATs from the earliest example in 1960 by (Hollingsworth1960, 77 , ) , until the paper’s publication in 2005, and discuss the change of AATs throughout time, from early assessment systems, to tool-oriented systems, finally to web-based tools. Whereas, (Ihantola2010, 78 , ) investigated the approaches tools from 2006 to 2010 utilised both from a pedagogical and technical point of view, including programming languages assessed, how instructors define tests and if any tools specialised in specific areas such as assessment of GUIs.

(Souza2016, 168 , ) conducted a systematic literature review of AATs between 1985 and 2013, focusing on key characteristics of AATs, including supported programming language, user interfaces and types of verification. They categorise the tools by degree of automation, if the tool is instructor or student-centred, and if the tools are specialised, such as contents, testing or quizzes.

4.2. Feedback

(Keuning2018, 95 , ) conducted a systematic literature review of automated programming feedback. They aimed to review the nature of feedback generated, the techniques used, the tools’ adaptability and the quality and effectiveness of the feedback.

To categorise the feedback types, (Keuning2018, 95 ) used and built on (Narciss2008, 130 ) ’s (Narciss2008, 130 ) feedback components; these include: knowledge of performance for a set of tasks, knowledge of result/response, knowledge of correct results, knowledge about task constraints, and four others. They found that the most common feedback types were knowledge about mistakes and how to proceed.

To evaluate how teachers can adapt the tools, (Keuning2018, 95 ) categorised the tools by the different types of input teachers can provide, including model solutions, test data and solution templates. Finally, they investigated how authors evaluated the quality and effectiveness of the feedback or tool and found that some completed most evaluations for technical analysis, such as comparing generated grades or feedback with an existing dataset of graded work. Other evaluation techniques include anecdotal evidence, surveys and learning outcome evaluations.

4.3. Novelty of This Review

Multiple reviews are small non-systematic reviews that hand-picked tools to assess and summarise (Douce2005, 55 , 104 , 4 , 131 , 177 ) . Our review differs from these by systematically searching the literature and extracting more detail about each system.

While (Keuning2018, 95 , ) investigated feedback generation, which overlaps with half of our review, their review only considered papers up to 2015 inclusive. Thus, none of their included papers overlap with ours. Nevertheless, their results form a valid distinct comparison with ours. Similarly, (Ihantola2010, 78 , ) and (Souza2016, 168 , ) , investigated automated grading systems between 2006 and 2010 and is now outdated, given the recent growth in automated assessment systems.

(Paiva2022, 139 , ) carried out a state-of-the-art review in 2022 concurrently with this work. Several of their research questions (such as code execution security) have no overlap with this review. They did investigate which aspects of programs are assessed, including quality and the techniques used to generate feedback. However, they did not cover the evaluation of these tools in detail, in contrast to our detailed examination of evaluation versus human graders. Their review focuses more on the technical aspects of the automated graders. In contrast, ours takes a more pedagogical approach to consider the automated graders’ place within education, focusing on how well they grade, what they grade and what they should be grading.

To summarise, our review introduces a new method of categorising AATs by the core skills graded, whether that be correctness, maintainability, readability or documentation, and builds on (Mutka2005, 3 , ) ’s categories for approaches to assessing programming assignments, by introducing a category for machine learning AATs, which is a sub-category of both static and dynamic approaches. We also include an analysis of the evaluation techniques used and the data availability and an analysis of the techniques used to grade or give feedback on the source code. In Section 7.8 , we further compare the results of our literature review to previous literature reviews.

5. Methodology

A systematic literature review is a method for locating, assessing, and interpreting all accessible data on a specific subject. They are frequently used to summarise existing evidence, identify gaps, and set the stage for new research endeavours (Kitchenham2007, 98 ) . Conducting a systematic review involves developing a review protocol, identifying research, selecting primary studies, study quality assessment, data extraction and data synthesis.

This section discusses our review protocol, defining our research questions, inclusion and exclusion criteria, and search and screening processes. We used Rayyan (Ouzzani2016, 136 ) to aid our screening process, using their in-built tools for automated and manual de-duplication and keyword highlighting derived from our search criteria. We did not use their paper clustering feature to aid the screening process. Section 6 discusses the outcomes of selecting primary studies, quality assessment, data extraction and data synthesis.

5.1. Preregistration

We preregistered our study (Messer2022, 126 ) with the Open Science Foundation. In the preregistration, we provided a complete description of our planned methodology, including our research questions, search terms, screening, data extraction, and data synthesis.

5.2. Research Questions

To guide our review, we used the following research questions:

Which are the most common techniques for programming automated assessment tools?

Which programming languages do programming automated assessment tools target?

Which critical programming skills, such as correctness, maintainability, readability and documentation, are typically assessed by programming automatic assessment tools?

How well do the automated grading techniques perform compared to a human grader?

Is the feedback generated by these techniques comparable to human graders?

What are the most common methods for providing feedback on a student’s programming assignment, and what areas do they address?

5.3. Search Strings

We defined two search strings, one for automated grading (Listing 1 ) and the other for automated feedback (Listing 2 ). These were defined by extracting keywords from our research questions. After finding the keywords, we extended our search string to include the keywords’ synonyms. To find the variants of a keyword, we stemmed the keywords and added a wildcard character. For example, “grading” becomes “grad*”. However, the stemming method yielded numerous irrelevant results (e.g. gradient) due to the common nature of our keywords. Therefore, we decided to use specific variants, such as “grade”, “grading” and “grader”.

We decided to include a negation statement in our search strings to reduce the number of outside sources, which eliminates sources that focus on robotics, source code vulnerability, or information and communication technology, as these are out of the scope of this review.

Our search strings have minor differences to maximise the likelihood of our benchmark papers being located in our database search. In Listing 2 , ‘student code’ and ‘novice programm*’ were included to allow the search to return the benchmark papers (Piech2015, 144 , 141 , 165 ) . While the benchmark papers chosen for Listing 1 did not include the terms ‘student code’ or ‘novice programming’ in their title or abstracts.

5.4. Search Process

To locate a set of relevant primary studies, we used our search strings in the ACM Digital Library (ACM DL), IEEEXplore, and Scopus. We chose these databases since ACM or IEEE publishes most Computer Science Education resources. We used Scopus, the world’s largest abstract and citation database for peer-reviewed literature, to find additional sources outside of ACM and IEEE.

Initially, we investigated other databases, specifically Google Scholar, ScienceDirect, and SpringerLink. Google Scholar and SpringerLink, returned almost 400,000 results, and ScienceDirect did not support the number of boolean terms in our search strings. We could not minimise the number of results by restricting our search strings because “feedback”, “mark”, “assessment”, and “code” are terms in many other domains, including medical and genetics publications, and these databases do not allow for filtering by domain. Thus we did not use these databases.

We implemented a benchmark using papers we found during our initial reading to validate our search strings by confirming that the search results contain known papers that we identified during our initial reading. For automated grading we used five papers (Douce2005, 55 , 77 , 82 , 141 , 149 ) , and for automated feedback we used seven papers (Parihar2017, 141 , 165 , 66 , 45 , 95 , 188 , 144 ) . The results from our benchmark can be found in Table 1 and was performed without applying our inclusion/exclusion criteria (Table 2 ). As shown in Table 1 , IEEEXplore does not contain any of our benchmark sources, as IEEE did not publish any of our benchmark sources. Even though there are no benchmark sources, our search strings should still produce relevant results in IEEEXplore, as the search strings provide relevant sources from ACM and Scopus.

5.5. Inclusion/Exclusion Criteria

We employed the inclusion and exclusion criteria listed in Table 2 to determine whether a primary study is eligible for our review. We decided only to include sources supporting textual languages, such as Java, Python and assembly. We only included sources from 2017–2021 inclusive to ensure we focused on the most up-to-date research. We omitted sources we could not access or those not written in English because we could not extract the required information. Similarly, we rejected posters and brief articles since they are unlikely to have sufficient detail. Finally, we removed non-peer-reviewed studies because the quality and veracity of these sources could not be verified.

5.6. Screening Process

We used two screeners to conduct our screening process, with any conflicts between decisions being discussed and resolved at regular intervals. We included or excluded studies based on their title and abstract for our first screening stage. In the second stage, we screened the introduction and conclusion. Finally, we screened the full-text sources and extracted data.

We changed our screening approach slightly (for the better) from our proposed approach in our preregistration (Messer2022, 126 ) . We decided that both screeners would review all sources during all stages of the review, as this would increase the reliability of the results.

Figure 1 , shows the general trend of automatic assessment-related papers over time. The increasing trend of research into automated grading correlates to the increasing class sizes.

Figure 2 shows an overview of the number of included and excluded papers in each stage of the review process. Our search for primary sources resulted in 1490 papers from the three databases, which after automated and manual deduplication, resulted in 1088 papers to screen by title and abstract. We excluded 820 publications based on our exclusion criteria during the title and abstract screening stage. Most of the exclusions were due to the paper’s lack of focus on grading or feedback on programming assignments.

To conduct our introduction and conclusion screening, we sought the full text of 268 papers, 8 of which we could not retrieve due to a lack of access to the publication. Of the 260 papers we retrieved, 112 were excluded, and most were excluded for not focusing on grading or feedback, paper length, or focusing on visual programming languages.

In the full-text stage of our screening, we reviewed 147 articles. Out of the 147, 26 were excluded, most of which we excluded as they worked towards a grading or feedback tool rather than discussing a completed tool, leaving 121 papers to extract data from. The final included papers can be found in the supplementary material grouped by research question.

We opted to annotate and analyse the papers themselves instead of extracting each tool from the articles, as the aim of this literature review is to provide an overview of current state-of-the-art research. Very few papers discussed multiple tools, and for these papers, we annotated based on all the tools discussed, which were typically within the same domain as each other. Furthermore, few tools were discussed in multiple papers, with only Antlr (rayyan-354359280, 8 , 190 , 121 , 176 , 89 ) , VPL (rayyan-354359269, 79 , 25 , 26 ) , Coderunner (rayyan-354359282, 38 , 51 , 191 ) and Travis (rayyan-354359293, 23 , 72 , 73 ) appearing in three or more papers, with Antlr and Travis primarily being underlying technologies for AATs. While there is some duplication in terms of papers discussing multiple tools and tools being discussed in multiple papers, this duplication does not overly skew the results of the literature review.

Refer to caption

6.1. Skills Evaluated and Utilised Techniques (RQ1, RQ3, RQ6)

In this section, we present our results for the techniques utilised to automatically assess student assignments. We categorise the tools on the previously defined core programming skills as correctness, maintainability, readability and documentation (Section 2 ), and the categories of AATs, including the degree of automation (Section 3 ).

Figure 3 shows that most of the tools focus on assessing correctness, followed by tools that grade correctness and readability or correctness and maintainability. Minimal research within our time frame has focused on grading documentation or exclusively maintainability or readability.

Figure 4 shows the combination of skills graded for each category of AAT and year. Most tools use a dynamic or static approach to assess correctness, and typically static analysis is used to grade readability and maintainability. Few papers investigate how machine learning can be used to assess any skill.

To grade and give feedback on the core skills, different techniques were used depending on the category of the AAT. Figure 5 shows the count of techniques used by category, as defined in Section 3 . It shows that most dynamic AATs use unit testing, and most other techniques use static approaches. However, few used techniques such as machine learning or analysing the graphical output (rayyan-354359310, 128 ) .

Refer to caption

Figure 6 shows the count of the degree of automation, with most tools implementing fully automated graders. For a few tools, the publication was unclear on their automation approach.

Refer to caption

81% of tools utilise a fully automated approach to assessment, while others have opted for a semi-automated approach (14%). Most AATs are implemented using a fully automated approach to allow for near-instantaneous feedback and multiple resubmissions of assignments without increasing the grading workload. In contrast, semi-automated approaches are typically used to verify grades given by an AAT or to aid a manual grading or feedback.

Most fully AATs offer bespoke solutions or implement existing grading tools, such as Virtual Programming Lab (Rodriguez2012, 155 ) , to grade primarily grade correctness, typically by implementing some form of unit tests. Some tools have adapted continuous integration and delivery by using industry tools such as GitHub Actions 6 6 6 GitHub Actions – A commercially available CI/CD tools: https://github.com/features/actions and TravisCI 7 7 7 TravisCI – Another CI/CD tool: https://www.travis-ci.com/ to conduct unit tests when students push their code to a repository. Students typically receive the test results as feedback directly from the chosen tool.

Semi-automated assessment can be implemented using a variety of techniques. (rayyan-354359291, 7 , ) ’s tool has instructor interactions through their grading workflow at multiple stages to grade computer graphics assignments. Instructors are expected to provide reference solutions, tests, read reports on outliers and highlight source code to produce a rubric, and finally grade the assignments based on the previous test results, extracted rubrics and highlighted code. Whereas in (rayyan-354359270, 80 , ) ’s grader, the instructors’ role is to manually grade any assignments that cannot be automatically graded and correct any issues with the automatically generated test cases.

6.1.1. Correctness

66% of the research included in this review only focused on assessing correctness. Approximately 36% of AATs focused entirely on correctness and used only dynamic analysis in unit testing. The others used static analysis (17%), a combination of dynamic and static analysis (21%), machine learning in combination with static, dynamic analysis, or as a standalone model (6%). Most static analysis tools implement some form of comparison to a model solution or a set of existing solutions to grade or give feedback on a task. Whereas most dynamic and static analysis tools combine unit testing and comparison to other solutions.

(rayyan-354359319, 146 , ) developed a gamified web-based automated grader for C programs. They utilise unit tests for both calculating the grade and providing specific feedback. Additionally, they award a certain amount of experience points and a badge for completing the assignment to a certain level. To evaluate their tool, they conducted a student survey and concluded that it was helpful and that users appreciated the gamification system.

Similarly, (rayyan-354359293, 23 , ) developed a unit test-based tool for automatically grading Android applications. They employ an industry-standard continuous integration tool to automatically run the test suite when students push to their version control repository. The students are awarded specific points for each test case passed, with the sum being their total grade. The authors conducted a student survey and found that most spoke highly of the system, but some students found that the grading logic is not flexible enough, a known issue with unit test-based AATs.

While many automated assessment tools utilise dynamic approaches, some use a static approach. (rayyan-354359388, 43 , ) developed a grader that uses an instructor-provided reference solution and verifications to grade a submission’s correctness. The verifications allow instructors to adjust the assignment difficulty and associate the verifications with specific language elements. When evaluating their tool, they found that students’ success rate improved, using the framework took less effort from the instructors, and they had a favourable reception from the students.

Dynamic and static approaches are most commonly used to automate the assessment of programming assignments. However, some tools have opted to use machine learning to assess the correctness of students’ programming assignments. (rayyan-354359341, 182 , ) trained a Support Vector Machine to grade assignments based on their structural similarity. They first convert the submissions to their abstract syntax trees (ASTs) and then replace all the identifiers with a standard character. Using these ASTs, they can calculate the structural similarity and produce a final score.

(rayyan-354359367, 1 , ) and (rayyan-354359295, 169 , ) adapted natural language processing techniques to assess source code. They normalised the source by removing comments and normalising the names and string literals. (rayyan-354359367, 1 ) encode the source code as a binary vector, while (rayyan-354359295, 169 ) trained a skip-gram model, which is a neural network that predicts a word using the context words around the missing word, to produce their vector encoding. Finally, they both train neural networks for a final task, (rayyan-354359367, 1 ) used a dense neural network to predict how to repair an incorrect solution, and (rayyan-354359295, 169 ) used a convolutional neural network to evaluate the quality of the assignments based on their underlying semantic structure.

6.1.2. Maintainability

While most AATs focus on correctness, only one tool in our review assesses exclusively maintainability. (rayyan-354359378, 37 , ) investigated conceptual feedback vs traditional feedback using a tool called Testing Tutor. Testing Tutor students learn how to create higher-quality test suites and improve their testing abilities. The ability to create a high-quality test suite allows the students to develop higher-quality code by finding bugs in their implementation. Additionally, creating a suite of test cases allows for future regression testing, tests you can run every time the program changes. These are a vital part of creating maintainable software (kaner1999testing, 87 ) .

Testing Tutor utilises a reference solution to detect missing test cases or fundamental concepts, such as testing boundary conditions or data integrity. This paper gives feedback in either a detailed coverage report or conceptual feedback detailing a core concept that has been missed in the testing.

6.1.3. Readability

Four tools focus on assessing only readability and use a static analysis approach for their grading or feedback (rayyan-354359306, 83 , 89 , 132 , 114 ) .

(rayyan-354359360, 89 , ) implemented a tool to promote learning code quality using automated feedback. Their tool recommends improvements in a student’s code and comments by providing suggestions that the student can implement.

They check that the naming has a consistent style, all names are written in camelCase or snake_case , check for misspelt subwords and validate if the name contains meaningful subwords. Meaningful subwords should only contain letters and not contain any stop words. They implement similar suggestions for comments by checking that comments do not have any misspelt words, meaningless words, or comments that are too short. To evaluate the tool’s effectiveness, they conducted two experiments. The first was conducted using a control group and an intervention group, and the second focused on a year one and a year two programming course. They conclude that while the research is incomplete, the tool can be helpful, as students do not satisfy all the code quality requirements due to human error.

Similarly, (rayyan-354359398, 114 , ) extended an existing Python static analysis by developing custom checks and feedback for novice programmers. They provide a textual description of the error, provide an example, and explain why the error is problematic. To evaluate their tool, they compared two years of an introductory Python course, with one year using the tool and the previous year without. The year that utilised the authors’ tool significantly reduced the number of repeated errors per submission and the number of submissions required to pass the exercise.

6.1.4. Documentation

No papers focus entirely on documentation. However, (rayyan-354359305, 64 , ) introduced AppGrader, an automated grader to grade Visual Basic applications, which graded correctness alongside documentation. They used static analysis to check if best practices have been followed and if comments exist for each subroutine or function. However, the tool did not analyse the quality of the documentation. In their bench tests, they assessed the tool against two scenarios, a typical homework assignment for an introductory programming course and the other was the source code for the tool itself, to find the overall execution times. The average overall execution time was 8.54 seconds for the typical home assignments, allowing this tool to be used as near-instantaneous feedback during development.

6.1.5. Combination Graders

While some AATs only focus on grading one skill, others aim to grade a combination of skills with varying approaches. The most common is to assess readability alongside correctness (15%). These tools primarily use a dynamic approach to assess correctness and a static approach to assess readability.

Some tools assess maintainability in addition to correctness and readability (6%) or just in addition to correctness (4%). These typically use either dynamic approaches in the form of mutation tests to assess the quality of the student-created test suites or static approaches such as metrics to calculate the complexity of the code. Few tools focus exclusively on assessing maintainability and readability (3%). These AATs use a static approach to assess both these skills.

(rayyan-354359369, 42 , ) introduced Annete, an intelligent tutor for Eclipse, to give feedback on correctness and readability. They use a neural network with a supervised learning algorithm to determine if a student needs assistance with their code and determines what feedback should be shown. Annete can give multiple types of feedback, including feedback about how to proceed, such as using language structures that should not be used in a particular assignment and practical support, such as giving positive feedback when they have passed a test case.

Similarly, (rayyan-354359351, 179 , ) developed a tool that provides feedback on maintainability and readability during development and then used unit tests to grade correctness after the final submission. For the feedback on maintainability and readability, they utilised static analysis to find antipatterns , including commonly occurring practices that reflect misunderstanding, poor design choices and code style violations in students’ submissions.

While most tools provide exclusively textual feedback, (rayyan-354359333, 56 , ) have developed a tool that combines spectrum-based fault localisation with visualisation to provide feedback on maintainability and uses unit testing to grade and provide feedback on correctness. They visualise the spectrum analysis as a heatmap, with more suspicious code producing a higher score. To evaluate their tool, they conducted a user study over two semesters. One was a control group that received only textual feedback, and the other semester provided both textual feedback and heat map visualisation. They found that having access to heat maps made it easier for students to make more incremental progress towards maximising their scores.

6.2. Languages Evaluated (RQ2)

As AATs can focus on grading any number of languages, we grouped the languages into their primary paradigm. Figure 7 shows the count of language paradigm among the tools. Most tools assess OOP languages, followed by functional languages and tools designed to grade any language 8 8 8 Non-OO or Imperative languages did not solely appear in the literature, and were paired with OO languages. Languages that can be used in an OO or non-OO way are included in the OO category. . Some papers did not clearly specify which language the tool assesses; these have been annotated as unknown.

Refer to caption

6.2.1. Object-Oriented, Functional and Logical Programming Languages

Most AATs grade object-oriented languages (69%), such as Java and Python. Due to object-oriented languages’ prominence in education and industry, object-oriented graders are the most developed, making practical teaching tools for the programming approaches we used (Kolling1999, 101 ) .

(rayyan-354359410, 32 , ) developed a tool to generate personalised hints for Python. Their tool first finds successful submissions that have failed the same test at some point and then computes an edit path from the student’s incorrect submission to the successful submissions. They then translate the edit path to natural language hints. To evaluate their tool, they used a set of historical submissions and measured the percentage of submissions for which the tool can generate hints. They found that their tool could generate a personalised hint for most submissions while only using data from as little as ten previous submissions.

While (rayyan-354359410, 32 ) focused on generating personalised hints for Python assignments, other authors focused on automated grading of Java. (rayyan-354359383, 156 , ) developed a tool to aid teaching assistants when manually reviewing submissions by generating a report assessing how well the student performed against a set of learning objectives. Similarly, (rayyan-354359337, 51 , ) ’s tool aimed to aid graders with grading the correctness of methodology; explicitly do students’ submissions contain the required methods or fields rather than the correctness of the functionality.

(rayyan-354359347, 116 , ) utilised formal semantics to develop a real-time Python AAT. They use a single reference solution to find differences in the output and execution trace of a student’s submission. The student’s submission is graded as incorrect if a difference is found. They evaluated their tool against a set of benchmarks of existing submissions graded by test suites and discovered that it revealed no false negatives, but a test suite did since it was missing some test cases.

Similarly, (rayyan-354359325, 121 , ) developed a tool that uses semantic analysis, a knowledge base of programming patterns and the instructor’s input to correlate the patterns with detailed natural language feedback to provide personalised feedback for Java assignments. They compared their work against state-of-the-art techniques, including (rayyan-354359347, 116 ) ’s tool, and concluded that their approach is based on understanding the semantics of the submissions and the original intentions of students when dealing with an assignment.

In addition to assessing object-oriented languages, some AATs assess functional languages, typically Haskell or OCaml (rayyan-354359307, 134 , 17 , 115 , 24 , 70 , 167 , 60 , 108 , 5 , 62 ) . (rayyan-354359336, 24 , ) developed an online IDE and AAT for OCaml while designing a massive open online course (MOOC). The IDE provided syntax and type error feedback as annotations and graded student submissions using unit tests.

While (rayyan-354359336, 24 ) developed an IDE as part of a MOOC, (rayyan-354359401, 108 , ) developed a program repair-based feedback tool for OCaml. They utilised test cases and correct reference solutions to find the fault and repair logical errors in students’ submissions. They conducted a benchmark and student survey to evaluate the tool’s efficiency and helpfulness. They found that the tool is powerful and capable of fixing logical errors in student submissions and that the students found it helpful.

(rayyan-354359375, 167 , ) developed an alternative program repair-based feedback tool for OCaml, utilising multiple partially matching solutions and test cases to generate a fix for logical errors. They compared their tool to (rayyan-354359401, 108 ) ’s tool and found that their approach was more effective at repairing submissions. Additionally, they conducted a user study, and students agreed that their tool was helpful.

Other tools have generated “am I on the right track” feedback for other functional languages. (rayyan-354359397, 60 , ) conducted a pilot study of a tool for Scheme that transformed a student’s partial submission into a final program with the same functionality as a desired correct solution to determine if the student is on the right track.

Similarly, (rayyan-354359412, 62 , ) developed a tool for Haskell, which used programming strategies derived from instructor-annotated model solutions to determine if a student’s submission is equivalent to a model solution. They deployed their tool into their course, and most interactions with the tool were classified as correct or incorrect. Furthermore, they conducted a student survey to evaluate the perceived usefulness of the tool. They found that students were taking larger steps than the tool could handle, that the tool was sufficient, and that there was room for improvement.

Only one paper grades a logic language: (rayyan-354359414, 105 , ) has developed a tool to assess Prolog clauses using static analysis. To grade and give feedback on Prolog clauses, the authors convert the clauses to abstract syntax trees and extract patterns that encode relations between nodes and the program’s syntax tree. These abstract syntax tree patterns are then used to predict program correctness and generate hints based on missing or incorrect patterns. They evaluated their approach on past student assignments and found that the tool helps classify Prolog programs and can be used to provide valuable hints for incorrect submissions. However, more work must be done to make the hints more understandable by annotating natural language explanations of their patterns and derived rules.

6.2.2. Specific Language Domains

While some AATs focus on the three major language paradigms, other AATs focus on grading more specific areas, such as web-based languages (rayyan-354359298, 157 , 132 , 194 ) , graphics development (rayyan-354359280, 8 , 7 , 118 , 191 ) , kernel development and assembly languages (rayyan-354359366, 117 , 40 , 142 ) , or query languages (rayyan-354359278, 185 ) .

(rayyan-354359362, 132 , ) introduced a tool to grade the quality of web-based team projects and measure students’ contributions. The authors utilise continuous integration tools to analyze the version control logs to determine the students’ contribution and use existing static analysis tools, including SonarQube and StyleLint, to evaluate the quality of the source code. While applying this tool to a course in their department, they found a weak association between lines of code modified and the final grade. Students with better grades fixed more errors but also introduced more errors.

To assess computer graphics courses, (rayyan-354359340, 118 , ) and (rayyan-354359349, 191 , ) developed AATs to grade OpenGL. Both solutions compared the students’ output to a reference solution by comparing the difference in pixels, with (rayyan-354359349, 191 ) also grading based on parameters passed to OpenGL and algorithm results, such as outputs of parametric equations. (rayyan-354359340, 118 ) implemented “visual unit testing”, allowing instructors to script keyboard and mouse input and provide detailed feedback through screenshots and videos of the execution. They both conducted student surveys to evaluate their tools’ usefulness for learning computer graphics and reported that the students found the tools helpful when learning to program computer graphics.

To assess operating system kernels (rayyan-354359390, 142 , ) developed a cloud-based Linux kernel practice environment and judgement system. This uses a dynamic approach to test the students’ attempts at programming an operating system kernel by comparing their kernel outputs to the teacher’s reference solution. The authors evaluated the parallelisation of their tool by measuring the execution time when running tests and comparing these results to a baseline serial execution, with their tool taking less time to grade longer scripts. Finally, they concluded that their tool works well in the real world and reduces the effort to verify a student’s work.

While (rayyan-354359390, 142 ) focused on assessing kernel-level assignments, (rayyan-354359373, 40 , ) and (rayyan-354359366, 117 , ) developed tools to assess assembly code assignments. Both tools utilised a dynamic approach in the form of test cases to grade and give feedback on their assignments. Additionally, (rayyan-354359366, 117 , ) also implemented an IDE plugin to give continuous feedback to students, including errors that the code cannot assemble, warnings to indicate potential bugs and information based on the analysis of the code. To evaluate their tools, they surveyed students and found that their tools were beneficial and contributed positively to the student’s learning of assembly languages.

To grade SQL statements, (rayyan-354359278, 185 , ) investigate combining dynamic and static analysis. The dynamic analysis compares the output of the students’ statements with the expected value. The static analysis compares the syntax similarity using an abstract syntax tree and the textual similarity of the statements themselves. To evaluate their approach, they compared their hybrid approach to a dynamic analysis that executes and compares the results with the expected results, a syntax-based approach that calculates the syntax similarity between a submission and a reference solution using the abstract syntax tree and a text-based approach that calculates the textual similarity between the student’s statement and a reference statement. They found that the existing grading approaches could not yield satisfactory results and that the hybrid approach successfully identifies various correct statements submitted by students and grades other statements according to their similarities.

6.3. Techniques Used to Evaluate the Tools (RQ4, RQ5)

Experiments were conducted using different techniques to evaluate the tool’s quality. Figure 8 shows the count of techniques used in the papers, with student surveys and tools being compared to manual grading being the most common. Most of these experiments were conducted on course-specific assignments (66%). As such Figure 9 shows that most of the data is not available to validate the results or for future research. While most tools are evaluated in some form, most do not provide the dataset in which the evaluation occurs (84%); this might be due to most tools being evaluated against course-specific assignments (66%) or exams (10%), which typically are not be distributed publicly.

Refer to caption

6.3.1. Approaches

90% of the papers include some form of evaluation, with most evaluations conducted by the tool developer. The primary method of evaluation is to ask for student feedback on a tool in the form of surveys (24%), as they are the primary beneficiaries of AATs. Their insight provides valuable feedback on how well a tool helps them learn how to program.

Another typical evaluation approach compares the tool to either manual grading (22%) or other automated tools (9%). The evaluation is typically used to check the tool’s grade accuracy compared to manual grading. Other aspects, such as using an automated tool to improve students’ grades or allow them to receive feedback faster, are also evaluated. Though most of the included articles perform some form of evaluation, 10% of them do not; this could be due to the tools still being in development and not at a stage where a full evaluation would be beneficial. However, even an initial validation that the tool is performing as expected at the early stages of the project could be beneficial to confirm that the final tool will aid students’ learning.

6.3.2. Performance

As an exploratory analysis, we annotated a third of all papers that conducted some form of evaluation and how well they performed. We found that most tools perform well in their evaluations, though some authors report mixed results, especially if using multiple evaluation approaches. While the tool developers conducted most evaluations, some tools were evaluated by third parties, typically in papers evaluating multiple tools or papers reporting authors’ experiences using existing tools. Reporting only positive results makes it difficult to evaluate different tools reliably and can be challenging for instructors to select the most effective tool for their course, especially with most AATs implementing the same methodology.

Some authors have produced evaluation or experience reports discussing one or more tools to provide instructors with an external evaluation of some AATs. (rayyan-354359364, 153 , ) evaluated two AATs, one that gave personalised hints using program repair and a program visualisation tool, to determine if students can solve problems faster, understand problem-solving and fix bugs easier, compared to a test suite. In their experiment, they had three test conditions; in the base condition, students only had access to the test suite, and in the other two, they had access to one of the tools and the test suite. The students were asked to complete a set of problems using one of the three conditions randomly assigned to the problem; after reaching a correct solution, they were presented with a post-test. This post-test was used to evaluate the student’s understanding of how to solve the problem and consisted of students being presented with four different solutions for a problem and being asked to identify which solutions were correct or incorrect without being able to execute the program. They found that the program repair tool greatly cuts student effort, with fewer attempts, and students using the visualisation tool showed lower post-test performance.

Similarly, (rayyan-354359309, 18 , ) evaluated two AATs, one implemented using unit tests and the other utilising reference solutions. They focused on applying these AATs in the context of a massive open online course. They found that the reference solution-based AAT typically performs as well as the unit test-based approach. However, the reference solution-based tool awards lower grades to correct solutions that are rarely implemented.

While some evaluation papers compare existing tools, (rayyan-354359316, 34 , ) investigate how different test suites can provide different grades and how the properties of the unit tests impact the awarded grades. To answer their research questions, they extended an existing set of programming assignments with artificial faulty versions and a sample of test suites from a larger pool. They generated the grades by calculating the percentage of passed tests and compared the different test suites. The authors concluded that the grades vary significantly across different test suites and that code coverage, the percentage of the source code executed while running a test suite, affects generated grades the most.

6.4. Performance Against Human Graders (RQ4, RQ5)

Only 22 tools are evaluated against human assessors, with 16 focused on grading and six focused on feedback. While some of the automated graders provide feedback to students, typically in the form of unit test results, similarity to model solutions or predefined human feedback, human assessors have not evaluated the generated feedback. To investigate how well AATs perform when compared to humans, we further annotated papers that conducted an evaluation that included some comparison to the assessment provided by a human.

Refer to caption

Figure 10 shows the count of tools by different evaluation techniques that involved comparing the results from the AAT with a human and the authors’ sentiment of the performance of the AAT. Most are evaluated against the accuracy of the AAT compared to human-provided grades (rayyan-354359274, 160 , 185 , 9 , 147 , 50 , 190 , 176 , 182 ) , with most reporting positive results.

(rayyan-354359278, 185 , ) tool uses a hybrid approach of reference solutions to automate the grading of SQL statements. It is evaluated against a benchmark of human-graded submissions and three other state-of-the-art approaches, including dynamic, syntax-based and text-based analysis. The proposed approach performs better than other state-of-the-art approaches with a clear advantage, with a mean average error of 8.37, compared to 26.01 for static analysis, 29.89 for text-based analysis and 21.66 for dynamic analysis.

While (rayyan-354359279, 9 , ) primarily discusses automated feedback in the form of code repair suggestions, the evaluation compared to a human is focused solely on the effect the code repair has on the automated grading and not how the feedback compares to a human assessor. The evaluation shows that their code repair tool increases the grade precision, measured by comparing the submission’s auto-graded result before and after code repair to manually graded scores, of their unit test-based automated grade, by 4%, from 81% to 85%. This shows that their tool improves the precision of the automated grader by repairing uncompilable code that can then be automatically graded.

Most tools that evaluate the accuracy of automated graders take the human-provided grades as ground truth, (rayyan-354359297, 81 , ) , introduce JavAssess, a framework that automatically inspects, tests, marks and corrects Java source code. To evaluate the performance of JavAssess to human graders, they compare the accuracy of manually and automatically graded exams. However, they treat the automatically provided grades as the ground truth and conduct an ANOVA test to validate the influence of human graders on marking errors. They concluded that human graders influence the mark when marking, with the probability value associated with the F value P ⁢ r ( > F ) = 0.0164 annotated 𝑃 𝑟 absent 𝐹 0.0164 Pr(>F)=0.0164 italic_P italic_r ( > italic_F ) = 0.0164 below the significance level of 0.05.

While most papers evaluate the tools’ accuracy, some investigate the correlation between human-provided grades and automated tools (rayyan-354359272, 31 , 187 , 35 , 112 , 140 ) . (rayyan-354359272, 31 , ) utilise unit testing and pattern rules to automate the grading of six programming assignments within a single undergraduate course. They compared the human grades to the tool’s grades to evaluate their tool. They found that the tool’s grades were significantly positively correlated with human grading in all six assignments ( p < 0.001 ) 𝑝 0.001 (p<0.001) ( italic_p < 0.001 ) , with on average 70% of the tool’s grades being the same as the human graders.

(rayyan-354359300, 112 , ) provide a detailed statistical analysis comparing manual and automated assessment of programming assignments and report on a lecturer’s experience integrating automated assessment into their module. To evaluate their chosen AAT, they used 77 exams that were manually and automatically graded and were analysed using a paired T-test, and the correlation between manual and automated assessment was measured with Pearson’s correlation coefficient.

Finally, the students were classified into three categories (failing, passing, and passing with distinction) based on their manual assessment marks, and a T-test was then performed on the categorical data. The T-test resulted in a “medium practically visible difference found” and can be attributed to the difference in granularity between the marks given by manual assessment and those given by unit tests and the correlation proved to be a significant relationship ( r = 0.789 , p < 0.001 ) formulae-sequence 𝑟 0.789 𝑝 0.001 (r=0.789,p<0.001) ( italic_r = 0.789 , italic_p < 0.001 ) . The t-test results on the categorical data show a very large, practically significant difference for failing students ( d = 1.303 , p < 0.001 ) formulae-sequence 𝑑 1.303 𝑝 0.001 (d=1.303,p<0.001) ( italic_d = 1.303 , italic_p < 0.001 ) and students who passed ( d = 0.151 , p = 0.409 ) formulae-sequence 𝑑 0.151 𝑝 0.409 (d=0.151,p=0.409) ( italic_d = 0.151 , italic_p = 0.409 ) , suggesting that automatic assessment is not reliable. However, they found that automated assessment can be more trustworthy for higher achieving students, with the small effect size ( d = 0.151 , p = 0.409 ) formulae-sequence 𝑑 0.151 𝑝 0.409 (d=0.151,p=0.409) ( italic_d = 0.151 , italic_p = 0.409 ) .

Another factor in automating grading is reducing the time instructors take to grade. Two papers investigate the effect of the time taken to grade assignments, with and without their tool (rayyan-354359270, 80 , 128 ) . (rayyan-354359270, 80 , ) introduce a semi-automated approach to grading Java assignments by automatically generating unit tests from an instructor’s solution and only presenting the graders with submissions that cannot be graded using the unit tests. To evaluate how well their tool expedites the grading process, they asked graders to grade six Java exams manually and then, six months later, asked the same six examiners to mark the exams using the tool. While instructors invested an additional hour, on average, to prepare the exams, the automated tool reduced the average time taken to grade each submission from 6 minutes to 2.5 minutes. The total average time invested, including preparation, decreased by 25.2 hours, from 53.5 to 22.3 hours.

Only six papers evaluate the feedback by comparing the automatically generated feedback to human-generated feedback (rayyan-354359367, 1 , 60 , 68 , 111 , 156 , 180 ) . Table 3 shows the count of tools by different evaluation techniques that involved comparing the results from the AAT with a human and the authors’ sentiment of the performance of the AAT.

(rayyan-354359367, 1 , ) use code repair to provide targeted examples for compilation errors. They evaluate their AAT by comparing the time taken to repair compilation errors with and without the automated code repair feedback, with both groups having access to human teaching assistants (TAs). They found that the tool helped resolve errors 25% faster on average, and the large-scale controlled nature of the empirical evaluation implies that the tools are comparable to human TAs.

(rayyan-354359397, 60 , ) utilises instructor-provided solutions to generate ”am I on the right track” feedback. As part of a pilot study, the tool’s feedback was shown to a subset of experienced TAs, and all TAs were asked to replicate what they would do when interacting with incorrect student code, and the TAs were asked to provide feedback on the tool. The TAs in the study provided mixed feedback and suggested areas of improvement, with one TA saying that the fully automated feedback features caused them to have a lack of control over the process. Other TAs saw the potential for the tool and how ”am I on the right track” style feedback can aid struggling students when they do not have access to teaching staff.

(rayyan-354359405, 180 , ) present a virtual teaching assistant to help teachers detect object-oriented errors in students’ source code by converting source code to Prolog and inferring errors from instructor-provided rules. To validate the corrections made by their tool, they used the virtual teaching assistant to check students’ coursework previously checked by human graders, to check if the tool overlooks corrections made by the instructors and to reduce the instructors’ workload. The authors observe that the tool detected 125 additional errors, totalling 196 instead of 71 object-oriented errors. However, they also found three types of object-oriented errors that the tool cannot detect, including non-static methods that do not use any field or method of the object, non-abstracted fields in sister classes and using Java Collections methods without implementing the equals methods. They repeated this study with additional rules to handle these missing errors, and the tool found 29 of these specific three errors overlooked by the instructor.

7. Discussion

7.1. why the focus on correctness assessment (rq3).

There are many reasons why researchers choose to focus on assessing correctness over other important skills. Assessing the correctness of functionality teaches students how to analyse and implement features based on written requirements. Understanding how the requirements can be translated to the desired features is a crucial skill and one that is often used in the industry. Additionally, assessing the correctness of methodology allows instructors to verify that students understand particular programming concepts, such as conditionals, iteration, and recursion, or other course material, such as particular algorithms they have been asked to implement.

Primarily, assessing correctness uses a dynamic approach in the form of unit tests, which have several benefits and limitations. Unit tests allow for quickly assessing large quantities of assignments and enable students to receive near-instantaneous feedback from the test suite results. Implementing a test suite to assess an assignment requires specific unit testing knowledge, and more complex implementations require reflection knowledge. However, many unit test-based tools aid instructors in creating their test suites by providing a framework to implement them or simplifying unit testing into a set of input-output tests.

While these AATs offer near-instantaneous feedback, they are often simplistic, typically displaying if the test passed or failed, and if it failed, the difference between the expected and actual outputs or any errors. This limited feedback only informs the student of an issue in their code, either that an error is produced or that their output does not match the desired output, and does not help them resolve their issues. Providing this form of limited feedback, where the student knows the error and potential location but does not provide any hints on how to fix their issues, is similar to the limited and cryptic feedback of compiler messages, which often results in increased frustration and hampers progress (Becker2016, 16 ) .

Furthermore, most of these AATs cannot award partial grades for incomplete or uncompilable programs or distinguish between qualitatively different incorrect solutions, resulting in students receiving a zero grade. In contrast, if a human graded them manually, they would typically receive partial marks for source code that implements a subset of the features or has minor logical or syntax issues. Some AATs resolve this by implementing code repair to fix the broken code before running the test suite, thus providing partial grades for syntactically incorrect submissions, typically by deducting marks from the unit tests results of the repaired code (rayyan-354359270, 80 , 181 , 9 , 19 , 186 , 161 ) .

Another limitation of using unit testing to assess correctness is that unit tests typically require students to follow a strict pre-defined structure for their code, often with class and function names specified. Furthermore, the popularity of unit test-based AATs can influence instructors when designing their assignments. Suppose the instructors have large class sizes or want to give students near-instantaneous feedback, in that case, they often use a unit test-based AAT, typically leading to small-scale closed-ended assignments. These small-scale closed-ended assignments and the strict structure can limit students’ opportunities to be creative. Additionally, this strict structure and small scale of the assignments can limit the student’s ability to learn how to write maintainable and readable code by limiting their opportunities to learn how to name classes and functions and how to design object-oriented solutions.

Instructors can design open-ended larger-scale coursework to enable students to take responsibility for their assignments and use their creativity to implement the required features. Students who assume control over their learning experience gain knowledge more effectively than those who do not (Knowles1975, 99 ) . Furthermore, allowing students to take ownership of their projects and use their creativity can increase their motivation and learning opportunities (Sharmin2021, 162 ) .

However, these are often impractical to assess at scale, as assessing open-ended assignments is time-consuming and challenging to automate with traditional auto-grading methods. A common approach to assessing open-ended assignments is to use multiple human graders (Aziz2015, 11 ) . However, using multiple graders can lead to grade and feedback consistency issues, as readability, maintainability, and documentation are often subjective. While multiple graders reduce the workload for a single grader, the overall assessment process is still time-consuming. Additionally, they cannot offer the near-instantaneous feedback that fully-automated assessment approaches can offer. The impracticality of open-ended assignments often leads to instructors designing their assignments to work with traditional auto-grading approaches, limiting the opportunities for students to take control of their learning.

7.2. Why Assess Maintainability, Readability and Documentation? (RQ3)

While most tools focus on assessing correctness, few focus on assessing code quality aspects, such as maintainability, readability or documentation. The automated assessment of these skills is typically minimal and is often focused on static analysis to determine the quality of the source code.

While evaluating these skills using static analysis can provide a good indicator, most static analysis tools and metrics are designed to evaluate professional code. Some metrics that evaluate these areas may not be suitable for all types of novice programming assignments, especially short-form assignments typically used with unit testing. This could be due to these assignments providing the overall code design, limiting the ability to use maintainability metrics such as Depth of Inheritance Tree or Coupling Between Object Classes (Chidamber1994, 29 ) .

In the case of documentation, the static analysis tools are typically limited to detecting the presence of comments and that they follow the correct format. However, recent research has started to investigate how to implement metrics commonly used in prose that can be applied to evaluating documentation. (Eleyan2020, 58 , ) use the Flesch reading ease score and Flesch-Kincaid grade level to evaluate how understandable comments and docstrings are and how long they take to read.

Most AATs that assess these skills using static analysis typically output the result, either as a number or an indication that something is missing. They do not typically tailor the output of these professional software engineering tools to novice programmers by adapting the result to something a novice programmer can use to improve. The limited information provided to the student can be confusing, limit progress, and frustrate students (Denny2020, 44 ) .

While assessing correctness evaluates students’ ability to write working code, assessing maintainability, readability, and documentation evaluate their ability to write good code. Typically, source code that is maintainable, readable and well-documented is easier to adapt, especially when working in teams or for future development. Furthermore, evaluating students’ adherence to consistent code style is critical in teaching students to write readable and well-documented code (Hart2023, 75 ) .

7.3. Degree of Automation (RQ1)

There are advantages and disadvantages to both fully automated and semi-automated assessment. Fully-automated assessment allows for near-instantaneous feedback but typically limits the scope of the assignment. Providing near-instantaneous feedback can encourage students to submit early (Leinonen2022, 110 ) and address underlying misconceptions (Gusukuma2018, 66 ) .

Most fully-automated assessment tools use unit testing to assess student solutions, but they share similar issues as most unit-test-based AATs. These include forcing instructors to design closed-ended or structured assignments compatible with the tool and limiting the opportunities for students to take control of their assignments. Additionally, maintainability, readability and documentation assessment is often non-existent or limited static analysis, such as conforming to a style guide or matching specific patterns. The typical structure used in these AATs makes it highly challenging to automatically grade elements such as correct use of object-oriented concepts, class and variable naming, and documentation quality, as these are typically provided by starter code designed to work with the AAT.

While semi-automated assessment does not allow for near-instantaneous feedback, it allows feedback to be delivered faster than manual assessment. Semi-automated assessment typically aids the grader by assessing the correctness of the submission and allowing the human grader to assess the other skills. Providing feedback quickly is essential to high student satisfaction (Kane2008, 41 ) . However, the typical use of semi-automated assessment does not resolve the issues around close-ended assignments, especially providing students with a specific structure. The human grader can manually assess documentation quality, how students have named their local variables and the code style. But they can still not assess key skills like the correct use of object-oriented principles, as this is typically defined in the provided starter code.

One potential solution to automatically assess open-ended assignments, at least partially, is to automate the grading of maintainability, readability and documentation and manually grade the correctness. This semi-automated approach would allow automated grading of elements commonly shared between solutions, such as documentation quality, variable naming, adherence to code style, and code design principles. While allowing human graders to focus on assessing the implemented functionality and whether the student’s solution met the requirements in the open-ended assignment. This approach would allow instructors to set open-ended assignments while offering students partial near-instantaneous feedback on skills that are rarely graded automatically. Furthermore, having the human graders only assess the correctness can reduce the overall assessment time and potentially reduce the variety in grades typically produced when multiple human graders grade the more subjective skills, including maintainability, readability and documentation.

Additionally, further research could investigate the effect of providing the results of the automatically assessed elements near-instantaneously to the student and providing the manually assessed element after the deadline. For example, students could receive continuous feedback on their maintainability, readability and documentation, and their final grade and feedback when the correctness has been manually assessed, providing students with feedback on areas that typically take time to grade and allowing instructors to set open-ended assignments.

7.4. Language Paradigms Graded (RQ2)

While surveys into popular programming languages 9 9 9 GitHub – The top programming languages 2022 (accessed 30/01/23): https://octoverse.github.com/2022/top-programming-languages , 10 10 10 JetBrains – The State of Developer Ecosystem 2022 (accessed 30/01/23): https://www.jetbrains.com/lp/devecosystem-2022/ , 11 11 11 StackOverflow – Developer Survey 2022 (accessed 30/01/23): https://survey.stackoverflow.co/2022/#technology have yielded that JavaScript, a web-based language, is the most commonly used programming language, OOP languages are still very prominent. Besides providing fundamental skills, the popularity of OOP languages could be why they are the most commonly automatically assessed paradigm. Additionally, most OOP languages have a framework that can be used for web development in conjunction with web-based languages, such as Spring 12 12 12 Spring – A Framework for Java Microservices and Web-Apps: https://spring.io/ for Java and Django 13 13 13 Django – A web framework for Python: https://www.djangoproject.com/ for Python.

These frameworks could provide a potential route to teaching web development. Most introductory courses teach an OOP-based language, allowing students to learn server-side-based web development without first needing to learn client-side web-based languages. Furthermore, these OOP-based frameworks often support unit tests, potentially allowing for existing AATs to be used to assess web-development assignments. However, further research should be conducted into the automated assessment of web-based languages, as these assessments are rarely automated.

7.5. Evaluation Techniques

7.5.1. approaches.

Conducting student and instructor surveys has many benefits, including providing insight into the users’ experience with the tool, how well it worked for them, and how it could be improved. However, surveys cannot validate the accuracy of AATs, as the surveys only collect the user’s opinion of how well the tool performed. To validate the accuracy of AATs, the results of the proposed tool should be compared to a benchmark, either other tools or, ideally, a human-graded dataset.

While comparing to a benchmark dataset can validate the tool’s accuracy, human graders are typically better at understanding the nuances in a student’s submission. This allows them to provide more accurate assessments, especially for partial or uncompilable submissions. However, using multiple human graders to assess large courses is common practice, which can introduce some variability in awarded grades and feedback given. Evaluating against other tools can demonstrate improved accuracy when running against the same benchmark.

Conducting a mixture of quantitative evaluation in the form of comparing against benchmarks, both against other tools and human graders, and qualitative evaluation, such as student and instructor surveys, could provide the most in-depth analysis of AATs. This allows researchers to show how their tool improves upon other tools and compares to the gold standard of human graders while also providing valuable insights into the user’s experience, both from a student and instructor’s point of view.

7.5.2. Data Availability

Publicly distributing datasets would allow instructors and researchers to compare similar and new AATs against a shared dataset, allowing them to make informed decisions on which tools to use or adapt for their purposes instead of developing another automated grader from scratch. Furthermore, releasing datasets alongside the study allows researchers to reproduce and validate experiment results, aiding instructors in choosing which AAT to use. Among the released datasets, few are utilised in multiple papers. Those utilised are typically from large-scale courses using smaller-scale programming exercises from online judges, such as Hacker Rank (rayyan-354359338, 36 , 182 ) . The lack of validated available benchmarks can make it difficult for researchers to validate the results of their tools, especially for AATs that focus on less commonly assessed skills, including maintainability, readability and documentation.

In addition to increasing reproducibility and providing data for benchmarking, releasing publicly available datasets could also allow researchers to find relevant data for their research, whether automated assessment-related or other code-based research, without requiring the researchers to develop a new dataset. This could decrease the overall time spent on a project and produce cutting-edge research faster.

7.6. Limitations of AAT Provided Feedback (RQ6)

Most AATs provided feedback in one form or another, with dynamic analysis tools typically showing whether the test cases succeeded and if they failed the expected output and actual output or the exception/compiler message if one was thrown. Providing the test results can aid students’ learning, especially when providing near-instantaneous feedback allowing students to get feedback on their progress. However, the lack of detailed feedback can frustrate students when they cannot figure out why certain test cases are failing, for example, when the expected and actual outputs look identical to the student but have an unnoticed trailing space. Furthermore, passing compiler and runtime exceptions directly to the students without post-processing is inadequate and presents a barrier to progress and a source of discouragement (Becker2016, 16 ) .

Feedback utilising static analysis also shares similar limitations. The tools that use code repair typically suggest edits that can be made to make the code compilable but with a limited explanation of why the suggested fix makes the code compile. Tools based on software metrics or linters often provide overwhelming feedback to the student, typically highlighting each occurrence of a readability, maintainability or documentation issue. Furthermore, as these AATs are typically based on tools designed for professional software engineers, the feedback supplied to the student can be confusing and contain feedback on topics they have yet to learn about.

While there are limitations to feedback provided by AATs, human-provided feedback is also imperfect. Instructors can provide more nuanced and directed feedback for particular students. However, this takes time, and when assessing large cohorts is practically impossible. AATs typically provide enough feedback to aid most students’ learning while allowing instructors more time to aid struggling students. Further research into the effects of AATs on student learning compared to assessment by human graders could be undertaken in the future.

7.7. Performance Against Human Assessors (RQ4, RQ5)

Evaluating AATs against human assessors is a common method of assessing the quality of AATs, second to conducting user studies. Most evaluate how accurate or well the AAT grades correlate with human graders, with few comparing the automatically generated and human-generated feedback. Those that evaluate automatically generated feedback against human-generated feedback primarily focus on whether the instructors agree with the generated feedback provided or if the generated feedback is accurate regarding errors detected or categorisation of issues within the code.

Further research is required to evaluate the learning effect of automatically generated grades and feedback compared to human-provided grades and feedback, mainly which elements are assessed and the quality and quantity of the feedback provided to the students.

7.8. Results Compared to Related Systematic Literature Reviews

Most of the related work investigates AATs outside of our publication window of 2017 to 2021; here, we compare our results with the previous reviews in this area to discuss any potential long-term trends when combined with our results. (Souza2016, 168 , ) investigated AATs between 1985 and 2013 and found that most tools were fully automated, with less than a quarter of tools that they reviewed opting to develop a semi-automated tool. This trend for fully automated tools has not changed in our review, and we also found that most tools were fully automated, with only 14% of tools using a semi-automated approach.

Similarly, (Keuning2018, 95 , ) focused on conducting a systematic literature review of feedback provided by AATs between 1960 and 2015. They provide a more in-depth analysis of feedback provided by AATs by utilising (Narciss2008, 130 ) ’s  (Narciss2008, 130 ) feedback categories and found that most AATs provide feedback on finding mistakes using test-based feedback. Our results also show this trend of providing feedback on mistakes, primarily using the output of dynamic analysis. (Keuning2018, 95 ) also found that some AATs provide feedback on how to proceed. Our review also found instances of AATs providing feedback on how to proceed, typically tools that provide feedback based on code repair.

Furthermore, (Keuning2018, 95 ) categorised the tools by language paradigm assessed and had similar results to our review, with most tools assessing object-oriented languages and few tools supporting the assessment of logic, functional or other paradigms. They also analysed the quality of the AATs they reviewed. They found that most utilised empirical approaches, such as comparing to learning objectives, conducting student and teacher surveys, or evaluating the AAT based on the time taken to complete a task, with other tools being evaluated analytically, anecdotally or not at all.

We found similar trends in assessing object-oriented languages and most tools being evaluated using empirical methods. However, we provided a more in-depth analysis of evaluation techniques and found that most tools are evaluated by surveys or by comparing the results against manual grading.

While the other systematic reviews discussed investigate papers published in years that do not overlap our review, (Paiva2022, 139 , ) ’s review investigates publications between 2010 and 2021. They investigated assessment techniques and found that whitebox dynamic analysis and static analysis are gaining more traction as methods to assess the functionality of submissions.

For code quality and software metrics, they discuss tools that utilise existing code quality tools and software metrics. While there is an overlap to (Paiva2022, 139 ) ’s review, our review provides a more detailed analysis of the techniques used to assess submissions by introducing categorising tools by both the key skills and the approaches used – whereas (Paiva2022, 139 ) primarily focus on the domains graded, such as visual programming, computer graphics, and software testing, how secure the code execution is, and the effectiveness of the tools.

7.9. Threats to Validity

Limiting the search for primary studies to IEEExplore, ACM Digital Library, and Scopus may have led to some relevant primary studies not published in an IEEE, ACM or a Scopus-indexed publication being missed. Additionally, given that some titles and abstracts did not clearly state what their papers addressed, some papers may have been mistakenly excluded during the title and abstract screening. To mitigate incorrectly excluding papers at this stage, the screeners opted to include such papers in the title and abstract screening to allow further analysis during the introduction and conclusion screening stage. If, during the screening stage, the screeners disagreed on whether a paper should be included, they discussed their disagreements and came to a consensus if the paper should be included.

We opted not to conduct a snowball search, where after the screening stage, included papers references are searched for any other papers that match the inclusion criteria. We decided against conducting a snowball search, as we felt that our final number of included papers was enough to provide a practical overview of the state of the art without delaying the publication of our review. However, as a result, there are other papers we most likely have missed in our initial search that do not contain the keywords in our search string in their titles or abstracts.

Commercial AATs may not have been included in our study as they do not have any associated academic literature. While many institutions use these commercial tools, the lack of literature discussing them makes them out of scope for our literature review.

Although both screeners conducted the first round of annotations during the full-text screening stage, additional annotations were needed while extracting the results. These additional annotations were compiled by a single screener, with the final list of applied annotations being validated by the second screener. Therefore, some papers may have been mislabeled or missing relevant annotations.

While writing our results for this paper, there was a rise in research into using large language models (LLMs) within computer science education. This rise in LLMs can affect automated assessment tools, especially with smaller, constrained assignments that are more susceptible to the code generation functionality of LLMs. One study found that if ChatGPT is given clear and straightforward instructions, it can generate effective solutions for trivial and constrained assignments (Ouh2023, 135 ) . Another study found that the solutions obtained for non-trivial programming assignments are not sufficient for the competition of the course, the LLM can correct solutions based on feedback from an AAT (Savelka2023, 158 ) .

8. Conclusion

This systematic review categorised state-of-the-art automated grading and feedback tools by the graded programming skills, techniques for awarding a grade or generating feedback, programming paradigms for automatically assessing programming assignments, and how these tools were evaluated. Furthermore, we investigated why automated assessment tools focus on assessing specific programming fundamental skills, discussed potential reasons for choosing a particular language and degree of automation, and investigated how researchers evaluated tools and how evaluation could be improved.

We found that most AATs in the scope of this review focus on assessing correctness using a dynamic approach, most of which used unit testing to provide the grade and feedback. Feedback is typically limited to whether the unit test has passed or failed and the expected and actual outcomes if the test has failed. This can leave students frustrated, as often the feedback does not provide enough detail to help the student to progress.

Another common approach to assess correctness is a static approach comparing a student’s submission to a reference solution or a set of correct student submissions. Static analysis is also used to assess both maintainability, readability and the presence of documentation. However, these skills are assessed less often and typically in conjunction with correctness grading.

Instructors focus on assessing correctness, which is typically seen as one of the most crucial skills. They can determine if students have used a specific language feature or algorithm and if students can understand and convert requirements into a complete code base. While correctness is often seen as one of the most crucial skills, maintainability, readability and documentation are also vital. Maintainable, readable and well-documented code is typically easier to develop further, especially when working in teams or on future releases.

Most tools offered fully automated assessment, allowing for near-instantaneous feedback and multiple resubmissions without increasing the grading workload. Receiving feedback quickly increases student satisfaction, and multiple submissions allow students more opportunities to succeed. However, fully automated tools typically limit the scope of the assignment to a smaller scale and limit opportunities to show creativity.

Some tools opt for semi-automated approaches, where human graders manually assess elements of the assignment, typically maintainability, readability and documentation, and automatically assess correctness. While semi-automated approaches do not allow for near-instantaneous feedback, they are faster than manual assessment. However, typical implementations do not resolve the issues around limiting the scope of assignments, as correctness is typically assessed using the same methodology as full-automated assessment.

In terms of language paradigms assessed, most assess object-oriented languages, such as Java, Python and C++. Other language paradigms, such as functional and logic languages, have AATs, but these are researched less often. Object-oriented languages are the primary focus for many automated assessments due to the prominence of these languages in education and industry. Other language paradigms are gaining more popularity, especially web-based languages like JavaScript.

Most papers evaluate an AAT, whether one they have developed or used in a course. The primary evaluation technique was to conduct student surveys about their thoughts on the tool and how using the tool aided their learning. Another common evaluation technique was to compare the AAT to human graders, most focusing on the accuracy of the assessment using the human graders’ marks as a benchmark. The dataset used for evaluation is typically not published. While most papers perform some form of evaluation, the evaluation is typically focused on a single tool, is conducted by the tool’s developer and has mostly positive results.

While evaluating tools with student and teacher surveys can provide valuable insights into the users’ experience with the tools, they do not evaluate the accuracy of the awarded grades and feedback. Evaluating the accuracy against a benchmark or to human graders in conjunction with the user’s experience allows instructors to compare similar tools for considerations in their courses. Releasing evaluation datasets would allow researchers to reproduce and validate experiment results and evaluate different tools against a common set of assignments. This would improve the evaluation by providing verifiable and comparable benchmarks for the tool’s accuracy.

8.1. Recommendations

With most research into automated assessment of programming assignments focusing on assessing correctness for small-scale closed assignments, we encourage researchers to investigate how to automatically assess maintainability, readability and documentation, as these are key skills that are not evaluated by most automated assessment tools. Furthermore, we suggest that future research investigates how to assess open-ended assignments. Including semi-automated approaches to automating the assessment of maintainability, readability and documentation while manually assessing correctness or designing open-ended assignments to automate the assessment fully.

As web-based languages become more prominent, future research could investigate the automatic assessment of web-based languages, including JavaScript and TypeScript. In addition to web-based languages, automatic assessment of web application development could be investigated further by investigating how to assess the use of popular web frameworks, user-experience design and web-testing frameworks.

While researching new methods of automatic assessment for open-ended assignments, maintainability, readability and documentation and web-based languages, we suggest that authors produce more robust evaluation practices and attempt to publish the datasets they used for their evaluation. While some tools are evaluated by comparing their results to human graders, most are only evaluated by student or teacher surveys or performance analytics, such as compute power required or run times. Evaluating tools with a survey can provide meaningful insights into the end-users opinions of a tool, but they cannot adequately determine the accuracy of the tool. The publication of annotated datasets would allow researchers to evaluate a tool’s accuracy to similar tools on the same dataset, providing more significant evidence of the tool’s performance. Evaluating against a benchmark dataset and using user surveys could provide a good mix of qualitative and quantitative evidence to support the performance of their tools.

8.2. Data Availability

Our final list of annotated papers and data processing pipeline is available on GitHub 14 14 14 Raw data and data processing repository: https://github.com/m-messer/Automated-Assessment-SLR-Data-Processing .

Acknowledgements.

  • (1) Umair Z. Ahmed, Renuka Sindhgatta, Nisheeth Srivastava and Amey Karkare “Targeted Example Generation for Compilation Errors” In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) , 2019, pp. 327–338 DOI: 10.1109/ASE.2019.00039
  • (2) Umair Z. Ahmed, Nisheeth Srivastava, Renuka Sindhgatta and Amey Karkare “Characterizing the Pedagogical Benefits of Adaptive Feedback for Compilation Errors by Novice Programmers” In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering Education and Training , ICSE-SEET ’20 Seoul, South Korea: Association for Computing Machinery, 2020, pp. 139–150 DOI: 10.1145/3377814.3381703
  • (3) Kirsti M Ala-Mutka “A Survey of Automated Assessment Approaches for Programming Assignments” In Computer Science Education 15.2 Routledge, 2005, pp. 83–102 DOI: 10.1080/08993400500150747
  • (4) Hussam Aldriye, Asma Alkhalaf and Muath Alkhalaf “Automated Grading Systems for Programming Assignments: A Literature Review” In International Journal of Advanced Computer Science and Applications 10.3 The ScienceInformation Organization, 2019 DOI: 10.14569/IJACSA.2019.0100328
  • (5) José Bacelar Almeida et al. “Teaching How to Program Using Automated Assessment and Functional Glossy Games (Experience Report)” In Proc. ACM Program. Lang. 2.ICFP New York, NY, USA: Association for Computing Machinery, 2018 DOI: 10.1145/3236777
  • (6) Basma S. Alqadi and Jonathan I. Maletic “An Empirical Study of Debugging Patterns Among Novices Programmers” In Proceedings of the 2017 ACM SIGCSE Technical Symposium on Computer Science Education , SIGCSE ’17 Seattle, Washington, USA: Association for Computing Machinery, 2017, pp. 15–20 DOI: 10.1145/3017680.3017761
  • (7) Carlos Andujar, Cristina Raluca Vijulie and Alvar Vinacua “A Parser-based Tool to Assist Instructors in Grading Computer Graphics Assignments” In Eurographics 2019 - Education Papers The Eurographics Association, 2019 DOI: 10.2312/eged.20191025
  • (8) Carlos Andujar, Cristina R. Vijulie and Àlvar Vinacua “Syntactic and Semantic Analysis for Extended Feedback on Computer Graphics Assignments” In IEEE Computer Graphics and Applications 40.3 , 2020, pp. 105–111 DOI: 10.1109/MCG.2020.2981786
  • (9) Sara Mernissi Arifi, Rachid Ben Abbou and Azeddine Zahi “Assisted learning of C programming through automated program repair and feed-back generation” In Indonesian Journal of Electrical Engineering and Computer Science 20.1 , 2020, pp. 454–464 DOI: 0.11591/ijeecs.v20.i1.pp 454-464
  • (10) Nathaniel Ayewah et al. “Using Static Analysis to Find Bugs” In IEEE Software 25.5 , 2008, pp. 22–29 DOI: 10.1109/MS.2008.130
  • (11) Maha Aziz et al. “Auto-Grading for Parallel Programs” In Proceedings of the Workshop on Education for High-Performance Computing , EduHPC ’15 Austin, Texas: Association for Computing Machinery, 2015 DOI: 10.1145/2831425.2831427
  • (12) Marini Abu Bakar et al. “Auto-marking System: A Support Tool for Learning of Programming” In Article in International Journal on Advanced Science Engineering and Information Technology August 8 , 2018 DOI: 10.18517/ijaseit.8.4.6416
  • (13) Thoms Ball “The Concept of Dynamic Analysis” In Proceedings of the 7th European Software Engineering Conference Held Jointly with the 7th ACM SIGSOFT International Symposium on Foundations of Software Engineering , ESEC/FSE-7 Toulouse, France: Springer-Verlag, 1999, pp. 216–234
  • (14) M.Rifky I. Bariansyah, Satrio Adi Rukmono and Riza Satria Perdana “Semantic Approach for Increasing Test Case Coverage in Automated Grading of Programming Exercise” In 2021 International Conference on Data and Software Engineering (ICoDSE) , 2021, pp. 1–6 DOI: 10.1109/ICoDSE53690.2021.9648439
  • (15) Chelsea Barraball, Moeketsi Raselimo and Bernd Fischer “An Interactive Feedback System for Grammar Development (Tool Paper)” In Proceedings of the 13th ACM SIGPLAN International Conference on Software Language Engineering , SLE 2020 Virtual, USA: Association for Computing Machinery, 2020, pp. 101–107 DOI: 10.1145/3426425.3426935
  • (16) Brett A. Becker “An Effective Approach to Enhancing Compiler Error Messages” In Proceedings of the 47th ACM Technical Symposium on Computing Science Education , SIGCSE ’16 Memphis, Tennessee, USA: Association for Computing Machinery, 2016, pp. 126–131 DOI: 10.1145/2839509.2844584
  • (17) Alessandro Bertagnon and Marco Gavanelli “MAESTRO: a semi-autoMAted Evaluation SysTem for pROgramming assignments” In 2020 International Conference on Computational Science and Computational Intelligence (CSCI) , 2020, pp. 953–958 DOI: 10.1109/CSCI51800.2020.00177
  • (18) Anis Bey, Patrick Jermann and Pierre Dillenbourg “An Empirical Study Comparing Two Automatic Graders for Programming. MOOCs Context” In Data Driven Approaches in Digital Education Cham: Springer International Publishing, 2017, pp. 537–540 DOI: 10.1007/978-3-319-66610-5“˙57
  • (19) Geoff Birch, Bernd Fischer and Michael Poppleton “Fast test suite-driven model-based fault localisation with application to pinpointing defects in student programs” In Software & Systems Modeling 18.1 , 2019, pp. 445–471 DOI: 10.1007/s10270-017-0612-y
  • (20) Yadira Boada and Alejandro Vignoni “Automated code evaluation of computer programming sessions with MATLAB Grader” In 2021 World Engineering Education Forum/Global Engineering Deans Council (WEEF/GEDC) , 2021, pp. 500–505 DOI: 10.1109/WEEF/GEDC53299.2021.9657355
  • (21) Jürgen Börstler et al. “”I Know It When I See It” Perceptions of Code Quality: ITiCSE ’17 Working Group Report” In Proceedings of the 2017 ITiCSE Conference on Working Group Reports , ITiCSE-WGR ’17 Bologna, Italy: Association for Computing Machinery, 2018, pp. 70–85 DOI: 10.1145/3174781.3174785
  • (22) Chris Brown and Chris Parnin “Nudging Students Toward Better Software Engineering Behaviors” In 2021 IEEE/ACM Third International Workshop on Bots in Software Engineering (BotSE) , 2021, pp. 11–15 DOI: 10.1109/BotSE52550.2021.00010
  • (23) Yun-Zhan Cai and Meng-Hsun Tsai “Improving Programming Education Quality with Automatic Grading System” In Innovative Technologies and Learning Cham: Springer International Publishing, 2019, pp. 207–215 DOI: 10.1007/978-3-030-35343-8“˙22
  • (24) Benjamin Canou, Roberto Di Cosmo and Grégoire Henry “Scaling up Functional Programming Education: Under the Hood of the OCaml MOOC” In Proc. ACM Program. Lang. 1.ICFP New York, NY, USA: Association for Computing Machinery, 2017 DOI: 10.1145/3110248
  • (25) Marílio Cardoso, António Vieira Castro and Alvaro Rocha “Integration of virtual programming lab in a process of teaching programming EduScrum based” In 2018 13th Iberian Conference on Information Systems and Technologies (CISTI) , 2018, pp. 1–6 DOI: 10.23919/CISTI.2018.8399261
  • (26) Marílio Cardoso, Rui Marques, António Vieira Castro and Álvaro Rocha “Using Virtual Programming Lab to improve learning programming: The case of Algorithms and Programming” In Expert Systems 38.4 , 2021, pp. e12531 DOI: https://doi.org/10.1111/exsy.12531
  • (27) Ted Carmichael, Mary Jean Blink, John C. Stamper and Elizabeth Gieske “Linkage Objects for Generalized Instruction in Coding (LOGIC)” In Proceedings of the Thirty-First International Florida Artificial Intelligence Research Society Conference, FLAIRS 2018, Melbourne, Florida, USA. May 21-23 2018 AAAI Press, 2018, pp. 443–446 URL: https://aaai.org/ocs/index.php/FLAIRS/FLAIRS18/paper/view/17702
  • (28) Hsi-Min Chen, Wei-Han Chen and Chi-Chen Lee “An automated assessment system for analysis of coding convention violations in Java programming assignments*Industry 4.0 View project An Automated Assessment System for Analysis of Coding Convention Violations in Java Programming Assignments *” In Journal of Information Science and Engineering 34.5 , 2018, pp. 1203–1221 DOI: 10.6688/JISE.201809“˙34(5).0006
  • (29) S.R. Chidamber and C.F. Kemerer “A metrics suite for object oriented design” In IEEE Transactions on Software Engineering 20.6 , 1994, pp. 476–493 DOI: 10.1109/32.295895
  • (30) Chih-Yueh Chou and Yan-Jhih Chen “Virtual Teaching Assistant for Grading Programming Assignments: Non-dichotomous Pattern based Program Output Matching and Partial Grading Approach” In 2021 IEEE 4th International Conference on Knowledge Innovation and Invention (ICKII) , 2021, pp. 170–175 DOI: 10.1109/ICKII51822.2021.9574713
  • (31) Chih-Yueh Chou and Yan-Jhih Chen “Virtual Teaching Assistant for Grading Programming Assignments: Non-dichotomous Pattern based Program Output Matching and Partial Grading Approach” In 2021 IEEE 4th International Conference on Knowledge Innovation and Invention (ICKII) , 2021, pp. 170–175 DOI: 10.1109/ICKII51822.2021.9574713
  • (32) Sammi Chow, Kalina Yacef, Irena Koprinska and James Curran “Automated Data-Driven Hints for Computer Programming Students” In Adjunct Publication of the 25th Conference on User Modeling, Adaptation and Personalization , UMAP ’17 Bratislava, Slovakia: Association for Computing Machinery, 2017, pp. 5–10 DOI: 10.1145/3099023.3099065
  • (33) Benjamin Clegg, Maria-Cruz Villa-Uriol, Phil McMinn and Gordon Fraser “Gradeer: An Open-Source Modular Hybrid Grader” In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering Education and Training (ICSE-SEET) , 2021, pp. 60–65 DOI: 10.1109/ICSE-SEET52601.2021.00015
  • (34) Benjamin S. Clegg, Phil McMinn and Gordon Fraser “The Influence of Test Suite Properties on Automated Grading of Programming Exercises” In 2020 IEEE 32nd Conference on Software Engineering Education and Training (CSEE&T) , 2020, pp. 1–10 DOI: 10.1109/CSEET49119.2020.9206231
  • (35) Ricardo Conejo, Beatriz Barros and Manuel F. Bertoa “Automated Assessment of Complex Programming Tasks Using SIETTE” In IEEE Transactions on Learning Technologies 12.4 , 2019, pp. 470–484 DOI: 10.1109/TLT.2018.2876249
  • (36) Daniel Coore and Daniel Fokum “Facilitating Course Assessment with a Competitive Programming Platform” In Proceedings of the 50th ACM Technical Symposium on Computer Science Education , SIGCSE ’19 Minneapolis, MN, USA: Association for Computing Machinery, 2019, pp. 449–455 DOI: 10.1145/3287324.3287511
  • (37) Lucas Cordova, Jeffrey Carver, Noah Gershmel and Gursimran Walia “A Comparison of Inquiry-Based Conceptual Feedback vs. Traditional Detailed Feedback Mechanisms in Software Testing Education: An Empirical Investigation” In Proceedings of the 52nd ACM Technical Symposium on Computer Science Education , SIGCSE ’21 Virtual Event, USA: Association for Computing Machinery, 2021, pp. 87–93 DOI: 10.1145/3408877.3432417
  • (38) David Croft and Matthew England “Computing with CodeRunner at Coventry University: Automated Summative Assessment of Python and C++ Code” In Proceedings of the 4th Conference on Computing Education Practice , CEP ’20 Durham, United Kingdom: Association for Computing Machinery, 2020 DOI: 10.1145/3372356.3372357
  • (39) Gilbert Cruz et al. “An AI System for Coaching Novice Programmers” In Learning and Collaboration Technologies. Technology in Education Cham: Springer International Publishing, 2017, pp. 12–21 DOI: 10.1007/978-3-319-58515-4“˙2
  • (40) João Damas, Bruno Lima and António J. Araújo “AOCO - A Tool to Improve the Teaching of the ARM Assembly Language in Higher Education” In 2021 30th Annual Conference of the European Association for Education in Electrical and Information Engineering (EAEEIE) , 2021, pp. 1–6 DOI: 10.1109/EAEEIE50507.2021.9530951
  • (41) James Williams David Kane and Gillian Cappuccini-Ansfield “Student Satisfaction Surveys: The Value in Taking an Historical Perspective” In Quality in Higher Education 14.2 Routledge, 2008, pp. 135–155 DOI: 10.1080/13538320802278347
  • (42) Melissa Day, Manohara Rao Penumala and Javier Gonzalez-Sanchez “Annete: An Intelligent Tutoring Companion Embedded into the Eclipse IDE” In 2019 IEEE First International Conference on Cognitive Machine Intelligence (CogMI) , 2019, pp. 71–80 DOI: 10.1109/CogMI48466.2019.00018
  • (43) Pedro Delgado-Pérez and Inmaculada Medina-Bulo “Customizable and scalable automated assessment of C/C++ programming assignments” In Computer Applications in Engineering Education 28.6 , 2020, pp. 1449–1466 DOI: https://doi.org/10.1002/cae.22317
  • (44) Paul Denny, James Prather and Brett A. Becker “Error Message Readability and Novice Debugging Performance” In Proceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education , ITiCSE ’20 Trondheim, Norway: Association for Computing Machinery, 2020, pp. 480–486 DOI: 10.1145/3341525.3387384
  • (45) Paul Denny, Jacqueline Whalley and Juho Leinonen “Promoting Early Engagement with Programming Assignments Using Scheduled Automated Feedback” In Proceedings of the 23rd Australasian Computing Education Conference , ACE ’21 Virtual, SA, Australia: Association for Computing Machinery, 2021, pp. 88–95 DOI: 10.1145/3441636.3442309
  • (46) Paul Denny, Jacqueline Whalley and Juho Leinonen “Promoting Early Engagement with Programming Assignments Using Scheduled Automated Feedback” In Proceedings of the 23rd Australasian Computing Education Conference , ACE ’21 Virtual, SA, Australia: Association for Computing Machinery, 2021, pp. 88–95 DOI: 10.1145/3441636.3442309
  • (47) Draylson Micael Souza, Michael Kölling and Ellen Francine Barbosa “Most common fixes students use to improve the correctness of their programs” In 2017 IEEE Frontiers in Education Conference (FIE) , 2017, pp. 1–9 DOI: 10.1109/FIE.2017.8190524
  • (48) Sergio Cozzetti B. Souza, Nicolas Anquetil and Káthia M. Oliveira “A Study of the Documentation Essential to Software Maintenance” In Proceedings of the 23rd Annual International Conference on Design of Communication: Documenting & Designing for Pervasive Information , SIGDOC ’05 Coventry, United Kingdom: Association for Computing Machinery, 2005, pp. 68–75 DOI: 10.1145/1085313.1085331
  • (49) Ignacio Despujol, Leonardo Salom and Carlos Turró “Integrating the evaluation of out of the platform autoevaluated programming exercises with personalized answer in Open edX” In 2020 IEEE Learning With MOOCS (LWMOOCS) , 2020, pp. 14–18 DOI: 10.1109/LWMOOCS50143.2020.9234387
  • (50) Prasun Dewan et al. “Automating Testing of Visual Observed Concurrency” In 2021 IEEE/ACM Ninth Workshop on Education for High Performance Computing (EduHPC) , 2021, pp. 32–42 DOI: 10.1109/EduHPC54835.2021.00010
  • (51) Anton Dil and Joseph Osunde “Evaluation of a Tool for Java Structural Specification Checking” In Proceedings of the 10th International Conference on Education Technology and Computers , ICETC ’18 Tokyo, Japan: Association for Computing Machinery, 2018, pp. 99–104 DOI: 10.1145/3290511.3290528
  • (52) Dante D. Dixson and Frank C. Worrell “Formative and Summative Assessment in the Classroom” In Theory Into Practice 55.2 Routledge, 2016, pp. 153–159 DOI: 10.1080/00405841.2016.1148989
  • (53) Yu Dong, Jingyang Hou and Xuesong Lu “An Intelligent Online Judge System for Programming Training” In Database Systems for Advanced Applications Cham: Springer International Publishing, 2020, pp. 785–789 DOI: 10.1007/978-3-030-59419-0“˙57
  • (54) Yu Dong, Jingyang Hou and Xuesong Lu “An Intelligent Online Judge System for Programming Training” In Database Systems for Advanced Applications Cham: Springer International Publishing, 2020, pp. 785–789 DOI: 10.1007/978-3-030-59419-0“˙57
  • (55) Christopher Douce, David Livingstone and James Orwell “Automatic Test-Based Assessment of Programming: A Review” In Journal on Educational Resources in Computing 5.3 New York, NY, USA: Association for Computing Machinery, 2005, pp. 4–es DOI: 10.1145/1163405.1163409
  • (56) Bob Edmison and Stephen H. Edwards “Turn up the Heat! Using Heat Maps to Visualize Suspicious Code to Help Students Successfully Complete Programming Problems Faster” In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering Education and Training , ICSE-SEET ’20 Seoul, South Korea: Association for Computing Machinery, 2020, pp. 34–44 DOI: 10.1145/3377814.3381707
  • (57) Aleksandr Efremov, Ahana Ghosh and Adish Singla “Zero-shot Learning of Hint Policy via Reinforcement Learning and Program Synthesis” In roceedings of The 13th International Conference on Educational Data Mining (EDM 2020) , 2020, pp. 338–394
  • (58) Derar Eleyan, Abed Othman and Amna Eleyan “Enhancing Software Comments Readability Using Flesch Reading Ease Score” In Information 11.9 , 2020 DOI: 10.3390/info11090430
  • (59) Hans Fangohr et al. “Automatic Feedback Provision in Teaching Computational Science” In Computational Science – ICCS 2020 Cham: Springer International Publishing, 2020, pp. 608–621 DOI: 10.1007/978-3-030-50436-6“˙45
  • (60) Molly Q Feldman et al. “Towards Answering “Am I on the Right Track?” Automatically Using Program Synthesis” In Proceedings of the 2019 ACM SIGPLAN Symposium on SPLASH-E , SPLASH-E 2019 Athens, Greece: Association for Computing Machinery, 2019, pp. 13–24 DOI: 10.1145/3358711.3361626
  • (61) Nebojša Gavrilović, Aleksandra Arsić, Dragan Domazet and Alok Mishra “Algorithm for adaptive learning process and improving learners’ skills in Java programming language” In Computer Applications in Engineering Education 26.5 , 2018, pp. 1362–1382 DOI: https://doi.org/10.1002/cae.22043
  • (62) Alex Gerdes, Bastiaan Heeren, Johan Jeuring and L.Thomas Binsbergen “Ask-Elle: an Adaptable Programming Tutor for Haskell Giving Automated Feedback” In International Journal of Artificial Intelligence in Education 27.1 , 2017, pp. 65–100 DOI: 10.1007/s40593-015-0080-x
  • (63) Alex Gerdes, Johan T. Jeuring and Bastiaan J. Heeren “Using Strategies for Assessment of Programming Exercises” In Proceedings of the 41st ACM Technical Symposium on Computer Science Education , SIGCSE ’10 Milwaukee, Wisconsin, USA: Association for Computing Machinery, 2010, pp. 441–445 DOI: 10.1145/1734263.1734412
  • (64) John Gerdes “Developing Applications to Automatically Grade Introductory Visual Basic Courses” In AMCIS 2017 Proceedings , 2017
  • (65) Sumit Gulwani, Ivan Radiček and Florian Zuleger “Automated Clustering and Program Repair for Introductory Programming Assignments” In SIGPLAN Not. 53.4 New York, NY, USA: Association for Computing Machinery, 2018, pp. 465–480 DOI: 10.1145/3296979.3192387
  • (66) Luke Gusukuma, Austin Cory Bart, Dennis Kafura and Jeremy Ernst “Misconception-Driven Feedback: Results from an Experimental Study” In Proceedings of the 2018 ACM Conference on International Computing Education Research , ICER ’18 Espoo, Finland: Association for Computing Machinery, 2018, pp. 160–168 DOI: 10.1145/3230977.3231002
  • (67) Thorsten Haendler, Gustaf Neumann and Fiodor Smirnov “RefacTutor: An Interactive Tutoring System for Software Refactoring” In Computer Supported Education Cham: Springer International Publishing, 2020, pp. 236–261 DOI: 10.1007/978-3-030-58459-7“˙12
  • (68) Georgiana Haldeman et al. “Providing Meaningful Feedback for Autograding of Programming Assignments” In Proceedings of the 49th ACM Technical Symposium on Computer Science Education , SIGCSE ’18 Baltimore, Maryland, USA: Association for Computing Machinery, 2018, pp. 278–283 DOI: 10.1145/3159450.3159502
  • (69) Maurice H. Halstead “Elements of Software Science (Operating and Programming Systems Series)” USA: Elsevier Science Inc., 1977
  • (70) Aliya Hameer and Brigitte Pientka “Teaching the Art of Functional Programming Using Automated Grading (Experience Report)” In Proc. ACM Program. Lang. 3.ICFP New York, NY, USA: Association for Computing Machinery, 2019 DOI: 10.1145/3341719
  • (71) Qiang Hao and Michail Tsikerdekis “How Automated Feedback is Delivered Matters: Formative Feedback and Knowledge Transfer” In 2019 IEEE Frontiers in Education Conference (FIE) , 2019, pp. 1–6 DOI: 10.1109/FIE43999.2019.9028686
  • (72) Qiang Hao et al. “Investigating the Essential of Meaningful Automated Formative Feedback for Programming Assignments” In 2019 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC) , 2019, pp. 151–155 DOI: 10.1109/VLHCC.2019.8818922
  • (73) Qiang Hao et al. “Towards understanding the effective design of automated formative feedback for programming assignments” In Computer Science Education 32.1 Routledge, 2022, pp. 105–127 DOI: 10.1080/08993408.2020.1860408
  • (74) Sakib Haque, Zachary Eberhart, Aakash Bansal and Collin McMillan “Semantic Similarity Metrics for Evaluating Source Code Summarization” In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension , ICPC ’22 Virtual Event: Association for Computing Machinery, 2022, pp. 36–47 DOI: 10.1145/3524610.3527909
  • (75) Rowan Hart et al. “Eastwood-Tidy: C Linting for Automated Code Style Assessment in Programming Courses” In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1 , SIGCSE 2023 Toronto ON, Canada: Association for Computing Machinery, 2023, pp. 799–805 DOI: 10.1145/3545945.3569817
  • (76) Emlyn Hegarty-Kelly and Dr Aidan Mooney “Analysis of an Automatic Grading System within First Year Computer Science Programming Modules” In Proceedings of 5th Conference on Computing Education Practice , CEP ’21 Durham, United Kingdom: Association for Computing Machinery, 2021, pp. 17–20 DOI: 10.1145/3437914.3437973
  • (77) Jack Hollingsworth “Automatic graders for programming classes” In Communications of the ACM 3 ACM PUB27 New York, NY, USA, 1960, pp. 528–529 DOI: 10.1145/367415.367422
  • (78) Petri Ihantola, Tuukka Ahoniemi, Ville Karavirta and Otto Seppälä “Review of Recent Systems for Automatic Assessment of Programming Assignments” In Proceedings of the 10th Koli Calling International Conference on Computing Education Research , Koli Calling ’10 Koli, Finland: Association for Computing Machinery, 2010, pp. 86–93 DOI: 10.1145/1930464.1930480
  • (79) David Insa, Sergio Pérez, Josep Silva and Salvador Tamarit “Semiautomatic generation and assessment of Java exercises in engineering education” In Computer Applications in Engineering Education 29.5 , 2021, pp. 1034–1050 DOI: https://doi.org/10.1002/cae.22356
  • (80) David Insa, Sergio Pérez, Josep Silva and Salvador Tamarit “Semiautomatic generation and assessment of Java exercises in engineering education” In Computer Applications in Engineering Education 29.5 , 2021, pp. 1034–1050 DOI: https://doi.org/10.1002/cae.22356
  • (81) David Insa and Josep Silva “Automatic assessment of Java code” In Computer Languages, Systems & Structures 53 , 2018, pp. 59–72 DOI: https://doi.org/10.1016/j.cl.2018.01.004
  • (82) David Insa and Josep Silva “Semi-Automatic Assessment of Unrestrained Java Code: A Library, a DSL, and a Workbench to Assess Exams and Exercises” In Proceedings of the 2015 ACM Conference on Innovation and Technology in Computer Science Education , ITiCSE ’15 Vilnius, Lithuania: Association for Computing Machinery, 2015, pp. 39–44 DOI: 10.1145/2729094.2742615
  • (83) Julian Jansen, Ana Oprescu and Magiel Bruntink “The impact of automated code quality feedback in programming education” In Post-proceedings of the Tenth Seminar on Advanced Techniques and Tools for Software Evolution (SATToSE) 210 , 2017
  • (84) Gregor Jerše and Matija Lokar “Learning and Teaching Numerical Methods with a System for Automatic Assessment.” In International Journal for Technology in Mathematics Education 24.3 , 2017, pp. 121–127
  • (85) An Ju, Ben Mehne, Andrew Halle and Armando Fox “In-Class Coding-Based Summative Assessments: Tools, Challenges, and Experience” In Proceedings of the 23rd Annual ACM Conference on Innovation and Technology in Computer Science Education , ITiCSE 2018 Larnaca, Cyprus: Association for Computing Machinery, 2018, pp. 75–80 DOI: 10.1145/3197091.3197094
  • (86) Yiannis Kanellopoulos et al. “Code Quality Evaluation Methodology Using The ISO/IEC 9126 Standard” In International Journal of Software Engineering and Applications 1.3 AcademyIndustry Research Collaboration Center (AIRCC), 2010, pp. 17–36 DOI: 10.5121/ijsea.2010.1302
  • (87) Cem Kaner, Jack Falk and Hung Q Nguyen “Testing computer software” John Wiley & Sons, 1999
  • (88) Angelika Kaplan et al. “Teaching Programming at Scale”, 2020 URL: https://ceur-ws.org/Vol-2531/paper01.pdf
  • (89) Oscar Karnalim and Simon “Promoting Code Quality via Automated Feedback on Student Submissions” In 2021 IEEE Frontiers in Education Conference (FIE) , 2021, pp. 1–5 DOI: 10.1109/FIE49875.2021.9637193
  • (90) Remin Kasahara, Kazunori Sakamoto, Hironori Washizaki and Yoshiaki Fukazawa “Applying Gamification to Motivate Students to Write High-Quality Code in Programming Assignments” In Proceedings of the 2019 ACM Conference on Innovation and Technology in Computer Science Education , ITiCSE ’19 Aberdeen, Scotland Uk: Association for Computing Machinery, 2019, pp. 92–98 DOI: 10.1145/3304221.3319792
  • (91) Ayaan M. Kazerouni et al. “Fast and accurate incremental feedback for students’ software tests using selective mutation analysis” In Journal of Systems and Software 175 , 2021, pp. 110905 DOI: https://doi.org/10.1016/j.jss.2021.110905
  • (92) Hieke Keuning, Bastiaan Heeren and Johan Jeuring “A Tutoring System to Learn Code Refactoring” In Proceedings of the 52nd ACM Technical Symposium on Computer Science Education , SIGCSE ’21 Virtual Event, USA: Association for Computing Machinery, 2021, pp. 562–568 DOI: 10.1145/3408877.3432526
  • (93) Hieke Keuning, Bastiaan Heeren and Johan Jeuring “Code Quality Issues in Student Programs” In Proceedings of the 2017 ACM Conference on Innovation and Technology in Computer Science Education , ITiCSE ’17 Bologna, Italy: Association for Computing Machinery, 2017, pp. 110–115 DOI: 10.1145/3059009.3059061
  • (94) Hieke Keuning, Bastiaan Heeren and Johan Jeuring “Student Refactoring Behaviour in a Programming Tutor” In Proceedings of the 20th Koli Calling International Conference on Computing Education Research , Koli Calling ’20 Koli, Finland: Association for Computing Machinery, 2020 DOI: 10.1145/3428029.3428043
  • (95) Hieke Keuning, Johan Jeuring and Bastiaan Heeren “A Systematic Literature Review of Automated Feedback Generation for Programming Exercises” In ACM Trans. Comput. Educ. 19.1 New York, NY, USA: Association for Computing Machinery, 2018 DOI: 10.1145/3231711
  • (96) Noela J. Kipyegen and William P.K. Korir “Importance of Software Documentation” In International Journal of Computer Science Issues (IJCSI) 10.5 , 2013, pp. 223–228
  • (97) Sándor Király, Károly Nehéz and Olivér Hornyák “Some aspects of grading Java code submissions in MOOCs” In Research in Learning Technology 25 Association for Learning Technology, 2017 DOI: 10.25304/RLT.V25.1945
  • (98) B. Kitchenham and S Charters “Guidelines for performing Systematic Literature Reviews in Software Engineering”, 2007
  • (99) Malcolm S. Knowles “Self-Directed Learning: A Guide for Learners and Teachers.” Association Press, 291 Broadway, New York, New York 10007 ($4.95), 1975
  • (100) Y.Ben-David Kolikant and M. Mussai ““So my program doesn’t run!” Definition, origins, and practical expressions of students’ (mis)conceptions of correctness” In Computer Science Education 18.2 Routledge, 2008, pp. 135–151 DOI: 10.1080/08993400802156400
  • (101) Michael Kölling “The Problem of Teaching Object-Oriented Programming, Part 1: Languages” In Journal of Object-Oriented Programming , 1999, pp. 8–15
  • (102) Stephan Krusche and Andreas Seitz “ArTEMiS: An Automatic Assessment Management System for Interactive Learning” In Proceedings of the 49th ACM Technical Symposium on Computer Science Education , SIGCSE ’18 Baltimore, Maryland, USA: Association for Computing Machinery, 2018, pp. 284–289 DOI: 10.1145/3159450.3159602
  • (103) Stephan Krusche, Nadine Frankenberg, Lara Marie Reimer and Bernd Bruegge “An Interactive Learning Method to Engage Students in Modeling” In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering Education and Training , ICSE-SEET ’20 Seoul, South Korea: Association for Computing Machinery, 2020, pp. 12–22 DOI: 10.1145/3377814.3381701
  • (104) Adidah Lajis et al. “A Review of Techniques in Automatic Programming Assessment for Practical Skill Test” In Journal of Telecommunication, Electronic and Computer Engineering (JTEC) 10.2-5 , 2018, pp. 109–113 URL: https://jtec.utem.edu.my/jtec/article/view/4394
  • (105) Timotej Lazar, Martin Možina and Ivan Bratko “Automatic Extraction of AST Patterns for Debugging Student Programs” In Artificial Intelligence in Education Cham: Springer International Publishing, 2017, pp. 162–174 DOI: 10.1007/978-3-319-61425-0“˙14
  • (106) Duc Minh Le “Model-based automatic grading of object-oriented programming assignments” In Computer Applications in Engineering Education 30.2 , 2022, pp. 435–457 DOI: https://doi.org/10.1002/cae.22464
  • (107) Haden Hooyeon Lee “Effectiveness of Real-Time Feedback and Instructive Hints in Graduate CS Courses via Automated Grading System” In Proceedings of the 52nd ACM Technical Symposium on Computer Science Education , SIGCSE ’21 Virtual Event, USA: Association for Computing Machinery, 2021, pp. 101–107 DOI: 10.1145/3408877.3432463
  • (108) Junho Lee, Dowon Song, Sunbeom So and Hakjoo Oh “Automatic Diagnosis and Correction of Logical Errors for Functional Programming Assignments” In Proc. ACM Program. Lang. 2.OOPSLA New York, NY, USA: Association for Computing Machinery, 2018 DOI: 10.1145/3276528
  • (109) V.C.S. Lee et al. “ViDA: A virtual debugging advisor for supporting learning in computer programming courses” In Journal of Computer Assisted Learning 34.3 , 2018, pp. 243–258 DOI: https://doi.org/10.1111/jcal.12238
  • (110) Juho Leinonen, Paul Denny and Jacqueline Whalley “A Comparison of Immediate and Scheduled Feedback in Introductory Programming Projects” In Proceedings of the 53rd ACM Technical Symposium on Computer Science Education - Volume 1 , SIGCSE 2022 Providence, RI, USA: Association for Computing Machinery, 2022, pp. 885–891 DOI: 10.1145/3478431.3499372
  • (111) Abe Leite and Saúl A. Blanco “Effects of Human vs. Automatic Feedback on Students’ Understanding of AI Concepts and Programming Style” In Proceedings of the 51st ACM Technical Symposium on Computer Science Education , SIGCSE ’20 Portland, OR, USA: Association for Computing Machinery, 2020, pp. 44–50 DOI: 10.1145/3328778.3366921
  • (112) Janet Liebenberg and Vreda Pieterse “Investigating the Feasibility of Automatic Assessment of Programming Tasks” In Journal of Information Technology Education: Innovations in Practice 17 Informing Science Institute, 2018, pp. 201–223 DOI: 10.28945/4150
  • (113) Simon Liénardy, Laurent Leduc, Dominique Verpoorten and Benoit Donnet “Café: Automatic Correction and Feedback of Programming Challenges for a CS1 Course” In Proceedings of the Twenty-Second Australasian Computing Education Conference , ACE’20 Melbourne, VIC, Australia: Association for Computing Machinery, 2020, pp. 95–104 DOI: 10.1145/3373165.3373176
  • (114) David Liu and Andrew Petersen “Static Analyses in Python Programming Courses” In Proceedings of the 50th ACM Technical Symposium on Computer Science Education , SIGCSE ’19 Minneapolis, MN, USA: Association for Computing Machinery, 2019, pp. 666–671 DOI: 10.1145/3287324.3287503
  • (115) Xiao Liu, Yeoneo Kim, Junseok Cheon and Gyun Woo “A Partial Grading Method using Pattern Matching for Programming Assignments” In 2019 8th International Conference on Innovation, Communication and Engineering (ICICE) , 2019, pp. 157–160 DOI: 10.1109/ICICE49024.2019.9117506
  • (116) Xiao Liu, Shuai Wang, Pei Wang and Dinghao Wu “Automatic Grading of Programming Assignments: An Approach Based on Formal Semantics” In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering Education and Training (ICSE-SEET) , 2019, pp. 126–137 DOI: 10.1109/ICSE-SEET.2019.00022
  • (117) Zikai Liu et al. “End-to-End Automation of Feedback on Student Assembly Programs” In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) , 2021, pp. 18–29 DOI: 10.1109/ASE51524.2021.9678837
  • (118) Evan Maicus, Matthew Peveler, Andrew Aikens and Barbara Cutler “Autograding Interactive Computer Graphics Applications” In Proceedings of the 51st ACM Technical Symposium on Computer Science Education , SIGCSE ’20 Portland, OR, USA: Association for Computing Machinery, 2020, pp. 1145–1151 DOI: 10.1145/3328778.3366954
  • (119) Evan Maicus, Matthew Peveler, Stacy Patterson and Barbara Cutler “Autograding Distributed Algorithms in Networked Containers” In Proceedings of the 50th ACM Technical Symposium on Computer Science Education , SIGCSE ’19 Minneapolis, MN, USA: Association for Computing Machinery, 2019, pp. 133–138 DOI: 10.1145/3287324.3287505
  • (120) Hamza Manzoor et al. “Auto-Grading Jupyter Notebooks” In Proceedings of the 51st ACM Technical Symposium on Computer Science Education , SIGCSE ’20 Portland, OR, USA: Association for Computing Machinery, 2020, pp. 1139–1144 DOI: 10.1145/3328778.3366947
  • (121) Victor J. Marin, Tobin Pereira, Srinivas Sridharan and Carlos R. Rivero “Automated Personalized Feedback in Introductory Java Programming MOOCs” In 2017 IEEE 33rd International Conference on Data Engineering (ICDE) , 2017, pp. 1259–1270 DOI: 10.1109/ICDE.2017.169
  • (122) Paul W. McBurney “Automatic Documentation Generation via Source Code Summarization” In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering 2 , 2015, pp. 903–906 DOI: 10.1109/ICSE.2015.288
  • (123) Paul W. McBurney et al. “Towards Prioritizing Documentation Effort” In IEEE Transactions on Software Engineering 44.9 , 2018, pp. 897–913 DOI: 10.1109/TSE.2017.2716950
  • (124) T.J. McCabe “A Complexity Measure” In IEEE Transactions on Software Engineering SE-2.4 , 1976, pp. 308–320 DOI: 10.1109/TSE.1976.233837
  • (125) Igor Mekterović, Ljiljana Brkić, Boris Milašinović and Mirta Baranović “Building a Comprehensive Automated Programming Assessment System” In IEEE Access 8 , 2020, pp. 81154–81172 DOI: 10.1109/ACCESS.2020.2990980
  • (126) Marcus Messer “ \anon Automated Grading and Feedback Tools: A Systematic Review” OSF, 2022 DOI: 10.17605/OSF.IO/VXTF9
  • (127) Marcus Messer, Neil C.C. Brown, Michael Kölling and Miaojing Shi “ \anon Machine Learning-Based Automated Grading and Feedback Tools for Programming: A Meta-Analysis” In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1 , ITiCSE 2023 Turku, Finland: Association for Computing Machinery, 2023, pp. 491–497 DOI: 10.1145/3587102.3588822
  • (128) Eerik Muuli et al. “Automatic Assessment of Programming Assignments Using Image Recognition” In Data Driven Approaches in Digital Education Cham: Springer International Publishing, 2017, pp. 153–163 DOI: 10.1007/978-3-319-66610-5“˙12
  • (129) Ramez Nabil et al. “EvalSeer: An Intelligent Gamified System for Programming Assignments Assessment” In 2021 International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC) , 2021, pp. 235–242 DOI: 10.1109/MIUCC52538.2021.9447629
  • (130) Susanne Narciss “Feedback Strategies for Interactive Learning Tasks” In Handbook of Research on Educational Communications and Technology, Third Edition TaylorFrancis, 2008, pp. 125–143 DOI: 10.4324/9780203880869-13
  • (131) Sidhidatri Nayak, Reshu Agarwal and Sunil Kumar Khatri “Automated Assessment Tools for grading of programming Assignments: A review” In 2022 International Conference on Computer Communication and Informatics (ICCCI) , 2022, pp. 1–4 DOI: 10.1109/ICCCI54379.2022.9740769
  • (132) Bao-An Nguyen, Kuan-Yu Ho and Hsi-Min Chen “Measure Students’ Contribution in Web Programming Projects by Exploring Source Code Repository” In 2020 International Computer Symposium (ICS) , 2020, pp. 473–478 DOI: 10.1109/ICS51289.2020.00099
  • (133) Narges Norouzi and Ryan Hausen “Quantitative Evaluation of Student Engagement in a Large-Scale Introduction to Programming Course using a Cloud-based Automatic Grading System” In 2018 IEEE Frontiers in Education Conference (FIE) , 2018, pp. 1–5 DOI: 10.1109/FIE.2018.8658833
  • (134) Norbert Oster, Marius Kamp and Michael Philippsen “AuDoscore: Automatic Grading of Java or Scala Homework”, 2017, pp. 1
  • (135) Eng Lieh Ouh, Benjamin Kok Siew Gan, Kyong Jin Shim and Swavek Wlodkowski “ChatGPT, Can You Generate Solutions for My Coding Exercises? An Evaluation on Its Effectiveness in an Undergraduate Java Programming Course.” In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1 , ITiCSE 2023 Turku, Finland: Association for Computing Machinery, 2023, pp. 54–60 DOI: 10.1145/3587102.3588794
  • (136) Mourad Ouzzani, Hossam Hammady, Zbys Fedorowicz and Ahmed Elmagarmid “Rayyan–a web and mobile app for systematic reviews” In Systematic Reviews 5 BioMed Central Ltd., 2016 DOI: 10.1186/S13643-016-0384-4
  • (137) Benjamin Paaßen et al. “The Continuous Hint Factory - Providing Hints in Vast and Sparsely Populated Edit Distance Spaces” In Journal of Educational Data Mining 10 , 2017 DOI: https://doi.org/10.5281/zenodo.3554697
  • (138) Matthew J. Page et al. “The PRISMA 2020 statement: an updated guideline for reporting systematic reviews” In Systematic Reviews 10.1 , 2021, pp. 89 DOI: 10.1186/s13643-021-01626-4
  • (139) José Carlos Paiva, José Paulo Leal and Álvaro Figueira “Automated Assessment in Computer Science Education: A State-of-the-Art Review” In ACM Trans. Comput. Educ. 22.3 New York, NY, USA: Association for Computing Machinery, 2022 DOI: 10.1145/3513140
  • (140) Sagar Parihar et al. “Automatic Grading and Feedback Using Program Repair for Introductory Programming Courses” In Proceedings of the 2017 ACM Conference on Innovation and Technology in Computer Science Education , ITiCSE ’17 Bologna, Italy: Association for Computing Machinery, 2017, pp. 92–97 DOI: 10.1145/3059009.3059026
  • (141) Sagar Parihar et al. “Automatic Grading and Feedback Using Program Repair for Introductory Programming Courses” In Proceedings of the 2017 ACM Conference on Innovation and Technology in Computer Science Education , ITiCSE ’17 Bologna, Italy: Association for Computing Machinery, 2017, pp. 92–97 DOI: 10.1145/3059009.3059026
  • (142) Hyunchan Park and Youngpil Kim “CLIK: Cloud-based Linux kernel practice environment and judgment system” In Computer Applications in Engineering Education 28.5 , 2020, pp. 1137–1153 DOI: https://doi.org/10.1002/cae.22289
  • (143) Beatriz Pérez “Enhancing the Learning of Database Access Programming using Continuous Integration and Aspect Oriented Programming” In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering Education and Training (ICSE-SEET) , 2021, pp. 221–230 DOI: 10.1109/ICSE-SEET52601.2021.00032
  • (144) Chris Piech et al. “Learning Program Embeddings to Propagate Feedback on Student Code” In Proceedings of the 32nd International Conference on Machine Learning 37 , Proceedings of Machine Learning Research Lille, France: PMLR, 2015, pp. 1093–1102 URL: https://proceedings.mlr.press/v37/piech15.html
  • (145) Giuseppina Polito and Marco Temperini “A gamified web based system for computer programming learning” In Computers and Education: Artificial Intelligence 2 , 2021, pp. 100029 DOI: https://doi.org/10.1016/j.caeai.2021.100029
  • (146) Giuseppina Polito, Marco Temperini and Andrea Sterbini “2TSW: Automated Assessment of Computer Programming Assignments, in a Gamified Web Based System” In 2019 18th International Conference on Information Technology Based Higher Education and Training (ITHET) , 2019, pp. 1–9 DOI: 10.1109/ITHET46829.2019.8937377
  • (147) Chung Keung Poon et al. “Automatic Assessment via Intelligent Analysis of Students’ Program Output Patterns” In Blended Learning. Enhancing Learning Success Cham: Springer International Publishing, 2018, pp. 238–250 DOI: 10.1007/978-3-319-94505-7“˙19
  • (148) Yizhou Qian and James Lehman “Students’ Misconceptions and Other Difficulties in Introductory Programming: A Literature Review” In ACM Transactions on Computing Education 18.1 New York, NY, USA: Association for Computing Machinery, 2017 DOI: 10.1145/3077618
  • (149) Md.Mostafizer Rahman, Yutaka Watanobe and Keita Nakamura “Source Code Assessment and Classification Based on Estimated Error Probability Using Attentive LSTM Language Model and Its Application in Programming Education” In Applied Sciences 10.8 , 2020 DOI: 10.3390/app10082973
  • (150) Sawan Rai, Ramesh Chandra Belwal and Atul Gupta “A Review on Source Code Documentation” In ACM Trans. Intell. Syst. Technol. 13.5 New York, NY, USA: Association for Computing Machinery, 2022 DOI: 10.1145/3519312
  • (151) Dhananjai M. Rao “Experiences With Auto-Grading in a Systems Course” In 2019 IEEE Frontiers in Education Conference (FIE) , 2019, pp. 1–8 DOI: 10.1109/FIE43999.2019.9028450
  • (152) Jef Raskin “Comments Are More Important than Code: The Thorough Use of Internal Documentation is One of the Most-Overlooked Ways of Improving Software Quality and Speeding Implementation.” In Queue 3.2 New York, NY, USA: Association for Computing Machinery, 2005, pp. 64–65 DOI: 10.1145/1053331.1053354
  • (153) Ruan Reis, Gustavo Soares, Melina Mongiovi and Wilkerson L. Andrade “Evaluating Feedback Tools in Introductory Programming Classes” In 2019 IEEE Frontiers in Education Conference (FIE) , 2019, pp. 1–7 DOI: 10.1109/FIE43999.2019.9028418
  • (154) Felipe Restrepo-Calle, Jhon J Ramírez-Echeverry and Fabio A González “Using an interactive software tool for the formative and summative evaluation in a computer programming course: an experience report” In Global Journal of Engineering Education 22.3 , 2020, pp. 174–185
  • (155) Juan Carlos Rodríguez-del-Pino, Enrique Rubio Royo and Zenón Hernández Figueroa “A virtual programming lab for Moodle with automatic assessment and anti-plagiarism features”, 2012 URL: http://hdl.handle.net/10553/9773
  • (156) Arthur Rump, Ansgar Fehnker and Angelika Mader “Automated Assessment of Learning Objectives in Programming Assignments” In Intelligent Tutoring Systems Cham: Springer International Publishing, 2021, pp. 299–309 DOI: 10.1007/978-3-030-80421-3“˙33
  • (157) Avneesh Sarwate, Creston Brunch, Jason Freeman and Sebastian Siva “Grading at Scale in Earsketch” In Proceedings of the Fifth Annual ACM Conference on Learning at Scale , LS ’18 London, United Kingdom: Association for Computing Machinery, 2018 DOI: 10.1145/3231644.3231708
  • (158) Jaromir Savelka et al. “Can Generative Pre-Trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses?” In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1 , ITiCSE 2023 Turku, Finland: Association for Computing Machinery, 2023, pp. 117–123 DOI: 10.1145/3587102.3588792
  • (159) Kevin Sendjaja, Satrio Adi Rukmono and Riza Satria Perdana “Evaluating Control-Flow Graph Similarity for Grading Programming Exercises” In 2021 International Conference on Data and Software Engineering (ICoDSE) , 2021, pp. 1–6 DOI: 10.1109/ICoDSE53690.2021.9648464
  • (160) Kevin Sendjaja, Satrio Adi Rukmono and Riza Satria Perdana “Evaluating Control-Flow Graph Similarity for Grading Programming Exercises” In 2021 International Conference on Data and Software Engineering (ICoDSE) , 2021, pp. 1–6 DOI: 10.1109/ICoDSE53690.2021.9648464
  • (161) Saksham Sharma, Pallav Agarwal, Parv Mor and Amey Karkare “TipsC: Tips and Corrections for programming MOOCs” In Artificial Intelligence in Education Cham: Springer International Publishing, 2018, pp. 322–326 DOI: 10.1007/978-3-319-93846-2“˙60
  • (162) Sadia Sharmin “Creativity in CS1: A Literature Review” In ACM Trans. Comput. Educ. 22.2 New York, NY, USA: Association for Computing Machinery, 2021 DOI: 10.1145/3459995
  • (163) Chad Sharp et al. “An Open-Source, API-Based Framework for Assessing the Correctness of Code in CS50” In Proceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education , ITiCSE ’20 Trondheim, Norway: Association for Computing Machinery, 2020, pp. 487–492 DOI: 10.1145/3341525.3387417
  • (164) Gursimran Singh, Shashank Srikant and Varun Aggarwal “Question Independent Grading Using Machine Learning: The Case of Computer Program Grading” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD ’16 San Francisco, California, USA: Association for Computing Machinery, 2016, pp. 263–272 DOI: 10.1145/2939672.2939696
  • (165) Rishabh Singh, Sumit Gulwani and Armando Solar-Lezama “Automated Feedback Generation for Introductory Programming Assignments” In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation , PLDI ’13 Seattle, Washington, USA: Association for Computing Machinery, 2013, pp. 15–26 DOI: 10.1145/2491956.2462195
  • (166) Marcus Soll, Melf Johannsen and Chris Biemann “Enhancing a Theory-Focused Course Through the Introduction of Automatically Assessed Programming Exercises-Lessons Learned” In Proceedings of the Impact Papers at EC-TEL 2020 (EC-TEL 2020) , 2020 URL: https://ceur-ws.org/Vol-2676/paper6.pdf
  • (167) Dowon Song, Woosuk Lee and Hakjoo Oh “Context-Aware and Data-Driven Feedback Generation for Programming Assignments” In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ESEC/FSE 2021 Athens, Greece: Association for Computing Machinery, 2021, pp. 328–340 DOI: 10.1145/3468264.3468598
  • (168) Draylson M. Souza, Katia R. Felizardo and Ellen F. Barbosa “A Systematic Literature Review of Assessment Tools for Programming Assignments” In 2016 IEEE 29th International Conference on Software Engineering Education and Training (CSEET) , 2016, pp. 147–156 DOI: 10.1109/CSEET.2016.48
  • (169) Fábio Rezende De Souza, Francisco De Assis Zampirolli and Guiou Kobayashi “Convolutional neural network applied to code assignment grading” In CSEDU 2019 - Proceedings of the 11th International Conference on Computer Supported Education 1 SciTePress, 2019, pp. 62–69 DOI: 10.5220/0007711000620069
  • (170) Ioannis Stamelos, Lefteris Angelis, Apostolos Oikonomou and Georgios L. Bleris “Code quality analysis in open source software development” In Information Systems Journal 12.1 , 2002, pp. 43–60 DOI: https://doi.org/10.1046/j.1365-2575.2002.00117.x
  • (171) Ioanna Stamouli and Meriel Huggard “Object Oriented Programming and Program Correctness: The Students’ Perspective” In Proceedings of the Second International Workshop on Computing Education Research , ICER ’06 Canterbury, United Kingdom: Association for Computing Machinery, 2006, pp. 109–118 DOI: 10.1145/1151588.1151605
  • (172) Martijn Stegeman, Erik Barendsen and Sjaak Smetsers “Designing a Rubric for Feedback on Code Quality in Programming Courses” In Proceedings of the 16th Koli Calling International Conference on Computing Education Research , Koli Calling ’16 Koli, Finland: Association for Computing Machinery, 2016, pp. 160–164 DOI: 10.1145/2999541.2999555
  • (173) Daniela Steidl, Benjamin Hummel and Elmar Juergens “Quality analysis of source code comments” In 2013 21st International Conference on Program Comprehension (ICPC) , 2013, pp. 83–92 DOI: 10.1109/ICPC.2013.6613836
  • (174) Ryo Suzuki et al. “Exploring the Design Space of Automatically Synthesized Hints for Introductory Programming Assignments” In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems , CHI EA ’17 Denver, Colorado, USA: Association for Computing Machinery, 2017, pp. 2951–2958 DOI: 10.1145/3027063.3053187
  • (175) Ryo Suzuki et al. “TraceDiff: Debugging unexpected code behavior using trace divergences” In 2017 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC) , 2017, pp. 107–115 DOI: 10.1109/VLHCC.2017.8103457
  • (176) Shao Tianyi, Kuang Yulin, Huang Yihong and Quan Yujuan “PAAA: An Implementation of Programming Assignments Automatic Assessing System” In Proceedings of the 2019 4th International Conference on Distance Education and Learning , ICDEL ’19 Shanghai, China: Association for Computing Machinery, 2019, pp. 68–72 DOI: 10.1145/3338147.3338187
  • (177) Zahid Ullah et al. “The effect of automatic assessment on novice programming: Strengths and limitations of existing systems” In Computer Applications in Engineering Education 26.6 , 2018, pp. 2328–2341 DOI: 10.1002/cae.21974
  • (178) Leo C. Ureel II and Charles Wallace “Automated Critique of Early Programming Antipatterns” In Proceedings of the 50th ACM Technical Symposium on Computer Science Education , SIGCSE ’19 Minneapolis, MN, USA: Association for Computing Machinery, 2019, pp. 738–744 DOI: 10.1145/3287324.3287463
  • (179) Leo C. Ureel II and Charles Wallace “Automated Critique of Early Programming Antipatterns” In Proceedings of the 50th ACM Technical Symposium on Computer Science Education , SIGCSE ’19 Minneapolis, MN, USA: Association for Computing Machinery, 2019, pp. 738–744 DOI: 10.1145/3287324.3287463
  • (180) Sebastián Vallejos et al. “Soploon: A virtual assistant to help teachers to detect object-oriented errors in students’ source codes” In Computer Applications in Engineering Education 26.5 , 2018, pp. 1279–1292 DOI: https://doi.org/10.1002/cae.22021
  • (181) Quentin Vaneck, Thomas Colart, Benoît Frénay and Benoît Vanderose “A Tool for Evaluating Computer Programs from Students” In Proceedings of the 3rd International Workshop on Education through Advanced Software Engineering and Artificial Intelligence , EASEAI 2021 Athens, Greece: Association for Computing Machinery, 2021, pp. 23–26 DOI: 10.1145/3472673.3473961
  • (182) Arjun Verma et al. “Source-Code Similarity Measurement: Syntax Tree Fingerprinting for Automated Evaluation” In Proceedings of the First International Conference on AI-ML Systems , AIMLSystems ’21 Bangalore, India: Association for Computing Machinery, 2021 DOI: 10.1145/3486001.3486228
  • (183) Orr Walker and Nathaniel Russell “Automatic Assessment of the Design Quality of Python Programs with Personalized Feedback” In Proceedings of The 14th International Conference on Educational Data Mining , 2021, pp. 495–501
  • (184) Jinshui Wang, Yunpeng Zhao, Zhengyi Tang and Zhenchang Xing “Combining Dynamic and Static Analysis for Automated Grading SQL Statements” In Taiwan Ubiquitous Information 5 , 2020
  • (185) Jinshui Wang, Yunpeng Zhao, Zhengyi Tang and Zhenchang Xing “Combining Dynamic and Static Analysis for Automated Grading SQL Statements” In Journal of Network Intelligence 5 , 2020
  • (186) Ke Wang, Rishabh Singh and Zhendong Su “Search, Align, and Repair: Data-Driven Feedback Generation for Introductory Programming Exercises” In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation , PLDI 2018 Philadelphia, PA, USA: Association for Computing Machinery, 2018, pp. 481–495 DOI: 10.1145/3192366.3192384
  • (187) Zhikai Wang and Lei Xu “Grading Programs Based on Hybrid Analysis” In Web Information Systems and Applications Cham: Springer International Publishing, 2019, pp. 626–637 DOI: 10.1007/978-3-030-30952-7“˙63
  • (188) Christopher Watson, Frederick W.B. Li and Jamie L. Godwin “BlueFix: Using Crowd-Sourced Feedback to Support Programming Students in Error Diagnosis and Repair” In Advances in Web-Based Learning - ICWL 2012 Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 228–239
  • (189) Dee A.B. Weikle, Michael O. Lam and Michael S. Kirkpatrick “Automating Systems Course Unit and Integration Testing: Experience Report” In Proceedings of the 50th ACM Technical Symposium on Computer Science Education , SIGCSE ’19 Minneapolis, MN, USA: Association for Computing Machinery, 2019, pp. 565–570 DOI: 10.1145/3287324.3287502
  • (190) M.L. Wickramasinghe et al. “Smart Exam Evaluator for Object-Oriented Programming Modules” In 2020 2nd International Conference on Advancements in Computing (ICAC) 1 , 2020, pp. 287–292 DOI: 10.1109/ICAC51239.2020.9357320
  • (191) Burkhard C. Wünsche et al. “Automatic Assessment of OpenGL Computer Graphics Assignments” In Proceedings of the 23rd Annual ACM Conference on Innovation and Technology in Computer Science Education , ITiCSE 2018 Larnaca, Cyprus: Association for Computing Machinery, 2018, pp. 81–86 DOI: 10.1145/3197091.3197112
  • (192) Yi-Xiang Yan, Jung-Pin Wu, Bao-An Nguyen and Hsi-Min Chen “The Impact of Iterative Assessment System on Programming Learning Behavior” In Proceedings of the 2020 9th International Conference on Educational and Information Technology , ICEIT 2020 Oxford, United Kingdom: Association for Computing Machinery, 2020, pp. 89–94 DOI: 10.1145/3383923.3383939
  • (193) Y.T. Yu, C.M. Tang and C.K. Poon “Enhancing an automated system for assessment of student programs using the token pattern approach” In 2017 IEEE 6th International Conference on Teaching, Assessment, and Learning for Engineering (TALE) , 2017, pp. 406–413 DOI: 10.1109/TALE.2017.8252370
  • (194) Lucas Zamprogno, Reid Holmes and Elisa Baniassad “Nudging Student Learning Strategies Using Formative Feedback in Automatically Graded Assessments” In Proceedings of the 2020 ACM SIGPLAN Symposium on SPLASH-E , SPLASH-E 2020 Virtual, USA: Association for Computing Machinery, 2020, pp. 1–11 DOI: 10.1145/3426431.3428654

Appendix A Supplementary Material

A.1. publications by skill graded and category of automatic assessment tools, a.2. publications by technique implemented, a.3. publications by degree of automation, a.4. publications by language paradigm, a.5. publications by evaluation technique, a.6. publications by dataset availability.

  • Automatically Grading with Web-CAT

Automatically Grading Programming Assignments with Web-CAT

automatically graded programming assignments

Web-CAT is the most widely used automated grading platform in the world, and is best known for allowing instructors to grade students based on how well they test their own code.

About This Tutorial

This tutorial introduces participants to using Web-CAT, an open-source automated grading system. Web-CAT is customizable and extensible, allowing it to support a wide variety of programming languages and assessment strategies. Web-CAT is most well-known as the system that “grades students on how well they test their own code,” with experimental evidence that it offers greater learning benefits than more traditional output-comparison grading. Participants will learn how to set up and configure assignments, manage multiple sections, and allow graders to manually grade for design.

Presentation: Automatically Grading Programming Assignments with Web-CAT (PDF, 6 pp.)

Download a single zip file containing all of the following examples, or view examples individually in your web browser.

The examples shown in the tutorial:

Example 1: DvrRecording :

  • DvrRecording.java
  • DvrRecordingTest.java

Example 2: Calculator

  • Calculator.java
  • CalculatorReferenceTest.java

Example 3: HelloWorld : Testing main programs, stdi/o, etc.

  • HelloWorld1.java
  • HelloWorld1Test.java
  • HelloWorld2.java
  • HelloWorld2Test.java
  • HelloWorld3.java
  • HelloWorld3Test.java
  • RandomNumbers.java
  • RandomNumbersTest.java

Example 4: PushCounter : Testing Swing GUI applications using LIFT.

  • PushCounter.java
  • PushCounterTest.java
  • PushCounterPanel.java
  • PushCounterPanelTest.java

For More Information

roster.csv : The simple course roster used as an example.

checkstyle-None.xml and pmd-None.xml : The configuration files used to disable (turn off) Checkstyle and/or PMD static analysis checks.

loose-checkstyle.xml : An alternative Checkstyle rule configuration that is more forgiving than the default.

webcat-eclipse-submitter-1.4.3.zip : The Web-CAT submission plug-in for Eclipse.

student.jar : JUnit support library used by Web-CAT and many universities, including student.TestCase , student.GUITestCase , LIFT, and more.

Source code for student.jar and LIFT is open-source, available from the Web-CAT project on SourceForge .

JUnit.org has more information on how to use JUnit.

Documentation for the asserts in JUnit: http://www.junit.org/apidocs/org/junit/Assert.html

Testing GUI Programs is a Prezi presentation that gives a good introduction to LIFT and how it is used.

The LIFT website provides more details on LIFT, including links to two sigcse papers, downloads, examples, and discussion forums.

autograder 3.7.7

pip install autograder Copy PIP instructions

Released: Apr 15, 2024

A simple, secure, and versatile way to automatically grade programming assignments

Verified details

Maintainers.

Avatar for Ovsyanka from gravatar.com

Unverified details

Project links.

  • Documentation

GitHub Statistics

  • Open issues:

View statistics for this project via Libraries.io , or by using our public dataset on Google BigQuery

License: GNU General Public License v3 (GPLv3) (GPL-3.0)

Author: Stanislav Zmiev

Requires: Python <3.12, >=3.8

Classifiers

  • 5 - Production/Stable
  • OSI Approved :: GNU General Public License v3 (GPLv3)
  • Python :: 3
  • Python :: 3.8
  • Python :: 3.9
  • Python :: 3.10
  • Python :: 3.11
  • Education :: Testing

Project description

automatically graded programming assignments

  • Blazingly fast (can grade hundreads of submissions using dozens of testcases in a few minutes. Seconds if grading python)
  • Easy to grade
  • Easy-to-write testcases
  • Testcase grade can be based on student's stdout
  • Can grade C, C++, Java, and Python code in regular mode
  • Can grade any programming language in stdout-only mode
  • A file with testcase grades and details can be generated for each student
  • You can customize the total points for the assignment, maximum running time of student's program, file names to be considered for grading, formatters for checking student stdout, and so much more .
  • Anti Cheating capabilities that make it nearly impossible for students to cheat
  • Grading submissions in multiple programming languages at once
  • JSON result output supported if autograder needs to be integrated as a part of a larger utility
  • Can check submissions for similarity (plagiarism)
  • Can detect and report memory leaks in C/C++ code

Installation

  • Run pip install autograder
  • gcc / clang for C/C++ support
  • Java JDK for java support
  • make for compiled stdout-only testcase support
  • Any interpreter/compiler necessary to run stdout-only testcases. For example, testcases with ruby in their shebang lines will require the ruby interpreter

pip install -U --no-cache-dir autograder

  • Run autograder guide path/to/directory/you'd/like/to/grade . The guide will create all of the necessary configurations and directories for grading and will explain how to grade.
  • Read the usage section of the docs

Supported Platforms

  • Linux is fully supported
  • OS X is fully supported
  • Stdout-testcases that require shebang lines are not and cannot be supported

Supported Programming Languages

  • CPython (3.8-3.11)
  • Any programming language if stdout-only grading is used

Project details

Release history release notifications | rss feed.

Apr 15, 2024

Sep 2, 2023

Nov 12, 2022

Jun 10, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages .

Source Distribution

Uploaded Apr 15, 2024 Source

Built Distribution

Uploaded Apr 15, 2024 Python 3

Hashes for autograder-3.7.7.tar.gz

Hashes for autograder-3.7.7-py3-none-any.whl.

  • português (Brasil)

Supported by

automatically graded programming assignments

Automatically grading Java programming assignments via reflection, inheritance, and regular expressions

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Grading Programming Assignments with an Automated Grading and Feedback Assistant

  • Conference paper
  • First Online: 26 July 2022
  • Cite this conference paper

automatically graded programming assignments

  • Marcus Messer   ORCID: orcid.org/0000-0001-5915-9153 11  

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13356))

Included in the following conference series:

  • International Conference on Artificial Intelligence in Education

3727 Accesses

Over the last few years, Computer Science class sizes have increased, resulting in a higher grading workload. Universities often use multiple graders to quickly deliver the grades and associated feedback to manage this workload. While using multiple graders enables the required turnaround times to be achieved, it can come at the cost of consistency and feedback quality. Partially automating the process of grading and feedback could help solve these issues. This project will look into methods to assist in grading and feedback partially subjective elements of programming assignments, such as readability, maintainability, and documentation, to increase the marker’s amount of time to write meaningful feedback. We will investigate machine learning and natural language processing methods to improve grade uniformity and feedback quality in these areas. Furthermore, we will investigate how using these tools may allow instructors to include open-ended requirements that challenge students to use their ideas for possible features in their assignments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

checkstyle. http://checkstyle.sourceforge.io/ . Accessed 14 May 2022

Bernius, J.P., Krusche, S., Bruegge, B.: A machine learning approach for suggesting feedback in textual exercises in large courses. In: Proceedings of the Eighth ACM Conference on Learning @ Scale (2021). https://doi.org/10.1145/3430895

Brown, N.C.C., Klling, M., Mccall, D., Utting, I.: Blackbox: a large scale repository of novice programmers’ activity. In: Proceedings of the 45th ACM Technical Symposium on Computer Science Education (2014). https://doi.org/10.1145/2538862

Chidamber, S.R., Kemerer, C.F.: A metrics suite for object oriented design. IEEE Trans. Softw. Eng. 476–493 (1994). https://doi.org/10.1109/32.295895

Ferguson, P.: Assessment and evaluation in higher education student perceptions of quality feedback in teacher education. Assess. Eval. High. Educ. (2009). https://doi.org/10.1080/02602930903197883

Insa, D., Silva, J.: Semi-automatic assessment of unrestrained java code * a library, a DSL, and a workbench to assess exams and exercises. In: Proceedings of the 2015 ACM Conference on Innovation and Technology in Computer Science Education (2015). https://doi.org/10.1145/2729094

Kane, D., Williams, J., Cappuccini-Ansfield, G.: Student satisfaction surveys: the value in taking an historical perspective, 135–155 (2008). https://doi.org/10.1080/13538320802278347

Kincaid, J.P., Fishburn Jr., R.P., Rogers, R.L., Chissom, B.S.: Derivation of new readability formulas for navy enlisted personnel, February 1975. https://apps.dtic.mil/sti/citations/ADA006655

Krusche, S., Reimer, L.M., Bruegge, B., von Frankenberg, N.: An interactive learning method to engage students in modeling. In: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering Education and Training (2020). https://doi.org/10.1145/3377814

Mccabe, T.J.: A complexity measure. IEEE Trans. Softw. Eng. 308–320 (1976). https://doi.org/10.1109/TSE.1976.233837

Nguyen, H., Lim, M., Moore, S., Nyberg, E., Sakr, M., Stamper, J.: Exploring metrics for the analysis of code submissions in an introductory data science course. In: ACM International Conference Proceeding Series, pp. 632–638, April 2021. https://doi.org/10.1145/3448139.3448209

Parihar, S., Das, R., Dadachanji, Z., Karkare, A., Singh, P.K., Bhattacharya, A.: Automatic grading and feedback using program repair for introductory programming courses. In: Annual Conference on Innovation and Technology in Computer Science Education, ITiCSE, pp. 92–97, June 2017. https://doi.org/10.1145/3059009.3059026

Rahman, M.M., Watanobe, Y., Nakamura, K.: Source code assessment and classification based on estimated error probability using attentive LSTM language model and its application in programming education. Appl. Sci. 10 , 2973 (2020). https://doi.org/10.3390/APP10082973

Article   Google Scholar  

Shah, M.: Exploring the use of parsons problems for learning a new programming language (2020). www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-88.html

Wisniewski, B., Zierer, K., Hattie, J.: The power of feedback revisited: a meta-analysis of educational feedback research. Front. Psychol. 3087 (2020). https://doi.org/10.3389/FPSYG.2019.03087

Download references

Author information

Authors and affiliations.

King’s College London, London, UK

Marcus Messer

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Marcus Messer .

Editor information

Editors and affiliations.

Ateneo De Manila University, Quezon, Philippines

Maria Mercedes Rodrigo

Department of Computer Science, North Carolina State University, Raleigh, NC, USA

Noburu Matsuda

Durham University, Durham, UK

Alexandra I. Cristea

University of Leeds, Leeds, UK

Vania Dimitrova

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Cite this paper.

Messer, M. (2022). Grading Programming Assignments with an Automated Grading and Feedback Assistant. In: Rodrigo, M.M., Matsuda, N., Cristea, A.I., Dimitrova, V. (eds) Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners’ and Doctoral Consortium. AIED 2022. Lecture Notes in Computer Science, vol 13356. Springer, Cham. https://doi.org/10.1007/978-3-031-11647-6_6

Download citation

DOI : https://doi.org/10.1007/978-3-031-11647-6_6

Published : 26 July 2022

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-11646-9

Online ISBN : 978-3-031-11647-6

eBook Packages : Computer Science Computer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

students at a computer

Automatically grading programming homework

Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), working with a colleague at Microsoft Research, have developed a new software system that can automatically identify errors in students’ programming assignments and recommend corrections.

Teaching assistants at MIT have already begun using the software. But some variation on it could help solve one of the biggest problems faced by massive open online courses (MOOCs) like those offered through edX, the  online learning initiative  created by MIT and Harvard University: how to automate grading.

The system grew out of work on program synthesis — the automatic generation of computer programs that meet a programmer’s specifications — at CSAIL’s Computer-Aided Programming Group, which is led by Armando Solar-Lezama, the NBX Career Development Assistant Professor of Computer Science and Engineering. A paper describing the work will be presented this month at the Association for Computing Machinery’s Programming Language Design and Implementation conference. Joining Solar-Lezama on the paper are first author Rishabh Singh, a graduate student in his group, and Sumit Gulwani of Microsoft Research.

“One challenge, when TAs grade these assignments, is that there are many different ways to solve the same problem,” Singh says. “For a TA, it can be quite hard to figure out what type of solution the student is trying to do and what’s wrong with it.” One advantage of the new software is that it will identify the minimum number of corrections necessary to get a program working, no matter how unorthodox the programmer’s approach.

Pursuing alternatives

The new system does depend on a catalogue of the types of errors that student programmers tend to make. One such error is to begin counting from zero on one pass through a series of data items and from one in another; another is to forget to add the condition of equality to a comparison — as in, “If a is greater than  or equal to  b, do x.”

The first step for the researchers’ automated-grading algorithm is to identify all the spots in a student’s program where any of the common errors might have occurred. At each of those spots, the possible error establishes a range of variations in the program’s output: one output if counting begins at zero, for instance, another if it begins at one. Every possible combination of variations represents a different candidate for the corrected version of the student’s program.

“The search space is quite big,” Singh says. “You typically get 10 15 , 10 20  possible student solutions after doing these corrections. This is where the work on synthesis that we’ve been doing comes in. We can efficiently search this space within seconds.”

One key insight from program synthesis is that the relationships between a program’s inputs and outputs can be described by a set of equations, which can be generated automatically. And solving equations is generally more efficient than running lots of candidate programs to see what answers they give.

But wherever a possible error has established a range of outputs in the original program, the corresponding equation has a “parameter” — a variable that can take on a limited range of values. Finding values for those variables that yield working programs is itself a daunting search problem.

Limiting options

The CSAIL researchers’ algorithm solves it by first selecting a single target input that should elicit a specific output from a properly working program. That requirement in itself wipes out a large number of candidate programs: Many of them will give the wrong answer even for that one input. Various candidates remain, however, and the algorithm next selects one of them at random. For that program, it then finds an input that yields an  incorrect  output. That input becomes a new target input for all the remaining candidate programs, and so forth, iterating back and forth between fixed inputs and fixed programs.

This process converges on a working program with surprising speed. “Most of the corrections that are wrong are going to be really wrong,” Solar-Lezama explains. “They’re going to be wrong for most inputs. So by getting rid of the things that are wrong on even a small number of inputs, you’ve already gotten rid of most of the wrong things. It’s actually hard to write a wrong thing that is going to be wrong only on one carefully selected input. But if that’s the case, then once you have that one carefully selected input, that’s that.”

The researchers are currently evaluating how their system might be used to grade homework assignments in programming MOOCs. In some sense, the system works too well: Currently, as a tool to help TAs grade homework assignments, it provides specific feedback, including the line numbers of specific errors and suggested corrections. But for online students, part of the learning process may very well involve discovering errors themselves. The researchers are currently experimenting with variations on the software that indicate the location and nature of errors with different degrees of specificity, and talking with the edX team about how the program could best be used as a pedagogic tool.

“I think that the programming-languages community has a lot to offer the broader world,” says David Walker, an associate professor of computer science at Princeton University. “Armando is looking at using these synthesis techniques to try to help in education, and I think that’s a fantastic application of this kind of programming-languages technology.”

“The kind of thing that they’re doing here is definitely just the beginning,” Walker cautions. “It will be a big challenge to scale this type of technology up so that you can use it not just in the context of the very small introductory programming examples that they cover in their paper, but in larger-scale second- or third-year problems. But it’s a very exciting area.”

Open Learning newsletter

  • Accelerator
  • Commitments

AutoGradr - Automated Grading for Programming Assignments

Member Logo

AutoGradr - Automated Grading for Programming Assignments Photos

Share autogradr - automated grading for programming assignments.

AutoGradr is used in 14 countries by over 100 institutions to automate the grading of their CS courses. AutoGradr comes with a fully loaded web-based IDE. Our IDE supports project workspaces, preloaded stacks, and test case verification. Students can write their code, run, debug, and submit all in one place.

Instructors can create questions and have them graded without writing any code themselves. Write test cases with just the expected console run, or upload the I/O files. Leave the rest up to AutoGradr.

AutoGradr also supports Unit tests, Web UI tests, and Database tests - so you can use AutoGradr for more types of courses.

AutoGradr then provides real-time feedback to students. AutoGradr’s IDE emulates real-life coding environments for students, and provides help as though an instructor is watching over them. The grading interface helps students better understand how their submission differs from the expectations with feedback in less than 3 seconds.

AutoGradr supports all major programming languages such as C, C++, Java, Python, Swift, GoLang, C#, JavaScript, R, and many more. AutoGradr also supports web frameworks (Spring, Express), build tools (Maven, Gradle, Make), unit testing frameworks (JUnit, Jest, pytest) and databases (MySQL, MongoDB). Now you can automate tests for any kind of programs.

AutoGradr provides the complete AP CS50 curriculum and assignments so teachers can administer the course and assign coding projects based on the syllabus and have it automatically graded.

Autograder for Programming Assignments

Save countless hours on grading programming assignments. Introducing functional and easy to use automated grading for your classroom. Seamless integration with Codequiry's plagiarism checker.

  • Unlimited Assignments

Optional Codequiry Add-on

Grade Coding Assignments Automatically

Codequiry's automated grading system takes minutes to setup, and saves you countless hours on grading. Dynamic and multi language support allows you to implement scalable and powerful auto grading.

Grade Instantly

Set up rules around your coding assignment and Codequiry's autograder will do the rest. No sorting, compiling, reviewing, and inputting grades. Our Autograder gives immediate feedback on code improvement and mistakes.

Export Grades

Get access to grades and important data easily with export tools. Use our Gradebook or export grades into your LMS, such as Canvas.

Easy as possible

Your source code doesn't need any SDK's for grading. Codequiry will compile, run, test, and grade through self-destroying sandbox instances. Just upload your files and Codequiry will do the hard work.

Request free early access to the Autograder

Try out the power of automated grading for programming assignments. Request early access and get free auto grading for 6 months.

Our Mission

Codequiry aims to achieve an equally fair environment for fields relating to computer science by preventing the use of unoriginal code. The first step to preserving academic integrity and original source code starts here.

Codequiry

  • Code Plagiarism API
  • Help Articles
  • Code Autograder
  • A code similarity checker
  • Check your code for plagiarism
  • Automatically grade your coding assignments
  • Check java code for plagiarism
  • Codequiry Code Checking API
  • Codequiry releases updated insights page
  • Codequiry vs Moss
  • How to detect plagiarism in source code
  • How a Code Plagiarism Detector Helps the IT Industry?

© 2018-2024 Codequiry, LLC.

Table of Contents

Assignments types and features, exams & quizzes, homework & problem sets, bubble sheets, programming assignments, online assignments (beta).

  • ​Assignment Workflow

Assignment Types

Gradescope allows you to grade paper-based exams, quizzes, bubble sheets, and homework. In addition, Gradescope enables you to grade  programming assignments  (graded automatically or manually) and lets you create online assignments that students can answer right on Gradescope.

For paper assignments, Gradescope works well for many types of questions: paragraphs, proofs, diagrams, fill-in-the-blank, true/false, and more. Our biggest users so far have been high school and higher-ed courses in Math, Chemistry, Computer Science, Physics, Economics, and Business — but we’re confident that our tool is useful to most subject areas and grade levels. Please  reach out to us  and we can help you figure out if Gradescope will be helpful in your course.

The following table details Gradescope assignment types, default settings, and offerings.

*The file-upload question type can be used for students to upload images of their handwritten work.

**Certain question types can be auto-graded: Multiple choice, select all, and fill in the blank.

***A non-templated, variable-length submission is only available for student-uploaded Gradescope assignments (HW/Problem Set and Programming assignments), and not for instructor-uploaded assignments.

A screen capture of the Exam/Quiz assignment type selected on the Create Assignment page.

Exam/Quiz assignments are for fixed-template assessments (not variable-length). You will upload a blank copy of the exam (see Creating, editing, and deleting an assignment  for more information) and create the assignment outline that you’ll use for grading. By default, the  Exam / Quiz  assignment type is set up so that instructors or TAs can scan and submit their students’ work.

Once the assignment is created, you’ll:

  • Mark the question regions on a template PDF ( Creating an outline )
  • Create rubrics for your questions if applicable ( Creating rubrics )
  • Upload and process scans*  ( Managing scans )
  • Match student names to submissions*  ( Managing submissions )
  • Grade student work with flexible, dynamic rubrics ( Grading )

When grading is finished you can:

  • Publish grades and email students ( Reviewing grades )
  • Export grades ( Exporting Grades )
  • Manage regrade requests ( Managing regrade requests )
  • See question and rubric-level statistics to better understand what your students have learned ( Assignment Statistics )

*Not applicable if students are uploading their own work.

A screen capture of the create assignment page with the homework / problem set option selected.

The  Homework / Problem Set  assignment type is for variable-length (non-templated) assessments. Be default, the assignment is set for student upload or for students to submit work. In a typical homework assignment, students will upload their work and be directed to mark where their answers are on their submissions ( Submitting an assignment ), making them even easier for you to grade. If needed, you can also submit on behalf of your students, even if you’ve originally set the assignment to be student-uploaded. See more on that on our Managing Submissions help page.

Next, Gradescope will prompt you to set the assignment release date and due date, choose your submission type and set your group submission policy ( Creating, editing, and deleting an assignment ). Next, you can select Enforce time limit and use the  Maximum Time Permitted  feature to give students a set number of minutes to complete the assignment from the moment they confirm that they’re ready to begin. Under  Template Visibility , you can select Allow students to view and download the template to let students view and download a blank copy of the homework after the assignment release date.

Then, you will create the assignment outline ( Creating an outline ) and either create a rubric now or wait for students to submit their work. You can begin grading as soon as a single submission is uploaded (although we recommend waiting until the due date passes, since students can resubmit), and you can view all student-uploaded submissions from the  Manage Submissions  tab. The rest of the workflow is the same as exams and quizzes: you can publish grades, email students ( Reviewing grades ), export grades ( Exporting Grades ), and manage regrade requests ( Managing regrade requests ).

If your assignment is completely multiple choice, you should consider using the Bubble Sheet assignment type. With this type of assignment, you need to electronically or manually distribute and have students fill out the  Gradescope Bubble Sheet Template . You can then mark the correct answers for each question ahead of time, and all student submissions will be automatically graded.

A screen capture of the create assignment page with the bubble sheet option selected.

By default, the Bubble Sheet assignment type is set up for instructors to scan and upload. However, you can change this by choosing Students under  Who will upload submissions?  in your assignment settings and following the steps in the Homework and Problem Sets section of this guide. If submissions will be student-uploaded, you can also enable  Template Visibility  in your assignment settings to let students download a blank, 200-question bubble sheet template from Gradescope when they open the assignment. If you enable template visibility on a Bubble Sheet assignment, please note that you will  not  need to upload a blank bubble sheet for students to be able to download it, and the template students can download will contain five answer bubbles per question, but no question content.

Once the assignment is created you’ll:

  • Create an answer key and set grading defaults ( Bubble Sheet specific features )
  • Upload and process scans * ( Managing scans )
  • Match student names to submissions * ( Managing submissions )
  • Review uncertain marks and optionally add more descriptive rubric items ( Bubble Sheet specific features )

And when grading is completed you have access to the usual steps:

However, there is also an additional analysis page for Bubble Sheet Assignments - Item Analysis. We calculate a discriminatory score, or the correlation between getting the question right and the overall assignment score.

For more information about specific features to Bubble Sheets check out our Bubble Sheets assignment guidance .

With Programming Assignments, students submit code projects and instructors can automatically grade student code with a custom written autograder and/or manually grade using the traditional Gradescope interface.

A screen capture of the create assignment page with the programming assignment type selected.

When setting up a Programming Assignment, you’ll have a few unique options to choose from for this specific assignment type which you can learn over in the  programming assignment documentation .

After the assignment is created , the workflow is similar to other student submitted assignments:

  • If you wish to manually grade questions, you’ll add them to the outline
  • If you wish to use an autograder, you’ll set it up next ( Autograder Specifications )
  • Wait for submissions from students
  • Optionally, manually grade student work ( Manual Grading )
  • Manage regrade requests ( Managing regrade requests ).

For more information about programming assignments and autograders, check out the  Programming Assignment documentation .

Currently in beta, an Online Assignment offers the following features:

  • Allows you to create questions directly on Gradescope.
  • Students will be able to log in and submit responses within the Gradescope interface.
  • If you’d like, you can also give students a set number of minutes to submit their work from the moment they open the assignment.
  • Additionally, you can choose to hide questions and responses once the due date passes or the time limit runs out to help prevent students who have completed the assignment from sharing questions and answers with students who have not finished working.
  • For multiple choice, select all, and short answer questions, you can indicate the correct answer ahead of time, and student submissions will be automatically graded. You can also add a  File Upload  field to a question that will allow students to complete their work on that question outside of Gradescope and then the upload files. For example, a photo or PDF of handwritten work can be uploaded that contains their answer.

A screen capture of the create assignment page with the online assignment type selected.

After creating the assignment:

  • Enter your questions using the Assignment Editor ( Online Assignment specific features )
  • Optionally, manually grade student answers

And when grading is completed, you have access to the usual steps:

For more information about Online Assignments, check out our Online assignments guidance .

Was this article helpful?

Related articles, what gradescope workflow will let my students handwrite or draw answers, how do i set up a paper-based assignment for remote assessment, what gradescope workflow will let my students type in answers online.

 Logo

IMAGES

  1. PPT

    automatically graded programming assignments

  2. Webinar: Automatically grading Python assignments in CodeGrade

    automatically graded programming assignments

  3. Programming Assignments

    automatically graded programming assignments

  4. PPT

    automatically graded programming assignments

  5. Solved Auto-graded programming assignments have numerous

    automatically graded programming assignments

  6. How to add automatically graded questions to your digital assignments

    automatically graded programming assignments

VIDEO

  1. Week 2 Maths2 graded assignment #iit #iitmadras #iitmadrasonlinedegree

  2. **UPDATED** AP CS A

  3. Python week-3 GrPA (Graded programming Assignment) IITM

  4. Week 2 python graded assignment iitm #iitmadrasonlinedegree #python

  5. How to Create and Automatically Grade Github Classroom for C++

  6. How to submit programs to a Classroom Assignment in the Scratch Programming Language

COMMENTS

  1. How can I automate the grading of programming assignments?

    I have seen universities use linters and test suites to automatically check the correct state of the assignment so it can be submitted. ... As a TA, the usual workflow for grading programming assignments for an introductory course would be: Some instructors used email as a submission mechanism (yes, really.)

  2. PDF Automated Grading and Feedback Tools for Programming Education: A

    8 CONCLUSION. This systematic review categorised state-of-the-art automated grading and feedback tools by the graded programming skills, techniques for awarding a grade or generating feedback, programming paradigms for automatically assessing programming assignments, and how these tools were evalu-ated.

  3. GitHub

    Features. Blazingly fast (can grade hundreads of submissions using dozens of testcases in a few minutes. Seconds if grading python) Easy to grade. Easy-to-write testcases. Testcase grade can be based on student's stdout. Can grade C, C++, Java, and Python code in regular mode. Can grade any programming language in stdout-only mode.

  4. Automatic Grading of Programming Assignments: An Approach Based on

    Programming assignment grading can be time-consuming and error-prone if done manually. Existing tools generate feedback with failing test cases. However, this method is inefficient and the results are incomplete. In this paper, we present AutoGrader, a tool that automatically determines the correctness of programming assignments and provides counterexamples given a single reference ...

  5. Automated Grading and Feedback Tools for Programming Education: A

    We conducted a systematic literature review on automated grading and feedback tools for programming education. We analysed 121 research papers from 2017 to 2021 inclusive and categorised them based on skills assessed, approach, language paradigm, degree of automation and evaluation techniques. Most papers assess the correctness of assignments ...

  6. PDF Grading Programming Assignments with an Automated Grading ...

    Partially automating the process of grading and feedback could help solve these issues. This project will look into meth-ods to assist in grading and feedback partially subjective elements of programming assignments, such as readability, maintainability, and doc-umentation, to increase the marker's amount of time to write meaning-ful feedback.

  7. PDF Programming Assignments Automatic Grading: Review of Tools and ...

    Thus, a characterization of evaluation metrics to grade programming assignments is provided as first step to get a model. Finally new paths in this research field are proposed. Keywords: Automatic Grading, Programming Assignments, Assessment. 1 INTRODUCTION The first reference about programming automatic grading comes from 1965 [1].

  8. Automated Grading and Feedback of Programming Assignments

    Automated Grading Systems for Programming Assignments: A Literature Review. International Journal of Advanced Computer Science and Applications , Vol. 10 (2019). Issue 3. Google Scholar Cross Ref; Jan Philip Bernius, Stephan Krusche, and Bernd Bruegge. 2021. A Machine Learning Approach for Suggesting Feedback in Textual Exercises in Large Courses.

  9. Automated Grading in Coding Exercises Using Large Language ...

    The notion of automatic grading and feedback generation for programming assignments has been explored for decades. Web-CAT [], BOSS [] and CourseMaker [] are some of the auto-grading tools available for programming assignments.Web-CAT is a popular system that provides automated grading and feedback for programming assignments, offering a range of analysis and assessment features.

  10. Automatically Grading Programming Assignments with Web-CAT

    Automatically Grading Programming Assignments with Web-CAT. Web-CAT is the most widely used automated grading platform in the world, and is best known for allowing instructors to grade students based on how well they test their own code. About This Tutorial. This tutorial introduces participants to using Web-CAT, an open-source automated ...

  11. autograder · PyPI

    A simple, secure, and versatile way to automatically grade programming assignments. Features. Blazingly fast (can grade hundreads of submissions using dozens of testcases in a few minutes. Seconds if grading python) Easy to grade; Easy-to-write testcases; Testcase grade can be based on student's stdout;

  12. Web-CAT: automatically grading programming assignments

    This demonstration introduces participants to using Web-CAT, an open-source automated grading system. Web-CAT is customizable and extensible, allowing it to support a wide variety of programming languages and assessment strategies. Web-CAT is most well-known as the system that "grades students on how well they test their own code," with ...

  13. Automatically grading Java programming assignments via reflection

    A suite of programs that performs the automatic grading of Java programming assignments is described. This suite has been used in the Introduction to Computer Science course at Rutgers University since the Spring 2000 semester where it grades some 400-600 weekly assignments. The time to grade the submissions is a matter of minutes on a 500 MHZ PC. The development process for a typical ...

  14. Automatically grading Java programming assignments

    Java is a popular programming language in education, especially for introductory courses, but also for mobile app development courses. With the right tools, it is very easy and beneficial to autograde Java. Learn about Java I/O Testing, JUnit Unit Testing for education, Code Quality Grading with PMD or Checkstyle and Java Code Structure grading.

  15. Grading Programming Assignments with an Automated Grading ...

    A set of metrics and a methodology to assist in grading and providing feedback automatically using ML and NLP for grading readability, maintainability and documentation. An implementation of the methodology for Java programming assignments. An evaluation of the effect of auto-grading programming assignments has on course design.

  16. Automatically grading programming homework

    Automatically grading programming homework. June 3, 2013. Larry Hardesty, MIT News Office. Researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), working with a colleague at Microsoft Research, have developed a new software system that can automatically identify errors in students' programming assignments and ...

  17. Automatically grading students' Python assignments using pytest unit

    Pytest for unit testing Python code. Pytest is an open-source python module that enables unit testing of python code in a simple and easy to understand format. The unit testing framework uses simple assertion statements to compare actual outcomes to predicted outcomes. It's designed for writing simple tests but can be scaled for complex ...

  18. Automatically grade your coding assignments

    The computerized grading system, therefore, yield more quality and value than manual computer science grading systems. Auto grading Programming Solution Product (Codequiry) ... Codequiry has a seamless combination with a plagiarism checker; hence, an educator can check the uniqueness of an assignment and automatically issue the grades to the ...

  19. Automatically grading Python code assignments

    Python is a popular programming language in education, especially for introductory courses and applied scripting or data science courses. With the right tools, it is very easy and beneficial to autograde Python. Learn about Python I/O Testing, PyTest Unit Testing for education, Code Quality Grading with Flake8 and PyLint and Python Code ...

  20. AutoGradr

    1,000 to 10,000. AutoGradr is used in 14 countries by over 100 institutions to automate the grading of their CS courses. AutoGradr comes with a fully loaded web-based IDE. Our IDE supports project workspaces, preloaded stacks, and test case verification. Students can write their code, run, debug, and submit all in one place.

  21. Code Autograding Platform

    Grade Coding Assignments Automatically. Codequiry's automated grading system takes minutes to setup, and saves you countless hours on grading. Dynamic and multi language support allows you to implement scalable and powerful auto grading. ... Try out the power of automated grading for programming assignments. Request early access and get free ...

  22. CodeGrade Autograder for programming assignments

    Schedule a personalized tour of CodeGrade today. Book a demo. Use the CodeGrade AutoGrader to automatically grade your coding assignments. Works with any programming language such as: Python, Java, SQL, Jupyter, C++, C#, Javascript, C.

  23. Assignment Types

    Gradescope allows you to grade paper-based exams, quizzes, bubble sheets, and homework. In addition, Gradescope enables you to grade programming assignments (graded automatically or manually) and lets you create online assignments that students can answer right on Gradescope. For paper assignments, Gradescope works well for many types of questions: paragraphs, proofs, diagrams, fill-in-the ...

  24. Assignments

    Assignments. Your Java submissions will be automatically graded for conforming to the following style rules: ... , make sure your program uses the prescribed names. Your program will be tested by our code that expects the names to be as specified. Of course, you are free to create additional sample data, and other methods (e.g. helper methods ...

  25. How teachers started using ChatGPT to grade assignments

    Teachers are embracing ChatGPT-powered grading. A new tool called Writable, which uses ChatGPT to help grade student writing assignments, is being offered widely to teachers in grades 3-12. Why it matters: Teachers have quietly used ChatGPT to grade papers since it first came out — but now schools are sanctioning and encouraging its use.