AI education

Why I think that AI-based grading in education is inevitable.

A few days ago I commented on an article that discusses the introduction of AI into education and why teachers shouldn’t worry about it. I also said that AI for grading was inevitable because it would be cheaper, and more reliable, fair and valid than human beings. I got some pushback from Ben on Twitter and realised that I was making several assumptions in my post so I’ve written this post to clarify some of what I said. I also wanted to use this post to test my assumptions around the claims I made, so it’s a bit longer than usual since I’m “thinking out loud” and trying to justify each point.

First, there are the 4 claims I make for why I think that AI-based assessment of things like essays is inevitable:

  • AI will be cheaper than human beings.
  • AI will be more reliable than human beings.
  • AI will be more fair than human beings.
  • AI will be more valid than human beings.

Cheaper: Over the past 60 years or so we’ve seen fairly consistent improvements in power, efficiency and speed at increasingly lower costs. Even if we assume that Moore’s Law is bottoming out we’ll still see continued progress in cost reduction because of improvements in programming techniques, purpose-built chips and new technologies like quantum computers. This is important because, like any industry, education works on a budget. If a university can get close to the same outcomes with a significant reduction in cost, they’ll take it. Software vendors will offer the “essay grading” module that can be integrated into the institutional LMS and the costing will be such that universities would be crazy not to at least pilot it. And my thinking is that it’ll become very clear, very quickly that a significant part of essay grading is really very simple for machines to do. Which brings me to the next claim…

More reliable: A lot of essay grading boils down to things that are relatively simple to programme into a system. For example, spelling is largely a problem that we’ve solved (barring regional inconsistencies) and can therefore express as system of rules. These rules can be coded into an algorithm, which is why spell-checking works. Grammatical structure is also generally well-understood, with most cultures having concepts like nouns, verbs, adjectives, etc., as well as an understanding of how these words are best positioned relative to each other to enhance readability and understanding. Whether we use prescriptive rules (“we should do this”) or descriptive rules (“we actually do this”) matters less than knowing what set of rules we’ll use for the task at hand. It seems reasonable that physiotherapy lecturers could tune an algorithm with a slider, specifying that grammatical structure is less important for their students (i.e. lower scores wrt prescriptive rules are OK) while an English lecturer might insist than their students must score higher on how words should be used. Referencing formatting is also easy to code with a series of rules, as well as the idea that knowledge claims should be supported with evidence. And related to this is the idea that machines are getting better at identifying passages of text that are simply copied from a source. And I think it’s reasonable to assert that a computer can count more quickly, and more reliably, than a person. Of course this doesn’t take into account things like creativity but I’ll get to that. For now, we should at least grant that an AI could plausibly be more reliable than a human being (i.e. it assesses the same things in the same way across multiple examples) when it comes to evaluating things like spelling, grammatical structure, essay structure, referencing, and plagiarism. And machines will do this consistently across tens of thousands of students.

Fairer: Human beings are inherently unfair. Regardless of how fair we think we’re being, there are some variables that we simply can’t tune because we’re not even aware that they’re affecting us. There’s evidence that we’re more strict when we’re hungry or when we’re angry with a partner, and that also we’re influenced by the gender of person we’re grading, the time of day, etc. We’re also affected by sequencing; my grading of the essays I read later are influenced by the earlier examples I’ve seen. This means that a student’s grade might be affected by where in the pile their script is lying, or by their surname if the submission is digital and sorted alphabetically. It may be literally true (warning: controversial opinion coming up) that a student’s mark is more strongly influenced by my current relationship with my children than by what they’ve actually written. Our cognitive biases make it almost impossible for human beings to be as fair as we think we are. And yes, I’m aware that biases are inherent to machine learning algorithms as well. The difference is that those kinds of biases can be seen and corrected, whereas human bias is – and is likely to remain – invisible and unaccountable.

More valid: And finally there’s the issue of validity; are we assessing what we say we’re assessing? For essays this is an important point. Essays are often written in response to a critical question and it’s easy for the assessor to lose sight of that during the grading process. Again, our biases can influence our perceptions without us even being aware of them. A student’s reference to a current political situation may score them points (availability bias) while another, equally valid reference to a story we’re not aware of wouldn’t have the same valence for the assessor. Students can tweak other variables to create a good impression on the reader, none of which are necessarily related to how well they answer the question. For example, even just taking a few minutes to present the essay in a way that’s aesthetically pleasing can influence an assessor, never mind the points received for simply following instructions on layout (e.g. margin size, line spacing, font selection, etc.). When you add things like the relationship between students and the assessor, you start to get a sense for how the person doing the grading can be influenced by many other factors besides the students’ ability to answer the essay question.

OK, so that’s why I think that the introduction of AI for grading – at least for grading essays – is inevitable. However, I’m aware that doesn’t really deal with the bulk of the concerns that Ben raised. I just wanted to provide some context and support for the initial claims I made. The rest of this post is in response to the specific concerns that Ben raised in his series of tweets. I’ve combined some of them below for easier reference.

Can we be sure [that AI-based grading of assessment] is not a bad thing? Is it necessarily fairer? Thinking about the last lot of essays I marked, the ones getting the highest grades varied significantly, representing different takes on a wide ranging and pretty open ended topic. As markers we could allow some weaknesses in an assignment that did other things extremely well and showed independence of thought. The ones getting very good but not excellent grades were possibly more consistent, they were polished and competent but didn’t make quite the same creative or critical jump.

I think I addressed the concern about fairness earlier in the post. I really do think that AI-based grading will be more fair to students. There’s also the argument about how the range of examples with the highest grades tend to be quite different. This is a good thing and represents the kinds of open-ended response to questions that demonstrates how students can use their imagination to construct wide-ranging, unanticipated responses to difficult questions. I think that this would be addressed by the fact that AI-based systems are trained on tens of thousands of examples, all of which are labelled by human assessors. Instead of the system being narrowly constrained by the algorithm, I think that algorithms will open up the possible space of what “good” looks like. While I’m often delighted with variation and creative responses from students, not all of my colleagues feel the same way. An AI-based grading system will ensure that, if we highlight “creativity” as an attribute that we value in our assessments, individual lecturers won’t have as much power to constrain its development. And AI systems will also be able to “acknowledge” that some areas of the students’ submissions are stronger than others, and will be able to grade across different criteria (for example, the output might look like: “student’s ability to follow instructions is “excellent”, language – especially grammar – can be improved, ability to develop an argument from initial premises is “good”, etc.”).

How will AI marking allow for the imaginative, creative assignments and avoid a cycle of increasingly standardised and sanitised assignments as students work out how to please the algorithm?

My first response to this is: how do we “…avoid a cycle of increasingly standardised and sanitised assignments as students work out how to please the lecturer?” And then there’s also the progress being made in “creative expressions” of AI-based systems; art (see here and here), music (see here, here, and here), and stories/poems (see here, and here). You can argue that an AI that uses human artifacts to generate new examples is simply derivative. But I’d counter by saying almost all human-generated art is similarly derivative. There are very few people who have developed unique insights that shift how we see the world at some fundamental level. You could also argue that some of these platforms aren’t yet very good. I’d suggest that they will only ever get better, and that you can’t say the same for people.

Is it fairer to aim for consistency of input/output or to allow for individual interpretations of an assignment? What, at heart is the point of assessment in higher education – consistent competence or individual critical thought?

Also who influences the algorithm? Is it on an institutional basis or wider? Is it fairer to allow for varied and localised interpretations of excellence or end up making everyone fit to one homogenous standard (we can guess which dominant cultural norms it would reflect…)

This is an excellent point and the main reason for why I think it’s incumbent on lecturers to be involved in the development of AI-based systems in education. We can’t rely on software engineers in Silicon Valley to be solely responsible for the design choices that influence how artificial intelligence should be used in education. I expand on these ideas in this book chapter (slideshow summary here).

On the whole I think that Ben has raised important questions and agree that these are valid concerns. For me, there are three main issues to highlight, which I’d summarise like so:

  1. There is a tension between creating assignments that enable open-ended (and therefore creative) student responses and those that are more closed, pushing students towards more standardised submissions. Will AI-based grading systems be able to deal with this nuance?
  2. There is a risk that students might become more concerned with gaming the system and aiming to “please the algorithm”, resulting in sanitised essays rather than imaginative and creative work. How can we avoid this “gaming the system” approach?
  3. There is a bias that’s built into machine learning which is likely to reflect the dominant cultural norms of those responsible for the system. Are we happy to have these biases influence student outcomes and if not, how will we counter them?

Looking back, I think that I’ve presented what I think are reasonable arguments for each of the points above. I may have misunderstood the concerns and I’ve definitely left out important points. But I think that this is enough for now. If you’re a university lecturer or high school teacher I think that the points raised by Ben in his tweets are great starting points for a conversation about how these systems will affect us all.

I don’t think that the introduction of AI-based essay grading will affect our ability to design open-ended assessments that enable student creativity and imagination. We’ve known for decades that rules cannot describe the complexity of human society because people – and the outcomes of interactions between people – are unknowable. And if we can’t specify in advance what these outcomes will look like, we can’t encode them in rules. But this has been the breakthrough that machine learning has brought to AI research. AI-based systems don’t attempt to have “reality” coded into them but rather learn about “reality” from massive sets of examples that are labelled by human beings. This may turn out to be the wrong approach but, for me at least, the argument for using AI in assessment is a plausible one.

AI education

Comment: Teachers, the Robots Are Coming. But That’s Not a Bad Thing.

…that’s exactly why educators should not be putting their heads in the sand and hoping they never get replaced by an AI-powered robot. They need to play a big role in the development of these technologies so that whatever is produced is ethical and unbiased, improves student learning, and helps teachers spend more time inspiring students, building strong relationships with them, and focusing on the priorities that matter most. If designed with educator input, these technologies could free up teachers to do what they do best: inspire students to learn and coach them along the way.

Bushweller, K. (2020). Teachers, the Robots Are Coming. But That’s Not a Bad Thing. Education Week.

There are a few points in the article that confuse rather than clarify (for example, the conflation of robots with software) but on the whole I think this provides a useful overview of some of the main concerns around the introduction of AI-based systems in education. Personally, I’m not at all worried about having humanoid (or animal-type) physical robots coming into the classroom to take over my job.

I think that AI will be introduced into educational settings more surreptitiously, for example via the institutional LMS in the form of grading assistance, risk identification, timetabling, etc. And we’ll welcome this because it frees us from the very labour intensive, repetitive work that we all complain about. Not only that but grading seems to be one of the most expensive aspects (in terms of time) of a teacher’s job and because of this we’re going to see a lot of interest in this area by governments. For example, see this project by Ofqual (the UK teaching standards regulator) to explore the use of AI to grade school exams.

In fact, I think that AI-based assessment is pretty much inevitable in educational contexts, given that it’ll probably be (a lot) cheaper, more reliable, fair, and valid than human graders.

Shameless self-promotion: I wrote a book chapter about how teachers could play a role in the development of AI-based systems in education, specifically in the areas of data collection, teaching practice, research, and policy development. Here is the full-text (preprint) and here are my slides from a seminar at the University of Cape Town where I presented an overview.

AI education

We Need Transparency in Algorithms, But Too Much Can Backfire

The students had also been asked what grade they thought they would get, and it turned out that levels of trust in those students whose actual grades hit or exceeded that estimate were unaffected by transparency. But people whose expectations were violated – students who received lower scores than they expected – trusted the algorithm more when they got more of an explanation of how it worked. This was interesting for two reasons: it confirmed a human tendency to apply greater scrutiny to information when expectations are violated. And it showed that the distrust that might accompany negative or disappointing results can be alleviated if people believe that the underlying process is fair.

Source: We Need Transparency in Algorithms, But Too Much Can Backfire

This article uses the example of algorithmic grading of student work to discuss issues of trust and transparency. One of the findings I thought was a useful takeaway in this context is that full transparency may not be the goal, but that we should rather aim for medium transparency and only in situations where students’ expectations are not met. For example, a student who’s grade was lower than expected might need to be told something about how it was calculated. But when they got too much information it eroded trust in the algorithm completely. When students got the grade they expected then no transparency was needed at all i.e. they didn’t care how the grade was calculated.

For developers of algorithms, the article also provides a short summary of what explainable AI might look like. For example, without exposing the underlying source code, which in many cases is proprietary and holds commercial value for the company, explainable AI might simply identify the relationships between inputs and outcomes, highlight possible biases, and provide guidance that may help to address potential problems in the algorithm.


Public posting of marks

My university has a policy where the marks for each assessment task are posted – anonymously – on the departmental notice board. I think it goes back to a time when students were not automatically notified by email and individual notifications of grades would have been too time consuming. Now that our students get their marks as soon as they are captured in the system, I asked myself why we still bother to post the marks publicly.

I can’t think of a single reason why we should. What is the benefit of posting a list of marks where students are ranked against how others performed in the assessment? It has no value – as far as I can tell – for learning. No value for self-esteem (unless you’re performing in the higher percentile). No value for the institution or teacher. So why do we still do it?

I conducted a short poll among my final year ethics students asking them if they wanted me to continue posting their marks in public. See below for their responses.


Moving forward, I will no longer post my students marks in public nor will I publish class averages, unless specifically requested to do so. If I’m going to say that I’m assessing students against a set of criteria rather than against each other, I need to have my practice mirror this. How are students supposed to develop empathy when we constantly remind them that they’re in competition with each other?

twitter feed

Twitter Weekly Updates for 2012-04-16

assessment education ethics physiotherapy social media technology

Using a rubric for a blogging assignment

Earlier this year I gave my 3rd year students an assignment in which they needed to write a reflective blog post based on a clinical experience they’d experienced. I just thought I’d share the rubric I used to grade the assignments, as I’ve come across a few people have have had difficulty trying to assign grades to blog posts. This one below is the best that I could manage but would love to hear if you think there’s anything I could do differently.

twitter feed

Twitter Weekly Updates for 2010-05-31

assessment diigo

Posted to Diigo 05/25/2010

    • Turn over grading to the students in the course
    • “It was spectacular, far exceeding my expectations,” she said. “It would take a lot to get me back to a conventional form of grading ever again.”
    • she found that it inspired students to do more work, and more creative work than she sees in courses with traditional grading
    • based on contracts and “crowdsourcing.” First she announced the standards — students had to do all of the work and attend class to earn an A. If they didn’t complete all the assignments, they could get a B or C or worse, based on how many they finished. Students signed a contract to agree to the terms. But students also determined if the assignments (in this case blog posts that were mini-essays on the week’s work) were in fact meeting standards
    • the students each ended up writing about 1,000 words a week, much more than is required for a course to be considered “writing intensive”
    • she said that students took more risks
    • “I think students were going out on a limb more and being creative and not just thinking about ‘What does the teacher want?’ ”
    • While the students are ending up with As, many of them are doing so only because they redid assignments that were judged not sufficient to the task on the first try
    • “No one wanted to get one of those messages” that an assignment needed to be redone. (But when they did receive such notes, the students didn’t complain, as many do about grades they don’t like. They reworked their essays, she said.)
    • the alternative approach to grading in the course didn’t eliminate the teacher’s role, but changed the dynamic from “a single teaching-student interaction to multiple teacher-student/student-student interactions” with students in the roles of both student and teacher
    • “peer pressure is a very influential thing.”
    • “The greatest scam ever pulled off by “vendors” was convincing management that an LMS isn’t just a database. The second biggest? That they really needed one. The third? That it is a “Learning” “Management” System.”
    • “Those organizations (and frankly public learning institutions) that are clinging
      to their standalone learning management systems as a way in which to
      serve up formal ILT course schedules and eLearning are absolutely missing the big picture. Sadly, there are too many organizations like this out there.”
    • “The traditional stand-alone learning management system (LMS) is
      built on an industrial age model. There are two specific problems with this model, first it is
      monolithic within a learning institution and second it is
      generic across learning institutions.
    • there are simpler, cost-effective ways of tracking and reporting usage of content
    • the key point, as mentioned in the earlier Dan Pontefract quote, is that by focusing on an LMS, organisations are missing the big picture
    • adding social functionality into formal courses might go some way to making them more “engaging” to users, but it isn’t addressing the wider “learning” needs of the organisation
    • you simply can’t manage or formalise informal learning; it then just becomes formal, managed learning
    • “Whether you’re in a private or public organization …  start first with a ‘collaboration’ system rather than a ‘learning’ system, and build out from there.”
twitter feed

Twitter Weekly Updates for 2009-12-14

Powered by Twitter Tools

twitter feed

Twitter Weekly Updates for 2009-08-10

Powered by Twitter Tools