A few days ago I commented on an article that discusses the introduction of AI into education and why teachers shouldn’t worry about it. I also said that AI for grading was inevitable because it would be cheaper, and more reliable, fair and valid than human beings. I got some pushback from Ben on Twitter and realised that I was making several assumptions in my post so I’ve written this post to clarify some of what I said. I also wanted to use this post to test my assumptions around the claims I made, so it’s a bit longer than usual since I’m “thinking out loud” and trying to justify each point.
First, there are the 4 claims I make for why I think that AI-based assessment of things like essays is inevitable:
- AI will be cheaper than human beings.
- AI will be more reliable than human beings.
- AI will be more fair than human beings.
- AI will be more valid than human beings.
Cheaper: Over the past 60 years or so we’ve seen fairly consistent improvements in power, efficiency and speed at increasingly lower costs. Even if we assume that Moore’s Law is bottoming out we’ll still see continued progress in cost reduction because of improvements in programming techniques, purpose-built chips and new technologies like quantum computers. This is important because, like any industry, education works on a budget. If a university can get close to the same outcomes with a significant reduction in cost, they’ll take it. Software vendors will offer the “essay grading” module that can be integrated into the institutional LMS and the costing will be such that universities would be crazy not to at least pilot it. And my thinking is that it’ll become very clear, very quickly that a significant part of essay grading is really very simple for machines to do. Which brings me to the next claim…
More reliable: A lot of essay grading boils down to things that are relatively simple to programme into a system. For example, spelling is largely a problem that we’ve solved (barring regional inconsistencies) and can therefore express as system of rules. These rules can be coded into an algorithm, which is why spell-checking works. Grammatical structure is also generally well-understood, with most cultures having concepts like nouns, verbs, adjectives, etc., as well as an understanding of how these words are best positioned relative to each other to enhance readability and understanding. Whether we use prescriptive rules (“we should do this”) or descriptive rules (“we actually do this”) matters less than knowing what set of rules we’ll use for the task at hand. It seems reasonable that physiotherapy lecturers could tune an algorithm with a slider, specifying that grammatical structure is less important for their students (i.e. lower scores wrt prescriptive rules are OK) while an English lecturer might insist than their students must score higher on how words should be used. Referencing formatting is also easy to code with a series of rules, as well as the idea that knowledge claims should be supported with evidence. And related to this is the idea that machines are getting better at identifying passages of text that are simply copied from a source. And I think it’s reasonable to assert that a computer can count more quickly, and more reliably, than a person. Of course this doesn’t take into account things like creativity but I’ll get to that. For now, we should at least grant that an AI could plausibly be more reliable than a human being (i.e. it assesses the same things in the same way across multiple examples) when it comes to evaluating things like spelling, grammatical structure, essay structure, referencing, and plagiarism. And machines will do this consistently across tens of thousands of students.
Fairer: Human beings are inherently unfair. Regardless of how fair we think we’re being, there are some variables that we simply can’t tune because we’re not even aware that they’re affecting us. There’s evidence that we’re more strict when we’re hungry or when we’re angry with a partner, and that also we’re influenced by the gender of person we’re grading, the time of day, etc. We’re also affected by sequencing; my grading of the essays I read later are influenced by the earlier examples I’ve seen. This means that a student’s grade might be affected by where in the pile their script is lying, or by their surname if the submission is digital and sorted alphabetically. It may be literally true (warning: controversial opinion coming up) that a student’s mark is more strongly influenced by my current relationship with my children than by what they’ve actually written. Our cognitive biases make it almost impossible for human beings to be as fair as we think we are. And yes, I’m aware that biases are inherent to machine learning algorithms as well. The difference is that those kinds of biases can be seen and corrected, whereas human bias is – and is likely to remain – invisible and unaccountable.
More valid: And finally there’s the issue of validity; are we assessing what we say we’re assessing? For essays this is an important point. Essays are often written in response to a critical question and it’s easy for the assessor to lose sight of that during the grading process. Again, our biases can influence our perceptions without us even being aware of them. A student’s reference to a current political situation may score them points (availability bias) while another, equally valid reference to a story we’re not aware of wouldn’t have the same valence for the assessor. Students can tweak other variables to create a good impression on the reader, none of which are necessarily related to how well they answer the question. For example, even just taking a few minutes to present the essay in a way that’s aesthetically pleasing can influence an assessor, never mind the points received for simply following instructions on layout (e.g. margin size, line spacing, font selection, etc.). When you add things like the relationship between students and the assessor, you start to get a sense for how the person doing the grading can be influenced by many other factors besides the students’ ability to answer the essay question.
OK, so that’s why I think that the introduction of AI for grading – at least for grading essays – is inevitable. However, I’m aware that doesn’t really deal with the bulk of the concerns that Ben raised. I just wanted to provide some context and support for the initial claims I made. The rest of this post is in response to the specific concerns that Ben raised in his series of tweets. I’ve combined some of them below for easier reference.
Can we be sure [that AI-based grading of assessment] is not a bad thing? Is it necessarily fairer? Thinking about the last lot of essays I marked, the ones getting the highest grades varied significantly, representing different takes on a wide ranging and pretty open ended topic. As markers we could allow some weaknesses in an assignment that did other things extremely well and showed independence of thought. The ones getting very good but not excellent grades were possibly more consistent, they were polished and competent but didn’t make quite the same creative or critical jump.
I think I addressed the concern about fairness earlier in the post. I really do think that AI-based grading will be more fair to students. There’s also the argument about how the range of examples with the highest grades tend to be quite different. This is a good thing and represents the kinds of open-ended response to questions that demonstrates how students can use their imagination to construct wide-ranging, unanticipated responses to difficult questions. I think that this would be addressed by the fact that AI-based systems are trained on tens of thousands of examples, all of which are labelled by human assessors. Instead of the system being narrowly constrained by the algorithm, I think that algorithms will open up the possible space of what “good” looks like. While I’m often delighted with variation and creative responses from students, not all of my colleagues feel the same way. An AI-based grading system will ensure that, if we highlight “creativity” as an attribute that we value in our assessments, individual lecturers won’t have as much power to constrain its development. And AI systems will also be able to “acknowledge” that some areas of the students’ submissions are stronger than others, and will be able to grade across different criteria (for example, the output might look like: “student’s ability to follow instructions is “excellent”, language – especially grammar – can be improved, ability to develop an argument from initial premises is “good”, etc.”).
How will AI marking allow for the imaginative, creative assignments and avoid a cycle of increasingly standardised and sanitised assignments as students work out how to please the algorithm?
My first response to this is: how do we “…avoid a cycle of increasingly standardised and sanitised assignments as students work out how to please the lecturer?” And then there’s also the progress being made in “creative expressions” of AI-based systems; art (see here and here), music (see here, here, and here), and stories/poems (see here, and here). You can argue that an AI that uses human artifacts to generate new examples is simply derivative. But I’d counter by saying almost all human-generated art is similarly derivative. There are very few people who have developed unique insights that shift how we see the world at some fundamental level. You could also argue that some of these platforms aren’t yet very good. I’d suggest that they will only ever get better, and that you can’t say the same for people.
Is it fairer to aim for consistency of input/output or to allow for individual interpretations of an assignment? What, at heart is the point of assessment in higher education – consistent competence or individual critical thought?
Also who influences the algorithm? Is it on an institutional basis or wider? Is it fairer to allow for varied and localised interpretations of excellence or end up making everyone fit to one homogenous standard (we can guess which dominant cultural norms it would reflect…)
This is an excellent point and the main reason for why I think it’s incumbent on lecturers to be involved in the development of AI-based systems in education. We can’t rely on software engineers in Silicon Valley to be solely responsible for the design choices that influence how artificial intelligence should be used in education. I expand on these ideas in this book chapter (slideshow summary here).
On the whole I think that Ben has raised important questions and agree that these are valid concerns. For me, there are three main issues to highlight, which I’d summarise like so:
- There is a tension between creating assignments that enable open-ended (and therefore creative) student responses and those that are more closed, pushing students towards more standardised submissions. Will AI-based grading systems be able to deal with this nuance?
- There is a risk that students might become more concerned with gaming the system and aiming to “please the algorithm”, resulting in sanitised essays rather than imaginative and creative work. How can we avoid this “gaming the system” approach?
- There is a bias that’s built into machine learning which is likely to reflect the dominant cultural norms of those responsible for the system. Are we happy to have these biases influence student outcomes and if not, how will we counter them?
Looking back, I think that I’ve presented what I think are reasonable arguments for each of the points above. I may have misunderstood the concerns and I’ve definitely left out important points. But I think that this is enough for now. If you’re a university lecturer or high school teacher I think that the points raised by Ben in his tweets are great starting points for a conversation about how these systems will affect us all.
I don’t think that the introduction of AI-based essay grading will affect our ability to design open-ended assessments that enable student creativity and imagination. We’ve known for decades that rules cannot describe the complexity of human society because people – and the outcomes of interactions between people – are unknowable. And if we can’t specify in advance what these outcomes will look like, we can’t encode them in rules. But this has been the breakthrough that machine learning has brought to AI research. AI-based systems don’t attempt to have “reality” coded into them but rather learn about “reality” from massive sets of examples that are labelled by human beings. This may turn out to be the wrong approach but, for me at least, the argument for using AI in assessment is a plausible one.