AI education

Why I think that AI-based grading in education is inevitable.

A few days ago I commented on an article that discusses the introduction of AI into education and why teachers shouldn’t worry about it. I also said that AI for grading was inevitable because it would be cheaper, and more reliable, fair and valid than human beings. I got some pushback from Ben on Twitter and realised that I was making several assumptions in my post so I’ve written this post to clarify some of what I said. I also wanted to use this post to test my assumptions around the claims I made, so it’s a bit longer than usual since I’m “thinking out loud” and trying to justify each point.

First, there are the 4 claims I make for why I think that AI-based assessment of things like essays is inevitable:

  • AI will be cheaper than human beings.
  • AI will be more reliable than human beings.
  • AI will be more fair than human beings.
  • AI will be more valid than human beings.

Cheaper: Over the past 60 years or so we’ve seen fairly consistent improvements in power, efficiency and speed at increasingly lower costs. Even if we assume that Moore’s Law is bottoming out we’ll still see continued progress in cost reduction because of improvements in programming techniques, purpose-built chips and new technologies like quantum computers. This is important because, like any industry, education works on a budget. If a university can get close to the same outcomes with a significant reduction in cost, they’ll take it. Software vendors will offer the “essay grading” module that can be integrated into the institutional LMS and the costing will be such that universities would be crazy not to at least pilot it. And my thinking is that it’ll become very clear, very quickly that a significant part of essay grading is really very simple for machines to do. Which brings me to the next claim…

More reliable: A lot of essay grading boils down to things that are relatively simple to programme into a system. For example, spelling is largely a problem that we’ve solved (barring regional inconsistencies) and can therefore express as system of rules. These rules can be coded into an algorithm, which is why spell-checking works. Grammatical structure is also generally well-understood, with most cultures having concepts like nouns, verbs, adjectives, etc., as well as an understanding of how these words are best positioned relative to each other to enhance readability and understanding. Whether we use prescriptive rules (“we should do this”) or descriptive rules (“we actually do this”) matters less than knowing what set of rules we’ll use for the task at hand. It seems reasonable that physiotherapy lecturers could tune an algorithm with a slider, specifying that grammatical structure is less important for their students (i.e. lower scores wrt prescriptive rules are OK) while an English lecturer might insist than their students must score higher on how words should be used. Referencing formatting is also easy to code with a series of rules, as well as the idea that knowledge claims should be supported with evidence. And related to this is the idea that machines are getting better at identifying passages of text that are simply copied from a source. And I think it’s reasonable to assert that a computer can count more quickly, and more reliably, than a person. Of course this doesn’t take into account things like creativity but I’ll get to that. For now, we should at least grant that an AI could plausibly be more reliable than a human being (i.e. it assesses the same things in the same way across multiple examples) when it comes to evaluating things like spelling, grammatical structure, essay structure, referencing, and plagiarism. And machines will do this consistently across tens of thousands of students.

Fairer: Human beings are inherently unfair. Regardless of how fair we think we’re being, there are some variables that we simply can’t tune because we’re not even aware that they’re affecting us. There’s evidence that we’re more strict when we’re hungry or when we’re angry with a partner, and that also we’re influenced by the gender of person we’re grading, the time of day, etc. We’re also affected by sequencing; my grading of the essays I read later are influenced by the earlier examples I’ve seen. This means that a student’s grade might be affected by where in the pile their script is lying, or by their surname if the submission is digital and sorted alphabetically. It may be literally true (warning: controversial opinion coming up) that a student’s mark is more strongly influenced by my current relationship with my children than by what they’ve actually written. Our cognitive biases make it almost impossible for human beings to be as fair as we think we are. And yes, I’m aware that biases are inherent to machine learning algorithms as well. The difference is that those kinds of biases can be seen and corrected, whereas human bias is – and is likely to remain – invisible and unaccountable.

More valid: And finally there’s the issue of validity; are we assessing what we say we’re assessing? For essays this is an important point. Essays are often written in response to a critical question and it’s easy for the assessor to lose sight of that during the grading process. Again, our biases can influence our perceptions without us even being aware of them. A student’s reference to a current political situation may score them points (availability bias) while another, equally valid reference to a story we’re not aware of wouldn’t have the same valence for the assessor. Students can tweak other variables to create a good impression on the reader, none of which are necessarily related to how well they answer the question. For example, even just taking a few minutes to present the essay in a way that’s aesthetically pleasing can influence an assessor, never mind the points received for simply following instructions on layout (e.g. margin size, line spacing, font selection, etc.). When you add things like the relationship between students and the assessor, you start to get a sense for how the person doing the grading can be influenced by many other factors besides the students’ ability to answer the essay question.

OK, so that’s why I think that the introduction of AI for grading – at least for grading essays – is inevitable. However, I’m aware that doesn’t really deal with the bulk of the concerns that Ben raised. I just wanted to provide some context and support for the initial claims I made. The rest of this post is in response to the specific concerns that Ben raised in his series of tweets. I’ve combined some of them below for easier reference.

Can we be sure [that AI-based grading of assessment] is not a bad thing? Is it necessarily fairer? Thinking about the last lot of essays I marked, the ones getting the highest grades varied significantly, representing different takes on a wide ranging and pretty open ended topic. As markers we could allow some weaknesses in an assignment that did other things extremely well and showed independence of thought. The ones getting very good but not excellent grades were possibly more consistent, they were polished and competent but didn’t make quite the same creative or critical jump.

I think I addressed the concern about fairness earlier in the post. I really do think that AI-based grading will be more fair to students. There’s also the argument about how the range of examples with the highest grades tend to be quite different. This is a good thing and represents the kinds of open-ended response to questions that demonstrates how students can use their imagination to construct wide-ranging, unanticipated responses to difficult questions. I think that this would be addressed by the fact that AI-based systems are trained on tens of thousands of examples, all of which are labelled by human assessors. Instead of the system being narrowly constrained by the algorithm, I think that algorithms will open up the possible space of what “good” looks like. While I’m often delighted with variation and creative responses from students, not all of my colleagues feel the same way. An AI-based grading system will ensure that, if we highlight “creativity” as an attribute that we value in our assessments, individual lecturers won’t have as much power to constrain its development. And AI systems will also be able to “acknowledge” that some areas of the students’ submissions are stronger than others, and will be able to grade across different criteria (for example, the output might look like: “student’s ability to follow instructions is “excellent”, language – especially grammar – can be improved, ability to develop an argument from initial premises is “good”, etc.”).

How will AI marking allow for the imaginative, creative assignments and avoid a cycle of increasingly standardised and sanitised assignments as students work out how to please the algorithm?

My first response to this is: how do we “…avoid a cycle of increasingly standardised and sanitised assignments as students work out how to please the lecturer?” And then there’s also the progress being made in “creative expressions” of AI-based systems; art (see here and here), music (see here, here, and here), and stories/poems (see here, and here). You can argue that an AI that uses human artifacts to generate new examples is simply derivative. But I’d counter by saying almost all human-generated art is similarly derivative. There are very few people who have developed unique insights that shift how we see the world at some fundamental level. You could also argue that some of these platforms aren’t yet very good. I’d suggest that they will only ever get better, and that you can’t say the same for people.

Is it fairer to aim for consistency of input/output or to allow for individual interpretations of an assignment? What, at heart is the point of assessment in higher education – consistent competence or individual critical thought?

Also who influences the algorithm? Is it on an institutional basis or wider? Is it fairer to allow for varied and localised interpretations of excellence or end up making everyone fit to one homogenous standard (we can guess which dominant cultural norms it would reflect…)

This is an excellent point and the main reason for why I think it’s incumbent on lecturers to be involved in the development of AI-based systems in education. We can’t rely on software engineers in Silicon Valley to be solely responsible for the design choices that influence how artificial intelligence should be used in education. I expand on these ideas in this book chapter (slideshow summary here).

On the whole I think that Ben has raised important questions and agree that these are valid concerns. For me, there are three main issues to highlight, which I’d summarise like so:

  1. There is a tension between creating assignments that enable open-ended (and therefore creative) student responses and those that are more closed, pushing students towards more standardised submissions. Will AI-based grading systems be able to deal with this nuance?
  2. There is a risk that students might become more concerned with gaming the system and aiming to “please the algorithm”, resulting in sanitised essays rather than imaginative and creative work. How can we avoid this “gaming the system” approach?
  3. There is a bias that’s built into machine learning which is likely to reflect the dominant cultural norms of those responsible for the system. Are we happy to have these biases influence student outcomes and if not, how will we counter them?

Looking back, I think that I’ve presented what I think are reasonable arguments for each of the points above. I may have misunderstood the concerns and I’ve definitely left out important points. But I think that this is enough for now. If you’re a university lecturer or high school teacher I think that the points raised by Ben in his tweets are great starting points for a conversation about how these systems will affect us all.

I don’t think that the introduction of AI-based essay grading will affect our ability to design open-ended assessments that enable student creativity and imagination. We’ve known for decades that rules cannot describe the complexity of human society because people – and the outcomes of interactions between people – are unknowable. And if we can’t specify in advance what these outcomes will look like, we can’t encode them in rules. But this has been the breakthrough that machine learning has brought to AI research. AI-based systems don’t attempt to have “reality” coded into them but rather learn about “reality” from massive sets of examples that are labelled by human beings. This may turn out to be the wrong approach but, for me at least, the argument for using AI in assessment is a plausible one.

AI education

Comment: Teachers, the Robots Are Coming. But That’s Not a Bad Thing.

…that’s exactly why educators should not be putting their heads in the sand and hoping they never get replaced by an AI-powered robot. They need to play a big role in the development of these technologies so that whatever is produced is ethical and unbiased, improves student learning, and helps teachers spend more time inspiring students, building strong relationships with them, and focusing on the priorities that matter most. If designed with educator input, these technologies could free up teachers to do what they do best: inspire students to learn and coach them along the way.

Bushweller, K. (2020). Teachers, the Robots Are Coming. But That’s Not a Bad Thing. Education Week.

There are a few points in the article that confuse rather than clarify (for example, the conflation of robots with software) but on the whole I think this provides a useful overview of some of the main concerns around the introduction of AI-based systems in education. Personally, I’m not at all worried about having humanoid (or animal-type) physical robots coming into the classroom to take over my job.

I think that AI will be introduced into educational settings more surreptitiously, for example via the institutional LMS in the form of grading assistance, risk identification, timetabling, etc. And we’ll welcome this because it frees us from the very labour intensive, repetitive work that we all complain about. Not only that but grading seems to be one of the most expensive aspects (in terms of time) of a teacher’s job and because of this we’re going to see a lot of interest in this area by governments. For example, see this project by Ofqual (the UK teaching standards regulator) to explore the use of AI to grade school exams.

In fact, I think that AI-based assessment is pretty much inevitable in educational contexts, given that it’ll probably be (a lot) cheaper, more reliable, fair, and valid than human graders.

Shameless self-promotion: I wrote a book chapter about how teachers could play a role in the development of AI-based systems in education, specifically in the areas of data collection, teaching practice, research, and policy development. Here is the full-text (preprint) and here are my slides from a seminar at the University of Cape Town where I presented an overview.

conference education physiotherapy

Comment: Science conferences are stuck in the dark ages

…for decades the room has been the same: four walls, a podium, and a projector. PowerPoints today mimic the effect of a centuries-old continuous-slide lantern. Even when time is occasionally left for questions at the end of lectures, it’s still a distinctly one-way flow of information. Scientific posters are similarly archaic.

Ngumbi, E. & Lovett, B. (2019). Science Conferences Are Stuck in the Dark Ages. Wired magazine.

Anyone who’s gone to an academic conference and reflected on it for more than a moment usually arrives at the conclusion that the experience is distinctly underwhelming. I’m not going to go into the details of why since Ben and I discussed it at length in our reflection on WCPT and the Unposter on the podcast, but the general idea is that most conferences suck because of the format.

And this is why you really need to think about coming to the second In Beta unconference on physiotherapy education at HAN in the Netherlands on the 14th and 15th of September 2020. The unconference will take place soon after the ENPHE/ER-WCPT conference, so if you’re attending that meeting then it’s a no-brainer to stay on for a few days and come to Nijmegen for something quite different. Click on the image below for more information.

education technology

Podcast: Are the kids alright?

In this anxious era of bullying, teen depression, and school shootings, tech companies are selling software to schools and parents that make big promises about keeping kids secure by monitoring what they say and write online. But these apps demand disturbing trade-offs in the name of safety.

This is a great episode of the Rework podcast looking at the dangers of using increasingly sophisticated technology in schools as part of programmes to “protect” children. What they really amount to are very superficial surveillance systems that can do a lot less than what the venture-backed companies say they can. If you’re a teacher or if you have kids at a school using these systems, this is a topic worth learning more about.

The show notes include a ton of links to excellent resources and also a complete transcript of the episode.

education learning

Comment: The game of school.

Schools are about learning, but it’s mostly learning how to play the game. At some level, even though we like to talk about schools as though they are about learning in some pure, liberal-arts sense, on a pragmatic level we know that what we’re really teaching students is to get done the things that they are asked to do, to get them done on time, and to get them done with as few mistakes as possible.

I think the danger comes from believing that those who by chance, genetics, temperament, family support, or cultural background find the game easier to play are actually somehow inherently betteror have more human value than the other students.

The students who aren’t succeeding usually don’t have any idea that school is a game. Since we tell them it’s about learning, when they fail they then internalize the belief that they themselves are actual failures–that they are not good learners. And we tell ourselves some things to feel OK about this taking place: that some kids are smart and some are not, that the top students will always rise to the top, that their behavior is not the result of the system but that is their own fault.

Hargadon, S. (2019). The game of school. Steve Hargadon blog: The learning revoluation has begun.

I thought that this was an interesting post with a few ideas that helped me to think more carefully about my own teaching. I’ve pulled out a few of the sentences from the post that really resonated with me but there are plenty more. Once you accept the idea that school (and university) is a game, it all makes a lot more sense; ranking students in leaderboards, passing and failing (as in quests or missions), levelling up, etc.

The author also then goes on to present 4 hierarchical “levels” of learning that really describe frameworks or paradigms rather than any real description of learning (i.e. the categores and names of the levels in the hierarchy are to some extent, arbitrary; it’s the descriptions in each level that count).

If I think about our own physiotherapy programme, we use all 4 “levels” interchangeably and have varying degrees of each of them scattered throughout the curriculum. However, I’d say that the bulk of our approach happens at the lowest level of Schooling, some at Training, a little at Education, and almost none at Self-regulated learning. While we pay lip service to the fact that we “offer opportunities for self-regulated learning”, what it really boils down to is that we give students reading to do outside of class time.

AI education

Resource: Elements of AI course.

The Elements of AI is a series of free online courses created by Reaktor and the University of Helsinki. We want to encourage as broad a group of people as possible to learn what AI is, what can (and can’t) be done with AI, and how to start creating AI methods. The courses combine theory with practical exercises and can be completed at your own pace.

  1. Finland created a course on AI for it’s citizens because the government believes that the technology is going to fundamentally change society.
  2. They made the course free and available to anyone in the world who wanted to take it.
  3. They’re in the process of translating the course into every EU language because they want to ensure that at least 1% of EU citizens have a basic understanding of AI. You can sign up here to be notified when the course is available in your home (EU) language.

Firstly, it’s amazing that Finland is doing this.

Secondly, if you’re even vaguely interested in AI then you should consider completing the course. I went through it earlier this year and found it interesting/useful just to read the notes (I skipped the exercises). I’m thinking that I might do it again in the new year but this time make an effort to also complete the exercises now that I’m a bit more comfortable with the topic.

You can find out more about the course here and here, and sign up here.

education technology

Training students for jobs that don’t exist yet. Or not.

The top 10 in demand jobs in 2010 did not exist in 2004. We are currently preparing students for jobs that don’t exist yet, using technologies that haven’t been invented, in order to solve problems we don’t even know are problems yet.

It takes some work to find out that the claim is not true.

Doxtdator, A. (2017). A field guide to ‘jobs that don’t exist yet’. Long View on Education.

If you’ve spent any time in education there’s a good chance you’ve seen the Shift Happens video below (this is the original version that came out in 2009 or thereabouts…there are updated versions for 2018 and 2019). It’s very inspiring (the music helps) and for the longest time I’d recommend it to anyone who’d listen. If you haven’t seen the video then watch it now before we move on.

The kind of complex thinking we deserve about education won’t come in factoids or bullet-point lists of skills of the future.

Doxtdator, A. (2017). A fieldguide to ‘jobs that don’t exist yet’. Long View on Education.

I’ve watched this video a lot, mainly in the first few years after starting as an academic because the narrative was perfectly aligned with the way I was thinking and the work I was doing. But as I’ve spent more time in education and research, I’ve become increasingly skeptical of the “sound bite” type solutions to pedagogical problems that are nuanced and complex. Having said that I’d say that, until earlier this year, I would still have been sympathetic to the main arguments in the video:

  • The rate of social and technical change is accelerating;
  • Because of the Internet and other emerging technologies;
  • Higher education is not adapting quickly enough;
  • But we need to future-proof our students;
  • So we’d better start changing soon.

In this More or less BBC podcast, Tim Harford asks what the staistical likelihood is that 65% of future jobs haven’t been invented yet and it seems fairly obvious straight away that it’s not a reasonable prediction. We might argue that the specific numbers are less important than the spirit of the claim, which is that the world is changing more quickly then ever before (probably true), and that this matters at a fundamental level (maybe true), and that how we respond in higher education has grave consequences for our students we train (little or no evidence that this is true). Consider the following quote from a presentation give in 1957:

We are too much inclined to think of careers and opportunities as if the oncoming generations were growing up to fill the jobs that are now held by their seniors. This is not true. Our young people will fill many jobs that do not now exist. They will invent products that will need new skills. Old-fashioned mercantilism and the nineteenth-century theory in which one man’s gain was another man’s loss, are being replaced by a dynamism in which the new ideas of a lot of people become the gains for many, many more.

Josephs, D. (1957). Oral presentation at the Conference on the American High School.

Notice 1) this statement is from a keynote given about 60 years ago, and 2) how closely the narrative mirrors the concerns raised about how contemporary education doesn’t prepare students for jobs that don’t yet exist. While it may be fair to say that the narrative might still be true, just on a longer timescale, it’s almost certainly not a result of the Internet, mobile phones or any other technology that’s emerged in the past few decades.

This is why I was delighted to come across the article I opened with. It’s a reminder that it’s essential that we take critical positions on the things we care most about.

education learning scholarship

When a metric becomes a target it fails to be a good metric.

Lately I’ve been thinking about metrics and all the ways that they can be misleading. Don’t get me wrong; I think that measuring is important. Measuring is the reason that our buildings and bridges don’t collapse. Measurements help tell us when a drug is working. GPS would be impossible without precise measurements of time. My Fitbit tells me when I’m exercising close to my maximum heart rate. So I’m definitely a fan of measuring things.

The problem is when we try to use measurements for things that aren’t easy to measure. For example, it’s hard to know when an article we publish has had an impact, so we look at the number of times that other researchers have used our articles as proxy indicators for their influence on the thinking of others. But this ignores the number of times that the articles are used to change a programme or trigger a new line of thinking in someone who isn’t publishing themselves. Or we use the number of articles being published in a department as a measure of “how much” science that department is doing. But this prioritises quantity over quality and ignores the fact that what we really want is a better understanding of the world, not “more publications”.

It sometimes feels like academia is just a weird version of Klout where we’re all trying to get better at increasing our “engagement” scores and we’ve forgotten the purpose of the exercise. We’ve confused achieving better scores on the metric rather than workign to move the larger project forward. We publish articles because articles are evidence that we’re doing research, and we use article citations and journal impact factors as evidence that our work is influential. But when a metric becomes a target it fails to be a good metric.

We see similar things happening all around us in higher education. We use percentages and scores to measure learning, even though we know that these numbers in themselves are subjective and sometimes arbitrary. We set targets in departments that ostensibly help us know when we’ve achieved an objective but we’re only mildly confident that the behaviours we’re measuring will help achieve the objective. For example, you have to be in the office for a certain number of hours each week so that we know that you’re working. But I don’t really care how often you’re in your office; I only really care about the quality of the work you do. But it’s hard to measure the quality of the work you do so I measure the thing that’s easy to measure.

This isn’t to say that we shouldn’t try to measure what we value, only that measurement is hard and that the metrics we choose will influence our behaviour. If I notice that people at work don’t seem to like each other very much I might start using some kind of  likeability index that aims to score everyone. But then we’d see people trying to increase their scores on the index rather than simply being kinder to each other. What I care about is that we treat each other well, not how well we each score on a metric.

We’ve set up the system so that students – and teachers – care more about the score achieved on the assessment rather than learning or critical thinking or collaborating. We give students page limits for writing tasks because we don’t want them to write everything in the hope that some of what they write is what we’re looking for. But then they play around with different variables (margin and font sizes, line spacing, title pages, etc.) in order to hit the page limit. What we really care about are other things, for example the ability to answer a question clearly and concisely, from a novel perspective, and to support claims about the world with good arguments.

I don’t have any solutions to the problem of measurement in higher education and academia. It’s a ahrd problem. I’m just thinking out loud about the fact that our behaviours are driven by what we’ve chosen to measure, and I’m wondering if maybe it’s time to start using different metrics as a way to be more intentional about achieving what we say we care about. Maybe it doesn’t even matter what the metrics are. Maybe what matters is how the choice of metrics can change certain kinds of behaviours.

AI education

UCT seminar: Shaping our algorithms

Tomorrow I’ll be presenting a short seminar at the University of Cape Town on a book chapter that was published earlier this year, called Shaping our algorithms before they shape us. Here are the slides I’ll be using, which I think are a useful summary of the chapter itself.

This slideshow requires JavaScript.

AI education

Book chapter published: Shaping our algorithms before they shape us

I’ve just had a chapter published in an edited collection entitled: Artificial Intelligence and Inclusive Education: Speculative Futures and Emerging Practices. The book is edited by Jeremy Knox, Yuchen Wang and Michael Gallagher and is available here.

Here’s the citation: Rowe M. (2019) Shaping Our Algorithms Before They Shape Us. In: Knox J., Wang Y., Gallagher M. (eds) Artificial Intelligence and Inclusive Education. Perspectives on Rethinking and Reforming Education. Springer, Singapore.

And here’s my abstract:

A common refrain among teachers is that they cannot be replaced by intelligent machines because of the essential human element that lies at the centre of teaching and learning. While it is true that there are some aspects of the teacher-student relationship that may ultimately present insurmountable obstacles to the complete automation of teaching, there are important gaps in practice where artificial intelligence (AI) will inevitably find room to move. Machine learning is the branch of AI research that uses algorithms to find statistical correlations between variables that may or may not be known to the researchers. The implications of this are profound and are leading to significant progress being made in natural language processing, computer vision, navigation and planning. But machine learning is not all-powerful, and there are important technical limitations that will constrain the extent of its use and promotion in education, provided that teachers are aware of these limitations and are included in the process of shepherding the technology into practice. This has always been important but when a technology has the potential of AI we would do well to ensure that teachers are intentionally included in the design, development, implementation and evaluation of AI-based systems in education.