The Future of Artificial Intelligence Depends on Trust

To open up the AI black box and facilitate trust, companies must develop AI systems that perform reliably — that is, make correct decisions — time after time. The machine-learning models on which the systems are based must also be transparent, explainable, and able to achieve repeatable results.

Source: Rao, A. & Cameron, E. (2018). The Future of Artificial Intelligence Depends on Trust.

It still bothers me that we insist on explainability for AI systems while we’re quite happy for the decisions of clinicians to remain opaque, inaccurate, and unreliable. We need to move past the idea that there’s anything special about human intuition and that algorithms must satisfy a set of criteria that we would never dream of applying to ourselves.

Separating the Art of Medicine from Artificial Intelligence

Writing a radiology report is an extreme form of data compression — you are converting around 2 megabytes of data into a few bytes, in effect performing lossy compression with a huge compressive ratio.

Source: Separating the Art of Medicine from Artificial Intelligence

For me, there were a few useful takeaways from this article. The first is that data analysis and interpretation is a data compression problem.  The trick is to find a balance between throwing out information that isn’t useful and maintaining the relevant message during the processing. Consider the patient interview, where you take 15-20 minutes of audio data (about  10-15 MB using mp3 compression) and convert it to about a page of text (a few kilobytes at most). The subjective decisions we make about what information to discard and what to highlight have a real impact on our final conclusions and management plans.

Human radiologists are so bad interpreting chest X-rays and/or agreeing what findings they can see, that the ‘report’ that comes with the digital image is often either entirely wrong, partially wrong, or omits information.

This is not just a problem in radiology. I haven’t looked for any evidence of this but from personal experience I have little doubt that the inter and intra-rater reliability of physiotherapy assessment is similarly low. And even in cases where the diagnosis and interventions are the same, there would likely be a lot of variation in the description and formulation of the report. And this links to the last thing that I found thought-provoking:

…chest X-ray reports were never intended to be used for the development of radiology artificial intelligence. They were only ever supposed to be an opinion, an interpretation, a creative educated guess…A chest X-ray is neither the final diagnostic test nor the first, it is just one part of a suite of diagnostic steps in order to get to a clinical end-point.

We’re using unstructured medical data captured in a variety of contexts, to train AI-based systems but the data were never obtained, captured or stored in a system that was designed for that purpose. The implication is that the data we’re using to train medical AI simply isn’t fit for purpose. As long as we don’t collect the metadata (i.e. the contextual information “around” a condition), and continue using poorly labeled information and non-standardised language, we’re going to have problems with training machine learning algorithms. If we want AI-based systems to be anything more than basic triage then these are important problems to address.

Digital literacy survey: Outcome of reliability testing

Earlier this year we started the International Ethics Project, a collaboration between physiotherapy departments from several countries who intend offering an online course in professional ethics to their undergraduate students. You can read more about the project here.

In June we started the process of developing a questionnaire that we can use to establish some baseline data on students’ levels of digital literacy. It’s taken a bit longer than expected but we’ve finally managed to complete the reliability testing of the questionnaire as part of a pilot study. Before we can begin planning the module and how it will be implemented we need to get a better understanding of how our population – drawn as they are from several countries from around the world – uses digital tools in the context of their learning practices. The results of the reliability study showed that most of the survey items had Kappa values between 0.5 – 0.6 (indicating moderate agreement); 0.7 – 0.8 (indicating strong agreement); or >0.8 (indicating almost perfect agreement). See this post on the project blog for more details on how the reliability testing was conducted.

Now that we have conducted quite a rigorous piloting of the questionnaire, we hope that this questionnaire might be useful for other health professional educators who are considering the use of digital tools in their classrooms. To this end we would like to report the results of this pilot, along with some preliminary results, at the ER-WCPT conference on 11-12 November, 2016 in Liverpool. We will therefore be submitting an abstract for the conference in the coming months.

SAFRI 2011 (session 2) – day 4

Reliability and validity

Validity

Important for assessment, not only for research

It’s the scores that are valid and reliable, not the instrument

Sometimes the whole is greater than the sum of the parts e.g. when a student gets all the check marks but doesn’t perform competently overall e.g. the examiner can tick each competency being assessed but the student doesn’t establish rapport with the patient. Difficult to address

What does the score mean?

Students are efficient in the use of their time i.e. they will study what is being assessed because the inference is that we’re assessing what is important

Validity can be framed as an “argument / defense” proposition

Our Ethics exam is a problem of validity. Written tests measure knowledge, not behaviour e.g. students can know and report exactly what informed consent is and how to go about getting it, but may not pay it any attention in practice. How do we make the Ethics assessment more valid?

Face” validity doesn’t exist, it’s more accurately termed “content” validity. “Face” validity basically amounts to saying that something looks OK

What are the important things to score? Who determines what is important?

There are some things that standardised patients can’t do well e.g. trauma

Assessment should sample more broadly from a domain. This improves validity and also students don’t feel like they’ve wasted their time studying things that aren’t assessed. The more assessment items we include, the more valid the results

Scores drop if timing of assessment is inappropriate e.g. too much or too little time → lower scores as students either rush or try to “fill” the time something that isn’t appropriate for the assessment

First round scores in OSCEs are often lower then later rounds

Even though the assessment is meant to indicate competence, there’s actually no way to predict if practitioners are actually competent

Students really do want to learn!

Reliability

We want to ensure that a students observed score is a reasonable reflection of their “true ability”

In reliability assessments, how do you reduce the learning that occurs between assessments?

In OSCEs, use as many cases / stations as you can, and have different assessor for each station. This is the most effective rating design

We did a long session on standard setting, which was fascinating especially when it came to having to defend the cut-scores of exams i.e. what criteria do we use to say that 50% (or 60 or 70) is the pass mark? What data do we have to defend that standard?

Didn’t even realise that this was something to be considered, good to know that methods exist to use data to substantiate decisions made with regards to standards that are set (e.g. Angoff Method)

Should students be able to compensate for poor scores in one area, with good scores in another. Should they have to pass every section that we identify as being important? If it’s not important, why is it being assessed?

Norm-referenced critera are not particularly useful to determine competence. Standards should be set according to competence, not according to the performance of others

Standard setting panels shouldn’t give input on the quality of the assessment items

You can use standard setting to lower the pass mark in a difficult assessment, and to raise the pass mark in an easier exam

Alignment of expectations with actual performance

Setting up an OSCE

  • Design
  • Evaluate
  • Logistics

Standardised, compartmentalised (i.e. not holistic), variables removed / controlled, predetermined standards, variety of methods

Competencies broken into components

Is at the “shows how” part of Miller’s pyramd (Miller, 1990, The assessment of clinical skills, Academic Medicine, 65; S63-S67)

Design an OSCE, using the following guidelines:

  • Summative assessment for undergraduate students
  • Communication skill
  • Objective
  • Instructions (student, examiner, standardised patient)
  • Score sheet
  • Equipment list

Criticise the OSCE stations of another group

 

Assessing clinical performance

Looked at using mini-CEX (clinical evaluation exercise)

Useful for formative assessment

Avoid making judgements too soon → your impression may change over time

 

Twitter Weekly Updates for 2010-04-12

  • @sbestbier enjoyed it too, been thinking about ways to break away from the linear presentation, looking forward to your thoughts #
  • @clivesimpkins Good idea, I’ll bring it up with him & ask about opening the platform to other students for editing #
  • Never really had much use for mindmapping, so when I played with #xmind before, it didn’t really impress me. Boy, have I changed my tune #
  • @clivesimpkins …but, I take your point and might bring it up with him later #
  • @clivesimpkins As it was initiated by the student & is a great eg of social responsibility, I thought I’d only encourage at this early stage #
  • The Youth issues of South Africa: Current issues that are tearing us apart! Beginnings of a blog by one of our students http://bit.ly/9LbZoq #
  • Hot for Teachers w/ Megan Fox and Brian Austin Green ~ Stephen’s Web ~ by Stephen Downes http://bit.ly/bUrXby #
  • The 2009 Chronic Awards | Very funny, a good read on a Saturday morning http://bit.ly/cfQixe #
  • The Chronic | Bringing you the Ed Tech Buzz http://bit.ly/aSMNkZ #
  • South African scientist Uses Google Earth to Find Ancient Ancestor http://tinyurl.com/y92thbz #
  • Can You Get an Education in Spite of School? http://tinyurl.com/ybgdbzh #
  • Resistance is Futile. Interesting thoughts in the iPad in education, by David Warlick http://tinyurl.com/ydgpjnm #
  • Thinking is hard… #
  • Busy capturing data for test-retest reliability analysis of my questionnaire…behind the scenes of being a research rock-star #
  • Personalizing Learning – The Important Role of Technology http://tinyurl.com/yajdgl7 #
  • “Isn’t it enough to see that a garden is beautiful without having to believe that there are fairies at the bottom of it too?” Douglas Adams #

Test-retest reliability analysis

A few thoughts on conducting test-retest reliability analysis on questionnaires, based on my own recent experiences;
– DO pay attention to your coding sheet before doing the test, it will influence your questionnaire design
– DO make sure you pilot your questionnaire for ambiguity and understanding before doing the test, it may not be essential but it is logical
– DO capture the data yourself, it will give you insight and a deeper appreciation of the process
– DO make sure you have a way to uniquely identify each questionnaire, and simple codes are better than complex ones
– DO make sure you ask participants to uniquely identify each form they complete, but make sure to preserve anonymity

– DO NOT rely on handwriting recognition to achieve the last point if you forgot to do it, it will waste your time and take you into a valley of despair
– DO NOT rush the process, you will make mistakes if you do