Training an LLM to become a medical assistant

Google are launching the second iteration of their Med PaLM project...
25 July 2023

Interview with 

Karan Singhal, Google

DOCTOR DIAGNOSIS

A doctor coming out of a computer screen

Share

What can these chatbots really do for us? If the chatbots we’re free to use on the internet are no more than glorified tech demos, how can we actually deploy them in properly useful ways? Karan Singhal is a staff research engineer at Google. As he explains to Chris Smith, he’s working on Med PaLM 2, an AI he hopes will serve as an assistant for medical professionals in diagnostics and therapeutics…

Karan - We've been really looking at the space of medical AI and all the advances over the last few years and noticed a few things. The first was that there were large advances in accuracy on narrow tasks, things like the ability for models to make diagnoses in radiology. But the second thing was really noticing that we saw limited uptake of these technologies. And I think part of it was of lack of flexibility and interactivity. And so if you have a model that makes predictions of whether or not some chest x-ray is normal or abnormal, it might be less usable than a system that you can truly interact with, engage in dialogue with, give feedback to, and get an explanation from instead of just the classification model. And so when we started this work, it was really thinking about that problem and bridging that gap from all the advances in AI to the things that are actually useful in real world clinical practice.

Chris - Arguably, if I ask an AI to tell me how many presidents of America there have been, if it gets it wrong and makes a few up, it's much less of a consequence than if I ask, 'does this chest x-ray contain a lung cancer?'

Karan - I think for me personally, that is almost the entire motivation for working in this setting to start. Foundation models are tricky to apply because it's so safety critical. And so I really came into this thinking about the problem of building more steerable or safe AI systems. There's a lot of nuance to performing well in this setting across all the different axes that we care about: preventing harm and producing equitable outcomes and making sure you're aligned to scientific consensus.

Chris - Is that down to what you train it on? Because these AIs are a product of the information they ingest and see the connections between. Or is it also more nuanced than that in how you actually instruct it to work? Or is it both?

Karan - Definitely it's both. If you take this base model that's been pre-trained on webscale data, the PaLM model for example - not specifically adapted for any of the medical settings or things like that - if you take that model and then apply it towards tasks like long form consumer medical question answering like we evaluated in the Med PaLM paper, it does not perform super well on axes like 'alignment with scientific consensus' because training data on the internet often has the potential for harm. And so if all we do is train on that data and not instruct these models on how to produce safer outputs, then you know we won't be in a good place. But when we take this extra step of providing explicit human feedback in various ways, that's a way we can guide these models. And so for the Med PaLM paper, what we did is we worked with a panel of physicians to craft expert demonstrations of good model behaviour across all these axes that we care about and then used that to instruct the Med PaLM model about how to behave using a technique called instruction prompt tuning that we introduced

Chris - When you did this, how good was it?

Karan - There were two broad tasks that we put the model through. One was multiple choice question answering on medical exams and medical research questions. What we noticed is that these models were performing state of the art across all the data sets that we studied in this work. The second thing was really thinking about consumer medical question answering. So asking these models to produce a long form open generation response to a consumer medical question. Baseline models didn't really perform well on this task. Physicians rated it 61.9%. So then what we did was we applied that human feedback aspect to the training of the model and then what we saw was that 92.6% of Med PaLM answers were aligned with scientific consensus and this was compared to 92.9% for clinicians. So it was now more in the same ballpark compared to that big difference earlier with the baseline model.

Chris - In other words, if I pick a physician off the shelf and I ask them to answer the question that your platform is answering, it's going to give an answer rated by a third party of about the same as the physician's answer, give or take?

Karan - It depends how you do that measurement. After the Med PaLM work, we've expanded on that measurement where we actually ask people to do pairwise comparisons between the model output and the physician output. What we have observed is, at least with the physician populations that we're using, with the specific ways we're collecting the data - I wanna caveat with all that - Med PaLM 2 responses were preferred across eight of nine of the axes that we were studying in that medical question answering task. Another caveat here is really thinking that this evaluation is not grounded in a real world clinical setting and it's not done with the largest panel of physicians; not fully done with the most representative sample of questions that we might ask. And so there's still a lot of work to be able to take this early promising technology and bring it to the settings in which it can have the most impact.

Chris - People say when you ask them 'how does it work?' They'll say, 'well, it's not an explainable technology.' As in, they don't mean they can't explain it, they just can't explain how it works because they don't know. How do you instruct it to 'think' a certain way like that?

Karan - There are a couple of different notions of explainability that can be useful here. One is asking a model to produce an explanation of its own behaviour before it produces a final answer. One version of this is called chain-of-thought prompting. This is something that we explored in this Med PaLM work as well. And what you do is, if you're asking the model to provide a diagnosis given a clinical vignette, you're asking the model to work step by step towards an answer to the question. That could be viewed as a form of explainability, but at the end of the day these models are still relatively a black box, but there's also work going on around mechanistic interpretability of models to better understand the nuts and bolts of how models work. And that's also work that we're excited about.

Chris - Say you are successful and you get something which the FDA approve, were this to go into clinical practice, where would you see this sitting in the consulting chain that goes from patient through to some kind of medical outcome?

Karan - The first things that I think we're going to see are use cases around reducing the burden of clinical documentation on doctors. There's a lot of work recently taking transcripts of medical interactions, producing summaries of notes that can be useful to be sent to patients and useful for payers and things like that. I think right now many doctors report that they spend two hours a day after dinner with their loved ones, writing clinical documentation to avoid liability or other issues. And I think that is a real cost and it's something where we can bring that time back. I think in the medium or longer term, there are higher stakes but also potentially impactful use cases that are worth exploring. And so things like clinical decision support, thinking about in the case of a radiologist for example, whether or not this model can double check/produce a more accurate report. I think there's a lot of use cases there that we're not quite ready for, but I think will be quite impactful over the next five, ten years.

Comments

Add a comment