false
Catalog
Gastroenterology & Artificial Intelligence: 3rd An ...
AI and Radiology: Lessons for Clinical Implementat ...
AI and Radiology: Lessons for Clinical Implementation in GI
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
We can move into the next talk by Chuck Kahn, and it's my pleasure to have Dr. Kahn join us. He is professor and vice chair of radiology at UPenn, and he's going to give us some very important insights about what he personally, as well as the radiology societies, have learned from applying artificial intelligence in their field and how it would impact a gastroenterologist and endoscopy as we move forward. So Dr. Kahn, please. Hello, I'm Chuck Kahn from University of Pennsylvania. It's a real pleasure for me to join you today to talk about AI and radiology and lessons for GI endoscopy. I have no commercial relationships to disclose. What I'd like to do as a radiologist is to provide you some of the lessons that we've learned from working with AI deep learning systems and those that might be applicable to your work in GI endoscopy. And I've grouped them here as four major points, test, test and test again, seek the truth, set standards, and challenge yourselves. And this cartoon over here on the right says, it's time we face reality, my friends. We're not exactly rocket scientists. So let's talk about testing. We know that AI and deep learning are tremendously powerful, but it's often the case that we don't understand exactly what it is that these systems have learned. And this is an example that I show frequently. It's a system that was built to detect pneumonia on chest radiographs, and it was built at two hospitals. One that was mostly inpatients had about 30% of the patients had pneumonia. And the other was mostly outpatients, only 1% of the patients had pneumonia. And that's good to get a variety of cases and to look at different settings to make a system more robust. And the system in its initial testing performed quite well. One of the things they did though, is they used a technique that's called heat maps or saliency maps. And it's a way that we can basically focus the information system and ask, in this case, what is the key feature that tells you whether or not the patient has pneumonia? And in this case, the computer said, it's the letter L. In fact, the hospital that had a large number of cases with pneumonia has the L in proper orientation on the black background, whereas the hospital with very few pneumonia cases, the letter L is reversed. And what the computer has done is it's found the feature that most clearly lets it identify pneumonia versus not. And although this looks rather laughable, unfortunately, we've seen it in a variety of other instances, cases, for example, training a system to detect TB on chest radiographs looked great until people realized the system was detecting the words TB clinic at the bottom of the films that had the greater number of cases of TB. Looking for pneumothorax, where the computer was not actually detecting pneumothorax, but rather the chest tube that had been placed to treat it. So it's very important as you look at these systems to make sure that you understand what it is that the systems are doing. AI systems can be a bit of a black box, but we have to test them rigorously in order to assure that we have appropriate performance from the system. When you look at these systems, understanding where the data came from, how the variables in the model were defined, what the criteria were to include or exclude patients or cases from the training and testing of the model, how much data was used, and how did they determine that, and particularly, how well do the training data match the intended clinical use of the model? I've seen a model that was used to detect lung nodules on chest CT. Unfortunately, the way it was trained, it detected calcified nodules, like calcified granulomita, which if you practice in the Midwest, and I'm from Chicago, that the great majority of people in the Mississippi-Ohio River Valley have histoplasmosis and will have calcified granulomita in their lungs, in their liver, in their spleen. These are not abnormalities that you want an AI system to be picking up. There are a variety of metrics that are used to quantify the performance of AI systems. Of these, the most familiar and frequently used is something called the Dice Similarity Coefficient. It gives a measure of the overlap of the area, the number of pixels usually measured that are labeled correctly. But there are some challenges with that. Imagine for a moment that what we're looking at is a CT of the liver, for example, where we're looking to pick up metastases. The image on the left, the reference image, shows a large lesion and two tiny satellite lesions. The first, the middle image, prediction one, the AI system has done quite a good job of finding most of the large lesion but has missed the tiny lesions. Prediction two, over on the right, it's found all three of the lesions and done a reasonably good job of finding the large lesion and outlining it. But the fact is that if you use the Dice Similarity Coefficient, the prediction one, the orange center prediction looks best, but clinically you might really want to have something that does more like prediction two where it finds the tiny lesions because the number of lesions and the fact that you have those small ones actually may have greater impact on treatment. As well, although many of us are familiar with ROC analysis, Receiver Operating Characteristic, calibration is a useful feature to have and is not as frequently reported. For calibration, we look at the actual prevalence of disease versus the predicted prevalence. And so if a patient, if a group of patients, say, has an estimated probability of malignancy of 40 percent, say, then we really want to see that 40 percent of that group of patients actually has malignancy. And one wants to see this calibration curve follow that, this dotted line that goes up along the diagonal. It gives us a measure of how well the system is actually estimating the probability of disease. And it's also critically important that developers of AI systems provide some examples to you not only of where their system works brilliantly, but also where it may fall down. And people are less interested in doing that quite often. Certainly, it's not something you expect to see in a sales pitch. But here, this was actually from the first manuscript that we published in the Radiology AI Journal. And hats off to these authors. I think this is absolutely essential because as a physician, it's important for me to know if I'm going to use an AI system, I want to know what its weaknesses are. I want to know, for example, that it may generate false positives based on, in this case, orthopedic hardware or IV tubing overlying the radius here. Or the fact that here, the radius and ulna are overlapping. Or in this case, that it missed a fracture, probably because of the cast that overlies it. So that's important information for us to think about and to know as we use these systems in clinical practice. Let's talk about ground truth. This image is from an article in the New York Times that actually shows something that looks like a call center. But in fact, these individuals are identifying polyps on video images from colonoscopy. It raises questions. What was the training that these people have? I don't believe people absolutely have to be physicians or be necessarily in the health professions. But it's really important to understand how that defined the ground truth in a system are defined. Whatever that is, it should be well-defined. You should know who or what annotated the data, what their qualifications or training were, what instructions they were given. And if more than one person was involved in annotation, did the people, the developers or manuscript authors, did they measure the inter-rater and intra-rater variability? And if there were discrepancies, how were those adjudicated? This is an example from an AI competition that we held at RSNA, one of our big radiology societies. And here, the goal was to identify pneumonia. And we took a publicly available data set, but three expert chest radiologists each drew bounding boxes, these rectangles around the areas where they were identifying as pneumonia. And the ground truth that we took as the consensus was the intersection of the boxes that had been drawn. But again, it's really critically important to understand how the ground truth was determined in any particular setting. I mentioned that because it's a particular way, and for those who are interested in furthering the science of AI and its application in medicine, it can go a long way. This is an example of a group at Stanford that developed a tool early on using deep learning to assess bone age, skeletal maturity, which is done by taking a left-hand radiograph of a child and comparing it to an atlas, a printed atlas that's available of hand radiographs. And their system did very well. I think the mean error was something on the order of five to six months. So they contributed that and we formed a competition around it that had some more than 300 teams, everything from industry labs and university groups and high school kids in their basement. And the amazing thing about it was that the 12 top finishers out of those 300 all came in basically within a hair's breadth of each other. But they all did better than, in fact, the Stanford group who had just published their own results about a month—the paper had come out just about the same time as the meeting was going on and the competition had finalized. The important lesson is that these competitions, these challenges, can actually produce remarkable gains by bringing a whole manner of different ideas to the fore about how to solve some of these problems. And then the following year, we stood up a pneumonia detection challenge and had more than 1,400 teams participate. We've gone on to look at intracranial hemorrhage on head CTs. And this year, we're looking at glioblastoma and looking for progression, pseudoprogression. We did also one in chest radiology for pulmonary embolism detection. Standards are another area that's really important for assuring that people understand the work that's being done and where all this is coming from. This is work that we've done in radiology, creating a vocabulary and, in fact, an ontology called RADLEX that provides terminology that's used in radiology. It supplements more general resources like SNOMED, for example, which is a more general vocabulary of medicine, but along with that, created a set of common data elements. A common data element is effectively a question and it's a loud set of answers. So liver volume, you know that it's going to be measured in milliliters. You know it has to be a numerical value. Liver lesion, the answer could be none, single, or multiple, for example. But whatever it is, it's going to be something that has a standardized value and that allows you to make the data more interoperable. One of the things that we crafted for the Radiology AI Journal is a checklist that applies to medical imaging. It is not designed for radiology specifically, but in fact, is designed for any medical imaging domain and applies across the board and for publication checklists like STARD, Standards for Reporting of Diagnostic Studies, they're developing a STARD AI guideline. I'm one of the participants in that process. It's forthcoming. But some of that actually incorporates some of the criteria that are in the CLAIM checklist. So I encourage you, if you're looking at manuscripts that involve AI in medical imaging, either as an author, a reviewer, or a reader of scientific articles, I encourage you to take a look at rsna.org slash CLAIM for that standard. So as practice pearls, I can just say in conclusion, I encourage test and test rigorously. Make sure you understand the ground truth and the measurements that you're using to judge the performance of a system. Seek to develop standards that allow you to exchange information and compare information across systems and where you can use AI competitions and challenges to further the science of AI. And with that, I thank you very much. And if you have questions, please don't hesitate to contact me.
Video Summary
In this video, Dr. Chuck Kahn, professor and vice chair of radiology at UPenn, discusses the lessons learned from applying artificial intelligence (AI) in radiology and how they can be applicable to gastroenterologists and endoscopy. He emphasizes the importance of rigorous testing and understanding what AI systems have learned. Dr. Kahn presents examples of systems that mistakenly focused on non-relevant features, such as the letter "L" instead of pneumonia, or detected unrelated objects like orthopedic hardware instead of fractures. He explains the challenges of measuring the performance of AI systems, including the Dice Similarity Coefficient and calibration. Dr. Kahn also highlights the significance of defining ground truth and the need for standardized terminology and data elements. He encourages the use of AI competitions and challenges to advance the field. Dr. Kahn concludes by urging practitioners to test rigorously, understand ground truth, develop standards, and participate in AI competitions.
Asset Subtitle
Charles Kahn, MD
Keywords
artificial intelligence
radiology
gastroenterologists
rigorous testing
ground truth
×
Please select your language
1
English