false
Catalog
Gastroenterology and Artificial Intelligence: 3rd ...
How to Evaluate AI in the GI Literature and Clinic ...
How to Evaluate AI in the GI Literature and Clinical Trials
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
We are ready to start session number four. And just going down the list and getting into clinical implementation, trial designs, how to evaluate the GI literature, the role of large data sets of images, that's the goal of this session. So it's my pleasure to introduce Alessandro Rapici from Milano, Italy, and Mike Wallace, the course co-director, to lead this session. So Ale and Mike, over to you guys. Thanks to ESG for the invitation, also pleasure to co-moderate together with Mike Wallace, who is my Editor-in-Chief at GI. So the first speaker is Cesare Hassan. Cesare is an outstanding endoscopist researcher. He has been working restless for producing innovation in the space of GI. He currently serves as a board member of the European Society of Gastrointestinal Endoscopy. And I'm so pleased to announce that he recently joined as a professor of gastroenterology at the Humana University Medical School and Humana University Research Hospital. So he's going to be part of our team and so glad to introduce Cesare for his talk. Thanks Cesare. Welcome everybody. Thanks for taking the mic for this important invitation, this third summit of the American Society on the Breakthrough Innovation in GI Endoscopy. Let's talk about what is available in the literature and how much relevance that we should give to it. This is my disclosure. Every time there is a new innovation in GI endoscopy, we need to look at three different phases of the innovation cycle. The first is whether the innovation does or does not work. This is named as a standalone performance in artificial intelligence. The second is somewhat more familiar with us because it is whether it works in an ideal setting that is the control study. And the third is its value in real life study, although it may take a bit more time. It's no doubt that in the artificial intelligence, standalone performance somewhat dominated the first years of literature. The AI standalone studies reply to a very simple question. Is artificial intelligence as good as human mind in detecting or in characterizing lesion? What we do is basically to expose to AI images that have been extracted by experts in order to understand whether AI can identify them or not, cut E, or characterize with the same or even a higher accuracy than pathologists. What is good about the standalone performance study is that differently from human mind, artificial intelligence does not suffer of any psychological bias. AI is not interested in showing that it is better or worse than us. For this reason, there is no need of randomization to marginalize operator related bias. However, these studies are not fully informative. Indeed, differently from any other innovation we tried in the last 20 years, what is characteristic of AI is that we don't know how it does work. We are testing something that we don't fully know. Indeed, as you heard in the previous presentation, when dealing with deep learning, we introduce a new bias. It's named as black box bias. The fact that we don't know the exact algorithm doesn't mean that we don't know anything. Indeed, as clinicians, we may ask additional information to the authors on how the deep learning was trained and how the deep learning was tested. For instance, anytime I review a paper on AI, I want to know everything about the patient population that was included in the training data set because this is the generalizability and reproducibility in community real life endoscopy. How many centers participated in collecting the cases? Were these centers tertiary, maybe selection bias? Were the cases consecutive or not? Were any disease enrichment such as plaque lesion or high grade dysplasia varied? Were all the cases histologically verified? This is extremely time consuming and a lot of AI have not been trained with histologically verified disease. But we need to know also about the endoscopy setting. What was the skill of endoscopy who collected the data and what technology was used? Of course, AI is robust, but if it was trained with advanced imaging, maybe I should use an advanced imaging. And then we need to be somewhat aware of the technical setting. Who selected the frame? Maybe they selected only the best frame. How was the annotation process performed? Were they expert? Were they non-expert endoscopy? Were they non-endoscopy? As frequently it is the case. And this is some table that we prepared for our review that show that in training, different devices present different characteristics in terms of number of cases, in terms of number of centers included, in terms of number of plaque lesions, and within the plaque lesion, the non-granular LST. Thus, you need to know any information about the training when you evaluate a CAD-E. And these are the possible bias that can affect the training selection bias. If in the data set, not all consecutive cases were put, operator bias, if we have always the same endoscopy, putting the same lesion at the same position with the same focus, with the same magnification, and we can have a new bias. This is named as overfitting bias. And this is when the system memorize instead of learn to recognize lesions. So the system will be very proficient when tested with the same cases using training, but does not recognize new lesion in real-life endoscopy. When coming to the testing data set, all what we said about training also apply here. We need to know the population, the setting, and technical characteristics. However, as a reviewer, I'm always extremely careful to do two points. First, were the cases used in the testing the same or coming from the same patient, or from the same center, or from the same endoscopy as in the training data set? Because this may facilitate an overfitting bias, and the result may be much more brilliant than what they actually are. And the second, were the cases generalizable to those of a community-based endoscopy? Because again, if the endoscopies in the testing were the best endoscopy in the world, we are not confident that everyone can have the same performance. What is new in the endoscopy setting, in the testing, is that the same images extracted from experts may be given to AI on one side, and to non-expert endoscopies on the other side, to benchmark AI with community endoscopy, not only against the expert. We don't want AI to be better than the expert. We want AI to prove the average level of a community endoscopy. And extremely important, a lot of software has been tested against frames, but we want software to be tested against video, and possibly prospectively in real life endoscopy. These are the two meta-analyses that show the standalone performance of all the AI devices. And as you may see, these systems are extremely good, very similar to face recognition software. And there is also quite small variability in the performance of this software. So this system can work. And this is a table from our meta-analysis that shows how different and variable the testing datasets are. For instance, we can have from 6,000 to 30,000 images for training, and the dataset for training may go from 17 to nearly 1,000. The second question is the effectiveness, the benefit in the real world. Indeed, the fact that a machine detects a lesion doesn't mean everything, because maybe the endoscopist is unable to recognize the high-grade displacement that the machine highlighted. But it is also true the opposite. Maybe I can see a polyp that my AI system missed. And what about false positive, unnecessary recession, unnecessary bias, patient anxiety, and etc. This is why we need randomized trials that may be performed in a parallel arm or in tandem arm as much as with any other innovation. On the other hand, for characterization, usually we dissect the lesion and we compare AI with pathologists. And this is a bit stronger because the pathologist is blinded to the machine. This is, for instance, the meta-analysis we perform for cut-in study. Of note, most of these studies come from China, where probably the regulatory process and the IRB are much faster than in Western countries. What I like of these randomized clinical trials is that they are very similar to what we did with any other innovation. For instance, a randomized trial with AI in colonoscopy is very similar to a randomized trial with any cut, ring, cap, advanced imaging, etc. I also like of a randomized trial the fact that they give us information about the consequences of a false positive. For instance, colonoscopy, we have 30 seconds more in withdrawal time. But what is bad about this trial is that they are at much higher risk of bias. For instance, for detection, there is the same bias as with any other innovation. And this is that the endoscopist is not blinded. However, this may be sometimes compensated with some intelligent artifacts. For instance, we can have a sham AI sending random activations. Or, what is more clever, we can rerun all the non-AI cases with AI retrospectively in order to be sure that in the control arm there was no operator-related bias. Similarly, with CAT-X, there is always the doubt whether we need to look at the performance of the machine or the performance of the operator assisted by the machine. Too much emphasis on the machine while it is very unlikely that we will accept an automatic diagnosis of the machine. Finally, what we need to pursue in the next future is to measure the value of this innovation that is spreading out in Western countries. I really want to pay tribute to my friend Yuichi Mori, not only to have opened the chapter of AI but to have delivered this first paper where retrospectively they measured the value of AI for colonoscopy showing the same 30% increase in ADR that was shown in randomized trials. Similarly, Yuichi and myself did this study on the impact of AI on possible epectomy surveillance to measure the value of AI on the expenditure of AI. All of this is within this algorithm that is not expected that all about AI is good because AI will increase costs in terms of polypectomy, in terms of pathology, in terms of surveillance, in terms of patient anxiety. But what we need to show is that all of these costs are reabsorbed by the saving in cancer prevention, including cancer prevention expenditure by an increased standard of care. So let me conclude that any endoscopy, not only when reading a paper but also when using a system, must have all the details on the training and the testing database. This is because from this perspective not all the systems are the same. The second is that randomized trials don't give us so much the confirmation that the systems are good because these systems are good but they allow us to explore the interaction in between the machine and the humans. And finally, we need to be extremely careful because this system must prove to have value in order to be reimbursed, in order to be accepted by all the stakeholders, by health system, patient health policy and etc. And we also need to question ourselves on the skilling of the endoscopies, especially on the trainees that like to be trained with the AI. Thanks again so much for this invitation. I'm ready to discuss with you about all of this. Bye bye.
Video Summary
The video transcript is a presentation on the topic of artificial intelligence (AI) in gastrointestinal (GI) endoscopy. The speaker discusses the different phases of the innovation cycle for AI in GI endoscopy and the importance of evaluating its performance in real-life settings. The speaker emphasizes the need to consider various factors when evaluating AI systems, such as the training data set (including patient population, centers involved, and disease verification), the testing data set (including generalizability to community-based endoscopy and comparison with non-expert endoscopists), and the potential biases that can affect the results. The speaker also discusses the need for randomized trials to assess the effectiveness of AI in real-world situations and highlights the importance of measuring the value of AI innovation in terms of costs, cancer prevention, and increased standard of care. The presentation concludes with a reminder about the importance of considering the details of training and testing databases, the interaction between machines and humans, and the need for value-based assessment of AI in GI endoscopy. The speaker expresses gratitude for the invitation and welcomes further discussion on the topic. No credits are specified in the video.
Asset Subtitle
Cesare Hassan, MD
Keywords
artificial intelligence
gastrointestinal endoscopy
evaluation
biases
effectiveness
×
Please select your language
1
English