false
Catalog
Gastroenterology and Artificial Intelligence: 2nd ...
Winning on Detection, Classification and Navigatio ...
Winning on Detection, Classification and Navigation Using AI - Is it Enough?
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
We want to continue with our next topic presented by Ehud Rivlin. Winning on detection, classification and navigation using AI, is it enough? Ehud Rivlin received a PhD degree in computer science in 1993 from the University of Maryland. He is currently a professor in the computer science department of the Technion Israel Institute of Technology in Haifa. Dr. Rivlin is also a research scientist at Google Health. His research interests include visual recognition, event perception, biological motivated vision and sensory based motion planning. Welcome Ehud. Thank you for joining us today and thank you Pratik and Sravatne for inviting me to give this talk. What I wanted to discuss today is the following question. If we can win on this trio, detection, classification and navigation, the basic trio that each endoscopist wants to have, the trio that gives the answer for where is it, what is it, where am I? If we can answer those, are we good? What else there is to ask for? And I wanted to start with some motivation. When we act, mistakes happen. We know, to err is human. But then I was surprised to find out that medical errors are the third leading cause of death in the US. And the numbers are staggering. I wanted to understand this better and I found this one. Ten medical errors that can kill you in the hospital. And number one, diagnostics. We are focused on diagnostics. From a personal point of view, every time that I was asked what am I doing and I said medical imaging, diagnostics, I got this strange look and a follow up. But what about treatment? I didn't feel very motivated until I found this one. Number one, diagnostics. Looking at our friends at radiology, it is interesting to note that their miss could be as high as 30%. In DGI, there is issue of missed polyps. We thought that CRC is a valid goal and wanted to understand what are the sources of these misses. How can we take those down? We found three sources for error. First, target was in the field of view but missed. It was there, we just didn't notice it. The second one, missed coverage. Here we have two sources for a miss. The first, traversal is incomplete. For example, we didn't reach it, we didn't get to the CICUM, etc. The second option, we didn't find it as we didn't look there. We looked to the right, the target was on the left. The visual inspection of the space was incomplete at the point which was visited. The third one, it just missed classification of the polyp. For detection, target is in the field of view and we need help to detect it. Detection is here. When Fuji, Medtronic, Olympus are all going out offering CAD-E, detection is basically here. The proof now is in the pudding. We still have work to do, improve performance, take far down, work in challenging conditions, handling inadequate power operation, etc. But we are there or very, very close. Classification. On classification, the ability to do optical biopsy and follow accordingly, we are getting there. You see the papers, the talks we just heard, great work. I would say that we are on track. Navigation. As to navigation, in my humble opinion, we are not there yet. We do not have clean, stable answers to relevant questions. Like, where am I? Where am I qualitatively and quantitatively? Have I been there, mapping the colon, navigating to a polyp we saw just 10 seconds ago? We are not there yet. So what is needed to answer those navigation-related questions? For some, large datasets and good image-matching abilities might be good enough. Note, by the way, that today we have this ability in our pocket. For example, using Google Lens, one can recognize landmarks. And the good question is, when are we going to see these abilities that we have in our pocket coming and executing on the tip of the endoscope? We have some results on landmark recognition, but today I want to focus on a more challenging navigation-related task. For answering some of the more challenging questions that we mentioned, we might need more building blocks. Building blocks like egomotion computation, the estimation of camera's motion relative to the rigid scene, or computation of visual odometry, the process of determining position and orientation by analyzing imagery. We might need to do depth extraction or reconstruct the scene. And the plan is to show how using those, we can answer the following navigation-related question. What part of the colon I visually recall? We want to solve coverage, and we will do it segment by segment. So we can answer first what fraction of the current segment has been covered while traversing the colon. If we can do it in real time, we will be able to react in real time, which is good. Summing over all segments will provide global measure for colon coverage. So what is coverage? How to compute it? What is the problem definition? Given a segment, trajectory, and a camera pose, what we want is to sum all the visible points and then divide those by all the points that we could see, the maximal visible points, to get the coverage. We can see a segment of the colon in 2D on the left, and you see all the points belong to the set that is colored in green. And on the right, for a given trajectory, that is illustrated by the red camera location and the viewing angle, the set of actually visible points is shown in orange. And then the coverage is just the ratio of the actual visible points to the maximally visible point. That will give us a number that will represent the coverage for this trajectory going from P0 to P1 with these viewing angles that you see on the right. The orange, all the points that are on the orange, divided by all the points that are in the green. In order for us to compute visibility, we will need depth. So we will have two phase processes. The first, depth computation and reconstruction. And the second one, the actual coverage computation. The process, by the way, is described in our paper, Detecting Deficiency Coverage in Colonoscopies, that was published in IEEE Transactions on Medical Images. So let's go for the first phase, depth computation, depth extraction. We want to find the trajectory reconstruction of the colon. And we will do it via learning. There is an issue, though. To learn, we need a tagged example, we need ground truth, and in this case, we don't have it. Interestingly enough, it is possible to learn how to reconstruct depth in an unsupervised fashion. Let's see how to do that. To extract depth, we are going to do a cool process. We try to estimate both depth and pose, the position and orientation of the camera, simultaneously. Solving a harder problem seems to be counterintuitive, but it will afford us an extra benefit, in that we can tie the depth and the pose together, expressing them both in the loss. So what we are going to do, we are going to get, we have two images. The image is at T and at T-1. We'll take the frame, the RGB frame at T, move it through depth network, and get a proposed depth. We'll take the image, the current image, frame T, with the frame in T-1, put it through pose network, to get a proposed pose. Now, when we have the pose and the depth, we can transform, back in time, the current depth values into T-1. And then we can render, again, the image in T-1, using the proposed depth and pose for T. And then we have an image, a proposed image, in T-1. When we compare the real RGB image at T-1 and the one that we produce, we get a view synthesis loss. And this is what we are going to minimize during training. This is how we get depth. So we can produce depth and pose. But how are we going to do validation? Training can be unsupervised, but evaluation needs supervision. How did we evaluate our computation? Well, we used synthetic data that we got from Symbionics, 3D systems, and validated our results. You can see that the mean relative error is relatively small, and you can get the impression on data that was validated and compare it to the ground truth in the center. Let's see some results on real data. This is a depth estimation on real data. It's for your eyes only, as we don't have ground truth. Bottom estimated depth map, and yellow is deeper, and blue is more shallow, and you can get the impression of the estimation that we are getting. So we have depth, and now we want to compute coverage, and to do it per segment. What we do here, to compute coverage per segment, we are going to do a two-step approach. First, we will compute coverage, and this is the first net that you see on the top right, per frame, for a single frame. For a single frame, the trajectory is just a single pose. Coverage is computed just in the same manner. Division of all visible points to the maximal set of points that can be visible. We trained this net, and the output of the depth image that we have will be the coverage for that frame. Now, when we have this, we are going to train another net that will give us the per segment coverage, by taking the input collection of the contained result from the first net for each frame, and the output will be segment coverage. Note that actually, what we are getting from the first net, we cut it a little bit before the end, and get a set of features for each frame. And then take those, this is what you see here from F1 to Fn, for the input for the second phase, and run it through the net, and get the coverage per segment. We are doing here supervised learning, we train on synthetic videos, and then we are doing transfer, domain transfer, domain adaptation, synth to real. Note also that the fact that we did 3D reconstruction is useful behind coverage, just as a small remark. We can do flat polyp detection, we can do estimation for polyp size, we can do visualization, and let's go back to coverage and see our results. Okay, results. We took about 550 video segments of synthetic data, 10 seconds each, and we ran them for evaluation by gastroenterologist and by our algorithm, C2D2, and compared the two. The left plot shows C2D2 performance, while the right one shows the physician performance, and the ideal would be diagonal. And as you can see, C2D2 performance is considerably better than those of the physicians. Now evaluation on real data. We took 300 clips, 300 clips of segment, 10 seconds each, and ran them with gastroenterologist. Here we don't have ground truth, so we went on agreement, and in total, there was physician agreement with C2D2 prediction on 93% of the clips, and that I think speaks to the accuracy of the algorithm. I hope this gives a good coverage of the coverage deficiency algorithm, and before time is up, which is really soon, I want to return to the question I started with, is it enough? So we still have work to do, winning on the trio, but let's assume for a minute that we won on those. The question, is it enough? Can we do more with this disruptive technology? Due to time limitation, I cannot go deep into this topic, but let me mention briefly some points for two main players, the patient and the GI. As to the patient, we want to be personalized. We can do comparative analysis, help in decision making, adjust performance of the patient, drive slowly as it is an old patient, etc. For the GI, we can do skill assessment, and I think we can help here a lot. We can do decision making, helping in complication, etc. And I think I still said nothing about bringing the technology that is in our pocket to the tip of the endoscope. Computational photography, situ, multispectral, etc. I think from my end the answer is a clear no, and we can now go back to work. Thank you.
Video Summary
In this video, presented by Ehud Rivlin, the topic of discussion is whether winning on detection, classification, and navigation using AI is enough in medical imaging. Dr. Rivlin starts by highlighting the importance of these three aspects and their relevance to medical errors, which are the third leading cause of death in the US. He focuses on colonoscopy and the sources of error in detecting polyps, including missed targets, incomplete coverage, and misclassification. He states that detection is almost at a satisfactory level with the availability of computer-aided detection, but classification and navigation still need improvement. Dr. Rivlin then explains the concept of colon coverage and proposes a method for its computation using depth extraction and trajectory reconstruction. He discusses the unsupervised learning approach used to estimate both depth and pose, and showcases results on real data and evaluation by gastroenterologists. In conclusion, Dr. Rivlin suggests that winning on the detection, classification, and navigation trio is not enough, and there is a need for personalized, comparative, and decision-making capabilities, as well as the implementation of pocket technology in endoscopy.
Asset Subtitle
Ehud Rivlin, PhD, MSc
Keywords
medical imaging
AI
detection
classification
navigation
×
Please select your language
1
English