false
Catalog
Gastroenterology & Artificial Intelligence: 3rd An ...
Computer Vision in Medicine: State of the Art
Computer Vision in Medicine: State of the Art
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
Moving on to our second talk, it's computer vision and medicine, the state of the art to be provided by Dr. Rahul Bhotekar. Rahul joined Amazon in 2014 and is currently the director of computer vision at Amazon Web Services. And he's part of the team that is responsible for many services that all of you may have heard about, such as Amazon Recognition, Extract, Lookout for Vision, et cetera. And he and his team continue to publish research in areas of computer vision related to Amazon Web Services. So it's my pleasure to introduce Rahul, and we will start off with Rahul's presentation. Rahul. So thank you again to the organizers for giving me the opportunity. It's always exciting and encouraging to see this much interest in the use of AI, especially in healthcare. So just moving on very quickly today, I'll talk about sort of three sections, but focus on the middle one the most. So briefly introduce the services that we are offering today in AI and machine learning, specifically in the areas of computer vision and health AI. And then the majority of the presentation will focus on how anyone, specifically domain experts can build models for their specific tasks, even without having ML expertise. And then briefly sort of towards the end, an aspirational section on what is multimodal representations, as we call, and what that could mean for healthcare in the future. So this is a very busy slide, but I don't need to explain everything. The point here is that you could think of it sort of as a stack or a layer cake, and we are trying to build sort of the broadest and deepest set of machine learning capabilities, depending on different users wanting different things. So that's sort of the bottom layer here. You have frameworks and infrastructure, which goes all the way to sort of hardware that we are innovating on, like building new FPGA architectures for deep learning, chips like Tranium and Inferentia, and then also enabling frameworks to be optimized so you can code up your machine learning algorithms in your favorite frameworks. Then moving up, you have ML services, which is really for a developer who knows machine learning, but requires tools for it to be efficient. And that starts with annotation of data, being able to set up and monitor training. If you're satisfied with model performance and need to deploy it into production and monitor it, then that's all done under sort of the ML services, Amazon SageMaker feature. And then at the top, you have AI services, which just like Dr. Bersin talked about, there's confusion between these, but hopefully what he said earlier is mirrored here in terms of sort of deep learning being a machine learning technique. And then AI is more like things that you would do in terms of natural language understanding or computer vision or speech and so on, and to mirror or replicate or exceed human performance. Specifically in computer vision, we have recognition, extract, and look out for vision. And we're just starting on services in healthcare. So comprehend medical does natural language understanding. So you could use it to convert unstructured text to structured text and transcribe does speech to text. And now HealthLake is there to, in a sort of a health compliant manner, store your data in a data lake and do analytics and other operations on it. So very briefly, Amazon Textract, sometime back before deep learning, all of this used to be optical character recognition or OCR. Now this is what you would call OCR++ because deep learning enables, through computer vision, understanding structure and extracting relationships. So even simple things that are easy for humans like looking at this form on the right, date of birth, which has structure under it, which is month, date, and year. And then you have radio boxes here. Earlier you would get the text, but you had to build a template to get all that information out. And now without that configuration, through deep learning methods, Textract can extract this information for you. The second area is Amazon Recognition, which is more about sort of natural images or images you find in the world. And I won't go into detail, but we have several offerings here, depending on what are sort of aggregate dominant use cases for our customers. A couple of things to point out here are that we are also, while we are sort of a cloud-based service, we are also looking at what we can do edge and on-premise or hybrid, which might be pertinent in this use case where you have real-time sort of ability to make diagnosis as well as privacy concerns where you may not be wanting to upload data, where you are not comfortable with where it goes on the cloud, and so on. And so for latency, privacy, bandwidth, all of those issues, very soon hopefully a lot of these cloud services will also be available in a hybrid or edge setup. I want to sort of focus here on custom labels, which I'll talk about in more detail in the next few slides. So one issue, and again, this was mentioned in the previous talk, is that when people are interested in things like cars, dogs, or humans and detecting them in images, then there are several customers that want the same thing. And so we can build those services and offer it so that the user only has to send an image to the service and they get an answer with all the predictions. But typically for different tasks, the use case is really specialized and it may also be that only the user has access to the data pertaining to that task. So it's not possible for us as a service provider to actually build what we call a pre-trained model ahead of time for all of these use cases. So that's what we call the long tail. And therefore, we need a way for the user to be able to bring their data and then build their model. The long tail sort of concept extends also to data or data imbalance. So in this example, which is the iNaturalist data set, you can see that on the left, there's a table that just says the number of plants is way more than the number of arachnids in that data set. And so how do you balance for that? And also if you go out in the world to capture data or in healthcare, you might have more prevalent sort of disease or abnormalities, and there might be more rare ones. So it's easy to find 10,000 images of monarch butterflies, but some particular detail, there might be only five images available worldwide. And then now you have to work with them to recognize them. So for that, we have built this service called custom labels. And basically what that does is that first you come in and there is a workflow for you to label your own images. And then you can basically say train the model and it uses those images to build a deep learning model. And then if you're satisfied with the progress or the performance on a test set, you can actually deploy it either for research or for production use cases. So again, no ML expertise is required because all of that is fully managed under the hood. And we are hoping that you can actually do this with tens to hundreds of images to get started, not necessarily the final model or, and not have to train tens of thousands of images first and then build your model. So across Amazon, one of the sort of the benefits of working at Amazon web services is the different types of users. So we have custom labels being used by media companies, by industrial organizations, agriculture, even document classification, drones, right? Detecting damages and hooked on the bottom, right? Even during the hunting season to find animals, right? There are some use cases in healthcare that have started to emerge. So here I'm showing a study, which is still on the research side and done by Eddie Corot, Piers Keane and others from Moorhead Eye Hospital in the UK that was published this year in Nature Machine Intelligence. So they are looking at digital retinopathy and using fundus photography and optical coherence tomography, which you see on the top left and top on that bottom left. So in this, basically, I'm happy to report that they evaluated a bunch of different services that do this, what we call auto ML or custom labels and Amazon web services, even though we had not trained a model specifically for healthcare or for digital retinopathy, they were able to get a performance out of it with a few iterations. So if you're interested, please try it out and give us feedback. Now, if your model doesn't work, then what do you do? So typically the problem is that you have some label data. And if you want a high performing algorithm, you may have to label a lot of data, as mentioned previously, but that's painful and you don't actually know what to label. So we've also done research in how you can reduce this sort of data labeling burden. So start with a few label samples and then train a model, use it to generate predictions. And then the domain expert goes in and says, okay, these predictions are wrong. So they are providing feedback. And we sample the feedback to add more label data and sort of do this in repeated cycles. And the benefit of that is now the cognitive load on the user to decide what images to label and how many to label is transferred to the algorithm. And they're just seeing the model performance and can keep giving feedback. So we now, this feedback can be a painful process by itself. And so interactive feedback is a way to accelerate it. So today what would happen is you would provide data and then depending on the models that you're trying to train, it could be a couple of hours, it could be three days, then you come back and get a model. So now if you label a hundred more images, and then again, have to wait for three days, that's not going to work. So we have done this study with, if you label just a few images at a time and start providing online updates to the model, then it actually turns out that in most cases, the amount of data you have to label to improve the model is 30 to 50% less. And now the user only looks at sort of the most significant mistakes and doesn't have to keep repeating labels on something that one example would have been sufficient to correct. So we have a publication on this called linear quadratic fine tuning, which is just a fancy way of saying that we can update models using methods in real time. So I have a quick demo video here that hopefully will work with beer. Someone somewhere in this call must be at happy hour. So I'll accelerate through this, not play all of it, but here the task is to identify different types of beer manufacturers. And so the user will start by providing some examples. So they are doing different settings. It could be a glass with a beer label. It could be a box of beer. And then eventually they get to the point where now they can train a model and then now start looking at, okay, here are seven beer bottles there. Are they being classified correctly? If it's not working, then they start editing. And here on the bottom left, you actually see the error in real time. So this is a recorded real time demo. As you are correcting mistakes, it's improving the model and you are seeing all the other images being relabeled as you correct it. And the test error keeps dropping till the point you're satisfied. And this is a much better experience to improve performance than having to wait hours each cycle. One other service that I want to mention quickly for the sort of the last part of the talk is Amazon Lookout Provision, which also allows users to build models for their specific tasks. The difference here is that it's for anomaly detection. And so you can actually get started with just normal images. So typically in a classification task, you would say, okay, this is class A dogs, class B cats, and so on, and build a classification system that discriminates between them. But here we are talking about normal versus anything that's an anomaly. And so for testing, you still need those, but for training, you can just start with normal and the model will learn what normal means, start predicting abnormal, and then you can correct it again using feedback to improve the performance. So we have actually looked at it in terms of just classifying the image. So in manufacturing scenarios, typically products are moving on a conveyor belt and you basically want to say, oh, something is defective, move it out of the workflow and then keep going. But in a lot of cases, as in healthcare, it's often important to also find out where in the image that anomaly is, and then start characterizing it. How big is it? What properties does it have? And so on. So this is sort of fresh research results that are not yet published, a couple of weeks old, where we have started to actually look at how we can at the same time classify the image, but also segment important structures inside that image. So basically at each pixel, provide a label of what object that pixel belongs to. So this again requires the ability to deal with examples which are few and far between, because we don't want users to again label painfully thousands of images and draw their boundaries or draw bounding boxes around them. And so we do a lot of heavy lifting on the side of the algorithm to try and augment the data. And then we actually design the model so that it can take care of imbalance. So you may give a hundred examples of one type of a polyp and only two of the other one, and how do we handle that? So using this, we actually in the last week did some experiments with, there's this public data set from Coursera that we can use for research, not for production. And we took that same model. So the same approach that I mentioned before, we don't want to build a model specifically for one task, but we want to provide a capability that users can go and then adapt that model for their task. So here in the left panel are results that look good. And again, these might be easy sort of detections to make. That's what's in the data set. And we do really well in this site. So the first column shows the actual image and the next column shows the detections with the boundary in green. And then towards to the right, we have cases where the model doesn't do yet very well. So either it gets only part of the boundary. It's hard to see, but there's the ground truth boundary in red and the detection in green. And here we are completely missing one. But what we showed here is that looking at this data set, if we take 10% of the training data, we can already get to 83% F1 score, which is a measure of precision and recall. And if you use 50% of the training data, we can get to 88%, which actually exceeds the current published state of the art on this data set. So we are somewhat encouraged because, one, we did not train on this data. We built models for adaptation. And then with less training data than what would normally be required, we were able to get to state of the art. And this generalizes to other applications. I'll skip this video in the interest of time. But the idea here is that the model is frame-based. So we sample frames that don't have motion blur to train the model. But then you can actually run it on every frame because the model runs in real time and then have the results in video. So this same model also applies to other use cases. So going back to the digital retinopathy example, that's the case where there's different structures in the same image. They're all listed here. And we use the same model, but now a different training set. And it can adapt to this task. So very quickly to end, I just wanted to talk about what could be coming next or what I believe is actually going to be even more useful in health care. So as you all know, data in health care is inherently multimodal. So you could have a report. And you can use Textract to get a scan report and convert it. You can use Comprehend or any other natural language service to convert it to structured tasks. But it's still isolated. But then you have images. You may have different types of images for the same patient at a particular given study. You can also have historical exams. And then you have a lot of other data that could be coming in terms of reports and other metadata and so on. And treating these in isolation or doing fusion of these different modalities very late gives minimal advantage. So what we have done is build basic models. And you see this a lot today in the AI field with even open AI publishing models that have joint multimodal signatures. And so we have trained a model that not only combines the text and the image data, but you can actually go in and now start doing queries, which can be text but also use images. So there is an alignment between the text representation of the data and the image representation of all of your data that you can use jointly. So in the beginning applications, we feel that you can take them to just do searches in your database, form patient cohorts, do experiments with those. It also allows you to do searches based on images. Show me an image like this that's in my historical data. And you can then now look at those and the diagnosis that comes from them. But eventually, we think that the models that I talked about earlier in the presentation will be replaced by these multimodal models. So to end, I think democratization of AI is very well underway. So the role of scientists like me is to continue to build models that can be used by others, but we are not anymore trying to build models for specific tasks all the time. This requires starting with few label samples and techniques that can deal with unlabeled data, provide feedback, and improve the model, and so on. And then again, multimodal data, I feel, and those representations are going to be potentially enabling higher performance in the future. So acknowledgments to many of my colleagues whose work I'm representing, and a special shout out to Dr. Sravanti Parasa for engaging me and inviting me to and giving me the opportunity to present here. All right. Thank you.
Video Summary
In this video, Dr. Rahul Bhotekar, the Director of Computer Vision at Amazon Web Services, discusses the state of the art in computer vision and medicine. He introduces various AI and machine learning services offered by Amazon, specifically in the areas of computer vision and health AI. Dr. Bhotekar focuses on the concept of custom labels, which allows users to build their own deep learning models without requiring ML expertise. He also discusses the challenges of data labeling and proposes a method to reduce the burden by using feedback loops between the user and the model. Dr. Bhotekar further presents two services, Amazon Textract and Amazon Lookout for Vision, that utilize computer vision techniques for tasks such as optical character recognition and anomaly detection. He concludes by highlighting the potential of multimodal representations in healthcare and the democratization of AI.
Asset Subtitle
Rahul Bhotika, PhD
Keywords
computer vision
medicine
AI
custom labels
data labeling
anomaly detection
×
Please select your language
1
English