ML in Healthcare Digital
ML in Healthcare Digital
ML in Healthcare Digital
We
were able to train a deep learning neural network to recapitulate the majority decision of 7 or 8 US
board certified ophthalmologists in the task of grading for a diabetic retinopathy.
The type of deep learning algorithm used to detect diabetic retinopathy in that study is called a
Convolutional Neural Network, or CNN. CNNs enable computer systems to analyze and classify
data. When applied to images, CNNs can recognize that an image shows a dog rather than a cat.
They can recognize the dog whether it's a small part or a large part of the picture - size doesn't
matter for this technique. It can also classify the dog by breed. CNN systems have also been
developed to help clinicians do their work including selecting cellular elements on pathological
slides, correctly identifying the spatial orientation of chest radiographs, and, as Dr. Peng mentioned,
automatically grading retinal images for diabetic retinopathy.
So let's open the deep learning black box to understand how this works.
First, a CNN is not one process. It's actually a complex network of interconnected processes,
organized in layers. With each layer, the CNN can detect higher-level, more abstract features.
When the CNN is identifying these features, it uses something called a filter. Here's how Larry
Carin, one of the authors of a JAMA Guide to Statistics and Methods article on CNNs, describes a
filter: So, we think about a medical image, a medical image in radiology or ophthalmology or
dermatology is characterized by local structure, could be textures, it could be edges, it could be
curves, corners, etc. And what these filters are doing are constituting little miniature versions of
each of these little building blocks. And the way that the CNN looks for these building blocks is the
C in CNN, and it stands for convolution. It's a mathematical operation that looks pretty complex.
But, actually, it's very simple. It's a very simple concept. It's kind of like you've got this filter, and
you're walking to every part of the image, and you're just asking the question, how much does this
image look like that filter?
Think of it like this: you have a drawing, that's the image, and you have a stencil, that's the filter.
You take that stencil and pass that stencil over that drawing that you have, and as you do that you
will see that some parts of the drawing become more visible than others as you do that, right?
And that process of sliding that stencil across this drawing is essentially the process of convolution.
Now that we've explained what a filter is and introduced the concept of convolution, let's use an
analogy of written language to understand the relationship between the filters and the hierarchical
structure of the layers in a CNN. We will simplify the explanation by using an analogy.
The analogy is a written document. In order to communicate through writing, we organize it as a
series of paragraphs, which are composed of sentences, those sentences are composed of words, and
the words of letters.
So reading a document requires assessing the relationship of letters to one another in increasing
layers of complexity, which is a kind of "deep" hierarchy, like the hierarchy in image analysis.
Continuing with our analogy, let's say we're looking for the phrase Ada Lovelace in a paragraph.
Ada Lovelace was a mathematician and writer who lived in the 19th century.
And she holds the honor of having published the very first algorithm intended to be used by a
machine to perform calculations, which makes her the first ever computer programmer.
In the first layer of the network, a CNN looks for the basic building blocks of an image. The basic
building blocks of written language are letters. So in this analogy, the filters the CNN uses in the
first layer would be letters.
Let's zoom in on the word "Ada."
Here is what the convolution process would look like for the letter A. When the "A" filter overlies
the letter "A" in the original image, the convolution output would generate a strong signal.
This signal would then be mapped onto something called a feature map. The feature map represents
how well elements in the image align with the filter. If something is there, the signal outputs white.
If nothing is there, the signal outputs black. CNNs generate a feature map for every filter. So in our
analogy, there would be a feature map for every letter. These feature maps would then become the
input for the second layer.
In this layer, the CNN would spatially align and "stack" all those maps from the previous layer.
This would allow the CNN to then look for short, specific sequences of letters in all the feature
maps simultaneously. So the CNN would use a new set of filters to look for specific letters that are
adjacent to one another in particular sequences.
In our analogy, the second layer would look for places where the letters A, D, and A are in sequence
together making the word "ADA". It would also look for places where letters A, C, E, L, O and V
are adjacent to one another using filters for LOVE and LACE.
The output of the second layer would be the feature maps for those three sequences of letters.
In other words, in those feature maps, strong signals would be present where the sequences ADA,
LOVE and LACE are located in the original paragraph.
In the third layer, the CNN would stack and align these three new maps and perform more
convolutions-this time identifying where longer words and groups of words are located.
So the CNN could at this point identify where in the original paragraph the sequences of letters and
words making the phrase "ADA LOVELACE" are located. In our analogy, we were looking for a
phrase consisting of only two words. Had we been looking for a longer sentence or even a
paragraph, the CNN would deal with the greater complexity by having more layers.
We've omitted quite a few details about CNNs for simplicity, but this captures the essence of the
model. But what does this look like for actual images, like identifying diabetic retinopathy from an
ocular photograph? Images are made out of pixels rather than letters. In a digital context, a pixel is
the smallest, controllable unit of an image represented on a display. Each pixel is a representation of
a tiny portion of the original image. Think about pixels like creating a drawing with dots where
every dot has a color value and an intensity.
The more dots used, the clearer the image becomes. The filters a CNN uses in that first layer are
small squares of pixels that correspond to things like textures, contrast between two colors, or
edges. These are the image analysis-equivalents of the letters used in our analogy.
And as a CNN goes up in the hierarchy, it looks for combinations of these filters, getting more and
more complex with each layer. As the complexity increases, the CNN gets closer to identifying what
it's looking for. So the specific features analyzed at each layer help put the whole thing together.
So, for example, some of the earlier work showed that some layers tend to be better at extracting,
sort of like, edge-like information. Meaning that, for example, if you combine different kinds of
horizontal edges, we might get a continuous line that resembles the retinal blood vessels.
And as you combine more of those and start to encode more higher-level concepts such as, you
know, is there a micro-aneurysm here, is there bleeding over here, is there other lesions in the
image? And right at the very end is where these, after these multiple layers, the network will try to
then condense all of that information down into a final prediction.
In this case, severe diabetic retinopathy. Developing a CNN to help identify diabetic retinopathy
was motivated because many patients with diabetes are not getting screened frequently enough.
We have to screen diabetic patients once a year or we should, and there are some barriers to getting
that done. Some of it is just, you know, not having enough trained professionals to do that task.
It's also not having that expertise available where the patient is. It's not that, you know, there aren't
retina specialists in a metropolitan city four hours away, it's that there isn't a retina specialist at your
grocery store. And CNNs could facilitate the integration of diabetic retinopathy and other screening
programs into primary care. But before that happens, more research, especially prospective clinical
trials, are needed. The way we do approach these things is really the way that medicine usually
works, which is to say, "let's do validations of the method again and again and again until we're
sure, we're reasonably confident that it really works on many kinds of images, in many settings for,
you know, many different patient populations." And so from my perspective that's really at the end
of the day what's most important: does it work on real patients and is it reliable?
The excitement generated by early results has already spurred several research groups to look into
the efficacy of CNNs in clinical practice, which could potentially finally get CNNs from the bench
to the bedside. I think we're on the third or fourth technological revolution where neural networks
are coming to the forefront, and I really hope that this time we'll get it right.
But there were failures in the past where people used the technology in suboptimal ways and we
don' t want it to happen again. One has to make sure that we have appropriate and sufficient data for
development, validation and testing, and that we're solving actual clinical problems.
At the end of the day, one thing to take away is that even if, as a clinician, it can be hard to
understand exactly how a CNN arrives at its diagnosis, it can still be a useful tool.
And this is similar to how many clinicians use other widely-adopted technologies. Consider
antibodies: You know, as a clinician I may not know exactly where that part of an antibody kind of
binds to, but I'm comfortable after looking at some of this clinical validation of using Lucentis, for
example, for an injection, right.
This is kind of like any new breakthrough technology: needs validation and needs transparency, but
I think, you know, the medical community in general responds very well to new technologies that
have been validated.