|
|
||||||||
1From the Image Sciences Institute, University Medical Center Utrecht, Utrecht, The Netherlands; the 2Retina Service, Department of Ophthalmology and Visual Sciences, University of Iowa Hospitals and Clinics, Iowa City, Iowa; the 4Department of Veterans Affairs, Iowa City VA Medical Center, Iowa City, Iowa; the 5Ophthalmology Service, OLVG, Amsterdam, The Netherlands; and the 3Department of Electrical and Computer Engineering, University of Iowa, Iowa City, Iowa.
| Abstract |
|---|
|
|
|---|
METHODS. Three hundred retinal images from one eye of 300 patients with diabetes were selected from a diabetic retinopathy telediagnosis database (nonmydriatic camera, two-field photography): 100 with previously diagnosed bright lesions and 200 without. A machine learning computer program was developed that can identify and differentiate among drusen, (hard) exudates, and cotton-wool spots. A human expert standard for the 300 images was obtained by consensus annotation by two retinal specialists. Sensitivities and specificities of the annotations on the 300 images by the automated system and a third retinal specialist were determined.
RESULTS. The system achieved an area under the receiver operating characteristic (ROC) curve of 0.95 and sensitivity/specificity pairs of 0.95/0.88 for the detection of bright lesions of any type, and 0.95/0.86, 0.70/0.93, and 0.77/0.88 for the detection of exudates, cotton-wool spots, and drusen, respectively. The third retinal specialist achieved pairs of 0.95/0.74 for bright lesions and 0.90/0.98, 0.87/0.98, and 0.92/0.79 per lesion type.
CONCLUSIONS. A machine learning-based, automated system capable of detecting exudates and cotton-wool spots and differentiating them from drusen in color images obtained in community based diabetic patients has been developed and approaches the performance level of retinal experts. If the machine learning can be improved with additional training data sets, it may be useful for detecting clinically important bright lesions, enhancing early diagnosis, and reducing visual loss in patients with diabetes.
We and others have described machine learning computer systems capable of detecting red lesions and blood vessels in retinal color photographs with high accuracy (Abràmoff MD et al. IOVS 2006;47:ARVO E-Abstract 3635).8 9 10 11 12 Because retinal exudates can represent the only visible sign of early diabetic retinopathy in some patients, computer-based systems that can detect exudates have been proposed.13 14 15 16 17 However, to diagnose the bright lesions associated with diabetic retinopathynamely, exudates and cotton-wool spotsthe lesions must be differentiated from drusen, the bright lesions associated especially with age-related macular degeneration (AMD), which can have similar appearance,18 as well as from posterior hyaloid reflexes and flash artifacts, which can sometimes mimic bright lesions in appearance. A computer-based system that can detect bright lesions must therefore be capable of differentiating among lesion types, as they have different diagnostic importance and management implications. Extending our previous work on machine learning automated detection of blood vessels and red lesions, we have developed a machine learning algorithm that can detect bright lesions in retinal color photographs and can differentiate among exudates, cotton-wool spots, and drusen.
The purpose of this study was to describe and evaluate this machine learning-based computer algorithm to detect exudates and cotton-wool spots in digital color fundus photographs and differentiate them from drusen. The evaluation was performed on a representative sample of images of patients with diabetes drawn from a telediagnosis project in The Netherlands, with 100 color fundus images containing bright lesions and 200 images with no abnormalities. Because the purpose was to compare the machine learning algorithm to that of human experts, a human expert reference standard was created by having three masked retinal specialists annotate the sample photographs.
| Methods |
|---|
|
|
|---|
Fundus Photography
Images were obtained at multiple sites with three different nonmydriatic cameras: the Topcon NW 100, the Topcon NW 200 (Topcon, Tokyo, Japan), and the Canon CR5-45NM (Canon, Tokyo, Japan). The imaging protocol has been published.19 Briefly, digital color photographs were obtained with natural dilation in a dark room, and if natural dilation did not suffice, the pupil was dilated pharmacologically with one drop of tropicamide 0.5% per eye, as per protocol. One disc-centered and one fovea-centered image were obtained for each eye, both at 45° field of view. For the Topcon cameras, spatial resolution was approximately 8 x 8 µm per pixel (2048 x 1536), whereas for the Canon camera, it was approximately 15 x 15 µm (1024 x 768 pixels). All cameras had 1-layer CCD RGB sensors, and images were JPEG compressed at the lowest loss compression setting, resulting in image files of approximately 0.15 to 0.5 MB per image.
Machine Learning Training Images
The machine learning algorithm is a so-called supervised algorithm, and therefore needs a set of annotated lesions to learn how to detect bright lesions and differentiate among them. For this purpose, 130 anonymous images originally read as containing bright lesions were selected. All pixels in all these images were segmented by retinal specialist A as to whether they were (part of) an exudate, cotton-wool spot, drusen or background retina. Vessels, disc, and red lesions, if present, were treated as background retina. The images used to create the training set were not included in the test set (described later). The training set contained 1113 exudates (93,067 pixels), 45 cotton-wool spots (33,959 pixels), and 2030 drusen (287,186 pixels).
Image Data Set for Testing Performance
Three hundred anonymous images were selected as the testing set. One hundred images were selected at random from all images that were originally read clinically as containing one or more bright lesions, and 200 images were selected from all images originally read as containing no lesions (not containing any exudates, cotton-wool spot, or drusen). Three masked retinal specialists (designated A, B, and C) performed annotation on all images in random order indicating whether one or more exudates, cotton-wool spots or drusen or any combination thereof was present. A consensus annotation was then obtained, using a teleconference discussion format, by asking two of the retinal specialists (A, B) to reach consensus on all images where their independent annotations had differed. The consensus annotation was used as the reference standard, and contained 105 images with bright lesions and 195 without. In other words, some images originally thought to be without bright lesions did contain one or more bright lesions according to the consensus reference standard and vice versa. There were 42 images with exudates, 30 with cotton-wool spots, and 52 with drusen in this test set. Some images also contained red lesions (microaneurysms, hemorrhages, microvascular abnormalities, or neovascularizations), and the distribution of the presence of lesions is given in Table 1 .
|
|
of the annotation compared with the consensus standard was calculated at each threshold setting. Sensitivity, specificity, and
of retinal specialist C compared with the consensus standard were also determined. For the machine learning algorithm, these sensitivity-specificity pairs can be used to create receiver operator characteristic (ROC) curves for all bright lesions, and for the differentiation of exudates, cotton-wool spots, and drusen. The ROC curve shows the sensitivity and specificity at various thresholds, and the system can be set for a specific sensitivity/specificity by selecting the corresponding threshold. For a screening system, sensitivity is more important than specificity, so a threshold could be chosen that maximizes sensitivity while still maintaining enough specificity. Because human experts cannot (consciously) adjust their lesion-detection threshold, only a single sensitivity-specificity pair was obtained for each human grader and plotted in the ROC curve as a point (Fig. 2) . The area under the ROC curve is regarded as the most accurate and comprehensive measure of system performance: an area of 1.0 has sensitivity = specificity = 1 and represents perfect detection, whereas an area of 0.5 is the performance of a system that essentially performs a coin toss. The number of images in which the original annotations of all three retinal specialists (designated A, B, and C) and the machine learning algorithm agreed, were also determined for all bright lesions as a group and for the three classes of lesions individually.
|
| Results |
|---|
|
|
|---|
of the machine learning algorithm and retinal specialist C, compared with the consensus standard.
|
The automated system achieved sensitivity/specificity of 0.95/0.88 for the detection all bright lesions, and 0.95/0.86, 0.70/0.93, and 0.77/0.88 for the detection of exudates, cotton-wool spots, and drusen, respectively. The third retinal specialist achieved 0.95/0.74 for bright lesions, and 0.90/0.98, 0.87/0.98, and 0.92/0.79 for the detection of exudates, cotton-wool spots, and drusen, respectively.
In total, 1739 bright lesions were detected at this optimal threshold setting. Of these 1739 lesions, 1513 lesions were classified correctly, and 226 were inaccurate classifications. Of the latter, 124 drusen were misclassified as exudates, 49 exudates were misclassified as drusen, 5 drusen were misclassified as cotton-wools spots, 2 exudates were misclassified as cotton-wool spots, and there were 45 other confused classifications, either misclassifications or nonlesions. Because the consensus standard was determined by two independent retinal specialists, it was useful to determine the
of their (independent) annotations before the consensus process, and these were 0.80, 0.65, 0.73, and 0.65, respectively.
In 225 of 300 images, the consensus standard, the automated system and retinal specialist C were all in full agreement on the presence of bright lesions. In 167 of 300 cases, the consensus standard, the automated system, and retinal specialist C were in full agreement on the type(s) of bright lesion. Examples are shown in Figure 3A and 3B . Two examples of cases where human experts did not agree among themselves and the automated system also did not agree with the human experts are shown in Figures 3C and 3D .
|
| Discussion |
|---|
|
|
|---|
Differentiating among the three types of bright lesions from color photographs can be challenging, as illustrated by Figures 3A - 3D . The lesions in these images are subtle, and correct differentiation may be improved with knowledge of patient age, consideration of contextual lesions, and the size of lesion classes. Despite the complexity of this task, in 1513 (87%) of 1739 cases, the automated system and the three retinal specialists agreed on the presence and type of lesion. As might be expected, drusen and exudates were easily confused because they are often similar in size, whereas cotton-wool spots are less often confused with the other two types of lesions.
The existing literature has focused almost exclusively on the automatic detection of exudates, only a single study included the detection of cotton-wool spots, and no studies took account of drusen in this context.13 14 15 17 18 20 Our results indicate that of all three bright lesion types, exudates present the easiest lesion to detect for both automated system and retinal specialist C. The capacity of our system algorithm to detect and differentiate all three types of lesions distinguishes it from prior studies.
Despite the encouraging performance achieved by the present system, several major issues remain. One disadvantage of this study is its simple application. We compare the machine learning system and human experts performance on digital fundus photographs obtained in a telediagnosis setting with nonmydriatic cameras. The accepted standard for detection of diabetic retinopathy is seven-field stereo fundus photography by certified photographers and read by certified readers.21 However, the reference standard used here is the consensus reading of the photographs by retinal specialists. This study, and the trained algorithm, may be biased, and some bright lesions may have been missed by both human experts and the automated system because of the limitations of digital two-field nonstereo photography.6 Because we set out to compare the performance of the system to human experts on the same photographs, our results are only valid in that context. To achieve a more comprehensive evaluation of this or a similar system, a comparison of the machine learning algorithm on nonmydriatic nonstereo images, to seven-field nonmydriatic images evaluated by human experts, would be necessary.
A second limiting factor that might limit optimal performance of the machine learning algorithm could be a constraint of the quality of the annotations of the training images. In other words, the better this training set, the better the theoretical performance of the algorithm. Improving the quality of the training data using multiple experienced clinicians and a larger number of training pixels, may improve system performance and also lessen the problem of clinician interobserver variability as evidenced in Figures 3C and 3D .
Third, both human readers and algorithms have difficulty detecting retinal thickening, especially in the absence of context lesions such as exudates, beading, or hemorrhages. Judging retinal thickening is especially difficult when using nonstereo fundus photography for early diagnosis.2 6 21 Retinal thickening without context lesions, if present, will be missed by the algorithm. This is a downside of nonstereo fundus photography, whether read by human experts or an algorithm, though the relative frequency of retinal thickening without context signs may be low.22 23
Fourth, the system has been tested on a small number of patients. If tested on a larger prospective data set, performance may not be comparable to these results.
The ROCs in Figure 2 are interesting because they suggest that professional experience or training differences may affect the performance of human experts. Retinal specialist C is an expert on AMD research and drug trials, in addition to being an expert in diabetic retinopathy, while retinal specialists A and B are primarily diabetic retinopathy specialists. As Table 2 depicts, the performance of C compared with the consensus standard on cotton-wool spots and exudates is comparable, but C is much more sensitive to drusen: one possible explanation is Cs heightened awareness of subtle AMD lesions.
An important question that cannot be answered by a preliminary study such as the present one is how the current algorithm, capable of detecting and differentiating exudates, cotton-wool spots and drusen, might fit into a complete system for diabetic retinopathy classification and healthcare delivery. We envision several different approaches, one of which would be a system for early detection of diabetic retinopathy in patients with diabetes who currently are not receiving the recommended regular dilated eye examinations. Such a system may require the following subcomponents, many of which we and others have presented previously:
In conclusion, we have developed and tested a machine learning system capable of detecting exudates, cotton-wool spots, and drusen, and differentiating among these, on color images of the retina obtained in a population of patients with diabetes. The performance of this proof-of-concept system has sensitivity and specificity that approaches retinal experts. If such a system can be improved by better quality training data and tested on larger data sets, it has the potential to help prevent visual loss and blindness in patients with diabetes.
|
| Appendix 1 |
|---|
|
|
|---|
Potential Bright Lesion Pixel Clusters
By setting the threshold at 60% (pixels with a probability higher than 60% are considered part of a bright lesion and retained) by grouping connected pixels above this threshold, a set of bright lesion pixel clusters is obtained (example: third row of Fig. 3 ). Because their output is required for further processing, algorithms that perform red lesion classification, optic disc segmentation, and vessel segmentation were applied to the images as we have previously described.8 9 10 26 That part of a bright lesion that overlaps with the optic disc is removed at this step, using the segmented optic disc. From here on, such a bright lesion pixel cluster will be termed potential lesion.
Bright Lesion Detection
The potential lesions include potential spurious responses, most of which occur along the major vessels and some of which are from posterior hyaloid reflexes. A second kNN classifier was trained by using a set of sample potential lesions extracted from the training set to suppress spurious bright lesion clusters (Table A1 for the features used). The contrast features measure the contrast of the cluster in multiple image RGB color planes, and other features provide information about the size, shape, and contrast of a potential lesion; its proximity to the closest vessel; and proximity to the closest red lesion, as potential lesions close to a red lesion are more likely to be true bright lesions. Each potential lesion is thereby assigned a probability indicating the likelihood that it is a true bright lesion. The probability that the image contains any type of bright lesion is now given by the maximum probability assigned to any of the potential lesions in the image. For the final classification step, all potential lesions with a probability below 70% were discarded. The value for this threshold was determined on the training set, and the results were not very sensitive to small variations in this threshold (see the bottom row of Fig. 3 ).
Bright Lesion Classification
A third classifier was trained using the bright lesion types from the training set. The features described in Table A1, and in addition, the following features were used:
A linear discriminant analysis classifier labeled all found lesions in the test set as to whether they were exudates, cotton-wool, or drusen.16
| Footnotes |
|---|
Submitted for publication August 22, 2006; revised December 29, 2006; accepted March 7, 2007.
Disclosure: M. Niemeijer (P); B. van Ginneken (P); S.R. Russell, None; M.S.A. Suttorp-Schulten, None; M.D. Abràmoff (P)
The publication costs of this article were defrayed in part by page charge payment. This article must therefore be marked "advertisement" in accordance with 18 U.S.C.
1734 solely to indicate this fact.
Corresponding author: Michael D. Abràmoff, Retina Service, Department of Ophthalmology and Visual Sciences, University of Iowa Hospitals and Clinics, 200 Hawkins Drive, Iowa City, IA 52242; michael-abramoff{at}uiowa.edu.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
M. D. Abramoff, M. Niemeijer, M. S.A. Suttorp-Schulten, M. A. Viergever, S. R. Russell, and B. van Ginneken Evaluation of a System for Automatic Detection of Diabetic Retinopathy From Color Fundus Photographs in a Large Population of Patients With Diabetes Diabetes Care, February 1, 2008; 31(2): 193 - 198. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |