Showing posts with label image-processing. Show all posts
Showing posts with label image-processing. Show all posts

Saturday, June 24, 2023

BMI 702 Review Part IV -- Biomedical Imaging

Here is Part IV of my ongoing review of the Biomedical Artificial Intelligence (BMI 702) course, part of Harvard's Foundation of Biomedical Informatics 2023 Spring session, taught by Prof Marinka Zitnik and her team. If you want to check out my previous reviews in this series, they are listed below.

This review covers Module 5 of the course (weeks 10 and 11) and is devoted to the use of Computer Vision techniques to address Biomedical Imaging use cases. There are 9 papers and 2 book chapters, 6 in the first week and 5 in the second. I have some interest in Computer Vision models, having built an Image Classifier by fine-tuning a ResNet pre-trained on ImageNet to predict the type of medical image (radiography, pathology, etc) in medical text, and more recently, fine-tuning an OpenAI CLIP model on medical image and caption pairs to provide text-to-image and image-to-image search capabilities. However, all of these papers have a distinctly medical flavor, i.e. these directly address the needs of doctors, radiologists and pathologists in their day to day work, using data that is typically only found in hospital settings. While a large number of these papers deal with supervised learning, some use semi-supervised or weakly-supervised strategies, which require some adaptation of already available data, which in turn would require you to know about existence of said data to come up with the idea. But I thought they were very interesting in a "broaden my horizons" kind of way.

Module 5 Week 1

Dermatologist-level classification of skin cancer with deep neural networks (Esteva et al, 2017)

This is one of many landmark events where a neural network achieves superhuman performance at a particular task – in this case, classifying a variety of skin cancers from smart phone photos of lesions. It is also covered in the What-Why-How video for this week. The paper itself is paywalled, and Google Scholar only finds presentation slides by the primary author for a GPU Tech 2017 conference. The paper describes an experiment where a GoogleNet Inception V3 CNN, pre-trained on ImageNet data, was further fine-tuned on 129,450 clinical images of skin lesions spanning 2,032 different diseases. The diseases were further classified into a hierarchy via a taxonomy. Classifiers were constructed to predict one of 3 disease classes (first level nodes of the taxonomy – benign, malignant and non-neoplastic) and one of 9 disease classes (second level nodes), and their outputs compared to that of a human expert on a sample of the dataset. In both cases, the trained classifier out-performed the humans. Later experiments with larger number of disease classes and biopsy-proven labels, performed even better, the AUC for the sensitivity-specificity curve was 0.96. The performance of the CNN to predict Melanoma (with photos and dermascopy) and Carcinoma was then compared with predictions of 21 board certified dermatologists and was found to beat their performance on average. Finally, to test the classifier encodings, the last hidden layer of the CNN was reduced to two dimensions using T-SNE and found to cluster well across four disease categories, as well as for individual diseases within each category. In addition to the good results obtained, the paper is important in that it demonstrates an approach to detect skin cancer cheaply and effectively compared to previous approaches (dermascopy and biopsy), thereby saving many people from death and suffering.

Toward robust mammography based models for breast cancer risk (Yala et al, 2021)

This paper describes the Mirai model to predict the risk of breast cancer at multiple timepoints (1-5 years), using mammogram images (4 standard perspectives) and optionally, additional non-image risk factors such as age and hormonal factors. If the additional risk factors are not provided, Mirai predicts them from the aggregated vector representation of the mammograms. The risk factors (predicted or actual) along with the mammogram vector to predict the risk of breast cancer. Mirai used data collected by Massachusetts General Hospital (MGH), representing approximately 26k exams, splitting it 80/10/10 for training, validation and testing. The resulting model was tested against established risk models such as Tyrer-Cuzik v8 (TCv8) and other SOTA image based neural models with and without additional risk factors. The latter models were also trained on the MGH data. Mirai was found to outperform them using the C-index (a measure of concordance between label and prediction) and AUC at 1-5 year intervals as evaluation metrics. The model was then evaluated against 19k and 13k exams from the Karolinska Institute (Sweden) and CGMH (Taiwan) respectively and had comparable performance on both. It was also tested on ethnic subgroups and was found to compare equally well across all groups. It also outperformed the industry standard risk models at identifying high risk cohorts. The paper concludes by saying that Mirai could be used to provide more sensitive screening and achieve earlier detection for patients who will develop breast cancer, while reducing unnecessary screening and over-treatment for the rest.

Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning (Tiu et al, 2022)

This paper describes training a multi-modal CLIP model CheXzero, that learns an embedding using 377k chest X-rays and their corresponding raw radiology report from the MIMIC-CXR dataset, which is then used to predict pathologies (indications of different diseases) of the lung for unseen chest X-rays. This is done by generating positive and negative prompts for each pathology of interest. The model uses the positive and negative scores to compute the probability of the presence of the pathology in the chest X-ray. The performance of CheXzero is comparable to that of a team of 3 board-certified radiologists across 10 different pathologies. CheXzero also outperforms previous label efficient methods, all of which require a small fraction of the dataset to be manually labeled to enable pathology classification. CheXzero can also perform auxiliary task such as patient gender detection that it was not explicitly trained for. The trained CheXzero model (trained on MIMIC-CXR) also performed well on other chest X-ray datasets such PadChest, showing that the self-supervised approach can generalize well.

International Evaluation of an AI System for Breast Cancer Screening (McKinney et al, 2020)

The paper describes a Deep Learning pipeline which is fed mammogram X-rays taken from 4 standard perspectives and which predicts if the patient would get breast cancer in 2-3 years. Two datasets were used, a larger one from the UK consisting of mammograms from 25k women used for training the model, and a smaller test set from the US for 3k women. The system (for which no code is shared nor any technical information provided) claims that it achieves better performance at breast cancer detection than a team of 6 human radiologists. The model was found to generalize across datasets, since it was trained on UK data and evaluated on US data. When the system was used for screening out initial mammograms for manual verification by a human radiologist (a double-reading scenario), it achieved an 88% increase in throughput. Thus such a system could be useful for providing automated immediate feedback for breast cancer screening, as well as a first step in the double reading scenario, as an assistive tool for human radiologists.

The new era of quantitative cell imaging – challenges and opportunities (Bagheri et al, 2021)

The paper compares the evolving popularity of optical microscopy with the enormous success of genomics a few years earlier, and argues that quantitative optical microscopy has similar potential to make similar contributions to the biomedical community. While the origins of optical microscopy are rooted in the 19th century, recent breakthroughs in this technology (notably high resolution and high throughput light microscopy but others as well), along with advances in deep learning that facilitate human analysis of images at greater scale, indicate that there is significant convergence of approaches that position optical microscopy as a viable candidate for biomedical data science. The idea is that rather than have optical microscopy contribute a small volume of highly curated images to a research project, it would be treated as a computational science where a large quantity of standardized images will be generated over time, and which could then provide insights based on statistical analysis and machine learning. The article then goes on to describe the challenges that the field must overcome, namely standardization of techniques to enable reproducibility within and across different labs, the storage of and FAIR (findable, accessible, interoperable and reusable) access to potentially terabytes of image data data generated. It also describes several initiatives that are happening within the biomedical community to address these challenges.

Data-analysis strategies for image-based cell profiling (Caideco et al, 2017)

This paper highlights strategies and methods to do high throughput quantification of phenotypic differences in cell populations. It can be seen as an extension to the previous paper that outlined the challenges and opportunities in this field. It proposes a workflow composed of the following steps – image analysis, image quality control, preprocessing extracted features, dimensionality reduction, single-cell data aggregation, measuring profile similarity, assay quality assessment and downstream analysis. Image Analysis transforms a population of digital cell images into a matrix of measurements, where each image corresponds to a row in the matrix. This stage often includes illumination correction, segmentation and feature extraction. The Quality Control step consists of computing metrics to detect cell quality using both field of view and cell levels. The Preprocessing step consists of removing outlier features or cells or imputing values for features based on the rest of the population. A notable operation in this stage is plate-level effect correction, which involves addressing edge effects and gradient artifacts across different plates of assays. We also do feature transformation and normalization in this step, such that the features have an approximately normal distribution. The next step is Dimensionality Reduction, where the aim is to retain or consolidate features that provide the most value in answering the biological question being studied. The Single Cell Data Aggregation step consists of using various statistical measures (mean, median, Kolmogorov-Smirnov (KS)) on the feature distribution to create an “average” cell. Clustering or Classification techniques are used to identify sub-populations of cells. The next step is to Measure Profile Similarity that measure and reveal similarities across the different profiles identified. At this point we are ready for the Assay Quality Assessment step where we evaluate the quality of the morphological profiling done during the previous steps. The final step is Downstream Analysis, where the morphological patterns found are interpreted and validated. The paper is extraordinarily detailed and contain many techniques that are suitable not only for image based cell profiling, but feature engineering in general. Data used for illustrating the workflow comes from the BBBC021 (Broad Bio-image Benchmark Collection) image collection of 39.6k image files of 113 small molecules, and author provides example code in the github repo cytomining/cytominer.

Module 4 Week 2

Chapter 10 of Artificial Intelligence in Medical Imaging (Imaging Biomarkers and Imaging Biobanks) (Alberich-Bayarri et al, 2019)

The chapter discusses challenges to the adoption of image analytics into clinical routine. Although efforts are under way to standardize production of imaging biomarkers, they still have a long way to go. In addition, they have to show efficacy in treatment response, which in turn should be confirmed via medical theory, through correlation with disease hallmarks. This allows imaging biomarkers to serve as surrogate indicators to relevant clinical outcomes. Finally, acquiring image biomarkers need to be cost efficient. The chapter covers the general methodology for development, validation and implementation of imaging biomarkers. In order to be effective, such data would then need to be stored in imaging biobanks, either population or disease focused, in order that they can be effectively shared within the community and thus provide maximum value.

Deep Learning-based Computational Pathology Predicts for Cancers of Unknown Primary (Lu et al, 2020)

This paper addresses the problem of predicting the primary site for Cancers of Unknown Primary (CUP) which cannot be determined easily for some patients. Addressing the cancer by generic therapies without determining the source results in low survival. It is possible to find the primary site using extensive diagnostic work-up spanning pathology, radiology, endoscopy, genomics, etc, but such diagnostic procedures are not possible for patients in low resource settings. The paper describes the Tumor Assessment via Deep Learning (TOAD) system that predicts if the cancer is primary or metastasized, and the primary site, based on the histopathology slides (called WSIs). TOAD was trained on 17.5k WSIs and achieved impressive results for top-3 and top-5 accuracy on the test set, and generalizes well with comparable results on WSIs from a different hospital. TOAD uses a CNN architecture which is trained jointly to predict both whether the cancer is primary or metastasized, and the primary site of the cancer (14 classes). For explainability TOAD can generate attention heatmaps to indicate which parts of the slides are indicative of the predicted cancer. TOAD was also tested against WSIs for which the labels were not known initially but were found later, during autopsy. The high accuracies of the top-3 and top-5 predictions means that physicians can narrow the scope of their diagnostic tests and treatments, thus resulting in more efficient use of medical resources. This paper is also covered in the What-Why-How video for the week.

Chapter 13 from Artificial Intelligence in Medical Imaging (Cardiovascular Diseases) (Verjans et al, 2019)

This chapter covers the use and applicability of various medical imaging techniques to diagnose and treat Cardiovascular diseases, such as specialty areas Echocardiography, Computed Tomography (CT), Magnetic Resonance Imaging (MRI) and Nuclear Imaging (PET). It also discusses predictive applications that can combine information from multiple sources, including imaging. The impact of AI in Cardiovascular imaging has so far been mainly in image interpretation and prognosis, it has the potential to impact the entire imaging pipeline – choosing a test per the guidelines, patient scheduling, image acquisition, reconstruction, interpretation and prognosis. Deep Learning techniques have been applied in the MRI space to reconstruct accelerated MR images in favor of compressed sensing, and research efforts show reconstruction of high quality CT images from low radiation noisy images. Deep Learning techniques have also been applied during image post-processing, such as automatically computing ejection fractions or cardiac volumes from CTs. In the near future, we expect that ML applications will generate diagnostics from images. In terms of prognosis, DL/ML approaches using medical imaging is expected to increase the quality of healthcare by detecting problems faster and cheaper. There also exists the scope of combining insights from medical imaging with other sources of information such as generic or social factors, to make better medical decisions. The chapter continues with a discussion of specific practical uses of AI in different cardiovascular imaging scenarios in each of the specialty areas listed above. The chapter also discusses the Vendor Neutral AI Platform (VNAP) to help with rapid adoption of AI based solutions in Medical Imaging.

Artificial Intelligence in Digital Pathology – new tools for diagnosis and precision oncology (Bera et al, 2019)

The paper describes how the digitizing of whole-slide images (WSI) of tissue has led to the rise of AI / ML tools in digital pathology, that can assist pathologists and oncologists provide better and more timely treatment. The rise of Deep Learning and computation power over the last two decades has given rise to many different applications in these areas. For pathologists, the primary applications are the identification of dominant morphological patterns that are indicative of certain diseases, and for oncologists, it is the identification of biomarkers that are indicative of a type of cancer and the stage it is in. These are both complex tasks and have high variability, so it usually takes years of specialization to do effectively. AI based approaches are robust and reproducible, and achieve a similar level of accuracy as human experts. When used in tandem, it can significantly cut down the human expert’s workload and make them more efficient, or serve as a confirmation (like a second opinion). These AI applications have been used in diagnostic applications such as differentiating between WSIs of malignant vs benign breast cancer tissue, and prognostic applications such as the ability to detect tumor infiltrating lymphocytes, which are indicative of 13 different cancers, or the ability to predict recurrence of lung cancer by the arrangement of cells in WSIs. It has also been used in Drug discovery and development, by identifying patients who are more likely to respond to certain treatments using WSIs of their nuclear or peri-nuclear features. DL architectures typically used in these applications are the CNN, FCN (sparse features, e.g. detecting cancerous regions in histopathology images), RNNs (to predict risk of disease recurrence over time), GAN (segment out specific features from histopathology images, conversion of one form of tissue staining to another, etc). Challenges to clinical adoption of these techniques include regulatory roadblocks, quality and availability of training data, the interpretability of these AI models, and the need to validate these models sufficiently before use.

Data-efficient and weakly supervised computational pathology on while-slide images (Lu et al, 2021)

The paper describes an attention mechanism called Clustering-constrained Attention Multi Instance learning (CLAM) which is used to identify regions of interest (ROI) in while slide images (WSI). WSIs are plentiful but are labeled with slide level labels, which are not as effective for classification tasks as manually labeled ROIs. CLAM allows an attention mechanism to be applied across all pixels and is very effective at finding ROIs which can then be extracted and used for various tasks, and has proven to be more effective than treating all pixels in the slide as having the same label. CLAM has been applied to the tasks of detecting renal cell carcinoma, non-small-cell lung cancer and lymph node metastasis and has been shown to achieve high performance with a systematically decreasing number of training labels. CLAM can also produce interpretable heatmaps that allow the pathologist to visualize the regions of tissue that contributed to a positive prediction. CLAM can also be used to compute slide level feature representations that are more predictive than raw pixel values. CLAM has been tested with independent test cohorts and found to generalize across data specific variants, including smartphone microscopy images. Weakly supervised approaches such as CLAM are important because it leverages abundant weak WSI labels to provide labeled ROIs of slide subregions, which in turn can produce more accurate predictive models of computational pathology.

That's all I have for today. I hope you found this useful. In my next review, I will review the paper readings for Module 6 (Therapeutic Science).

Sunday, October 17, 2021

Fine-tuning OpenAI CLIP for different domains

In July this year, a group of us on the TWIML Slack Channel came together and participated in the Flax/JAX Community Week organized by Hugging Face and Google Cloud. Our project was about fine-tuning the CLIP Model from OpenAI with the RSICD (Remote Sensing Image Captioning Dataset), and ended up placing third.

The code for the project is available on github at arampacha/CLIP-rsicd if you are curious about how we went about doing this, or if you want to replicate our efforts. Our fine-tuned model is available on the Hugging Face model repository at flax-community/clip-rsicd-v2, you can find instructions on how to use it for inference on your own remote-sensing / satellite data. We also have a Streamlit based demo that shows its application in image search and finding features in images using text descriptions. Finally, we also have a blog post on the Hugging Face blog titled Fine tuning CLIP with Remote Sensing (Satellite) images and captions. Hope you fine these useful, do check them out.

Even before this project, I had been considering learning a joint embedding for medical images and their captions as described in the Contrastive Learning of Medical Visual Representations from Paired Images and Text (CONVIRT) paper by Zhang et al (2010), and using it to power a text-to-image image search application. Based on the RSICD project, however, CLIP looked like a better and more modern alternative.

Elsevier has a Dev-10 program for their engineers, by which they are given 10 working days (2 weeks) to build something that does not necessarily have to align with company objectives, but which is somewhat work-related. When my Dev-10 days came up in early September, I used it to fine-tune the same OpenAI CLIP baseline as we did for the Flax/JAX community week, but with the ImageCLEF 2017 Image Captioning dataset. Happily, the results were just as encouraging as fine-tuning it with RSICD, if anything, the improvement was even more dtamatic.

During the RSICD fine-tuning exercise, the fine-tuning work was done by other members of the team. My contribution to that project was the evaluation framework, the image augmentation piece, the demo, and later the blog post. On the ImageCLEF exercise, I was the only developer, so while a lot of the code in the second case was borrowed or adapted from the first, there were some important differences as well, apart from the dataset.

First, in the RSICD fine-tuning case we used JAX/Flax with a TPU enabled instance on Google Cloud, and in the second I used Pytorch on a single-GPU EC2 instance on AWS (with the Deep Learning AMI). I found that the Hugging Face wrapper for CLIP provides a lot of the support that was being done explicitly, so I tried to leverage the provided functionality as much as possible, resulting in slightly cleaner and more readable code (even if I say so myself :-)).

Second, I didn't do any image or text augmentation like we did with the RSICD fine-tuning effort. RSICD had a total of 10k images with approximately 5 captions per image, of which we were using about 7k for training. On the other hand, ImageCLEF was about 160k images and captions, of which we were using 140k for training. In addition, RSICD was training on a TPU with 4 parallel devices, and ImageCLEF was training an a single GPU. Because of this, I ended up using subsampling from the training set as a form of regularization instead, and using early stopping to terminate the training process once no improvements in validation accuracy were detected.

Third, with the benefit of hindsight, I settled on a more industry-standard metric for evaluation, the Mean Reciprocal Rank (MRR@k) compared to the less strict and somewhat ad-hoc Hits@k metric I had used for the first exercise.

And fourth, because the data volume for my second Image Search demo was much larger (200k images instad of 10k), I switched from using NMSLib to using Vespa, the open source hybrid vector + text search engine from Yahoo!. Using it, I was able to provide image search results based on lexical matches between query and caption text, vector space matches between CLIP query vector and CLIP image vectors, and hybrid search results ranked by combining the relevance of the two approaches.

Unfortunately I am not able to share the code. Since the work was done on company time with company resources, the code rightfully belongs to the company. I am also hopeful that the work could be used to power image search (or related) functionlity in some production application. For these reasons I am unable to share the code, but in general, it is similar (with the differences enumerated above) to the RSICD version.

However, just to give some idea of the kind of results you can expect from a fine-tuned CLIP model, here are couple of screenshots. The results are for the queries "computed tomography" and "computed tomography deep vein thrombosis". Both results are from doing vector matching, i.e. ranked by cosine similarity between the CLIP encoding of the query text and the CLIP encoding of each image.

As you can see, CLIP returns relevant images for both high level and detailed queries, indicating how rich the embedding is. My main takeaway from this series of exercises are twofold -- first, CLIP's joint image-text encoding is a seriously powerful idea and is super-effective, and second, transformer models trained on general data (natural images and text in this case) can be fine-tuned effectively for specialized domains using relatively small amounts of data.

Friday, March 29, 2019

The Amazing Composability of Convolutional Networks for Computer Vision Tasks


I last watched the videos for Stanford's CS231n: Convolutional Neural Networks for Visual Recognition almost two years ago, along with a lot of other videos, trying to scramble up the Deep Learning learning curve before I got too far behind. Recently, however, I was looking for some specific information around object detection, so I rewatched the Lecture 11: Detection and Segmentation video on Youtube. This lecture was from their 2017 class, and was taught by Justin Johnson. The lecture covers various popular approaches to object detection and segmentation, and can be summarized by the figure (taken from Darian Frajberg's presentation on Introduction to the Artificial Intelligence and Computer Vision Revolution at Politecnico di Milano) below.


What struck me again in the lecture was the incremental nature of the approaches, with each approach building on the one before it. But underneath it all, its the same convolutional and pooling layers with classification and regression heads, just reconfigured in different ways. That is what I will write about in this post today, and I hope by the end of the post, you will agree with me as to how super cool this is. Of necessity, I am going to describe the different networks at a somewhat high level, so if you are looking for more information, I would advise you to watch the video yourself. Alternatively, Leonardo Araujo dos Santos has an excellent tutorial on Object Localization and Detection that also goes into more detail than I am going to. But on to the post.

The first computer vision task is Image Classification. Here you train a network with images which can belong to one of a fixed number of classes, and use the trained model to predict the class. The Convolutional Neural Network (CNN) is the tool of choice when it comes to image classification. CNNs have consistently blown away classification benchmarks against the ImageNet dataset starting with AlexNet in 2012. Over the years, networks have grown deeper and some novel extensions to the basic CNN architecture have been developed, and today, image classification against the ImageNet dataset is considered a mostly solved problem. CNN models pre-trained against ImageNet are available for many frameworks, and people often use them to build their own classifiers using Transfer Learning.

Semantic Segmentation


The goal of semantic segmentation is to label each pixel of the image with its corresponding class label. Naively, this would be done by taking small crops by sliding a window across and down the input image, then predicting the central pixel in each crop. This can be expensive, since an image H pixels high and W pixels wide will have to do H x W operations for a single image.

A less expensive approach might be to run your image through a CNN which will increase the number of channels without changing the width and height. At the end, each pixel is represented by a vector of size equal to the number of target channels. Each of these vectors are then used to predict the class of the pixel. While better than the previous approach, this approach is also prohibitively expensive and is not used.

A third way is to use a CNN encoder-decoder pair, where the encoder will decrease the size of the image but increase its depth using Convolution and Pooling operations, and the decoder will use Transposed Convolutions to increase its size and decrease its depth. The input to this network is the image, and the output is the segmentation map. The U-Net, originally developed for biomedical image segmentation, contains additional skip-connections between corresponding layers of the encoder and decoder.

Classification + Localization


In the Classification + Localization problem, not only do you have to report the class of the object in the image, you have to report the location. The assumption in a localization problem is that you have a single object in the image.

The only addition to a standard CNN pipeline is that this network will have two heads, a classification head and a regression head that reports the bounding box coordinates (x, y, w, h). The convolutional part of the network will output a vector for the input image that is fed to a pair of dense networks jointly -- the classification head uses some sort of categorical loss function and the regression head uses a continuous loss function, combined with an additional scalar hyperparameter. You can train the networks separately and fine tune them jointly. Depending on where you place the regression head -- if placed before the fully connected layers, it is called the Overfeat or VGG style, and if placed after, it is called the R-CNN style.

Object Detection


The Object Detection task is the same as the Classification + Localization problem, except we now have multiple objects in the image, for each of which we have to find the class and bounding box. We don't know the number or the sizes of the objects in the image. This is a hard and almost intractable problem if we have to compute random crops.

The first solution is to use an external tool from computer vision, such as Selective Search, to compute "blobby" areas of the image as possible areas where objects might be found. These areas are called Region Proposals. Proposals are resized and used to fine-tune a Image classification network, which is then used to vectorize the images. Multiple binary SVMs (one per class) was used to classify the object between object and background, and a linear regression network was used to correct the bounding boxes proposed in the region proposals. This was the Region Proposal Network, or R-CNN.

The first improvement, called the Fast R-CNN, still gets the region proposals from an external network, but projects these proposals to the output of the CNN. An ROI Pooling layer resizes all the regions to a fixed size and pass the vector for each region to a classification and a regression head, which predicts the class and bounding box coordinates respectively.

The next improvement, called the Faster R-CNN, gets rid of the external Region Proposal mechanism and replaces it with a Region Proposal Network (RPN) between the Deep CNN and the ROI Pooling layer. The output of the RPN and the output of the CNN are fed into the ROI Pooling Layer from which it is fed into the fully connected part of the CNN with a classification and regression head to predict the class and bounding box for each proposal respectively.

The speedup of a Fast R-CNN over the R-CNN is about 25x, and speedup of the Faster R-CNN over a Fast-RCNN is 10x, thus a Faster R-CNN is about 250x faster than a R-CNN.

A slightly different approach to the Object Detection task was taken by the class of networks called Single Shot Detectors (SSD), one of which is the YOLO (You Only Look Once) network. The YOLO network breaks up the image into a 7x7 grid, then applies a set of pre-determined bounding boxes (B) with different aspect ratios to each grid. For each bounding box, it will compute the coordinates (x, y, w, h) and confidence for the object bounding box coordinates and the class probabilities for each of C classes. Thus the image can be represented by a vector of size 7 x 7 x (5B + C). The YOLO network is a CNN which converts the image to this vector.

Instance Segmentation


Instance Segmentation Tasks are similar to Semantic Segmentation, but the difference is that it distinguishes between individual objects of the same class, nor does it aim to label every pixel in the image. In some ways Instance Segmentation could also be considered similar to Object Detection, but instead of a bounding box, we want to find a binary mask the contains each object. The network that is used for this task is the Mask R-CNN.

Mask R-CNN adds an additional mask predicting branch (another CNN) to the regression head of a Faster R-CNN. This is applied to every region proposed by the RPN in the Faster R-CNN. The mask predicting branch converts the bounding box for each proposal into a binary mask.

This concludes my post on Object Detection and Segmentation networks. As you can see, while the networks have grown progressively more complex as the tasks got harder, they are composed of (mostly) the same basic building blocks. Some things to note is the gradually making networks learn end-to-end, by moving as much learning from the outside to inside the network.

I hope you enjoyed my high level overview on the composability of CNNs for different computer vision tasks. I have deliberately tried to keep the descriptions at a fairly high level. For more depth, consider watching the lecture on which this post is based on, or consult one of the many tutorials on the Internet that go into each network in greater detail, including the tutorial I mentioned earlier.


Monday, April 24, 2017

Predicting Image Similarity using Siamese Networks


In my previous post, I mentioned that I want to use Siamese Networks to predict image similarity from the INRIA Holidays Dataset. The Keras project on Github has an example Siamese network that can recognize MNIST handwritten digits that represent the same number as similar and different numbers as different. This got me all excited and eager to try this out on the Holidays dataset, which contains 1491 photos from 500 different vacations.

My Siamese network is somewhat loosely based on the architecture in the Keras example. The main idea behind a Siamese network is that it takes two inputs which need to be compared to each other, so we reduce it to a denser and hopefully more "semantic" vector representation and compare it using some standard vector arithmetic. Each input undergoes a dimensionality reduction transformation implemented as a neural network. Since we want the two images to be transformed in the same way, we train the two networks using shared weights. The output of the dimensionality reduction is a pair of vectors, which are compared in some way to yield a metric that can be used to predict similarity between the inputs.

The Siamese network I built is shown in the diagram below. It differs from the Keras example in two major ways. First, the Keras example uses Fully Connected Networks (FCNs) as the dimensionality reduction transformation component, whereas I use a Convolutional Neural Network (CNN). Second, the example computes the Euclidean distance between the two output vectors, and attempts to minimize the contrastive loss between them to produce a number in the [0,1] range that is thresholded to return a binary similar/dissimilar prediction. In my case, I use a FCN that combines the output vectors using element-wise dot product, use cross-entropy as my loss function, and predict a 0/1 to indicate similar/dissimilar.


For the CNN, I tried various different configurations. Unfortunately, I started running out of memory on the g2.2xlarge instance when I started trying large CNNs, and ended up migrating to a p2.xlarge. Even then, I had to either cut down the size of the input image or the network complexity, and eventually settled on a LeNet configuration for my CNN, which seemed a bit underpowered for the data. For the current configuration, shown in 02-holidays-siamese-network notebook, the network pretty much refused to learn anything. In other tries, the best test set accuracy I was able to get was about 60%, but all of them involved compromising on the input size or the complexity of the CNN, so I gave up and started looking at other approaches.

I have had success with transfer learning in the past, where you take large networks pre-trained on some external corpus such as ImageNet, chop off the classification head, and expose the vector from the layer prior to the head layer(s). So the pre-trained network acts as the vectorizer or dimension reducer component. I used the following pre-trained networks that are available in Keras applications, to generate vectors from. The code to do this can be found in the 03-pretrained-nets-vectorizers notebook.

  • VGG-16
  • VGG-19
  • ResNet
  • InceptionV3
  • xCeption


The diagram above shows the general setup of this approach. The first step is to just run the predict method on the pre-trained models to generate the vectors for each image. These vectors then need to be combined and fed to another classifier component. Some strategies I tried were element-wise dot product, absolute difference and squared (Euclidean) distance. In case of dot product, corresponding elements of the two vectors that are both high end up becoming higher, and elements that differ end up getting smaller. In case of absolute and squared differences, elements that are different tend to become larger. In case of squared difference, large differences are highlighted better than small differences.

The classifier component (shown as FCN in my previous diagram) can be any kind of classifier, including non neural network based ones. As a baseline, I tried several common classifiers from the Scikit-Learn and XGBoost packages. You can see the code in the 04-pretrained-vec-dot-classifier, 05-pretrained-vec-l1-classifier, and 06-pretrained-vec-l2-classifier notebooks. The resulting accuracies for each (vectorizer, merge strategy, classifier) combination on the held out test set are summarized below.








Generally speaking, XGBoost seems to do the best across all merge strategies and vectorization schemes. Among these, Inception and ResNet vectors seem to be the best overall. We also now have a pretty high baseline for accuracy, about 96.5% for Inception vectors merged using dot product and classified with XGBoost. The code for this can be found in the 07-pretrained-vec-nn-classifier notebook. The figure below shows the accuracies for different merge strategies for ResNet and Inception.


The next step was to see if I could get even better performance by replacing the classifier head with a neural network. I ended up using a simple 3 layer FCN that gave a 95.7% accuracy with Inception vectors and using dot product for a merge strategy. Not quite as good as the XGBoost classifier, but quite close.

Finally, I decided to merge the two approaches. For the vectorization, I chose a pre-trained Inception network with its classification head removed. Input to this network would be images, and I would use the Keras ImageDataGenerator to augment my dataset, using the mechanism I described in my previous post. I decided to keep all the pre-trained weights fixed. For the classification head, I decided to start with the FCN I trained in the previous step and fine tune its weights during training. The code for that is in the 08-holidays-siamese-finetune notebook.


Unfortunately, this did not give me the stellar results I was hoping for, my best result was about 88% accuracy in similarity prediction. In retrospect, it may make sense to experiment with a simpler pre-trained model such as VGG and fine tune some of the later layer weights instead of keeping them all frozen. There is also a possibility that my final network is not getting the benefits of a fine tuned model from the previous steps. One symptom is that the accuracy after the first epoch is only around 0.6 - I would have expected it to be higher with a well trained model. In another project where a similar thing happened, a colleague discovered that I was doing extra normalization with ImageDataGenerator that I hadn't been doing with the vectorization step - this doesn't seem to be the case here though.

Overall, I got the best results from the transfer learning approach, with Inception vectors, dot product merge strategy and XGBoost classifier. Nice thing about transfer learning is that it is relatively cheap in terms of resources compared to the fine tuning or even the from-scratch training approach. While XGBoost does take some time to train, you can do the whole thing on your laptop. This is also true if you replace the XGBoost classifier with an FCN. You can also do inline Image Augmentation (i.e, without augmenting and saving) using the Keras ImageDataGenerator if you use the random_transform call.

Edit 2017-08-09: - seovchinnikov on Github has run some further experiments on his own datasets, where he has achieved 98% accuracy using feature fusion (code). See here for the full discussion.

Edit 2018-07-10: - The "Siamese Networks" in the title of the post is misleading and incorrect. Siamese networks train a function (implemented by a single set of NN weights) that returns the similarity between two inputs. In this case, we are using a pre-trained network to create vectors from images, then training a classifier to take these vectors and predict similarity (similar/dissimilar) between them. In case of a Siamese network, we would train the image to vector generating network against a loss function that minimizes for similar images and maximizes for dissimilar images (or vice versa). At the time I wrote this post, I did not know this. My apologies for the confusion, and thanks to Priya Arora for pointing this out.

Saturday, February 18, 2017

Using the Keras ImageDataGenerator with a Siamese Network


I have been looking at training a Siamese network to predict if two images are similar or different. Siamese networks are a type of Neural network that contain a pair of identical sub-networks that share the same parameters and weights. During training, the parameters are updated identically across both subnetworks. Siamese networks were first proposed in 1993 by Bromley, et al in their paper Signature Verification using a Siamese Time Delay Neural Network. Keras provides an example of a Siamese network as part of the distribution.

My dataset is the INRIA Holidays Dataset, a set of 1491 photos from 500 different vacations. The photos have a naming convenition from which the groups can be derived. Each photo is numbered with six digits - the first 4 refer to the vacation and the last two is a unique sequence number within the vacation. For example, a photo named 100301.jpg is from vacation 1003 and is the first photo in that group.

The input to my network consist of image pairs and the output is either 1 (similar) or 0 (different). Similar image pairs are from the same vacation group. For example, the code snippet displays three photos - the first two are from the same group and the last one is different.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from __future__ import division, print_function
from keras.preprocessing.image import ImageDataGenerator
from keras.utils import np_utils
from scipy.misc import imresize
import itertools
import matplotlib.pyplot as plt
import numpy as np
import random
import os

DATA_DIR = "../data"
IMAGE_DIR = os.path.join(DATA_DIR, "holiday-photos")

ref_image = plt.imread(os.path.join(IMAGE_DIR, "100301.jpg"))
sim_image = plt.imread(os.path.join(IMAGE_DIR, "100302.jpg"))
dif_image = plt.imread(os.path.join(IMAGE_DIR, "127202.jpg"))

def draw_image(subplot, image, title):
    plt.subplot(subplot)
    plt.imshow(image)
    plt.title(title)
    plt.xticks([])
    plt.yticks([])
    
draw_image(131, ref_image, "reference")
draw_image(132, sim_image, "similar")
draw_image(133, dif_image, "different")
plt.tight_layout()
plt.show()


The following code snippet loops through the image directory and uses the file naming convention to create all pairs of similar images and a corresponding pair of different images. Similar image pairs are generated by considering all combination of image pairs within a group. Dissimilar image pairs are generated by pairing the left hand image of the similar pair with a random image from some other group. This gives us 2072 similar image pairs and 2072 different image pairs, ie, a total of 4144 image pairs for our training data.

Fearing that this might not be nearly enough to train my network adequately, I decided to use the Keras ImageDataGenerator to augment the dataset. Before Keras, when I was working with Caffe, I would manually augment my input with a fixed number of standard transformations, such as rotation, flipping, zooming and affine transforms (these are all just matrix transforms). The Keras ImageDataGenerator is much more sophisticated, you instantiate it with the range of transformations you will allow on your dataset, and it returns you a generator containing transformations on your input images images from a directory.

I have used the ImageDataGenerator previously to augment my dataset to train a simple classification CNN, where the input was an image and the output was a label. This is the default case the component is built to handle, so its actually very simple to use this. My problem this time was a litle different - my input is a pair of image names from a triple, and I wanted that the identical transformation be applied to both imaages. (This is not strictly necessary in my case, but can't hurt, and in any case I wanted to learn how to do this for another upcoming project).

It seems to be something that others have been looking for as well, and there is some discussion in Keras Issue 3059. In addition, the ImageDataGenerator documentation covers some cases where this can be done, using a pair of ImageDataGenerator instances that are instantiated with the same parameters. However, all these seem to require that you either enumerate the LHS and RHS images in the pair as 4-dimensional tensors (using flow()) or store them in two parallel directories with identical names (using flow_from_directory()). The first seems a bit wasteful, and the second seems incredibly complicated for my use case.

So I went digging into the code and found a private (in the sense of undocumented) method called random_transform(). It applies a random sequence of the transformations you have specified in the ImageDataGenerator constructor to your input image. In this post, I will describe an image generator that I built for my Siamese network using the random_transform() method.

We start with a basic generator that returns a batch of image triples per invocation. The generator is instantiated at each epoch, and the next() method is called to get the next batch of triples.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def image_triple_generator(image_triples, batch_size):
    while True:
        # loop once per epoch
        num_recs = len(image_triples)
        indices = np.random.permutation(np.arange(num_recs))
        num_batches = num_recs // batch_size
        for bid in range(num_batches):
            # loop once per batch
            batch_indices = indices[bid * batch_size : (bid + 1) * batch_size]
            yield [image_triples[i] for i in batch_indices]
            
triples_batch_gen = image_triple_generator(image_triples, 4)
triples_batch_gen.next()

This gives us a batch of 4 triples as shown:

[('149601.jpg', '149604.jpg', 1),
 ('144700.jpg', '106201.jpg', 0),
 ('103304.jpg', '111701.jpg', 0),
 ('133200.jpg', '128100.jpg', 0)]

Calling next() returns the next 4 triples. This is what happens after each batch.

1
triples_batch_gen.next()

[('135104.jpg', '122601.jpg', 0),
 ('137700.jpg', '137701.jpg', 1),
 ('136005.jpg', '105501.jpg', 0),
 ('132500.jpg', '132511.jpg', 1)]

Next, we apply the ImageDataGenerator.random_transform() to a single image to see if it does indeed do what I think it does. My fear was that there needs to e some upstream initialization before I could call the random_transform() method. As you can see from the output, the random_transform() augments the original image into variants that are quite close and could legitimately have been real photos.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
datagen_args = dict(rotation_range=10,
                    width_shift_range=0.2,
                    height_shift_range=0.2,
                    shear_range=0.2,
                    zoom_range=0.2,
                    horizontal_flip=True)
datagen = ImageDataGenerator(**datagen_args)

sid = 150
np.random.seed(42)
image = plt.imread(os.path.join(IMAGE_DIR, "115201.jpg"))
sid += 1
draw_image(sid, image, "orig")
for j in range(4):
    augmented = datagen.random_transform(image)
    sid += 1
    draw_image(sid, augmented, "aug#{:d}".format(j + 1))

plt.tight_layout()
plt.show()


Next I wanted to see if I could take two images and apply the same transformation to both the images. I now take a pair of ImageDataGenerators configured the same way. The individual transformations that are applied to the image in the random_transform() method are all driven using numpy random number generators, so one way to make them do the same thing was to initialize the random number seed to the same random value for each ImageGenerator at the start of each batch. As you can see from the photos below, this strategy seems to be working.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
image_pair = ["108103.jpg", "112003.jpg"]

datagens = [ImageDataGenerator(**datagen_args),
            ImageDataGenerator(**datagen_args)]

sid = 240
for i, image in enumerate(image_pair):
    image = plt.imread(os.path.join(IMAGE_DIR, image_pair[i]))
    sid += 1
    draw_image(sid, image, "orig")
    # make sure the two image data generators generate same transformations
    np.random.seed(42)
    for j in range(3):
        augmented = datagens[i].random_transform(image)
        sid += 1
        draw_image(sid, augmented, "aug#{:d}".format(j + 1))

plt.tight_layout()
plt.show()


Finally, we are ready to build our final generator that can be plugged in to the Siamese network. I haven't built that yet, so there might be some changes once I try to integrate it in, but here is the first cut. The caching is because I noticed that it takes a while to generate the batches, so caching is hopefully going to spped it up.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
RESIZE_WIDTH = 300
RESIZE_HEIGHT = 300

def cached_imread(image_path, image_cache):
    if not image_cache.has_key(image_path):
        image = plt.imread(image_path)
        image = imresize(image, (RESIZE_WIDTH, RESIZE_HEIGHT))
        image_cache[image_path] = image
    return image_cache[image_path]

def preprocess_images(image_names, seed, datagen, image_cache):
    np.random.seed(seed)
    X = np.zeros((len(image_names), RESIZE_WIDTH, RESIZE_HEIGHT, 3))
    for i, image_name in enumerate(image_names):
        image = cached_imread(os.path.join(IMAGE_DIR, image_name), image_cache)
        X[i] = datagen.random_transform(image)
    return X

def image_triple_generator(image_triples, batch_size):
    datagen_args = dict(rotation_range=10,
                        width_shift_range=0.2,
                        height_shift_range=0.2,
                        shear_range=0.2,
                        zoom_range=0.2,
                        horizontal_flip=True)
    datagen_left = ImageDataGenerator(**datagen_args)
    datagen_right = ImageDataGenerator(**datagen_args)
    image_cache = {}
    
    while True:
        # loop once per epoch
        num_recs = len(image_triples)
        indices = np.random.permutation(np.arange(num_recs))
        num_batches = num_recs // batch_size
        for bid in range(num_batches):
            # loop once per batch
            batch_indices = indices[bid * batch_size : (bid + 1) * batch_size]
            batch = [image_triples[i] for i in batch_indices]
            # make sure image data generators generate same transformations
            seed = np.random.randint(low=0, high=1000, size=1)[0]
            Xleft = preprocess_images([b[0] for b in batch], seed, 
                                      datagen_left, image_cache)
            Xright = preprocess_images([b[1] for b in batch], seed,
                                       datagen_right, image_cache)
            Y = np_utils.to_categorical(np.array([b[2] for b in batch]))
            yield Xleft, Xright, Y

Here is a little snippet to call my data generator and verify that it returns the right shaped data.

1
2
3
triples_batch_gen = image_triple_generator(image_triples, 32)
Xleft, Xright, Y = triples_batch_gen.next()
print(Xleft.shape, Xright.shape, Y.shape)

which returns the expected shapes.

(32, 300, 300, 3) (32, 300, 300, 3) (32, 2)

So anyway, this is all I have so far. Once I have my Siamese network coded up and running, I will talk about it in a subsequent post. I haven't heard about anyone using the ImageDataGenerator.random_transform() directly before, so I thought that it might be interesting to describe my experience. Currently the enhancements seem to be aimed at trying to continue to allow folks to use the flow() and flow_from_directory() methods. I am not sure if more specialized requirements will come up in the future, but I think using the random_transform() method instead might a good choice for many situations. Of course, it is quite likely that I may be missing something, so in case you know of problems with this approach, please let me know.


Monday, November 28, 2016

Image Preprocessing with OpenCV


In my last post, I mentioned that I presented at the Demystifying Deep Learning and Artificial Intelligence event at Oakland. My talk was about using Transfer Learning from and Fine tuning a Deep Convolutional Network (DCNN) trained on ImageNet to classify images in a different domain. The domain I chose was the images of the retina to detect varying stages of Diabetic Retinopathy (DR). The images came from the Diabetic Retinopathy competition on Kaggle.

In order to demonstrate the ideas mentioned in the presentation, I trained a few simple networks with a sample (1,000/35,000) of the data provided. My results were nowhere close to the competition winner, who achieved a Kappa score of 0.85 (a metric indicating agreement of predictions with labels), which is better than human performance (0.83 between a General Physicial and an Opthalmologist and 0.72 between an Optometrist and an Opthalmologist according to this forum post). Although my best model did achieve a Kappa score of 0.75 on my validation set, which puts me at around the 25-26 position on the public leaderboard.

The competition winner Benjamin Graham (min-pooling) posted his a description of his algorithm after the competition. One of the things he did was to preprocess the images so they had more uniformity in terms of brightness and shape. This made sense, since the images vary quite a bit along these dimensions, as you can see below.


I have been recently playing around with OpenCV, so I figured it would be interesting to apply some of these techniques to preprocess the images so they were more similar to each other. This post describes what I did.

I first tried to standardize on the size. As you can see, some images are more rectangular, with more empty space on the left and right, and some are more square. In fact, if you group loosely by aspect ratio, it turns out that there are three major size groups.








My first attempt at standardization was to find the edge of the circle representing the retina, then crop on the vertical tangent to the edge. I ended up not using this approach, but I include it here because I think it is interesting and maybe if I had more time and patience I might have figured out a way to use this approach instead of what I did.








The code to do so is shown below. The image is first read in as a grayscale image and converted to a matrix, then vertical and horizontal Sobel filters are applied to extract edges. Finally, we find the edge farthest from the center (approximated by the vertical center of the image) and crop vertically along this.

import cv2

def compute_edges(image):
    image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    image = cv2.GaussianBlur(image, (11, 11), 0)
    sobel_x = cv2.Sobel(image, cv2.CV_64F, 1, 0)
    sobel_x = np.uint8(np.absolute(sobel_x))
    sobel_y = cv2.Sobel(image, cv2.CV_64F, 0, 1)
    sobel_y = np.uint8(np.absolute(sobel_y))
    edged = cv2.bitwise_or(sobel_x, sobel_y)
    return edged    

def crop_image_to_edge(edged, threshold=10, margin=0.2):
    # find edge along center and crop
    mid_y = edged.shape[0] // 2
    notblack_x = np.where(edged[mid_y, :] >= threshold)[0]
    if notblack_x.shape[0] == 0:
        lb_x = 0
        ub_x = edged.shape[1]
    else:
        lb_x = notblack_x[0]
        ub_x = notblack_x[-1]
    if lb_x > margin * edged.shape[1]:
        lb_x = 0
    if (edged.shape[1] - ub_x) > margin * edged.shape[1]:
        ub_x = edged.shape[1]        
    mid_x = edged.shape[1] // 2
    notblack_y = np.where(edged[:, mid_x] >= threshold)[0]
    if notblack_y.shape[0] == 0:
        lb_y = 0
        ub_y = edged.shape[0]
    else:
        lb_y = notblack_y[0]
        ub_y = notblack_y[-1]
    if lb_y > margin * edged.shape[0]:
        lb_y = 0
    if (edged.shape[0] - ub_y) > margin * edged.shape[0]:
        ub_y = edged.shape[0]
    cropped = edged[lb_y:ub_y, lb_x:ub_x, :]
    return cropped

image = cv2.imread(image_name)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) # left image
edged = compute_edges(gray)                    # middle image
cropped = crop_image_to_edge(gray)             # right image

Although in this (and lots of other) cases, this gave me good results, but it failed on some where the edge could not be detected because the image was so dark. Also, as you can see from the histogram on the left below, aspect ratios of the original uncropped images had two distinct clusters, but after the cropping operation, the distribution is all over the place. Our objective was to standardize on the aspect ratio after the cropping operation, the kind of scenario shown on the histogram on the right.






The approach I came up with was to eyeball the aspect ratios. Most of them were around 1.3 and 1.5, so I decided based on some manual cropping that the best aspect ratio is around 1.2. The resulting histogram of aspect ratios is the one on the right above.

def crop_image_to_aspect(image, tar=1.2):
    # load image
    image_bw = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
    # compute aspect ratio
    h, w = image_bw.shape[0], image_bw.shape[1]
    sar = h / w if h > w else w / h
    if sar < tar:
        return image
    else:
        k = 0.5 * (1.0 - (tar / sar))
        if h > w:
            lb = int(k * h)
            ub = h - lb
            cropped = image[lb:ub, :, :]
        else:
            lb = int(k * w)
            ub = w - lb
            cropped = image[:, lb:ub, :]
        return cropped

cropped = crop_image_to_aspect(image)

This is what the random sample of 9 retina images looks like after the cropping operation.


Next I tried looking at standardizing the brightnesses. Benjamin Graham's report suggests just subtracting the mean pixel value from each RGB channel, but I decided to do something a little fancier. First I converted each image to the HSV (Hue, Saturation, Value) color space and computed the mean value of V across all images in my sample. The value of V is a measure of the brightness of the image. I then computed the mean V per image. I then added the global V mean and subtracted the local V mean from each V, and converted it back to RGB.

def brighten_image_hsv(image, global_mean_v):
    image_hsv = cv2.cvtColor(image, cv2.COLOR_RGB2HSV)
    h, s, v = cv2.split(image_hsv)
    mean_v = int(np.mean(v))
    v = v - mean_v + global_mean_v
    image_hsv = cv2.merge((h, s, v))
    image_bright = cv2.cvtColor(image_hsv, cv2.COLOR_HSV2RGB)
    return image_bright

vs = []
for image_dir, image_name in get_next_image_loc(DATA_DIR):
    image = cv2.imread(os.path.join(DATA_DIR, image_dir, image_name))
    image_hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
    h, s, v = cv2.split(image_hsv)
    vs.append(np.mean(v))
global_mean_v = int(np.mean(np.array(vs)))

brightened = brighten_image_hsv(resized, global_mean_v)

As expected, this mean centering operation converts a somewhat skewed distribution of brightnesses to a more balanced one.






After mean centering the brightness values and converting back to RGB, our sample of 9 retina images looks like this. The resulting images are not as clean as the examples shown in the winner's competition report, where he mean centered directly on RGB. But the brightness does look roughly equal now.


In order to mean center by RGB, we compute the global mean of R, G and B channels across all the images, then subtract the individual R, G, and B channel means from the image. Code to do this is shown below:

def brighten_image_rgb(image, global_mean_rgb):
    r, g, b = cv2.split(image)
    m = np.array([np.mean(r), np.mean(g), np.mean(b)])
    brightened = image + global_mean_v - m
    return brightened
    
mean_rgbs = []
for image_dir, image_name in get_next_image_loc(DATA_DIR):
    image = cv2.imread(os.path.join(DATA_DIR, image_dir, image_name))
    image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    r, g, b = cv2.split(image_rgb)
    mean_rgbs.append(np.array([np.mean(r), np.mean(g), np.mean(b)]))
global_mean_rgbs = np.mean(mean_rgbs, axis=0)

brightened = brighten_image_rgb(resized, global_mean_rgbs)

The set of sample images, brightened by RGB channel, looks like this:


Sadly, the preprocessing does not actually translate to higher accuracy or Kappa scores. In fact, resizing and brightening the image using HSV results in a Kappa score of 0.68. Resizing and brightening using RGB results in a Kappa score of 0.61. Kappa score without pre-processing was 0.75. So preprocessing images actually had a negative effect in my case. However, knowing how to do this is good to know for the future, so I think it was time well spent.

The entire code for preprocessing the sample images, as well as printing a random sample of 9 images at each step, is available here in my project on GitHub.