Sunday, October 06, 2019

Efficient Reverse Image Search on Lucene using Approximate Nearest Neighbors


I wrote last week about our Search Summit, an internal conferences where engineers and product managers from Elsevier, LexisNexis, and various other organizations that make up the RELX Group, get together to share ideas and best practices on search. The Search Summit is in its third year, and given that its a single track conference and runs for two days, there are limited speaking slots available, and competition for these slots is quite fierce. Our acceptance rate this year was around the 40-50% mark, which I guess puts us on par with some big external conferences in terms of exclusivity. In any case, I guess that was my long-winded way of saying that this year I didn't get accepted for a speaking slot, and was offered the opportunity to present a poster instead, which I accepted.

This was the first time I made a poster for work, and quite honestly, I didn't know where to begin. My only experience with posters was back when I was at school, where I remember that we used crayons a lot, and then helping my kids build posters for their school, which involved a lot of cutting out pictures and glueing them on to a posterboard. Fortunately, as I learned from some googling, technology has progressed since then, and now one can build the poster as a single Microsoft Powerpoint (or Apple Keynote or Google slides) slide (I used Powerpoint). Quite a few sites provide free templates for various standard poster sizes as well, I ended up choosing one from Genigraphics. One you are done designing the poster, you save it as a PDF, and upload the PDF to sites such as Staples Print Services for printing. And that's pretty much all there is to it. Here is the poster I ended up presenting at the search summit.


Basically, the poster describes a Proof of Concept (POC) application that attempts to do reverse image search, i.e., given an image, it returns a ranked list of images, similar to the given image, in the corpus. My dataset for my experiments was a set of approximately 20,000 images from the ImageCLEF 2016 competition. The notion of similarity that I use is the one learned by a RESNET network pre-trained on ImageNet. The idea is that this sort of network, trained on a large generic image corpus such as ImageNet, learns how to recognize colors, edges, shapes, and other more complex features in images, and these features can be used to provide a notion of similarity across images, much like tokens in a document.

RESNET (and other pre-trained networks like it) generate dense and (comparatively) low-dimensional vectors of features. For example, the last layer of a trained RESNET model will generate feature vectors of size 2048. On the other hand, text is typically vectorized for Information Retrieval as a vector over all the tokens (and sometimes token combinations as well) in its vocabulary, so feature vector sizes of 50,000 or 100,000 are not uncommon. In addition, such vectors tend to be quite sparse as well, since a given token is relatively unlikely to occur in many documents. Lucene (and probably other search engines) leverage the high-dimensionality and sparseness of text feature vectors using its Inverted Index data structure (aka Postings List), where the key is the token and the value is a priority queue of document ids that the token occurs in.

In order to present a ranked list of similar images, a naive (and exhaustive) solution would be to compute some measure of vector similarity (such as Cosine similarity or Euclidean similarity) between the input image and every other image in the corpus, a O(N) operation. A better solution might be to partition the similarity space using something like a KD-tree such that comparison is done against a subset of most similar images, and that is a O(log N) operation. However, the Lucene inverted index makes the search an almost O(1) operation -- first looking up by token, and then navigating down a priority queue of a subset of document IDs containing that token. My reasoning went as follows -- if I could somehow project the dense, low-dimensional vectors from RESNET back into a sparse, high-dimensional space similar to that used for text vectors without losing too much information, I would be able to do image search as efficiently as text search.

I experimented with two such projection techniques -- Random K-Means and Random Projections.

  • Random K-Means involves running a bunch of K-Means clusterings over the image set, typically 50-200 of them, each with some random value for k. Each image (point in similarity space) is represented by a sequence of its cluster memberships across all the clusterings. The intuition here is that similar images, represented by two points that are close in the similarity space, will share cluster memberships across more clusterings than dissimilar images, represented by points that are further apart. A 2D schematic on the bottom center of the poster illustrates the idea.
  • Random Projections is similar, but instead of partitioning the space into a random number of circular clusters, we partition the space itself using a (D-1) dimensional hyperplane, where D is the size of the dense feature vector (2048). The number of random projections used is typically much higher than the number of K-Means, typically 1,000-10,000. This effectively results in binary clusters where points behind the hyperplane get assigned a 0 and points in front get assigned a 1. As before, each image is represented by a sequence of its membership across all the random projections. The intuition here is similar to random K-Means. The 2D schematic on the bottom right of the poster illustrates how random projections work.

My baseline for experiments was caption based text search, i.e., similar images were found by doing text search on their captions. I took a random set of 20 images, and manually evaluated the relevance for top 10 similar images for each on a 4-point scale. Its only one person, so its not very objective, but I needed a place to start. The guideline I followed was to give it the highest rating (4) if the image is identical, or almost identical, to the source image -- possibly either the same image, or a composite containing the same image, or a minor modification. The second highest rating (3) was for images that were basically the same thing -- for example, if the source image was an X-ray of the left lung, I would rate an X-ray of the right as very similar. The third rating (2) were for images that I couldn't rate 4 or 3, but which I would expect to see in the set of similar images. Finally, the lowest rating (1) is for images that I don't expect to see in the results, i.e., irrelevant images.

I used the ratings to compute Discounted Cumulative Gain (DCG) for k=1, 3, 5, and 10, where k is the number of top results taken from the ranked list of similar images for each image, and then averaged them across the 20 images. I then computed the Ideal DCG (IDCG) for k=10 by sorting the ranked list based on my ratings, and then divided the computed DCG by the IDCG to get the Normalized Discounted Cumulative Gain (NDCG). These numbers are shown in the table under the Results section in the poster. The bolded numbers indicate the best performers. Results for k=1 serve as a sanity check, since all the searches in my evaluation returned the source image as the top result in the ranked list of similar images. Overall, the numbers seem quite good, and Random K-Means with 250 clusters produced the best results.

Random K-Means provided the best results in qualitative terms as well. The block of images at the center of the poster show a brain MRI as the source image, and the top 3 results from the baseline, the best random K-Means (250 clusters), and the best Random Projections (10000 projections). A green box indicates relevant results, and a red box indicates irrelevant results. As you can see, the caption text search based baseline produces more diverse results, but doesn't do very well otherwise. Random K-Means and Random Projections produce similar images that are more visually similar to the source image, and between the two, Random K-Means tends to produce better results.

In the future, I hope to extend this approach to text vectors as well. Specifically, I want to capitalize on the increase in recall that I noticed with caption based text search, by building embeddings (word2vec, ELMo, BERT, etc.) from the caption and treat these embeddings as additional feature vectors for the image. On the other hand, I do notice that the trend has been to use special purpose backends for vector search, such as the Non-Metric Space Library (NMSLib) based on Numpy, or the GPU-based FAISS Library for efficient similarity search, or hybrid search engines such as Vespa. So while it might be intellectually satisfying to try to make a single platform serve all your search needs, it might more worthwhile to see if one of these other toolkits might provide better results.

I also wanted to acknowledge the work of Simon Hughes. The approaches used in this poster are based heavily on ideas originally proposed by him on his DiceTechJobs/VectorsInSearch project, as well as ideas and suggestions proposed by him and others on the Relevancy and Matching Technology #vectors-in-search slack channel.



Be the first to comment. Comments are moderated to prevent spam.