My previous post was #5 on Hacker News Friday last week. Thanks to the person who thought it was good enough to post out there, and also to the folks who contacted me personally to congratulate me. I also wanted to thank the commenters for their constructive suggestions. While I was planning to do (and write) a bit more around this subject anyway, part of this post is based on your comments on the Hacker News thread.
In that post, I described a setup for image search using intensity histograms represented as a vector of "words" and associated payloads. Each histogram was created by binning the RGB pixel intensities into 75 equally spaced buckets (25 per channel). The idea is to make images searchable using a text search engine (I used Elasticsearch for my example). From a text search viewpoint, each image is a single "sentence" of upto 75 "words". Instead of using Lucene's standard Payload scores for ranking hits, I use a custom Function Query to compute Cosine Similarity between the query image and the results.
In this post, I will describe some other vector spaces into which I tried mapping my images. The vector spaces I tried are as follows:
- Binning on RGB space (baseline).
- Binning on HSV (Hue, Saturation, Value) space.
- Hand-crafted Image Features (Intensity Histogram, Edge Histogram, Entropy).
- Transfer Learning using a Deep Learning model (TL6 and TL7).
For evaluating the different approaches, I reused the evaluation framework I built for RGB vectors. The evaluation is done using a test set of 50 random images from the corpus. For each image, I try to find its rank in the results for the query. Ideally, it should be ranked at #1, so my success criteria is how close the Mean Reciprocal Rank (MRR) metric is to 1. Further, I vary the number of features that I use in my query, where the features are arranged in order of their importance to the image. The best approach would be the one that reaches the highest MRR with the least number of features. The chart below shows the results.
Note that since the size of the feature space is different for different approaches, the X-axis represents about 50% of the feature space available to each approach. For example, for RGB and HSV, the maximum number of features is 75, so the range shown represents 1-40 features for RGB. On the other hand for Transfer Learning 6 and 7 (tl6 and tl7), the number of features are 4096, so the visible range for these two is about 1-2000.
Clearly, the winner here is HSV (green), followed by TL7 (magenta) and TL6 (cyan) respectively. However, both TL6 and TL7 show better performance with smaller number of features. The Hand crafted Image Features (IMF) approach (red) fares the worst, very likely because I don't know much about image features. The RGB (blue) plateaus at around 0.8 for 35 features, same as what I noted in my last post.
Below, I describe each approach in more detail.
The Hue, Saturation, Value space (HSV) approach
I selected this color space based on the suggestion in Adrian Rosebrock's Complete Guide to Building an Image Search Engine, that while RGB is easy to understand, it fails to mimic how humans perceive color. The image in HSV has the same shape as the one in RGB, so I can reuse most of our code to vectorize the image.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | # Source: src/es_build_hsv.py
# -*- coding: utf-8 -*-
import collections
import cv2
import numpy as np
import operator
import os
IMAGE_DIR = "/path/to/images/of/butterflies/"
OUTPUT_FILE = "/path/to/butterflies_hsv.txt"
def get_vector_hsv(img):
axes = ["h", "s", "v"]
bins = np.linspace(0, 256, 25)
means = 0.5 * (bins[1:] + bins[:-1]).astype("uint8")
words = []
for i in range(len(axes)):
px_orig = img[:, :, i].flatten()
labels = np.searchsorted(bins, px_orig)
px_reduced = np.choose(labels.ravel(), means,
mode="clip").astype("uint8").tolist()
counter = collections.Counter(px_reduced)
words.extend([(axes[i] + str(x[0]), x[1]) for x in counter.items()])
words_sorted = sorted(words, key=operator.itemgetter(1), reverse=True)
max_freq = words_sorted[0][1]
words_sorted = [(x[0], 100.0 * x[1] / max_freq) for x in words_sorted]
return words_sorted
fout = open(OUTPUT_FILE, 'wb')
h, s, v = [], [], []
for img_file in os.listdir(IMAGE_DIR):
image = cv2.imread(os.path.join(IMAGE_DIR, img_file))
image = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
image_vec = get_vector_hsv(image)
words = [e[0] + "|" + ("%.5f" % (e[1])) for e in image_vec]
fout.write("%s\t%s\n" % (img_file, " ".join(words)))
fout.close()
|
The output of this block of code is a flat file with filename and HSV image vector. Each line of the file consists of the image file name and the image HSV vector. The HSV vector is a sequence of elements separated by space. Each element consists of the index and the magnitude, and is represented in the index as a pseudo "word" and its payload. As a reminder, here is a sample of the HSV output. The general format of this file is identical for all approaches.
1 2 3 4 | 1132868439-1210.jpg v122|100.00000 h16|86.25048 s16|85.04542 ...
1132868439-12100.jpg v122|100.00000 s16|92.02476 h37|90.57395 ...
1132868439-12101.jpg v122|100.00000 s16|96.92287 h37|85.75985 ...
...
|
Hand Crafted Image Features (IMF)
I wasn't planning on doing this one, since I don't know much about Image Processing techniques. However, there were some comments on the Hacker News thread as well as a helpful comment on my post that came with a paper reference, that convinced me that I should try it out. I was already using OpenCV for the HSV stuff, so I figured it wouldn't be that hard. At the very least it would be a chance to learn something new.
So anyway, I looked at the suggested paper Dicoogle, a Pacs Featuring Profiled Content Based Image Retrieval and found some features that the author successfully used to classify radiology images. These were:
- Intensity Histogram
- Edge Histogram
- Entropy
- Segmentations and respective area and center of mass
- Image momentums
Of these, I was able to compute everything except for the segmentations, although I am not sure I am using them correctly. If you think I should do this differently please let me know.
Intensity Histograms are not new, the RGB and HSV approaches are based solely on intensity histograms. However, in this case, I decided to compute black and white histograms only, because I needed to compute B/W images for the other metrics.
The Edge Histogram is interesting. Essentially, you compute gradients along the horizontal and vertical directions using the Sobel filter, then compute the angle of any edge along each pixel, which can be a value between +/- Pi radians, then bin the angles into a histogram. This mailing list discussion has more details. The diagram below shows the original image, the gradients in the two directions and the resulting edge histogram for a single image.
The entropy is based off Shannon's formula using the guidelines on this blog post. For momentums, I used OpenCV's built in function to compute momentums off the contours, and agreegate them for all contours in the image. The code for this is shown below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 | # Source: src/es_build_imf.py
# -*- coding: utf-8 -*-
import cv2
import math
import numpy as np
import operator
import os
IMAGE_DIR = "/path/to/images/of/butterflies/"
OUTPUT_FILE = "/path/to/butterflies_imf.txt"
def get_intensity_histogram(img):
""" 25 histogram entries per image (gray scale) """
int_hist = np.histogram(img.flatten(), bins=25, range=(0, 255))[0]
row_sum = np.sum(int_hist)
int_hist = int_hist * 1.0 / row_sum
return int_hist
def get_entropy(img):
""" Adapted from:
http://areshopencv.blogspot.com/2011/12/computing-entropy-of-image.html
and
https://en.wikipedia.org/wiki/Entropy
"""
freqs = cv2.calcHist([img], [0], None, [255], [0, 255]).T[0]
row_sum = np.sum(freqs)
freqs = freqs * 1.0 / row_sum
return sum([-x * math.log(x, 2) for x in freqs.tolist() if x > 0])
def get_moments(img):
""" Compute average moments across all contours in the image. """
ret, thresh = cv2.threshold(img, 127, 255, 0)
img2, contours, hierarchy = cv2.findContours(thresh, 1, 2)
moments = {}
for contour in contours:
mom = cv2.moments(contour)
for k in mom.keys():
if moments.has_key(k):
moments[k] += mom[k]
else:
moments[k] = mom[k]
val_sum = sum([moments[k] for k in moments.keys()])
for k in moments.keys():
moments[k] = moments[k] * 1.0 / val_sum
return moments
def get_edge_histogram(img):
""" Adapted from this discussion:
http://opencv-users.1802565.n2.nabble.com/Edge-orientation-td3146691.html
"""
dx = cv2.Sobel(img, cv2.CV_64F, 1, 0, ksize=5)
dy = cv2.Sobel(img, cv2.CV_64F, 0, 1, ksize=5)
grad = np.zeros((dx.shape[0], dx.shape[1]))
for i in range(dx.shape[0]):
for j in range(dx.shape[1]):
grad[i, j] = math.atan2(dx[i, j], dy[i, j])
edge_hist = np.histogram(grad, bins=25, range=(-math.pi, math.pi))[0]
edge_sum = np.sum(edge_hist)
edge_hist = edge_hist * 1.0 / edge_sum
return edge_hist
def get_image_vector_imf(img):
img_vec = []
int_hist = get_intensity_histogram(img)
img_vec.extend([("x" + str(i), x)
for i, x in enumerate(int_hist.tolist())
if x > 0.0])
edge_hists = get_edge_histogram(img)
img_vec.extend([("e" + str(i), x)
for i, x in enumerate(edge_hists.tolist())
if x != 0.0])
moments = get_moments(img)
for k in moments.keys():
if moments[k] != 0.0:
img_vec.append((k, moments[k]))
entropy = get_entropy(img)
img_vec.append(("ent", entropy))
words = ["%s|%.5f" % (k, v) for k, v in
sorted(img_vec, key=operator.itemgetter(1), reverse=True)]
return words
fout = open(OUTPUT_FILE, 'wb')
for img_file in os.listdir(IMAGE_DIR):
print img_file
img = cv2.imread(os.path.join(IMAGE_DIR, img_file), 0)
words = get_image_vector_imf(img)
fout.write("%s\t%s\n" % (img_file, " ".join(words)))
fout.close()
|
Transfer Learning using a Deep Learning Model (TL6 and TL7)
I used the reference Convolutional Neural Network (CNN) model built with Caffe and pre-trained against the ImageNet dataset of approximately 14 million images, to generate image vectors for my corpus of 200 butterfly images. Since the CNN has been trained to recognize many different kinds of features, it has developed the ability to extract good features from images - so essentially I use it as a feature extractor for my corpus of 200 butterfly images, even though none of them probably exist in ImageNet and the CNN has never seen them before. This article Recycling Deep Learning Models with Transfer Learning by Zachary Lipton has a good explanation of what Transfer Learning is and why it works so well.
Conceptually, the Caffe reference CNN looks like this. Each line represents a layer in the CNN. The second column represents the dimensions of the layer. The first element of the dimension tuple is the batch size, and the rest represent the actual dimension of the layer (for each input record). Input comes in through the "data" layer at the top and the class prediction probabilities are generated by the "prob" layer at the bottom. CNNs consist of alternating layers of convolutions and pooling, ending finally in one or more fully connected (fc) layers. In this case, there are 3 fully connected layers ("fc6", "fc7" and "fc8"). Of these, "fc8" is not very interesting because it has 1000 elements same as the "prob" layer, and therefore too closely connected to the ImageNet classes to be useful to us.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | data (50, 3, 227, 227)
conv1 (50, 96, 55, 55)
pool1 (50, 96, 27, 27)
norm1 (50, 96, 27, 27)
conv2 (50, 256, 27, 27)
pool2 (50, 256, 13, 13)
norm2 (50, 256, 13, 13)
conv3 (50, 384, 13, 13)
conv4 (50, 384, 13, 13)
conv5 (50, 256, 13, 13)
pool5 (50, 256, 6, 6)
fc6 (50, 4096)
fc7 (50, 4096)
fc8 (50, 1000)
prob (50, 1000)
|
The way the transfer learning setup works is that I take the trained model weights, and chop off the "fc8" and "prob" layers from the network, and run my butterfly images through the truncated network. This will result in a vector (output of "fc7") of size (4096,). We use this vector to represent our image for TL7.
For TL6, I chop off the "fc7" layer as well, and run my images through the resulting truncated network. Once again, the result is a vector of size (4096,) from "fc6" for each image.
The code to extract image vectors has been adapted from this Caffe classification example notebook, and is similar for TL6 and TL7. The only thing that changes is the protobuf file name (MODEL_DEF) for the network deployment descriptor and the layer name to get the vectors out of at the end (line 54). You will need to install Caffe and pycaffe for this work. Also, to import pycaffe as I have done in the code below, I had to add the location of pycaffe to my PYTHONPATH (ie, export PYTHONPATH=$PYTHONPATH:$CAFFE_HOME/python). Installation was mostly painless with Ubuntu 15.10 and Anaconda Python, the two things that I encountered were solved using solutions from here and here. Also since I am just running the model (as opposed to training it), a GPU box is not needed.
The original model definition for the MODEL_DEF assignment in the code is in this deploy.prototxt file. For running with TL7, I needed to remove the lines 184-213 (ie, everything below the layer named "fc7"), and for TL6, the lines 160-213 (ie, everything below the layer named "fc6").
My inputs were of size (640, 480, 3), but the CNN takes images of size (3, 227, 227) so I had to resize my images and swap the axes to make it compatible with the CNN. I also had to center the images along the mean. Caffe provides a "mean image" which looks a bit like this if you visualize it (unlike my intuition that its just a matrix of 128s).
The whole process of resizing, swapping axes, mean-centering and feeding the resulting matrix to the CNN is done using a Caffe Transformer. Once we define the transformer, we just loop through our images and run the CNN over each image, finally collecting the image vector at the other end.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | # Source: src/es_build_tl.py
# -*- coding: utf-8 -*-
import caffe
import matplotlib.pyplot as plt
import numpy as np
import operator
import os
import scipy.misc
IMAGE_DIR = "/path/to/butterflies/"
RESIZED_IMAGE_DIR = "/path/to/butterflies_227x227"
OUTPUT_FILE = "/path/to/butterflies_tl7.txt"
CAFFE_HOME = "../" # in caffe/examples
MODEL_DEF = CAFFE_HOME + "models/bvlc_reference_caffenet/deploy_fc7.prototxt"
MODEL_WEIGHTS = CAFFE_HOME + "models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel"
MEAN_IMAGE = CAFFE_HOME + "python/caffe/imagenet/ilsvrc_2012_mean.npy"
# resize input files to 227x227, since Caffe network expects these dimensions
for img_name in os.listdir(IMAGE_DIR):
img_src = plt.imread(os.path.join(IMAGE_DIR, img_name))
img_res = scipy.misc.imresize(img_src, (227, 227))
plt.imsave(os.path.join(RESIZED_IMAGE_DIR, img_name), img_res)
# Caffe model setup - I modify the supplied deploy.prototxt file to remove
# the last two layers "fc8" and "prob". Output of "fc7" is a (4096,) vector
caffe.set_mode_cpu()
net = caffe.Net(MODEL_DEF, MODEL_WEIGHTS, caffe.TEST)
# get network configuration configuration
for layer_name, blob in net.blobs.iteritems():
print layer_name, str(blob.data.shape)
# input preprocessing (subtract mean, align axes, etc)
mu = np.load(MEAN_IMAGE)
mu = mu.mean(1).mean(1)
transformer = caffe.io.Transformer({"data": net.blobs["data"].data.shape})
transformer.set_transpose("data", (2, 0, 1))
transformer.set_mean("data", mu)
transformer.set_raw_scale("data", 255)
transformer.set_channel_swap("data", (2, 1, 0))
# set size of input (batch size is 50, even though I just classify 1 by 1)
net.blobs["data"].reshape(50, 3, 227, 227)
# run image through truncated network and write (4096,) image vector
fout = open(os.path.join(OUTPUT_FILE), 'wb')
for img_name in os.listdir(RESIZED_IMAGE_DIR):
print img_name
image = caffe.io.load_image(os.path.join(RESIZED_IMAGE_DIR, img_name))
transformed_image = transformer.preprocess("data", image)
net.blobs["data"].data[...] = transformed_image
output = net.forward()
img_vec = output["fc7"][0]
words = [(i, x) for i, x in enumerate(img_vec.tolist()) if x > 0.0]
sorted_words = sorted(words, key=operator.itemgetter(1), reverse=True)
words = ["%s|%.5f" % (i, x) for i, x in sorted_words]
fout.write("%s\t%s\n" % (img_name, " ".join(words)))
fout.close()
|
The output for both the TL6 and TL7 runs is a file of image names and their associated vectors extracted from the respective truncated CNNs. In both cases, image vectors have shape (4096,).
Loading into the index and Evaluation
Since the data formats are identical, I can reuse the code for loading the index across all the approaches. I decided to store the data into the same index, under different schemas. So I needed to update my mapping section to explicitly list each schema, and the associated field definitions.
Once the data was in Elasticsearch, the following code was used to select 50 random images, and run it against each schema with varying number of features, and finally draw the chart shown earlier in the post.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 | # Source: src/eval_feats.py
# -*- coding: utf-8 -*-
from __future__ import division, print_function
import elasticsearch
import matplotlib.pyplot as plt
import numpy as np
import os
import image_search_utils
IMAGE_DIR = "/path/to/data/for/butterflies"
OUTPUT_DIR = "/path/to/output"
ES_HOST = "127.0.0.1"
ES_PORT = "9200"
ES_INDEXNAME = "butterflies"
# gather 50 random unique images from collection
np.random.seed(42)
test_image_ids = set(np.random.choice(200, 50, replace=False).tolist())
test_images = []
curr_idx = 0
for fname in os.listdir(IMAGE_DIR):
if curr_idx in test_image_ids:
test_images.append(fname)
curr_idx += 1
# Run the evaluation
# for each doctype, I find the max number of non-zero features per doctype
# for our corpus and then cap the number of features. Then for each doc type
# I query with each test image varying the number of features and record the
# MRR for each result. MRR is averaged across the test set for each doc type
# and number of features combination and plotted
es = elasticsearch.Elasticsearch(hosts = [{
"host": ES_HOST,
"port": ES_PORT
}])
feature_sizes = { # approx number of max non-zero entries per doc_type
"rgb": 40,
"hsv": 40,
"imf": 100,
"tl6": 1000,
"tl7": 1000
}
for doc_type in feature_sizes.keys():
print("Evaluating %s" % (doc_type))
fout = open(os.path.join(OUTPUT_DIR, "eval-%s.txt" % (doc_type)), 'wb')
num_feats_list = np.linspace(0, feature_sizes[doc_type], 20).tolist()
for num_feats in num_feats_list[1:]:
num_feats = int(num_feats)
print("... #-features: %d for doc_type: %s" % (num_feats, doc_type))
result_mrr = 0
for test_image in test_images:
print("...... Querying with %s (#-feats: %d, doc_type: %s)" %
(test_image, num_feats, doc_type))
qdata = image_search_utils.search_by_id(test_image, es,
ES_INDEXNAME, doc_type)
# build vector for query image
q_vec = []
for ws in qdata[0]["imgsig"].split(" "):
word, score = ws.split("|")
q_vec.append((word, score))
results = image_search_utils.search(q_vec, es, ES_INDEXNAME,
doc_type, num_feats)
result_filenames = [result[0] for result in results]
for rank, result_image in enumerate(result_filenames):
if result_image == test_image:
result_mrr += (1.0 / (rank + 1))
break
fout.write("%d\t%.5f\n" % (num_feats, result_mrr / len(test_images)))
fout.close()
# Plot the results
doc_types = ["rgb", "hsv", "imf", "tl6", "tl7"]
for doc_type in doc_types:
feval = open(os.path.join(OUTPUT_DIR, "eval-%s.txt" % (doc_type)), 'rb')
xs, ys = [], []
for line in feval:
x, y = line.strip().split("\t")
xs.append(x)
ys.append(y)
feval.close()
plt.plot(range(len(xs)), ys, label=doc_type)
plt.xticks([])
plt.xlabel("Number of features")
plt.ylabel("Mean Reciprocal Rank (MRR)")
plt.grid(True)
plt.legend()
|
Finally, I needed to add one more function to look up an image by ID from the index instead of generating the vector for the query image using code. Here is the additional method.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | def search_by_id(filename, es, index_name, doc_type):
query = """
{
"from": 0,
"size": 1,
"query": {
"match": {
"filename": "%s"
}
}
}
""" % (filename)
resp = es.search(index=index_name, doc_type=doc_type, body=query)
hits = resp["hits"]["hits"]
return [hit["_source"] for hit in hits]
|
Thats all I have for today, hope you enjoyed reading it. If you have suggestions about other vector spaces that have worked well for you with image search, would love to hear about them.
Really enjoy while reading, you have cover very small but important point, I am so impressed & appreciate your post.
ReplyDeleteThank you, glad I could help.
ReplyDelete