Sunday, October 20, 2019

Trip Report: Graphorum 2019


I just got back from Graphorum 2019, organized by the folks at Dataversity. The conference was held at The Westin Chicago River North, and colocated with the Data Architecture Summit (DAS), also organized by Dataversity. Attendees were not restricted to talks at one or the other conference, they were allowed, even encouraged, to attend talks at the other conference, perhaps in an effort to popularize Graph concepts among Data Architects, and Data Architecture best practices among Graph practioners. Both conferences were very heavily multi-track -- at any point, you had around 3-5 choices if you restricted yourself to talks in either track. I attended only the Graphorum talks, so this trip report represents one of 322 million possible unique trip reports (and one of 142 billion possible unique reports if I had not restricted myself), with the naive assumption that at all talks within each conference were independent, and at any point, I was equally likely to select any one of the talks offered.

The conference was four days long, starting on Monday last week (October 14 2019) and ended yesterday (October 17). As expected, many of the talks (at least among the ones I attended) were centered around Knowledge Graphs (KGs). Some of these focused on techniques and advice on how to build them from unstructured text, and some focused on making them more effective. Many of the presentations ended up covering the various Linked Data standards such as Resource Description Framework (RDF) for specifying semantic triples, and Web Ontology Language (OWL) for doing inference on them. More than one talk mentioned the new Shape Constraint Language (SHACL) for validating such RDF graphs. On the other hand, it looks like there is strong industry support for Labeled Property Graphs (LPG) as well, both among database vendors and users. Regardless of the graph flavor, there was also quite a lot of interest in using Machine Learning (ML) and Natural Language Processing (NLP) to leverage the signal inherent in graph structure. In the rest of this post, I will cover the talks that I liked in my 1-in-322M path through the conference. I will also cover some interesting discussions I had with some vendors at the exhibition, and some overall feedback about the conference as a whole.

I arrived late at Chicago on Sunday night, so was unable to take advantage of the early bird registration. At breakfast next morning, I found myself at the end of a fairly long line while the others were relatively empty. In an attempt to understand why, and notwithstanding Murphy's Law, I noticed that the lines were by first character of last name, and unevenly sized (A-E, F-J, K-O, and P-Z). It is possible that the sizing is based on actual attendee last names, in that case I guess people with last names P-Z tend to be relatively last minute types.

Day 1


On the first day, I attended a presentation about Applications of Knowledge Graphs in Enterprise Data Management and AI by Juergen Jakobitsch from Semantic Web Company and Andreas Blumauer from Pool Party. It was very comprehensive, and included lots of background material on Knowledge Graphs. Among some standard applications were semantic search, info-box style results, product recommendations, virtual assistants, etc. One interesting application mentioned was text data augmentation, and (in the same vein) filtering results from chatbots for illogical answers. They also talked about the PoolParty pipeline for converting inputs (structured and unstructured) to Knowledge Graphs, which includes entity resolution and entity linking. This presentation, as well as others throughout the conference, also focused on the various W3C standards such as RDF, SPARQL, OWL2, and SHACL.

I also attended Graphs Transform Chemical Discovery in Pharmaceutical Research presented by Liana Kiff of Tom Sawyer Software. I had initially thought that it would talk about the actual process of Chemical Discovery using graphs, but it turned out to be (an admittedly very powerful) visualization of a graph of chemical entities from the ChEMBL database, including features that allow for manual chemical discovery.

The final presentation of the day was You Must Be This Tall: Machine Learning from Graphs, Tables, Trees, and Documents by Brian Sletten of Bosatsu Consulting, Inc. As with the previous talk, I came in hoping to learn about ML with graphs, but the talk turned out to be about what you need to do before you can do ML with your graphs. Nevertheless, the presenter was very knowledgable, so the talk ended up being pretty interesting and educational, in spite of it covering a lot of the RDF/SPARQL/OWL ground covered by earlier talks. One good insight here was the distinction between data, knowledge, and wisdom as points on a Data Understanding vs Connected space, coupled with capturing relationships between entities, the context in which these relations exist, and the exploitation of this information using RDF technologies. Another was the need for data lakes to be available as a pre-requisite to effective ML in the organization. Yet another other thing I liked about his presentation is this example SPARQL queries against Wikidata. He also talked about JSON-LD and how it can be a good substitute for RDF.

Day 2


The first presentation of Tuesday was the Ontology Engineering for Knowledge Graphs by Elisa Kendall and Deborah McGuinness. The talk focused on the role of the Ontology Engineer, who is the technical person who talks to the Knowledge Engineer or Domain Expert. The main component of this role is being effective at interviewing the Knowledge Engineer. The talk also covered various aspects of modeling the domain, with many examples drawn from the presenter's own experiences.

The next presentation I attended was Merging Company Data from Multiple Sources with GraphDB and Ontotext Platform by Atanas Kiryakov (their CEO). As expected, the presentation was more about how Ontotext does things. However, Ontotext is one of the first graph database companies I know of that pursued a strong NLP + Graph strategy, so there was lots of interesting content. Among the functionality covered were features around Entity Recognition from text and Entity Disambiguation (from structured data as well as text), the use of different embedding technology (1-hot, 2-hot, BERT, etc), and the use of different technologies such as Random Forests and SVM for entity and document classification, BiLSTM and CRF for Named Entity Recognition, and Inductive Logic Programming (ILP) for rules engines built around the discovered entities and relations.

The final presentation on Tuesday for me was Reveal Predictive Patterns with Neo4j Graph Algorithms (Hands-on) by Amy Hodler and William Lyon. I had learned about Graph Algorithms in Neo4j (and also used some of them for my own talk later in the conference) from the Graph Algorithms: Practical Examples in Apache Spark and Neo4j from O'Reilly (Amy Hodler, one of the presenters, is also one of the authors of this book), so some of it was old material for me. The talk started off with the need for graph algorithms as tools that exploit the additional information implicit in the graph structure, then covered the various classes of graph algorithms (Pathfinding, Centrality, Community Detection, Link Prediction and Similarity), with deep dives into specific algorithms and running them on their Neo4j Desktop product (proprietary product with 30 day free trial, but all the features covered are also available in the free community edition). I ended up learning a few new things, such as how to use virtual graphs (generated as a result of a Cypher query, sort of like views in the RDBMS world), and how to use the Strongly Connected components algorithm as a debugging tool. They also showed off their NEuler product, which allows forms-based invocation of various algorithms, as well as some very good visualizations. Talking about visualization, William Lyon also mentioned the neo4j-contrib/neovis.js project, which seems interesting as well. Overall, lots of useful information about Neo4j and graphs.

I also learned about the Bridges of Chicago, based on a challenge from the presenters about using Cypher (the Neo4j query language) to find an Eulerian path similar to the Bridges of Königsberg problem. I guess I was the only one that responded, since the problem is much simpler than it appears to be at first glance.

Exhibitors started setting up their booths today, so I spent some of the coffee breaks and most of the evening talking to various exhibitors. Both Graph database vendors and consultants were well represented among the exhibitors (considering it was a graph + data architecture conference). Graph vendors I knew of included Neo4j, Ontotext, TigerGraph, DataStax, and StarDog. Among those who I learned about at this conference were PoolParty, Semantic Web Company, and Cambridge Semantics. Having attended the presentations from PoolParty and The Semantic Web, and Ontotext, I spent a lot of time talking with them. I also met up with the folks at TigerGraph, and let them know how helpful their Graph Gurus webinar series has been to me. I also took the opportunity to meet up with the folks at Stardog, who I had met earlier at another Graph conference few years earlier through a reference. Since I was a speaker here, the conversation also drifted occassionally to the subject of my talk, and what graph database I was using (Neo4j).

Day 3


Wednesday was quite a heavy day in terms of presentations, comparatively speaking. It started with two keynote presentations. The first one was Knowledge Graphs and AI: The Future of Enterprise Data by David Newman from Wells Fargo. He spoke of the progression of looking at Strings to Things to Predicted Things to Vectors, which resonated with me as well, since we are progressing along a very similar path ourself. He led us through multiple examples involving harmonizing an entity across multiple Knowledge Graphs in the enterprise, the need for classifying entities into a taxonomy, using Knowledge Graphs to predict new relationships, using graph relations for creating digital fingerprints for ML algorithms, etc. His examples referenced the Financial Industry Business Ontology (FIBO), which provides a standard schema for the financial services industry.

The second keynote was Graph Stories: How Four Metaphors can help you decide if Graphs are right for you by Dan McCreary of Optum. While David Newman's presentation was based on RDF style graphs, Dan McCreary is a big proponent of Labeled Property Graphs (LPG), although his choice had several very pragmatic reasons. The four metaphors he described are the Neighborhood Walk, the Knowledge Triangle, the Open World Assumption, and the Jenga Tower. Specifically, the first indicates the importance of relationship traversal in your applications, the second indicates where your application is (or wants to be) on the Data / Information / Knowledge Triangle, the third indicates the ease with which new information can be incorporated into your system, and the fourth indicates the resilience of your query system to small changes in your backend. The keynote also covered the importance of graph structure (Structure is the new gold in data mining), the inter-relationship of Graphs with Deep Learning techniques such as Graph Convolutional Networks (GCNN) and Structured Learning with Graphs.

The next presentation I attended was Knowledge Graphs and Model Driven Classification by Scott Henninger of SmartLogic, where he showed off the capabilities of the SmartLogic platform, which centered around Metadata tagging, document classification (based on the metadata tagging and external taxonomies), and Search Enhancement Services (SES). The bulk of the capabilities seem to be rule based, which can be good for explainability purposes. SmartLogic's KG backend is based on RDF Schema, OWL, and SHACL. An interesting functionality of SmartLogic is to allow the user to manually fine-tune the (term) weights from their classifier. I got quite excited at this, thinking that perhaps this functionality could be leveraged to produce explainable Deep Learning models by perturbing the inputs, but then realized that the intuition is similar to the idea behind LIME - Local Interpretable Model-Agnostic Explanations.

Next up was a talk on How Do You Read Millions of Documents for Meaning using Graph? by Ryan Chandler of Caterpillar, Inc. He described a system he built at Caterpillar, that allowed customer support technicians to query a large collection of internal support tickets created by other technicians. The end result is a query-able knowledge base. The text in the support tickets are tokenized and segmented into sentences, tagged with cause, complaint, solution, note, and correction (classification). The document is decomposed into semantic frames, and the document and the associated semantic frames, along with its metadata, are stored in a Neo4j graph database. On the query side, the natural language (NL) query is converted into a graph using a dependency parse, and re-composed into a Cypher query against specific semantic frames (as indicated by the metadata). The Cypher query produces a ranked list of support tickets that best satisfy the NL query. I thought this was quite an interesting technique, although it may be somewhat dependent on the structure of the input data.

The next presentation I attended was Graph Analytics for Enterprise Applications by Melliyal Annamalai, Souripriya Das, and Matthew Perry from Oracle. I came in a few minutes late so I missed the first part, but from what I gathered, it covers Oracle's foray into graph databases -- it turns out that Oracle customers can now start working with SPARQL using SQL Developer, seamlessly against Oracle's new hybrid Property and RDF graph. The functionality is nice, but probably only useful for current and future Oracle customers.

My next presentation was Insufficient Facts always invite Danger: Combat them with a Logical Model by Michael Grove of Stardog, where he described how important it was to have a Logical Model to ensure completeness of your model, and and how it can help you avoid problems later.

The evening before I had spent some time at the DataStax booth, mainly for nostalgic reasons since I worked with Cassandra (the Apache version, not the DataStax version) at my previous job, and I was curious about their graph product based on Cassandra (initially called Titan, then Janus). So I attended the presentation Graph Innovations for Distributed Data at Scale by Jonathan Lacefield. The presentation covered the evolution of their graph product, and also answered a nagging question I had about how they implemented the graph in a column-family database under the covers -- turns out that each row is basically the star graph around each node. Other interesting things in this presentation were their use of Gremlin and Spark support through their DataStax extension.

The last presentation of the day was Knowledge Graphs and GraphQL in Action: A Practical Approach to using RDF and Semantic Models for Web Applications by Irene Polikoff and Ralph Hodgson of TopQuadrant. They described their Semantic GraphQL interface which provides the user with a GraphQL interface, and converts down to a RDF, OWL, and SHACL query against a RDF triple store.

Finally, the last event of the day was a session about Open Source Knowledge Graph Tooling, which really turned out to be a group of banking folks trying to collaborate around the FIBO Ontology, but it is likely that they might expand to other industries as well in the future. There was also talk about listing out a current (non-deprecated) list of open source ontologies in various industries, applying unit tests to ontologies so they don't become stale and irrelevant, both of which were interesting to me.

The exhibitors were still around, and so I hung around for some more conversations with vendors and fellow attendees for a couple more hours after that. Among them were Cambridge Semantics, who have a fast analytics graph database called AnzoDB.

Day 4


The first presentation of the day was Unsupervised and Supervised ML on Big Graph: Case Studies by Victor Lee. He described various case studies using TigerGraph. The first one was finding influential healthcare provides in various local networks from a specialty network, and finding their influence networks. Another case study had to do with detecting spam phone calls in the China Mobile network, the training data for which consisted of 180+ graph features. The model was a Random Forest classifier. At prediction time, an incoming phone call would be placed in the call graph, the 180+ features computed and fed into the Random Forest model to predict (under 20 milliseconds) whether the call was spam or not spam. The third case study was for Bank Fraud, based on some synthetic data from a Kaggle competition, where TigerGraph engineers built some compound relationships based on edges discovered in the feature graph, which ended up giving good results, showing that the structure of the data provides useful signal. The talk ended with an introduction to Node2Vec, a graph embedding scheme.

The next presentation in my list was my own (Graph Techniques for Natural Language Processing). My presentation was about using Graph techniques (mostly a combination of common third party algorithms) to solve Natural Language Processing (NLP) problems. I covered four case studies that attempted to replicate academic papers (referenced from the Bibliography of Graph-Based Natural Language Processing and Information Retrieval) around document summarization, clustering using language model based vectors, word sense disambiguation, and topic finding. Graph techniques used included various graph centrality metrics (some built-in and some computed using Cypher and built-in algorithms), random walk techniques, Louvain Community Detection, Label Propagation, and Personalized PageRank. Compared to the other presentations, mine was probably a bit unusual, since it focused on NLP more than on graphs, so while I had only about 15-20 attendees, there seemed to be lots of interest, and some very good questions at the end. For those of you who weren't able to make it to the presentation but would like more details, you can find the link to my slides and code (in Jupyter notebooks, with a lot of verbose commentary) at my sujitpal/nlp-graph-examples repository.

I hung around a bit after my presentation answering questions, so I ended up being a bit late to the next presentation, even with the generous coffee break in between. This was When Time Meets Relationships: Understanding an Immutable Graph Database by Brian Platz of Fluree. He makes the case that a Knowledge Graph is a snapshot at a point in time. A time-aware Knowledge Graph can be thought of as an immutable linked list, where facts are added to an append-only log, and made tamper-proof with hashing techniques, much like a private blockchain. The strategy assumes the Knowledge Graph is a triple-store of (Subject, Predicte, Object). As time passes, facts are either retracted or added, so a time-aware tuple would be (Subject, Predicate, Object, Time, Add/Retract). In addition, labeled properties, such as a scheduled expiration date, can be accommodated with an addition Metadata attribute. He also covered some indexing strategies that can make it efficient to query such an time-aware tuple-store.

After this, there were two keynotes. The first one was News and Graphs by Peter Olson of NBC News Digital, which covered the graph structure of the NBC News Publishing pipeline, and how NBC leverages graphs to provide news with high velocity and scale. The second keynote was Knowledge Graph Pilot improves Data Quality While Providing a Customer 360 View by Bethany Swhon and Patricia Branum of Capital One, where they described how they improved the quality of their Knowledge Graph to provide a better Customer view across the enterprise.

The final presentation of the day and conference for me was Automated Encoding of Knowledge from Unstructured Text into a Graph Database by Chris Davis of Lymba. The presentation describes the Lymba pipeline to convert text into Knowledge Graph. It includes the usual preprocessing, tokenizing, POS tagging, and segmentation steps other presentations covered (and in some ways seem to be standard knowledge in the text to KG NLP sub-community), but this presentation went one step further and talked about the need for Word Sense Disambiguation, Concept extraction (using gazetteers and NER models), and Syntactic (constituent) and Semantic Parses (dependency) for relation extraction. It also includes Coreference Resolution, which is also quite important but usually omitted from pipelines because of its complexity. The Lymba product provides a turnkey solution plus consulting for various industries.

I had to catch a flight back, and having heard about the Chicago traffic and having faced the zero tolerance for lateness in large airports such as LAX, I didn't want to miss it. So I ended up skipping the last panel discussion on Graphs vs Tables. Turns out I didn't need to, but better safe than sorry.

Conclusions


As conferences go, this was quite luxurious -- attendees were treated to a sumptous buffet breakfast every day, and a 3 course sit-down lunch for 3 of the 4 days (1 of the days was build-your-own sandwiches, but even that was quite nice). One observation is that sit-down lunches can foster really good and insightful conversations. In addition, there was coffee and snacks throughout the day, and (limited) free drinks for 2 of the 3 evenings. Swag included a Dataversity branded backpack to hold your conference materials, wool hats with the Dataversity logo, stickers, and a pen which contained a USB drive with all the presentation slides, as well as the swag vendors give out at their stalls (to potential clients).

Of course, the nicest thing about conferences (after the presentations) are the conversations with fellow attendees, and the chance to learn from their insights, and what they are doing with the tech under consideration (in this case graphs). I met people from aviation, health (insurance), finance, consulting, the government (from both the covert and the overt branches), as well as scientific publishing. In addition, it was a chance to interact with people from the vendor companies, and bounce ideas against them about specific things they do well. Two insights, both gained at lunch table conversations -- first, RDF has better inter-operability and tooling, but LPGs are easier to work with; second, certain Asian cultures believe that you can never define an object fully, which seems to warrant more of a triple-store structure than the more efficient but constrained graph structure.

Overall, it was good to see Graphs being treated with so much importance. The previous Graph conferences I have attended were much smaller affairs, rarely lasting more than a day. I suppose this might partly be because of the focus on explainable AI, advances in Knowledge Graphs, Graph CNNs and embeddings, as well as the realization that graph structure provides useful exploitable signal, all of which are causing graphs to become more and more important, and graph conferences to become more popular.

If I had to suggest an improvement, I would suggest streamlining the evaluation process. I don't know how many feedback forms were returned (I returned all four that were provided in my conference materials, but not the last global one). Each form takes approximately 5 minutes to complete, so it is tempting to skip it and go to the next session instead. And by the evening, it is harder to do, since you have to refer to your notes instead of relying on short term memory. On the other hand, someone at the door with an iPad who scans your badge and asks you to tap on a smiley versus a frowney icon provides much better coverage (although you would have to interpret the meaning of the feedback). I guess its the tension between explicit versus implicit feedback, there are tradeoffs either way.


Sunday, October 06, 2019

Efficient Reverse Image Search on Lucene using Approximate Nearest Neighbors


I wrote last week about our Search Summit, an internal conferences where engineers and product managers from Elsevier, LexisNexis, and various other organizations that make up the RELX Group, get together to share ideas and best practices on search. The Search Summit is in its third year, and given that its a single track conference and runs for two days, there are limited speaking slots available, and competition for these slots is quite fierce. Our acceptance rate this year was around the 40-50% mark, which I guess puts us on par with some big external conferences in terms of exclusivity. In any case, I guess that was my long-winded way of saying that this year I didn't get accepted for a speaking slot, and was offered the opportunity to present a poster instead, which I accepted.

This was the first time I made a poster for work, and quite honestly, I didn't know where to begin. My only experience with posters was back when I was at school, where I remember that we used crayons a lot, and then helping my kids build posters for their school, which involved a lot of cutting out pictures and glueing them on to a posterboard. Fortunately, as I learned from some googling, technology has progressed since then, and now one can build the poster as a single Microsoft Powerpoint (or Apple Keynote or Google slides) slide (I used Powerpoint). Quite a few sites provide free templates for various standard poster sizes as well, I ended up choosing one from Genigraphics. One you are done designing the poster, you save it as a PDF, and upload the PDF to sites such as Staples Print Services for printing. And that's pretty much all there is to it. Here is the poster I ended up presenting at the search summit.


Basically, the poster describes a Proof of Concept (POC) application that attempts to do reverse image search, i.e., given an image, it returns a ranked list of images, similar to the given image, in the corpus. My dataset for my experiments was a set of approximately 20,000 images from the ImageCLEF 2016 competition. The notion of similarity that I use is the one learned by a RESNET network pre-trained on ImageNet. The idea is that this sort of network, trained on a large generic image corpus such as ImageNet, learns how to recognize colors, edges, shapes, and other more complex features in images, and these features can be used to provide a notion of similarity across images, much like tokens in a document.

RESNET (and other pre-trained networks like it) generate dense and (comparatively) low-dimensional vectors of features. For example, the last layer of a trained RESNET model will generate feature vectors of size 2048. On the other hand, text is typically vectorized for Information Retrieval as a vector over all the tokens (and sometimes token combinations as well) in its vocabulary, so feature vector sizes of 50,000 or 100,000 are not uncommon. In addition, such vectors tend to be quite sparse as well, since a given token is relatively unlikely to occur in many documents. Lucene (and probably other search engines) leverage the high-dimensionality and sparseness of text feature vectors using its Inverted Index data structure (aka Postings List), where the key is the token and the value is a priority queue of document ids that the token occurs in.

In order to present a ranked list of similar images, a naive (and exhaustive) solution would be to compute some measure of vector similarity (such as Cosine similarity or Euclidean similarity) between the input image and every other image in the corpus, a O(N) operation. A better solution might be to partition the similarity space using something like a KD-tree such that comparison is done against a subset of most similar images, and that is a O(log N) operation. However, the Lucene inverted index makes the search an almost O(1) operation -- first looking up by token, and then navigating down a priority queue of a subset of document IDs containing that token. My reasoning went as follows -- if I could somehow project the dense, low-dimensional vectors from RESNET back into a sparse, high-dimensional space similar to that used for text vectors without losing too much information, I would be able to do image search as efficiently as text search.

I experimented with two such projection techniques -- Random K-Means and Random Projections.

  • Random K-Means involves running a bunch of K-Means clusterings over the image set, typically 50-200 of them, each with some random value for k. Each image (point in similarity space) is represented by a sequence of its cluster memberships across all the clusterings. The intuition here is that similar images, represented by two points that are close in the similarity space, will share cluster memberships across more clusterings than dissimilar images, represented by points that are further apart. A 2D schematic on the bottom center of the poster illustrates the idea.
  • Random Projections is similar, but instead of partitioning the space into a random number of circular clusters, we partition the space itself using a (D-1) dimensional hyperplane, where D is the size of the dense feature vector (2048). The number of random projections used is typically much higher than the number of K-Means, typically 1,000-10,000. This effectively results in binary clusters where points behind the hyperplane get assigned a 0 and points in front get assigned a 1. As before, each image is represented by a sequence of its membership across all the random projections. The intuition here is similar to random K-Means. The 2D schematic on the bottom right of the poster illustrates how random projections work.

My baseline for experiments was caption based text search, i.e., similar images were found by doing text search on their captions. I took a random set of 20 images, and manually evaluated the relevance for top 10 similar images for each on a 4-point scale. Its only one person, so its not very objective, but I needed a place to start. The guideline I followed was to give it the highest rating (4) if the image is identical, or almost identical, to the source image -- possibly either the same image, or a composite containing the same image, or a minor modification. The second highest rating (3) was for images that were basically the same thing -- for example, if the source image was an X-ray of the left lung, I would rate an X-ray of the right as very similar. The third rating (2) were for images that I couldn't rate 4 or 3, but which I would expect to see in the set of similar images. Finally, the lowest rating (1) is for images that I don't expect to see in the results, i.e., irrelevant images.

I used the ratings to compute Discounted Cumulative Gain (DCG) for k=1, 3, 5, and 10, where k is the number of top results taken from the ranked list of similar images for each image, and then averaged them across the 20 images. I then computed the Ideal DCG (IDCG) for k=10 by sorting the ranked list based on my ratings, and then divided the computed DCG by the IDCG to get the Normalized Discounted Cumulative Gain (NDCG). These numbers are shown in the table under the Results section in the poster. The bolded numbers indicate the best performers. Results for k=1 serve as a sanity check, since all the searches in my evaluation returned the source image as the top result in the ranked list of similar images. Overall, the numbers seem quite good, and Random K-Means with 250 clusters produced the best results.

Random K-Means provided the best results in qualitative terms as well. The block of images at the center of the poster show a brain MRI as the source image, and the top 3 results from the baseline, the best random K-Means (250 clusters), and the best Random Projections (10000 projections). A green box indicates relevant results, and a red box indicates irrelevant results. As you can see, the caption text search based baseline produces more diverse results, but doesn't do very well otherwise. Random K-Means and Random Projections produce similar images that are more visually similar to the source image, and between the two, Random K-Means tends to produce better results.

In the future, I hope to extend this approach to text vectors as well. Specifically, I want to capitalize on the increase in recall that I noticed with caption based text search, by building embeddings (word2vec, ELMo, BERT, etc.) from the caption and treat these embeddings as additional feature vectors for the image. On the other hand, I do notice that the trend has been to use special purpose backends for vector search, such as the Non-Metric Space Library (NMSLib) based on Numpy, or the GPU-based FAISS Library for efficient similarity search, or hybrid search engines such as Vespa. So while it might be intellectually satisfying to try to make a single platform serve all your search needs, it might more worthwhile to see if one of these other toolkits might provide better results.

I also wanted to acknowledge the work of Simon Hughes. The approaches used in this poster are based heavily on ideas originally proposed by him on his DiceTechJobs/VectorsInSearch project, as well as ideas and suggestions proposed by him and others on the Relevancy and Matching Technology #vectors-in-search slack channel.