Sunday, October 20, 2019

Trip Report: Graphorum 2019


I just got back from Graphorum 2019, organized by the folks at Dataversity. The conference was held at The Westin Chicago River North, and colocated with the Data Architecture Summit (DAS), also organized by Dataversity. Attendees were not restricted to talks at one or the other conference, they were allowed, even encouraged, to attend talks at the other conference, perhaps in an effort to popularize Graph concepts among Data Architects, and Data Architecture best practices among Graph practioners. Both conferences were very heavily multi-track -- at any point, you had around 3-5 choices if you restricted yourself to talks in either track. I attended only the Graphorum talks, so this trip report represents one of 322 million possible unique trip reports (and one of 142 billion possible unique reports if I had not restricted myself), with the naive assumption that at all talks within each conference were independent, and at any point, I was equally likely to select any one of the talks offered.

The conference was four days long, starting on Monday last week (October 14 2019) and ended yesterday (October 17). As expected, many of the talks (at least among the ones I attended) were centered around Knowledge Graphs (KGs). Some of these focused on techniques and advice on how to build them from unstructured text, and some focused on making them more effective. Many of the presentations ended up covering the various Linked Data standards such as Resource Description Framework (RDF) for specifying semantic triples, and Web Ontology Language (OWL) for doing inference on them. More than one talk mentioned the new Shape Constraint Language (SHACL) for validating such RDF graphs. On the other hand, it looks like there is strong industry support for Labeled Property Graphs (LPG) as well, both among database vendors and users. Regardless of the graph flavor, there was also quite a lot of interest in using Machine Learning (ML) and Natural Language Processing (NLP) to leverage the signal inherent in graph structure. In the rest of this post, I will cover the talks that I liked in my 1-in-322M path through the conference. I will also cover some interesting discussions I had with some vendors at the exhibition, and some overall feedback about the conference as a whole.

I arrived late at Chicago on Sunday night, so was unable to take advantage of the early bird registration. At breakfast next morning, I found myself at the end of a fairly long line while the others were relatively empty. In an attempt to understand why, and notwithstanding Murphy's Law, I noticed that the lines were by first character of last name, and unevenly sized (A-E, F-J, K-O, and P-Z). It is possible that the sizing is based on actual attendee last names, in that case I guess people with last names P-Z tend to be relatively last minute types.

Day 1


On the first day, I attended a presentation about Applications of Knowledge Graphs in Enterprise Data Management and AI by Juergen Jakobitsch from Semantic Web Company and Andreas Blumauer from Pool Party. It was very comprehensive, and included lots of background material on Knowledge Graphs. Among some standard applications were semantic search, info-box style results, product recommendations, virtual assistants, etc. One interesting application mentioned was text data augmentation, and (in the same vein) filtering results from chatbots for illogical answers. They also talked about the PoolParty pipeline for converting inputs (structured and unstructured) to Knowledge Graphs, which includes entity resolution and entity linking. This presentation, as well as others throughout the conference, also focused on the various W3C standards such as RDF, SPARQL, OWL2, and SHACL.

I also attended Graphs Transform Chemical Discovery in Pharmaceutical Research presented by Liana Kiff of Tom Sawyer Software. I had initially thought that it would talk about the actual process of Chemical Discovery using graphs, but it turned out to be (an admittedly very powerful) visualization of a graph of chemical entities from the ChEMBL database, including features that allow for manual chemical discovery.

The final presentation of the day was You Must Be This Tall: Machine Learning from Graphs, Tables, Trees, and Documents by Brian Sletten of Bosatsu Consulting, Inc. As with the previous talk, I came in hoping to learn about ML with graphs, but the talk turned out to be about what you need to do before you can do ML with your graphs. Nevertheless, the presenter was very knowledgable, so the talk ended up being pretty interesting and educational, in spite of it covering a lot of the RDF/SPARQL/OWL ground covered by earlier talks. One good insight here was the distinction between data, knowledge, and wisdom as points on a Data Understanding vs Connected space, coupled with capturing relationships between entities, the context in which these relations exist, and the exploitation of this information using RDF technologies. Another was the need for data lakes to be available as a pre-requisite to effective ML in the organization. Yet another other thing I liked about his presentation is this example SPARQL queries against Wikidata. He also talked about JSON-LD and how it can be a good substitute for RDF.

Day 2


The first presentation of Tuesday was the Ontology Engineering for Knowledge Graphs by Elisa Kendall and Deborah McGuinness. The talk focused on the role of the Ontology Engineer, who is the technical person who talks to the Knowledge Engineer or Domain Expert. The main component of this role is being effective at interviewing the Knowledge Engineer. The talk also covered various aspects of modeling the domain, with many examples drawn from the presenter's own experiences.

The next presentation I attended was Merging Company Data from Multiple Sources with GraphDB and Ontotext Platform by Atanas Kiryakov (their CEO). As expected, the presentation was more about how Ontotext does things. However, Ontotext is one of the first graph database companies I know of that pursued a strong NLP + Graph strategy, so there was lots of interesting content. Among the functionality covered were features around Entity Recognition from text and Entity Disambiguation (from structured data as well as text), the use of different embedding technology (1-hot, 2-hot, BERT, etc), and the use of different technologies such as Random Forests and SVM for entity and document classification, BiLSTM and CRF for Named Entity Recognition, and Inductive Logic Programming (ILP) for rules engines built around the discovered entities and relations.

The final presentation on Tuesday for me was Reveal Predictive Patterns with Neo4j Graph Algorithms (Hands-on) by Amy Hodler and William Lyon. I had learned about Graph Algorithms in Neo4j (and also used some of them for my own talk later in the conference) from the Graph Algorithms: Practical Examples in Apache Spark and Neo4j from O'Reilly (Amy Hodler, one of the presenters, is also one of the authors of this book), so some of it was old material for me. The talk started off with the need for graph algorithms as tools that exploit the additional information implicit in the graph structure, then covered the various classes of graph algorithms (Pathfinding, Centrality, Community Detection, Link Prediction and Similarity), with deep dives into specific algorithms and running them on their Neo4j Desktop product (proprietary product with 30 day free trial, but all the features covered are also available in the free community edition). I ended up learning a few new things, such as how to use virtual graphs (generated as a result of a Cypher query, sort of like views in the RDBMS world), and how to use the Strongly Connected components algorithm as a debugging tool. They also showed off their NEuler product, which allows forms-based invocation of various algorithms, as well as some very good visualizations. Talking about visualization, William Lyon also mentioned the neo4j-contrib/neovis.js project, which seems interesting as well. Overall, lots of useful information about Neo4j and graphs.

I also learned about the Bridges of Chicago, based on a challenge from the presenters about using Cypher (the Neo4j query language) to find an Eulerian path similar to the Bridges of Königsberg problem. I guess I was the only one that responded, since the problem is much simpler than it appears to be at first glance.

Exhibitors started setting up their booths today, so I spent some of the coffee breaks and most of the evening talking to various exhibitors. Both Graph database vendors and consultants were well represented among the exhibitors (considering it was a graph + data architecture conference). Graph vendors I knew of included Neo4j, Ontotext, TigerGraph, DataStax, and StarDog. Among those who I learned about at this conference were PoolParty, Semantic Web Company, and Cambridge Semantics. Having attended the presentations from PoolParty and The Semantic Web, and Ontotext, I spent a lot of time talking with them. I also met up with the folks at TigerGraph, and let them know how helpful their Graph Gurus webinar series has been to me. I also took the opportunity to meet up with the folks at Stardog, who I had met earlier at another Graph conference few years earlier through a reference. Since I was a speaker here, the conversation also drifted occassionally to the subject of my talk, and what graph database I was using (Neo4j).

Day 3


Wednesday was quite a heavy day in terms of presentations, comparatively speaking. It started with two keynote presentations. The first one was Knowledge Graphs and AI: The Future of Enterprise Data by David Newman from Wells Fargo. He spoke of the progression of looking at Strings to Things to Predicted Things to Vectors, which resonated with me as well, since we are progressing along a very similar path ourself. He led us through multiple examples involving harmonizing an entity across multiple Knowledge Graphs in the enterprise, the need for classifying entities into a taxonomy, using Knowledge Graphs to predict new relationships, using graph relations for creating digital fingerprints for ML algorithms, etc. His examples referenced the Financial Industry Business Ontology (FIBO), which provides a standard schema for the financial services industry.

The second keynote was Graph Stories: How Four Metaphors can help you decide if Graphs are right for you by Dan McCreary of Optum. While David Newman's presentation was based on RDF style graphs, Dan McCreary is a big proponent of Labeled Property Graphs (LPG), although his choice had several very pragmatic reasons. The four metaphors he described are the Neighborhood Walk, the Knowledge Triangle, the Open World Assumption, and the Jenga Tower. Specifically, the first indicates the importance of relationship traversal in your applications, the second indicates where your application is (or wants to be) on the Data / Information / Knowledge Triangle, the third indicates the ease with which new information can be incorporated into your system, and the fourth indicates the resilience of your query system to small changes in your backend. The keynote also covered the importance of graph structure (Structure is the new gold in data mining), the inter-relationship of Graphs with Deep Learning techniques such as Graph Convolutional Networks (GCNN) and Structured Learning with Graphs.

The next presentation I attended was Knowledge Graphs and Model Driven Classification by Scott Henninger of SmartLogic, where he showed off the capabilities of the SmartLogic platform, which centered around Metadata tagging, document classification (based on the metadata tagging and external taxonomies), and Search Enhancement Services (SES). The bulk of the capabilities seem to be rule based, which can be good for explainability purposes. SmartLogic's KG backend is based on RDF Schema, OWL, and SHACL. An interesting functionality of SmartLogic is to allow the user to manually fine-tune the (term) weights from their classifier. I got quite excited at this, thinking that perhaps this functionality could be leveraged to produce explainable Deep Learning models by perturbing the inputs, but then realized that the intuition is similar to the idea behind LIME - Local Interpretable Model-Agnostic Explanations.

Next up was a talk on How Do You Read Millions of Documents for Meaning using Graph? by Ryan Chandler of Caterpillar, Inc. He described a system he built at Caterpillar, that allowed customer support technicians to query a large collection of internal support tickets created by other technicians. The end result is a query-able knowledge base. The text in the support tickets are tokenized and segmented into sentences, tagged with cause, complaint, solution, note, and correction (classification). The document is decomposed into semantic frames, and the document and the associated semantic frames, along with its metadata, are stored in a Neo4j graph database. On the query side, the natural language (NL) query is converted into a graph using a dependency parse, and re-composed into a Cypher query against specific semantic frames (as indicated by the metadata). The Cypher query produces a ranked list of support tickets that best satisfy the NL query. I thought this was quite an interesting technique, although it may be somewhat dependent on the structure of the input data.

The next presentation I attended was Graph Analytics for Enterprise Applications by Melliyal Annamalai, Souripriya Das, and Matthew Perry from Oracle. I came in a few minutes late so I missed the first part, but from what I gathered, it covers Oracle's foray into graph databases -- it turns out that Oracle customers can now start working with SPARQL using SQL Developer, seamlessly against Oracle's new hybrid Property and RDF graph. The functionality is nice, but probably only useful for current and future Oracle customers.

My next presentation was Insufficient Facts always invite Danger: Combat them with a Logical Model by Michael Grove of Stardog, where he described how important it was to have a Logical Model to ensure completeness of your model, and and how it can help you avoid problems later.

The evening before I had spent some time at the DataStax booth, mainly for nostalgic reasons since I worked with Cassandra (the Apache version, not the DataStax version) at my previous job, and I was curious about their graph product based on Cassandra (initially called Titan, then Janus). So I attended the presentation Graph Innovations for Distributed Data at Scale by Jonathan Lacefield. The presentation covered the evolution of their graph product, and also answered a nagging question I had about how they implemented the graph in a column-family database under the covers -- turns out that each row is basically the star graph around each node. Other interesting things in this presentation were their use of Gremlin and Spark support through their DataStax extension.

The last presentation of the day was Knowledge Graphs and GraphQL in Action: A Practical Approach to using RDF and Semantic Models for Web Applications by Irene Polikoff and Ralph Hodgson of TopQuadrant. They described their Semantic GraphQL interface which provides the user with a GraphQL interface, and converts down to a RDF, OWL, and SHACL query against a RDF triple store.

Finally, the last event of the day was a session about Open Source Knowledge Graph Tooling, which really turned out to be a group of banking folks trying to collaborate around the FIBO Ontology, but it is likely that they might expand to other industries as well in the future. There was also talk about listing out a current (non-deprecated) list of open source ontologies in various industries, applying unit tests to ontologies so they don't become stale and irrelevant, both of which were interesting to me.

The exhibitors were still around, and so I hung around for some more conversations with vendors and fellow attendees for a couple more hours after that. Among them were Cambridge Semantics, who have a fast analytics graph database called AnzoDB.

Day 4


The first presentation of the day was Unsupervised and Supervised ML on Big Graph: Case Studies by Victor Lee. He described various case studies using TigerGraph. The first one was finding influential healthcare provides in various local networks from a specialty network, and finding their influence networks. Another case study had to do with detecting spam phone calls in the China Mobile network, the training data for which consisted of 180+ graph features. The model was a Random Forest classifier. At prediction time, an incoming phone call would be placed in the call graph, the 180+ features computed and fed into the Random Forest model to predict (under 20 milliseconds) whether the call was spam or not spam. The third case study was for Bank Fraud, based on some synthetic data from a Kaggle competition, where TigerGraph engineers built some compound relationships based on edges discovered in the feature graph, which ended up giving good results, showing that the structure of the data provides useful signal. The talk ended with an introduction to Node2Vec, a graph embedding scheme.

The next presentation in my list was my own (Graph Techniques for Natural Language Processing). My presentation was about using Graph techniques (mostly a combination of common third party algorithms) to solve Natural Language Processing (NLP) problems. I covered four case studies that attempted to replicate academic papers (referenced from the Bibliography of Graph-Based Natural Language Processing and Information Retrieval) around document summarization, clustering using language model based vectors, word sense disambiguation, and topic finding. Graph techniques used included various graph centrality metrics (some built-in and some computed using Cypher and built-in algorithms), random walk techniques, Louvain Community Detection, Label Propagation, and Personalized PageRank. Compared to the other presentations, mine was probably a bit unusual, since it focused on NLP more than on graphs, so while I had only about 15-20 attendees, there seemed to be lots of interest, and some very good questions at the end. For those of you who weren't able to make it to the presentation but would like more details, you can find the link to my slides and code (in Jupyter notebooks, with a lot of verbose commentary) at my sujitpal/nlp-graph-examples repository.

I hung around a bit after my presentation answering questions, so I ended up being a bit late to the next presentation, even with the generous coffee break in between. This was When Time Meets Relationships: Understanding an Immutable Graph Database by Brian Platz of Fluree. He makes the case that a Knowledge Graph is a snapshot at a point in time. A time-aware Knowledge Graph can be thought of as an immutable linked list, where facts are added to an append-only log, and made tamper-proof with hashing techniques, much like a private blockchain. The strategy assumes the Knowledge Graph is a triple-store of (Subject, Predicte, Object). As time passes, facts are either retracted or added, so a time-aware tuple would be (Subject, Predicate, Object, Time, Add/Retract). In addition, labeled properties, such as a scheduled expiration date, can be accommodated with an addition Metadata attribute. He also covered some indexing strategies that can make it efficient to query such an time-aware tuple-store.

After this, there were two keynotes. The first one was News and Graphs by Peter Olson of NBC News Digital, which covered the graph structure of the NBC News Publishing pipeline, and how NBC leverages graphs to provide news with high velocity and scale. The second keynote was Knowledge Graph Pilot improves Data Quality While Providing a Customer 360 View by Bethany Swhon and Patricia Branum of Capital One, where they described how they improved the quality of their Knowledge Graph to provide a better Customer view across the enterprise.

The final presentation of the day and conference for me was Automated Encoding of Knowledge from Unstructured Text into a Graph Database by Chris Davis of Lymba. The presentation describes the Lymba pipeline to convert text into Knowledge Graph. It includes the usual preprocessing, tokenizing, POS tagging, and segmentation steps other presentations covered (and in some ways seem to be standard knowledge in the text to KG NLP sub-community), but this presentation went one step further and talked about the need for Word Sense Disambiguation, Concept extraction (using gazetteers and NER models), and Syntactic (constituent) and Semantic Parses (dependency) for relation extraction. It also includes Coreference Resolution, which is also quite important but usually omitted from pipelines because of its complexity. The Lymba product provides a turnkey solution plus consulting for various industries.

I had to catch a flight back, and having heard about the Chicago traffic and having faced the zero tolerance for lateness in large airports such as LAX, I didn't want to miss it. So I ended up skipping the last panel discussion on Graphs vs Tables. Turns out I didn't need to, but better safe than sorry.

Conclusions


As conferences go, this was quite luxurious -- attendees were treated to a sumptous buffet breakfast every day, and a 3 course sit-down lunch for 3 of the 4 days (1 of the days was build-your-own sandwiches, but even that was quite nice). One observation is that sit-down lunches can foster really good and insightful conversations. In addition, there was coffee and snacks throughout the day, and (limited) free drinks for 2 of the 3 evenings. Swag included a Dataversity branded backpack to hold your conference materials, wool hats with the Dataversity logo, stickers, and a pen which contained a USB drive with all the presentation slides, as well as the swag vendors give out at their stalls (to potential clients).

Of course, the nicest thing about conferences (after the presentations) are the conversations with fellow attendees, and the chance to learn from their insights, and what they are doing with the tech under consideration (in this case graphs). I met people from aviation, health (insurance), finance, consulting, the government (from both the covert and the overt branches), as well as scientific publishing. In addition, it was a chance to interact with people from the vendor companies, and bounce ideas against them about specific things they do well. Two insights, both gained at lunch table conversations -- first, RDF has better inter-operability and tooling, but LPGs are easier to work with; second, certain Asian cultures believe that you can never define an object fully, which seems to warrant more of a triple-store structure than the more efficient but constrained graph structure.

Overall, it was good to see Graphs being treated with so much importance. The previous Graph conferences I have attended were much smaller affairs, rarely lasting more than a day. I suppose this might partly be because of the focus on explainable AI, advances in Knowledge Graphs, Graph CNNs and embeddings, as well as the realization that graph structure provides useful exploitable signal, all of which are causing graphs to become more and more important, and graph conferences to become more popular.

If I had to suggest an improvement, I would suggest streamlining the evaluation process. I don't know how many feedback forms were returned (I returned all four that were provided in my conference materials, but not the last global one). Each form takes approximately 5 minutes to complete, so it is tempting to skip it and go to the next session instead. And by the evening, it is harder to do, since you have to refer to your notes instead of relying on short term memory. On the other hand, someone at the door with an iPad who scans your badge and asks you to tap on a smiley versus a frowney icon provides much better coverage (although you would have to interpret the meaning of the feedback). I guess its the tension between explicit versus implicit feedback, there are tradeoffs either way.


Sunday, October 06, 2019

Efficient Reverse Image Search on Lucene using Approximate Nearest Neighbors


I wrote last week about our Search Summit, an internal conferences where engineers and product managers from Elsevier, LexisNexis, and various other organizations that make up the RELX Group, get together to share ideas and best practices on search. The Search Summit is in its third year, and given that its a single track conference and runs for two days, there are limited speaking slots available, and competition for these slots is quite fierce. Our acceptance rate this year was around the 40-50% mark, which I guess puts us on par with some big external conferences in terms of exclusivity. In any case, I guess that was my long-winded way of saying that this year I didn't get accepted for a speaking slot, and was offered the opportunity to present a poster instead, which I accepted.

This was the first time I made a poster for work, and quite honestly, I didn't know where to begin. My only experience with posters was back when I was at school, where I remember that we used crayons a lot, and then helping my kids build posters for their school, which involved a lot of cutting out pictures and glueing them on to a posterboard. Fortunately, as I learned from some googling, technology has progressed since then, and now one can build the poster as a single Microsoft Powerpoint (or Apple Keynote or Google slides) slide (I used Powerpoint). Quite a few sites provide free templates for various standard poster sizes as well, I ended up choosing one from Genigraphics. One you are done designing the poster, you save it as a PDF, and upload the PDF to sites such as Staples Print Services for printing. And that's pretty much all there is to it. Here is the poster I ended up presenting at the search summit.


Basically, the poster describes a Proof of Concept (POC) application that attempts to do reverse image search, i.e., given an image, it returns a ranked list of images, similar to the given image, in the corpus. My dataset for my experiments was a set of approximately 20,000 images from the ImageCLEF 2016 competition. The notion of similarity that I use is the one learned by a RESNET network pre-trained on ImageNet. The idea is that this sort of network, trained on a large generic image corpus such as ImageNet, learns how to recognize colors, edges, shapes, and other more complex features in images, and these features can be used to provide a notion of similarity across images, much like tokens in a document.

RESNET (and other pre-trained networks like it) generate dense and (comparatively) low-dimensional vectors of features. For example, the last layer of a trained RESNET model will generate feature vectors of size 2048. On the other hand, text is typically vectorized for Information Retrieval as a vector over all the tokens (and sometimes token combinations as well) in its vocabulary, so feature vector sizes of 50,000 or 100,000 are not uncommon. In addition, such vectors tend to be quite sparse as well, since a given token is relatively unlikely to occur in many documents. Lucene (and probably other search engines) leverage the high-dimensionality and sparseness of text feature vectors using its Inverted Index data structure (aka Postings List), where the key is the token and the value is a priority queue of document ids that the token occurs in.

In order to present a ranked list of similar images, a naive (and exhaustive) solution would be to compute some measure of vector similarity (such as Cosine similarity or Euclidean similarity) between the input image and every other image in the corpus, a O(N) operation. A better solution might be to partition the similarity space using something like a KD-tree such that comparison is done against a subset of most similar images, and that is a O(log N) operation. However, the Lucene inverted index makes the search an almost O(1) operation -- first looking up by token, and then navigating down a priority queue of a subset of document IDs containing that token. My reasoning went as follows -- if I could somehow project the dense, low-dimensional vectors from RESNET back into a sparse, high-dimensional space similar to that used for text vectors without losing too much information, I would be able to do image search as efficiently as text search.

I experimented with two such projection techniques -- Random K-Means and Random Projections.

  • Random K-Means involves running a bunch of K-Means clusterings over the image set, typically 50-200 of them, each with some random value for k. Each image (point in similarity space) is represented by a sequence of its cluster memberships across all the clusterings. The intuition here is that similar images, represented by two points that are close in the similarity space, will share cluster memberships across more clusterings than dissimilar images, represented by points that are further apart. A 2D schematic on the bottom center of the poster illustrates the idea.
  • Random Projections is similar, but instead of partitioning the space into a random number of circular clusters, we partition the space itself using a (D-1) dimensional hyperplane, where D is the size of the dense feature vector (2048). The number of random projections used is typically much higher than the number of K-Means, typically 1,000-10,000. This effectively results in binary clusters where points behind the hyperplane get assigned a 0 and points in front get assigned a 1. As before, each image is represented by a sequence of its membership across all the random projections. The intuition here is similar to random K-Means. The 2D schematic on the bottom right of the poster illustrates how random projections work.

My baseline for experiments was caption based text search, i.e., similar images were found by doing text search on their captions. I took a random set of 20 images, and manually evaluated the relevance for top 10 similar images for each on a 4-point scale. Its only one person, so its not very objective, but I needed a place to start. The guideline I followed was to give it the highest rating (4) if the image is identical, or almost identical, to the source image -- possibly either the same image, or a composite containing the same image, or a minor modification. The second highest rating (3) was for images that were basically the same thing -- for example, if the source image was an X-ray of the left lung, I would rate an X-ray of the right as very similar. The third rating (2) were for images that I couldn't rate 4 or 3, but which I would expect to see in the set of similar images. Finally, the lowest rating (1) is for images that I don't expect to see in the results, i.e., irrelevant images.

I used the ratings to compute Discounted Cumulative Gain (DCG) for k=1, 3, 5, and 10, where k is the number of top results taken from the ranked list of similar images for each image, and then averaged them across the 20 images. I then computed the Ideal DCG (IDCG) for k=10 by sorting the ranked list based on my ratings, and then divided the computed DCG by the IDCG to get the Normalized Discounted Cumulative Gain (NDCG). These numbers are shown in the table under the Results section in the poster. The bolded numbers indicate the best performers. Results for k=1 serve as a sanity check, since all the searches in my evaluation returned the source image as the top result in the ranked list of similar images. Overall, the numbers seem quite good, and Random K-Means with 250 clusters produced the best results.

Random K-Means provided the best results in qualitative terms as well. The block of images at the center of the poster show a brain MRI as the source image, and the top 3 results from the baseline, the best random K-Means (250 clusters), and the best Random Projections (10000 projections). A green box indicates relevant results, and a red box indicates irrelevant results. As you can see, the caption text search based baseline produces more diverse results, but doesn't do very well otherwise. Random K-Means and Random Projections produce similar images that are more visually similar to the source image, and between the two, Random K-Means tends to produce better results.

In the future, I hope to extend this approach to text vectors as well. Specifically, I want to capitalize on the increase in recall that I noticed with caption based text search, by building embeddings (word2vec, ELMo, BERT, etc.) from the caption and treat these embeddings as additional feature vectors for the image. On the other hand, I do notice that the trend has been to use special purpose backends for vector search, such as the Non-Metric Space Library (NMSLib) based on Numpy, or the GPU-based FAISS Library for efficient similarity search, or hybrid search engines such as Vespa. So while it might be intellectually satisfying to try to make a single platform serve all your search needs, it might more worthwhile to see if one of these other toolkits might provide better results.

I also wanted to acknowledge the work of Simon Hughes. The approaches used in this poster are based heavily on ideas originally proposed by him on his DiceTechJobs/VectorsInSearch project, as well as ideas and suggestions proposed by him and others on the Relevancy and Matching Technology #vectors-in-search slack channel.



Sunday, September 29, 2019

Searching for Answer Candidate Passages with Solr and Anserini


I just got back from our company's internal Search Summit at our offices at Raleigh, NC -- the conference is in its third year, and it has grown quite a bit from its humble beginnings. We even have our own conference sticker! The conference was 1 day of multi-track workshops, and two days of single track presentations. Our Labs team conducted a workshop on Bidirectional Encoder Representations from Transformers (or BERT), and I presented my results on BERT based Open Domain Question Answering.

Our BERT based Question Answering pipeline is inspired by the End-to-end Open-Domain Question Answering with BERTSerini paper from Prof Jimmy Lin's team from the University of Waterloo. We are using our own content from ScienceDirect, and we have been trying BERT, pre-trained variants such as BioBERT and SciBERT, and other models such as XLNet and AllenNLP BiDAF, fine tuned with SQuAD 1.1 and 2.0 datasets, and we are using the pipeline variants to answer our own set of questions in the scientific domain. Overall, we have gotten best results from SciBERT+SQuAD 1.1, but we are looking at fine-tuning with SQuAD 2.0 to see if we can get additional signal from when the model abstains from answering.

The figure below shows the BERTSerini pipeline (as described in the BERTSerini paper). In this post, I want to describe the Anserini Retriever component and our implementation of it as a Solr plugin. The Anserini Retriever is an open source IR toolkit described in Anserini: Enabling the Use of Lucene for Information Retrieval Research. It was originally built as a way to experiment with running things like TREC benchmarks, as described in Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. It implements a pluggable strategy for handling question style queries against a Lucene index. The code for the Anserini project is available at castorini/anserini on Github.


Functionally, the Anserini retriever component takes as input a string representing a question, and returns a set of results that can be used as candidate passages by the Question Answering module. The pipeline consists of two steps -- query expansion and results reranking. Anserini offers multiple ways to do each step, allowing the caller to mix and match these strategies to create customized search pipelines. It also offers multiple pluggable similarity strategies, most commonly used of which seem to be BM25 (default similarity for Lucene and its derivative platforms nowadays) and QL (Query Likelihood). The question is parsed by the query expansion steps, and sent to the index, which I will call query A. Results from query A are then reranked -- the reranking is really another (filtered) query to the index, which I will call query B.

Query Expansion strategies include the Bag of Words (BoW), and the Sequential Dependency Model (SDM). Bag of words is fairly self explanatory, its just an OR query of all the tokens in the query, after stopwording, and optionally synonym expansion. SDM is only slightly more complex, it is a weighted query with three main clauses. The first clause is a Bag of Words. The second clause is an OR query of neighboring bigram tokens where proximity and order are both important, and the third clause is an OR query of neighboring bigram tokens where proximity is relaxed and order is not important. The three clauses are weighted in a compound OR query, default weights are (0.85, 0.1, 0.05).

The query (Query A) is sent to the index, which will return results. We take the top K results (configurable, we use K=50 as our default) and send it to the result reranking step. Anserini provides three pluggable reranking algorithms. They are RM3 (Relevance Model 3), Axiomatic, and Identity. RM3 computes feature vectors for query terms and the results from query A. Feature vectors for the results come from the top fbTerms (default 10) from each of the top fbDocs (default 10) documents in the result set. Query vectors and result vectors are interpolated using a multiplier alpha (default 0.5), and resulting top scoring terms are used to construct Query B as a weighted OR query, where the weights for each term is the score computed for it. The Axiomatic strategy is similar, except it uses a mix of the top rerankCutoff results from query A, and a random set of non-results to improve recall. It uses Mutual Information (MI) between query terms in the query and results to compute the top results. As with RM3, Query B for Axiomatic is a weighted OR query consisting of terms with highest MI and the corresponding weights are the MI values for the term. The Identity strategy, as the name suggests, is a no-op passthrough, which passes the output of Query A unchanged. It can be useful for debugging (in a sense "turning-off" reranking), or when the results of Query A produce sufficiently good candidates for question answering. Finally, since Query B is its own separate query, in order to ensure that it behaves as a reranker, we want to restrict the documents returned to those returned in the top rerankedCutoff documents from Query A. In the Solr plugin, we have implemented that as a docID filter on top results of Query A that is added to Query B.

Pluggable similarities is probably a bit of a misnomer. Way back, Lucene offered a single similarity implementation -- a variant of TF-IDF. Later they started offering BM25 as an alternative, and since Lucene 7.x (I believe), BM25 has become the default Similarity implementation. However, probably as a result of the changes needed to accommodate BM25, it became easier to add newer similarity implementations, and recent versions of Lucene offer a large variety of them, as you can see from the Javadocs for Lucene 8 Similarity. However, similarities are associated with index fields, so a pluggable similarity will only work if you indexed your field with the appropriate similarity in the first place. Anserini offers quite a few similarity implementations, corresponding to the different similarities available in Lucene. However, we noticed that in our case, we just needed BM25 and QL (Query Likelihood, corresponding to Lucene's LMDirichletSimilarity), so our Solr plugin just offers these two.

When I set out to implement the BERTSerini pipeline, my original thought was to leverage the Lucene code directly. However, I decided against it for a number of reasons. First, the scripts I saw in their repository suggested that the primary use case is running large benchmarks with different parameters in batch mode, whereas my use case (at least initially) was more interactive. Second, our index is fairly large, consisting of 4000 books from Science Direct, which translates to approximately 42 million records (paragraphs), and takes up 150 GB (approx) disk space, so we are constrained to build it on a cloud provider's machine (AWS in our case). With Lucene, the only way to "look inside" is Luke, which is harder to forward to your local machine over SSH, compared to forwarding HTTP. For these reasons I decided on using Solr as my indexing platform, and implementing the necessary search functionality as a Solr plugin.

Once I understood the functionality Anserini offered, it took just 2-3 days to implement the plugin and access it from inside a rudimentary web application. The figure below shows the candidate passages for a question that should be familiar to many readers of this blog -- How is market basket analysis related to collaborative filtering? If you look at the top 3 (visible) paragraphs returned, they seem like pretty good candidate passages. Overall, the (BM25 + BoW + RM3) strategy seems to return good passages for question answering.


While the plugin is currently usable as-is, i.e., it is responsive and produces good results, the code relies exclusively in copying functionality (and sometimes chunks of code) from the Anserini codebase rather than using Anserini as a library. In fact, the initial implementation (in the "master" branch) does not have any dependencies on the Anserini JAR. For long term viability, it makes sense to have the plugin be dependent on Anserini. I am currently working with Prof Lin to make that happen, and some partially working code is available to do this in the branch "b_anserini_deps".

The code for the Solr plugin (and documentation on how to install and use it) can be found in the elsevierlabs-os/anserini-solr-plugin repository on Github. My employer (Elsevier, Inc.) open sourced the software so we could (a) make it more robust as described above, in consultation with Prof Lin's team, and (b) provide a tool for Solr users interested in exposing the awesome candidate passage generation for question answering functionality provided by Anserini.

If you are working in this space and are looking for a good tool to extract candidate passages from questions, I think you will find the Solr plugin very useful. If you end up using it, please let us know what you think, including how it could be improved.

Friday, August 09, 2019

KDD 2019: Trip Report


I had the good fortune last week to attend KDD 2019, or more formally, the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, that was held at downtown Anchorage, AK from August 4-8, 2019. Approximately 3000 participants descended on the city (boosting the population by 1%, as Mayor Berkowitz pointed out in his keynote). This was my first KDD, and I found it somewhat different from other conferences I have attended in the past. First, there seems to be a more equitable representation from academia and industry. To some extent this is understandable, since data is a big part of KDD, and typically, it is industry that collects, and has access to, large and interesting datasets. A side effect is that there is as much emphasis on tools and frameworks as on methods, so by judiciously selecting the sessions you attend, you could load up on the combination that works best for you. Second, there is a surfeit of choice -- at any point in time during the conference, there were 5-8 parallel tracks spread across two conference buildings. While there were some difficult choices to make as to which track to attend and which to drop, I felt I got a lot of value out of the conference. Of course, this does mean that each attendee's experience of the conference is likely to be somewhat personalized by their background and their choices. In this post, I will write about my experience at KDD 2019.

Day 1 (Sunday, August 4) -- Lecture Style Tutorials


My choices here were slightly non-obvious, since I haven't worked directly with hospital data (unless you count the i2b2 and MIMIC datasets), and don't forsee doing so any time soon. However, I had just finished reading the book Deep Medicine by Dr. Eric Topol, and was quite excited about all the cool ideas in the book around using machine learning to assist medical practitioners in hospital settings. There were tutorials that were closer to my work interests, but I figured that it might be good to mix in some exploration with the exploitation, and I decided to attend the tutorials on Mining and model understanding on medical data, and Data mining methods for Drug Discovery and Development.

The first tutorial provided a very broad coverage of Medical Data Mining, starting with sources and coding schemes (ICD-10, ATC, etc.), various interesting strategies for extracting temporal features from Electronic Health Records (EHR), such as the use of Allen's temporal logic and Itemset disproportionality. Also covered were Learning from Cohorts and the use of Randomized Control Trials, and the application of the Dawid-Skene algorithm to Data Fusion, i.e., a process of creating clean features from multiple noisy features, which reminded me of the Snorkel generative model. I also learned that the Dawid-Skene algorithm is equivalent to a one-layer Restricted Boltzmann Machine (RBM) (example code in Tensorflow). Another interesting insight provided by one of the presenters is the timeline for ML techniques -- starting with rules, then simple matrix based techniques (logistic regression, decision trees, SVM, XGBoost, etc), then very briefly probabilistic statistical techniques, rapidly supplanted by Deep Learning techniques. There is of course a move to merge the last two techniques nowadays, so probabilistic techniques are coming back to the forefront. The tutorial was run by Prof. Myra Spiliopoulou and her team from the Otto-von-Guericke-Universität Magdeburg.

The second tutorial provided a broad overview of in-silico drug development. Primary tasks here are Molecular Representation Learning (for example, mol2vec) to allow molecules to be represented in a semantic vector space similar to word embeddings; Molecular Property Prediction that takes a drug molecule and outputs its property; Drug Repositioning that takes a (molecule, protein) pair and outputs an affinity score to indicate how the molecule will interact with the (disease) protein; Adverse Drug Interation that takes a (molecule, molecule) pair and predicts their interaction; and finally De Novo drug development, which takes a chemical property and outputs a drug molecule. We also learned various schemes for encoding molecules as text, such as 1D, 2D, and 3D encoding, circular fingerprints (ECFPx), SMILES, and adjacency matrix (Bond Adjacency). The main takeaways for me were the ways in which molecules can be encoded and embedded, and the use of Variational Autoencoders (VAE) and grammar constraints to restrict generated drugs to valid ones. The tutorial was run by Cao Xiao and Prof. Jimeng Sun of IQVIA and Georgia Tech respectively.

Day 2 (Monday, August 5) -- Workshops


The workshop I attended was Mining and Learning with Graphs Workshop (MLG 2019). This was an all-day event, with 5 keynote speakers. The first keynote speaker was Prof. Lisa Getoor of University of California, Santa Cruz - she gave a shorter version of her RecSys 2018 keynote and mentioned the Probabilistic Soft Logic (PSL) Framework, a framework for developing (specifying rules and optimizing) probabilistic models. Prof. Austin Benson from Cornell spoke of Higher Order Link Prediction (slides). Lada Adamic spoke about interesting Social Network findings based on crunching county-level US demographics data from Facebook. She works for Facebook now, but I remember her as the University of Michigan professor who re-introduced me to Graph Theory after college, through her (now no longer active) course on Social Network Analysis on Coursera. Prof. Vagelis Papalexakis, from the University of Riverside, talked about Tensor Decomposition for Multi-aspect Graph Analytics, a talk that even a fellow workshop attendee and PhD student who had a poster accepted thought was heavy. Finally, Prof Huan Liu of Arizona State University, and the author of the (free to download) book Social Media Mining, spoke very entertainingly about the challenges in mining Big Social Media data, mostly related to feature sparsity and privacy, and possible solutions to these. He also pointed the audience to an open source feature selection library called scikit-feature.

There were quite a few papers in there (as well as posters) in the workshop that I found interesting. The paper Graph-Based Recommendation with Personalized Diffusions uses random walks to generate personalized diffusion features for an item-based recommender. The Sparse + Low Rank trick for Matrix Factorization-based Graph Algorithms based on Halko's randomized algoithm, describes a simple way to make matrix factorization more scalable by decomposing the matrix into a sparse and a low-rank component. Graph Embeddings at Scale proposes a distributed infrastructure to build graph embeddings that avoids graph partitioning. The Temporal Link Prediction in Dynamic Networks (poster) uses a SiameseLSTM network to compare pairs of sequences of node embeddings over time. When to Remember where you came from: Node Representation Learning in Higher-Order Networks uses historical links to predict future links.

Finally, I also went round looking at posters from other workshops. Of these, I found Automatic Construction and Natural-Language Description of Nonparametric Regression Models that attempts to classify time series trends against a library of reference patterns, and then create a vector that can be used to generate a set of explanations for the trend.

This was followed by the KDD opening session, where after various speeches by committee members and invited dignitaries, awards for various activities were given out. Of note was the ACM SIGKDD Innovation Award awarded to Charu Aggarwal, the ACM SIGKDD Service Award for Balaji Krishnapuram, and the SIGKDD Test of Time Award to Christos Faloutsos, Natalie Glance, Carlos Guestrin, Andreas Krause, Jure Leskovec, and Jeanne VanBriesen.

There was another poster session that evening, where I had the chance to see quite a few more posters. Some of these that I found interesting are as follows. Revisiting kd-tree for Nearest Neighbor Search, which uses randomized partition trees and Fast Fourier Transforms (FFT) to more efficiently build kd-trees with the same level of query accuracy. It caught my interest because I saw something about randomized partition trees, and I ended up learning something interesting. Another one was Riker: Mining Rich Keyword Representations for Interpretable Product Question answering, which involves creating word vectors for questions and using attention maps to predict the importance of each of these words for a given product.

Day 3 (Tuesday, August 6) -- Oral Presentations


The day started with a keynote presentation titled The Unreasonable Effectiveness and Difficulty of Data in Healthcare by Dr Peter Lee of Microsoft Research. To a large extent, his points mirror those made by Dr. Eric Topol in Deep Medicine in terms of what is possible in medicine with the help of ML/AI, but he also looks at the challenges that must be overcome before this vision becomes reality.

Following that, I attended two sessions of Applied Data Science Oral Presentations, one on Auto-ML and Development Frameworks, and the other on Language Models and Text Mining, and then one session of Research Track Oral Presentation on Neural Networks.

I found the following papers interesting in the first Applied Data Science session. Auto-Keras: An Efficient Neural Architecture Search System uses Bayesian Optimization to find the most efficient Dense Keras network for your application. To the user, calling this is a one-liner. Currently this works on legacy Keras, but the authors are working with Google to have this ported to tf.keras as well. A more interesting framework keras-tuner currently works with tf.keras, and while invoking keras-tuner involves more lines of code, it does seem to be more flexible as well. TF-Ranking: Scalable Tensorflow Library for Learning-to-Rank is another Learning to Rank (LTR) framework that is meant to be used instead of libraries like RankLib or LambdaMART. It provides pointwise, pairwise based, and listwise ranking functions. FDML: A Collaborative Machine Learning Framework for Distributed Learning is meant to be used where learning needs to happen across platforms which are unable to share data either because of volume or privacy reasons. The idea is to learn local models with diverse local features, which will output local results, then combine local results to get the final prediction. In addition, there was a talk on Pythia: AI assisted code completion system that is used in the VSCode editor, and Shrinkage Estimators in Online Experiments, which mentions the Pytorch based Adaptive Experimentation Platform for Bayesian Parameter Optimization.

The second Applied Data Science session was on Language Models. The papers I found interesting in this session are as follows. Unsupervised Clinical Language Translation (paper) which uses an unsupervised technique to induce a dictionary between clinical phrases and corresponding layman phrases, then uses a standard Machine Translation (MT) pipeline to translate one to the other. A reverse pipeline is also constructed, which can be used to generate more training data for the MT pipeline. GMail Smart Compose: Real-Time Assisted Writing underlies the phrase completion feature most GMail users are familiar with. It is achieved by interpolating predictions from a large global language model and a smaller per-user language model. As part of this work, they have open sourced Lingvo, a Tensorflow based framework for building sequence models. And finally, Naranjo Question Answering using End-to-End Multi-task Learning Model attempts to infer adverse drug reactions (ADR) from EHRs by answering the Naranho questionnaire using automated question answering. There was also Automatic Dialog Summary Generation for Customer Service uses key point sequences to guide the summarization process, and uses a novel Leader-Writer network for the purpose.

The final oral presentation session for the day was the Research Track on Neural Networks. Unfortunately, I did not find any of the papers useful, in terms of techniques I could borrow for my own work. I did get the impression that Graph based Neural Networks were the new thing, since almost every paper used some form of Graph network. Apart from graph embeddings that are derived from node properties or conducting random walks on graphs, there is the graph convolution network (GCN) which uses graph local features instead of spatially local features. The GCN-MF: Disease-Gene Association Identification by Graph Convolutional Networks and Matrix Factorization uses this kind of architecture to detect associations between diseases and genes. Similarly, the Origin-destination Matrix prediction via Graph Convolution: A new perspective of Passenger Demand Modeling uses GCNs to predict demand for ride-hailing services.

The exhibition booth had also opened earlier that day, so I spent some time wandering the stalls, meeting a few people and asking questions about their products. There were a couple of publishers, MIT Press and Springer, selling Data Science books. There were some graph database companies, TigerGraph and Neo4j. Microsoft and Amazon were the two cloud providers with booths, but Google wasn't present (not my observation, it was pointed out to me by someone else). Indeed and LinkedIn were also there. NVIDIA was promoting its RAPIDS GPU-based acceleration framework, along with its GPUs. There were a bunch of smaller data science / analytics companies as well. I picked up a couple of stickers and some literature from the National Security Agency (NSA) and the NVIDIA booths.

I then wandered over to the poster area. I managed to talk to a few people and listen to a few presentations. Notable among them was the poster on Chainer: a Deep Learning Framework for Accelerating the Research Cycle. I haven't used Chainer, but looking at the code in the poster, it looked a bit like Pytorch (or more correctly perhaps, Pytorch looks a bit like Chainer). Another framework to pick up when time permits, hopefully.

Day 4 (Wednesday, August 7) -- Hands-on Tutorials


I came in bright and early, hoping to attend the day's keynote presentation, but ended up having a great conversation with a data scientist from Minnesota instead, as we watched the sun rise across the Alaska range from the third floor terrace of the conference building. In any case, the keynote I planned on attending ended up getting cancelled, so it was all good. For my activity that day, I had decided on attending two hands-on tutorials, one about Deep Learning for Natural Language Processing with Tensorflow, and the other about Deep Learning at Scale on Databricks.

The Deep Learning for NLP with Tensorflow was taught by a team from Google. It uses the Tensorflow 2.x style of eager execution and tf.keras. It covers basics, then rapidly moves on to sequence models (RNN, LSTM), embeddings, sequence to sequence models, attention, and transformers. As far as teachability goes, I have spent a fair bit of time trying to figure this stuff out myself, then trying to express it in the cleanest possible way to others, and I thought this was the most intuitive explanation of attention I have seen so far. The slide deck is here, they contain links to various Collab notebooks. The Collab notebooks can also be found at this github link. The tutorial then covers the transformer architecture, and students (in an ideal world with enough internet bandwidth and time) are taught how to construct a transformer encoder-decoder architecture from scratch. They also teach you how to user the pre-trained BERT model from TF-Hub and optionally fine tune it. Because we were not in an ideal world, after the initial few Collab notebooks, it was basically a lecture, where we are encouraged to run the notebooks on our own time.

The Deep Learning at Scale on Databricks was taught by a team from Databricks, and was apparently Part-II in a two part session. But quite a few of us showed up based on the session being marked as a COPY of the morning session, so the instructor was kind enough to run through the material again for our benefit. The slide deck can be found here. Unfortunately, I can no longer locate the URL for the notebook file archive to be imported into Databricks, but I am guessing these notebooks will soon be available as a Databricks tutorial. We used the Databricks platform provided by Microsoft Azure. In any case, the class schedule was supposed to cover Keras basics, MLFlow, Horovod for distributed model training, HyperOpt for simultaneously training models on workers with different hyperparameters. We ended up running through the Keras basics very fast, then spending some time on MLFlow, and finally run distributed training with Horovod on Spark. Most people had come to get some hands-on with Horovod anyway, so not being able to cover HyperOpt was not a big deal for most of us.

That evening was also the KDD dinner. I guess lot of people (including me, based on past ACL conferences) had expected something more formal, but it turned out to be a standup with drinks and hors-d'oeuvres. To be fair, the stand-up model does give you more opportunities to network. However, it was also quite crowded, so after a fairly long time spent in lines with correspondingly little profit, I decided to hit the nearby Gumbo House where I enjoyed a bowl of gumbo and some excellent conversation with a couple of AWS engineers, also KDD attendees who decided to eat out rather than braving the lines. Talking of food, other good places to eat at Anchorage downtown are the Orso, Pangea, and Fletcher's (good pizza). I am sure there are others, but these are the ones I went to and can recommend.

Day 5 (Thursday, August 8) -- More Hands-on Tutorial


This was the last day of the conference. I had a slightly early flight (3 pm) which meant that I would be able to attend only sessions in the first half. In the morning keynote, Prof. Cynthia Rudin of Duke University spoke about her experience with smaller simpler models versus large complex ones, and made the point that it is harder to come up with a simple model because the additional constraints are harder to satisfy. She then shows that it is possible to empirically test for whether one or more simple models are available by looking at accuracies from multiple ML models. Overall, a very thought provoking and useful talk.

For the rest of the day, I chose another hands-on tutorial titled From Graph to Knowledge Graph: Mining Large-scale Heterogeneous Networks using Spark taught by a team from Microsoft. As with the previous hands-on, we used Databricks provided by Azure. The objective was to learn to operate on subsets of the Microsoft Academic Graph, using Databricks notebooks available on this github site. However, since we were all sharing a cluster, there wasn't enough capacity for the students to do any hands-on, so we ended up watching the instructor run through the notebooks on the projector. The initial notebooks (run before lunch) seemed fairly basic, with standard DataFrame operators being used. I am guessing the fun stuff happened in the afternoon after I left, but in any case, Microsoft also offers a longer course From Graph to Knowledge Graph - Algorithms and Applications on edX, which I plan on auditing.

Closing Thoughts


There were some logistical issues, that in hindsight perhaps, could be avoided. While Anchorage is a beautiful city and I thoroughly enjoyed my time there, for some attendees it was perhaps not as great an experience. One particularly scary problem was that some people's hotel bookings got cancelled due to a mixup with their online travel agents, which meant that they had no place to sleep when they arrived here. Apparently some people had to sleep on park benches -- I thought that was particularly scary, at least until the University of Alaska opened up their dormitory to accommodate the attendees who had nowhere to go. I didn't get accommodation at the "KDD approved" hotels listed on their site either, but I did end up getting a place to stay that was only a 7 minute walk from the conference venue, so I count myself as one of the lucky ones. However, apart from this one major mishap, I think the conference went mostly smoothly.

At RecSys 2018, which I attended last year, one of the people in the group I found myself in said that he had finally "found his people". While my conference experience has been improving steadily over time with respect to the social aspect, and I did end up making lot more friends at RecSys 2018 than I did here (partly due to the network effect of my colleague and his friends being die-hard extroverts), I do think I have finally found mine at KDD.



Monday, June 24, 2019

Understanding LR Finder and Cyclic Learning Rates using Tensorflow 2


A group of us at work are following Jeremy Howard's Practical Deep Learning for Coders, v3. Basically we watch the videos on our own, and come together once a fortnight or so, to discuss things that seemed interesting and useful, or if someone has questions that others might then try to answer. One thing that's covered fairly early on in the course is how to use the Learning Rate Finder (LR Finder) tool that comes built-in with the fast.ai library (fastai/fastai on github). The fast.ai library builds on top of the Pytorch framework, and provides convenience functions that can make deep learning development simpler. A lot like what Keras did for Tensorflow, which incidentally is also the Deep Learning framework that I started with and confess being somewhat partial to, although nowadays I use the tf.keras (Tensorflow) port exclusively. So I figured that it would be interesting to see how to do this (LR Finding) with Keras.

The LR Finder is the brainchild of Leslie Smith, who has published two papers on it -- A disciplined approach to neural network hyperparameters: Part 1 -- Learning Rate, Batch Size, Momentum, and Weight Decay, and another jointly with Nicholay Topin, Super-Convergence: Very Fast Training of Neural Networks using Large Learning Rates. The LR Finder approach is able to predict, with only a few iterations of training, a range of learning rates that would be optimal for a given model/dataset combination. It does this by varying the learning rate across the training iterations, and observing the loss for each learning rate. The shape of the plot of loss against learning rates provides clues about the optimal range of learning rates, much like stock charts provides clues to future prices of the stock.

This in itself is a pretty big time savings, compared to running limited epochs of training with different learning rates to find the "best" one. But in addition to that, Smith also proposes using a Learning Rate Schedule he calls the Cyclic Learning Rate (CLR). In its simplest incarnation, the learning rate schedule for CLR looks a bit like an isoceles triangle. Assuming N epochs of training, and an optimum learning rate range predicted by the LR Finder plot (min_lr, max_lr), for the first N/2 epochs, the learning rate rises uniformly from min_lr to max_lr, and for about 90% of the next N/2 epochs, it falls uniformly from max_lr to min_lr, then for the last 10%, it falls uniformly from min_lr to 0. According to his experiments on a wide variety of standard model/dataset combinations, using the CLR schedule trains networks results in higher classification accuracy often with fewer epochs of training, compared to using static learning rates.

There is already a LR Finder and CLR schedule implementation for Keras, thanks to Somshubhra Majumdar (titu1994/keras-one-cycle), where both the LR Finder and CLR Schedule (called OneCycleLR here) are implemented as Keras callbacks. There is a decidedly Keras flavor to the implementation. For example, the LR Finder always runs for one epoch of training. However, there are advantages to using this implementation compared to rolling your own. For example, unlike the LearningRateScheduler callback built into Keras, the OneCycleLR callback also optionally allows the caller to schedule a Momentum Schedule along with a Learning Rate Schedule. Overall, it would take some effort to convert over to tf.keras, but probably not a whole lot.

At the other end of the spectrum is a Pytorch implementation from David Silva (davidtvs/pytorch-lr-finder), where the LR Finder is more of a standalone utility, which can predict the optimum range of learning rates given a model/dataset combination. I felt this approach was a bit cleaner in the sense that one can focus on what the LR finder does rather than try to think in terms of callback events.

So I decided to use the pytorch-lr-finder as a model to build a basic LR Finder of my own that works against tf.keras on Tensorflow 2, and try it out against a small standard network to see how it works. For the CLR scheduler, I decided to pass a homegrown version of the CLR schedule into the built in tf.keras LearningRateScheduler. This post will describe that experience. However, since this was mainly a learning exercise, the code has not been tested beyond what I describe here, so for your own work, you should probably stick to using the more robust implementations I referenced above.

The network I decided on was the LeNet network, proposed by Yann LeCun in 1995. The actual implementation is based on Wei Li's Keras implementation available on the BIGBALLON/cifar-10-cnn repository. The class is defined as follows, using the new imperative Chainer like syntax adopted by Pytorch and now Tensorflow 2. I had originally assumed, like many others, that this syntax was one of the features that Tensorflow 2 was adopting from Pytorch, but it turns out that they are both adopting it from Chainer, as this Twitter thread from François Chollet indicates. In any case, convergence is a good thing for framework users like me. Talking of tweets from François Chollet, if you are comfortable with Keras already, here is another Twitter thread which tells you pretty much everything you need to know to get started with Tensorflow 2.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
class LeNetModel(tf.keras.Model):

    def __init__(self, **kwargs):
        super(LeNetModel, self).__init__(**kwargs)
        self.conv1 = tf.keras.layers.Conv2D(
            filters=6,
            kernel_size=(5, 5),
            padding="valid",
            activation="relu",
            kernel_initializer="he_normal",
            input_shape=(32, 32, 3))
        self.pool1 = tf.keras.layers.MaxPooling2D(
            pool_size=(2, 2))
        self.conv2 = tf.keras.layers.Conv2D(
            filters=16,
            kernel_size=(5, 5),
            padding="valid",
            activation="relu",
            kernel_initializer="he_normal")
        self.pool2 = tf.keras.layers.MaxPooling2D(
            pool_size=(2, 2),
            strides=(2, 2))
        self.flat = tf.keras.layers.Flatten()
        self.dense1 = tf.keras.layers.Dense(
            units=120,
            activation="relu",
            kernel_initializer="he_normal")
        self.dense2 = tf.keras.layers.Dense(
            units=84,
            activation="relu", 
            kernel_initializer="he_normal")
        self.dense3 = tf.keras.layers.Dense(
            units=10,
            activation="softmax",
            kernel_initializer="he_normal")


    def call(self, x):
        x = self.conv1(x)
        x = self.pool1(x)
        x = self.conv2(x)
        x = self.pool2(x)
        x = self.flat(x)
        x = self.dense1(x)
        x = self.dense2(x)
        x = self.dense3(x)
        return x        

Here is the Keras summary view for those of you who prefer something more visual. If you were wondering how I got actual values in the Output Shape column with the code above, I didn't. As Tensorflow Issue# 25036 indicates, the call() method creates a non-static graph, and so model.summary() is unable to compute the output shapes. To generate the summary below, I rebuilt the model as a static graph using tf.keras.models.Sequential(). The code is fairly trivial so I don't include it here.

Model: "le_net_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv1 (Conv2D)               (None, 28, 28, 6)         456       
_________________________________________________________________
pool1 (MaxPooling2D)         (None, 14, 14, 6)         0         
_________________________________________________________________
conv2 (Conv2D)               (None, 10, 10, 16)        2416      
_________________________________________________________________
pool2 (MaxPooling2D)         (None, 5, 5, 16)          0         
_________________________________________________________________
flatten (Flatten)            (None, 400)               0         
_________________________________________________________________
dense1 (Dense)               (None, 120)               48120     
_________________________________________________________________
dense2 (Dense)               (None, 84)                10164     
_________________________________________________________________
dense3 (Dense)               (None, 10)                850       
=================================================================
Total params: 62,006
Trainable params: 62,006
Non-trainable params: 0
_________________________________________________________________

The dataset I used for the experiment was the CIFAR-10 dataset, a collection of 60K (32, 32, 3) color images (tiny images) in 10 different classes. The CIFAR-10 dataset is available via the tf.keras.datasets package. The function below downloads the data, preprocesses it appropriately for use by the network, and converts it into the tf.data.Dataset format that Tensorflow 2 likes. It will return datasets for training, validation, and test, with size 45K, 5K, and 10K images respectively.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def load_cifar10_data(batch_size):
    (xtrain, ytrain), (xtest, ytest) = tf.keras.datasets.cifar10.load_data()

    # scale data using MaxScaling
    xtrain = xtrain.astype(np.float32) / 255
    xtest = xtest.astype(np.float32) / 255

    # convert labels to categorical    
    ytrain = tf.keras.utils.to_categorical(ytrain)
    ytest = tf.keras.utils.to_categorical(ytest)

    train_dataset = tf.data.Dataset.from_tensor_slices((xtrain, ytrain))
    test_dataset = tf.data.Dataset.from_tensor_slices((xtest, ytest))

    # take out 10% of train data as validation data, shuffle, and batch
    val_size = xtrain.shape[0] // 10
    train_dataset = train_dataset.shuffle(10000)
    val_dataset = train_dataset.take(val_size).batch(
        batch_size, drop_remainder=True)
    train_dataset = train_dataset.skip(val_size).batch(
        batch_size, drop_remainder=True)
    test_dataset = test_dataset.shuffle(10000).batch(
        batch_size, drop_remainder=True)
    
    return train_dataset, val_dataset, test_dataset

I trained the model first using a learning rate of 0.001, which I picked up from the blog post CIFAR-10 Image Classification in Tensorflow by Park Chansung. The training code is just 5-6 lines of code that is very familiar to Keras developers - declare the model, compile the model with loss function and optimizer, then train it for a fixed number of epochs (10), and finally evaluate it against the held out test dataset.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
model = LeNetModel()
model.build(input_shape=(None, 32, 32, 3))

learning_rate = 0.001
optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate)
loss_fn = tf.keras.losses.CategoricalCrossentropy()

model.compile(loss=loss_fn, optimizer=optimizer, metrics=["accuracy"])
model.fit(train_dataset, epochs=10, validation_data=val_dataset)
model.evaluate(test_dataset)

The output of this run is shown below. The first block is the output from the training run (model.fit()), and the last line is the output of the model.evaluate() call. As you can see, while the final accuracy values are not stellar, it is steadily increasing, so presumably we can expect good results given enough epochs of training. Also, the objective of this run was to create a baseline against which we will measure training runs with learning rates that we infer from our LR finder, described below.

Epoch 1/10
351/351 [==============================] - 12s 35ms/step - loss: 2.3043 - accuracy: 0.1105 - val_loss: 2.2936 - val_accuracy: 0.1170
Epoch 2/10
351/351 [==============================] - 12s 34ms/step - loss: 2.2816 - accuracy: 0.1276 - val_loss: 2.2801 - val_accuracy: 0.1330
Epoch 3/10
351/351 [==============================] - 12s 33ms/step - loss: 2.2682 - accuracy: 0.1464 - val_loss: 2.2668 - val_accuracy: 0.1442
Epoch 4/10
351/351 [==============================] - 11s 33ms/step - loss: 2.2517 - accuracy: 0.1620 - val_loss: 2.2474 - val_accuracy: 0.1621
Epoch 5/10
351/351 [==============================] - 12s 33ms/step - loss: 2.2254 - accuracy: 0.1856 - val_loss: 2.2141 - val_accuracy: 0.1893
Epoch 6/10
351/351 [==============================] - 12s 34ms/step - loss: 2.1810 - accuracy: 0.2117 - val_loss: 2.1601 - val_accuracy: 0.2226
Epoch 7/10
351/351 [==============================] - 12s 34ms/step - loss: 2.1144 - accuracy: 0.2421 - val_loss: 2.0856 - val_accuracy: 0.2526
Epoch 8/10
351/351 [==============================] - 12s 35ms/step - loss: 2.0363 - accuracy: 0.2641 - val_loss: 2.0116 - val_accuracy: 0.2714
Epoch 9/10
351/351 [==============================] - 12s 35ms/step - loss: 1.9704 - accuracy: 0.2841 - val_loss: 1.9583 - val_accuracy: 0.2901
Epoch 10/10
351/351 [==============================] - 13s 36ms/step - loss: 1.9243 - accuracy: 0.2991 - val_loss: 1.9219 - val_accuracy: 0.2985

78/78 [==============================] - 1s 13ms/step - loss: 1.9079 - accuracy: 0.3112

My version of the LR Finder presents an API similar to the pytorch-lr-finder, where you pass in the model, optimizer, loss function, and dataset to create an instance of LRFinder. You then make call range_test() on the LRFinder with the minimum and maximum boundaries for learning rate, and the number of iterations. This step is similar to the Learner.lr_find() call in fast.ai. The range_test() function will split the learning rate range into the specified number of iterations given by num_iter, and train the model with one batch with each learning rate, and record the loss. Finally, the plot() method will plot the losses against the learning rate. Since we are training at the batch level, we need to calculate losses and gradients ourselves, as seen in the train_step() function. The code for the LRFinder class is as follows. The main section (under if __name__ == "__main__") contains calling code using the LeNet model, CIFAR-10 dataset, the SGD optimizer, and the categorical cross-entropy loss function.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

class LRFinder(object):
    def __init__(self, model, optimizer, loss_fn, dataset):
        super(LRFinder, self).__init__()
        self.model = model
        self.optimizer = optimizer
        self.loss_fn = loss_fn
        self.dataset = dataset
        # placeholders
        self.lrs = None
        self.loss_values = None
        self.min_lr = None
        self.max_lr = None
        self.num_iters = None


    @tf.function
    def train_step(self, x, y, curr_lr):
        tf.keras.backend.set_value(self.optimizer.lr, curr_lr)
        with tf.GradientTape() as tape:
            # forward pass
            y_ = self.model(x)
            # external loss value for this batch
            loss = self.loss_fn(y, y_)
            # add any losses created during forward pass
            loss += sum(self.model.losses)
            # get gradients of weights wrt loss
            grads = tape.gradient(loss, self.model.trainable_weights)
        # update weights
        self.optimizer.apply_gradients(zip(grads, self.model.trainable_weights))
        return loss


    def range_test(self, min_lr, max_lr, num_iters, debug):
        # create learning rate schedule
        self.min_lr = min_lr
        self.max_lr = max_lr
        self.num_iters = num_iters
        self.lrs =  np.linspace(
            self.min_lr, self.max_lr, num=self.num_iters)
        # initialize loss_values
        self.loss_values = []
        curr_lr = self.min_lr
        for step, (x, y) in enumerate(self.dataset):
            if step >= self.num_iters:
                break
            loss = self.train_step(x, y, curr_lr)
            self.loss_values.append(loss.numpy())
            if debug:
                print("[DEBUG] Step {:d}, Loss {:.5f}, LR {:.5f}".format(
                    step, loss.numpy(), self.optimizer.learning_rate.numpy()))
            curr_lr = self.lrs[step]


    def plot(self):
        plt.plot(self.lrs, self.loss_values)
        plt.xlabel("learning rate")
        plt.ylabel("loss")
        plt.title("Learning Rate vs Loss ({:.2e}, {:.2e}, {:d})"
            .format(self.min_lr, self.max_lr, self.num_iters))
        plt.xscale("log")
        plt.grid()
        plt.show()


if __name__ == "__main__":

    tf.random.set_seed(42)

    # model
    model = LeNetModel()
    model.build(input_shape=(None, 32, 32, 3))
    # optimizer
    optimizer = tf.keras.optimizers.SGD()
    # loss_fn
    loss_fn = tf.keras.losses.CategoricalCrossentropy()
    # dataset
    batch_size = 128
    dataset, _, _ = load_cifar10_data(batch_size)
    # min_lr, max_lr
#    min_lr = 1e-6
#    max_lr = 3
    min_lr = 1e-2
    max_lr = 1
    # compute num_iters (Keras fit-one-cycle used 1 epoch by default)
    dataset_len = 45000
    batch_size = 128
    num_iters = dataset_len // batch_size
    # declare LR Finder
    lr_finder = LRFinder(model, optimizer, loss_fn, dataset)
    lr_finder.range_test(min_lr, max_lr, num_iters, debug=True)
    lr_finder.plot()

We first ran the LRFinder for a relatively large learning rate range from 1e-6 to 3. This gives us the chart on the left below. For the CLR schedule, the minimum LR for our range is where the loss starts descending, and the maximum LR is where the loss stops descending or becomes ragged. My charts are not as clean as those shown in the two projects referenced, but we can still infer that these boundaries are 1e-6 and about 3e-1. The chart on the right below is the plot of LR vs Loss on a smaller range (1e-2 and 1) to help see the chart in greater detail. We also see that the LR with minimum loss is about 3e-1.






Based on this, my first experiment is to try and train the network with the larger best learning rate (3e-1) we found from the LR Finder, and see if it trains better over 10 epochs than my previous attempt. The only thing we have changed here from the previous training code block above is to replace learning_rate from 0.001 to 0.3. Here are the results of 10 epochs of training, followed by evaluation on the held out test set.

Epoch 1/10
351/351 [==============================] - 12s 36ms/step - loss: 2.1572 - accuracy: 0.1664 - val_loss: 1.9966 - val_accuracy: 0.2590
Epoch 2/10
351/351 [==============================] - 12s 34ms/step - loss: 1.8960 - accuracy: 0.2979 - val_loss: 1.7568 - val_accuracy: 0.3732
Epoch 3/10
351/351 [==============================] - 12s 35ms/step - loss: 1.7456 - accuracy: 0.3642 - val_loss: 1.6556 - val_accuracy: 0.3984
Epoch 4/10
351/351 [==============================] - 12s 35ms/step - loss: 1.6634 - accuracy: 0.4021 - val_loss: 1.6050 - val_accuracy: 0.4331
Epoch 5/10
351/351 [==============================] - 12s 35ms/step - loss: 1.5993 - accuracy: 0.4213 - val_loss: 1.6906 - val_accuracy: 0.3858
Epoch 6/10
351/351 [==============================] - 12s 36ms/step - loss: 1.5244 - accuracy: 0.4484 - val_loss: 1.5754 - val_accuracy: 0.4399
Epoch 7/10
351/351 [==============================] - 13s 36ms/step - loss: 1.4568 - accuracy: 0.4749 - val_loss: 1.4996 - val_accuracy: 0.4712
Epoch 8/10
351/351 [==============================] - 13s 36ms/step - loss: 1.3894 - accuracy: 0.4971 - val_loss: 1.4854 - val_accuracy: 0.4786
Epoch 9/10
351/351 [==============================] - 12s 35ms/step - loss: 1.3323 - accuracy: 0.5207 - val_loss: 1.4527 - val_accuracy: 0.4950
Epoch 10/10
351/351 [==============================] - 12s 36ms/step - loss: 1.2817 - accuracy: 0.5411 - val_loss: 1.4320 - val_accuracy: 0.5068

78/78 [==============================] - 1s 12ms/step - loss: 1.4477 - accuracy: 0.4920

Clearly, the larger learning rate is helping the network achieve better performance, although it does seem (at least around epoch 3) that it may be slightly too large. Accuracy numbers on the held out test set jumped from 0.3112 to 0.4920. So overall it seems to be helping. So even if we just use the LR Finder to find the "best" learning rate, this is still cheaper than doing multiple training runs of a few epochs each.

Finally, we will try using a Cyclic Learning Rate (CLR) schedule using the learning rate boundaries (1e-6, 3e-1). The code for this is shown below. The clr_schedule() function produces a triangular learning rate schedule which rises for the first 5 epochs (in our case) from the minimum specified learning rate to the maximum, then falls from the maximum to the minimum for the next 4 epochs, and finally falls to half the minimum for the last epoch. This is analogous to the Learner.fit_one_cycle() call in fast.ai. The clr_schedule function is passed to the LearningRateScheduler callback, which is then called by the model training loop via the callback parameter in the fit() function call. Here is the code.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

def clr_schedule(epoch):
    num_epochs = 10
    mid_pt = int(num_epochs * 0.5)
    last_pt = int(num_epochs * 0.9)
    min_lr, max_lr = 1e-6, 3e-1
    if epoch <= mid_pt:
        return min_lr + epoch * (max_lr - min_lr) / mid_pt
    elif epoch > mid_pt and epoch <= last_pt:
        return max_lr - ((epoch - mid_pt) * (max_lr - min_lr)) / mid_pt
    else:
        return min_lr / 2

# plot the points
# epochs = [x+1 for x in np.arange(10)]
# clrs = [clr_schedule(x) for x in epochs]
# plt.plot(epochs, clrs)
# plt.show()

batch_size = 128
train_dataset, val_dataset, test_dataset = load_cifar10_data(batch_size)

model = LeNetModel()

min_lr = 1e-6
max_lr = 3e-1
optimizer = tf.keras.optimizers.SGD(learning_rate=min_lr)
loss_fn = tf.keras.losses.CategoricalCrossentropy()

lr_scheduler = tf.keras.callbacks.LearningRateScheduler(clr_schedule)

model.compile(loss=loss_fn, optimizer=optimizer, metrics=["accuracy"])
model.fit(train_dataset, epochs=10, validation_data=val_dataset,
    callbacks=[lr_scheduler])
model.evaluate(test_dataset)

And here are the results of training the LeNet model with the CIFAR-10 dataset for 10 epochs, and evaluating on the held out test set. As you can see, the evaluation accuracy on the held out test set has jumped further from 0.4920 to 0.5469.

Epoch 1/10
351/351 [==============================] - 13s 36ms/step - loss: 2.7753 - accuracy: 0.0993 - val_loss: 2.7627 - val_accuracy: 0.1060
Epoch 2/10
351/351 [==============================] - 12s 34ms/step - loss: 2.0131 - accuracy: 0.2097 - val_loss: 2.0829 - val_accuracy: 0.2634
Epoch 3/10
351/351 [==============================] - 12s 34ms/step - loss: 1.8321 - accuracy: 0.3106 - val_loss: 1.7187 - val_accuracy: 0.3718
Epoch 4/10
351/351 [==============================] - 12s 35ms/step - loss: 1.7100 - accuracy: 0.3648 - val_loss: 1.7484 - val_accuracy: 0.3928
Epoch 5/10
351/351 [==============================] - 12s 36ms/step - loss: 1.5779 - accuracy: 0.4209 - val_loss: 1.6188 - val_accuracy: 0.4087
Epoch 6/10
351/351 [==============================] - 13s 36ms/step - loss: 1.5451 - accuracy: 0.4300 - val_loss: 1.5704 - val_accuracy: 0.4377
Epoch 7/10
351/351 [==============================] - 12s 35ms/step - loss: 1.3597 - accuracy: 0.5063 - val_loss: 1.3742 - val_accuracy: 0.5014
Epoch 8/10
351/351 [==============================] - 12s 36ms/step - loss: 1.2383 - accuracy: 0.5484 - val_loss: 1.3620 - val_accuracy: 0.5204
Epoch 9/10
351/351 [==============================] - 12s 35ms/step - loss: 1.1379 - accuracy: 0.5856 - val_loss: 1.3336 - val_accuracy: 0.5391
Epoch 10/10
351/351 [==============================] - 12s 35ms/step - loss: 1.0564 - accuracy: 0.6130 - val_loss: 1.3008 - val_accuracy: 0.5557

78/78 [==============================] - 1s 13ms/step - loss: 1.3043 - accuracy: 0.5469

This indicates that the LR Finder and CLR schedule seem like good ideas to try when training your models, especially when using non-adaptive optimizers such as SGD.

I tried the same sequence of experiments with the Adam optimizer, and I got better results (test accuracy: 0.5908) using a fixed learning rate of 1e-3 for the first training run. Thereafter, based on the LR Finder reporting a learning rate range of (1e-6, 1e-1), the next two experiments using the best learning rate and CLR schedule both produced accuracies of about 0.1 (i.e., close to random for 10-class classifier). I wasn't too surprised, since I figured that the CLR schedule probably interfered with Adam's own learning rate schedule. However, according to this tweet from Jeremy Howard, the LR Finder can be used with the Adam optimizer as well. Given that he has probably conducted many more experiments around this than I have, and the fast.ai Learner.lr_find() code is more robust and heavily tested than my homegrown implementation, he is very likely right, and my results are an anomaly.

That's all I have for today. Thanks for staying with me so far. I learned a lot from implementing this code, and hopefully you learned a few things from reading this as well. Hopefully, this gives you some ideas for building an LR Finder for Tensorflow 2 that can be used easily by end-users -- if you do end up building one, please let me know, will be happy to link to your site/repository and recommend it to other readers.