Monday, October 01, 2018

Trip Report (sort of): RELX Search Summit 2018


Last week, I was at our London office attending the RELX Search Summit. The RELX Group is the parent company that includes my employer (Elsevier) as well as LexisNexis, LexisNexis Risk Solutions and Reed Exhibitions, among others. The event was organized by our Search Guild, an unofficial special interest group of search professionals from all these companies. As you can imagine, search is something of a big deal at both LexisNexis and Elsevier, given several large, well known search platforms such as Lexis Advance, ScienceDirect, Scopus and Clinical Key.

There were quite a few interesting presentations at the Search Summit, some of which I thought was quite groundbreaking from an applied research point of view. I was going to write up a trip report for that when I realized that at least some of the talks probably represented competitive information that would not be appropriate for a public forum. Besides, it is very likely that it would of limited interest anyway. So I decided to write this post only about the two presentations that I did at the Summit. Neither of them are groundbreaking, but I think it might be interesting for most people in general. Hopefully you think so too.

The first presentation was a 3 hour tutorial session around Content Engineering. The theme I wanted to explore was how to identify keywords in text using various unsupervised and supervised techniques, and how this can improve search. As you know, my last job revolved around search driven by a medical ontology, where the ontology was painstakingly hand-curated by a team of doctors, nurses and pharmacists over a span of several years.

Having an ontology makes quite a few things much easier. (It also makes several things much harder, but we won't dwell on that here). However, the use case I was trying to replicate was where you have the content to search, but no ontology to help you with it. Could we, using a variety of rule-based, statistical and machine learning techniques, identify phrases that represent key ideas in the text, similar to concepts in our ontology? And how does this help with search?

The dataset I used was the NIPS (Neural Information Processing Systems) conference papers from 1987 to 2017. The hope was that I would learn something about the cool techniques and algorithms that show up at NIPS, just by having to look at the text to debug problems, although it didn't quite work out the way I had hoped. I demonstrate a variety of techniques such as LLR (statistical), RAKE (rule based) and MAUI (machine learning based), as well as using Stanford and SpaCy NER models, and duplicate keyword detection and removal using SimHash and Dedupe. In addition, I also demonstrate how to do dimensionality reduction using various techniques (PCA, Topic Modeling, NMF, word vectors) and how they can be used to enhance the search experience. In addition, another point I was trying to make was that there is plently of third party open source tools available to do this job without significant investment in coding.

All the techniques listed above are demonstrated using Jupyter notebooks. In addition, I built a little Flask based web application that showed these techniques in action against a Solr 7.3 index containing the NIPS papers. The web application demonstrates techniques both on the query parsing side where we rewrite queries in various ways to utilize the information available, as well as on the content side, where the additional information is used to suggest documents like the one being viewed, or make personalized reading recommendations based on the collection of documents read already.

The presentation slides, the notebooks and the web application can all be found in my sujitpal/content-engineering-tutorial project on Github. Several new ideas were suggested by participants during the tutorial, since many of them had been looking at similar ideas already, so it morphed into a nice interactive workshop style discussion. I hope to add them in as I find time.

My second presentation was around Learning to Rank (LTR) basics. I had recently become interested in LTR following my visit to the Haystack Search Relevancy conference earlier this year, coupled with a chance discovery that a content-based recommender system I was working to help improve had around 40,000 labeled query document pairs, which could be used to improve the quality of recommendations.

The dataset I chose for this presentation was The Movie DataBase (TMDB), a collection of 45,000 movies, 20 genres and 31,000 unique keywords. The idea was to see if I could teach a RankLib LambdaMART model the ordering given by the rating field on a 10 point continuous scale. In a sense, the approach is similar to this LTR article using scikit-learn by Alfredo Motta. Most LTR datasets just give you the features dataset in LETOR format to train your ML models, so you can't actually do the full end-to-end pipeline.

In any case, the presentation starts off with a little bit of historical context, talks about different kinds of LTR models (pointwise, pairwise and listwise), some common algorithms that people tend to use, some advice to keep in mind when considering building an LTR model, ideas for features, the LETOR data format, etc. Most of the meat of the presentation is the creation of an LTR model using the Solr 7.4 and Elasticsearch 6.3.1 plugins, as well for a hypothetical indexer with no LTR support (I used Solr, but did the feature generation outside the indexer). I was hoping to cover at least one of the case studies but ran into technical difficulties (my fault, I should have listened to the organizers when they said to put everything in the slides).

Essentially, the methodology is similar for either of the 3 case studies, the main differences are in syntax (for Solr vs Elasticsearch). First we need a set of queries with a ranked list of documents for each. I used the ratings to create categorical query-document labels on a 5 point scale as explained earlier.

Once the data is loaded, the first step is to define the features to Solr and Elasticsearch - features are specified as function queries. We then generate the feature values from Solr or Elasticsearch by running the queries against these function queries and writing the features into a file in LETOR format. The reason we use the index is mostly to generate the query-document similarity features. For a system without LTR support, this can be done less efficiently outside the index as well.

The LETOR format was originally used by the LTR model suite RankLib (provides 8 different LTR models), and has since been adopted by most other third party LTR models. I trained a RankLib LambdaMART model for all 3 cases. Model training has to happen using third party algorithms, of which RankLib is one. The output of RankLib is an XML file whose format varies depending on what kind of model it represents. For a linear model, it is just a set of coefficients for each of the features defined. For RankNet, a neural network, it is a weight matrix that transforms the incoming features into a set of rank probabilities. For LambdaMART, which is a forest of decision trees, it is a set of trees, each with splits defined for various levels.

Once the model is trained, it has to be uploaded to Solr or Elasticsearch. Solr needs the model to be specified in JSON format, so you need to write some code to convert the XML to JSON, while Elasticsearch will accept the XML definition of the trained model without any conversion. You can now use the rerank functionality in Solr or Elasticsearch to rerank the top slice of a base query. For indexers that don't have LTR support, you will have to generate the search results, extract and rerank the top slice using your trained model, and add it back into the search results yourself.

Notebooks and scripts describing each of the case studies as well the presentation slides can be found in my sujitpal/ltr-examples repository on Github. My main objective here was to understand how to "do" LTR using Solr and Elasticsearch, so I didn't spend much time trying to improve results. Perhaps if I can find a bigger labeled dataset to play with, I might revisit this project and try to evaluate each platform in more detail, would appreciate suggestions if you know of any such datasets. Note that standard LTR datasets such as MSLR-WEB10K just provide the features in LETOR format, so it is all about the part where you train and evaluate the LTR model, nothing to do with showing the results on the search index. What I am looking for is a dataset with labeled query-document pairs.

That's all I have for today. This week I am at RecSys 2018 at Vancouver, hoping to learn more about recommendation systems from the very best in the field, and meet others working in the recommendation space. Do ping me if you are here as well, would be nice to meet face to face.


2 comments (moderated to prevent spam):

Anonymous said...

How did I miss this, Sujit? :)

Amazing, it consumed a good portion of my Sunday today... totally worth it!

It's you to thanks!

/sim

Sujit Pal said...

Sorry for the delay in replying, and glad you enjoyed it, Sim!