Sunday, April 15, 2018

Trip Report: Haystack 2018


Earlier this week (April 10 and 11), I was at the Haystack Search Relevance conference at Charlottesville, VA. The conference was organized by Doug Turnbull and Eric Pugh from OpenSource Connections (o19s). Doug Turnbull (and his co-author John Berryman, who was also at the conference) introduced me many years ago to the world of principled search relevance evaluation through their book Relevant Search. The stated objective of the conference was to bring together people interested in search relevance, so we can collectively advance the state of the art in this area. To that end, a Slack Channel was created in advance of the conference, allowing people to form mental alliances even before they arrived at the conference, and to build upon ideas that resulted from actual meetings during the conference.

Although I knew of Doug through his book before I started at my current job, it was during my outreach efforts at our Search Guild (a loose internal group of search engineers, product managers, information retrieval specialists, UI/UX professionals, etc, interested in search) that I got to know him personally. He contacted me to check if I might be interested in presenting at Haystack, and if so, to submit an abstract. Now, even though my background is in search, lately the work that I am doing has more to do with Machine Learning (ML) than Search. And while I wasn't looking, Lucene/Solr/Elasticsearch (the search tools I am familiar with) have all jumped a couple or more major versions, so my skills in Search are dated as well. Almost on a whim and without much hope of it getting accepted, I submitted a presentation proposal around Image Search, describing some work I had done over last year, even though I wasn't very confident about whether it would be a good fit for this conference.

In retrospect, I am glad I did. The proposal was accepted and the resulting talk was quite well received. It also wasn't the complete outlier I had feared it would be. There was at least one talk that talked about image search (granted it was more about Image recognition and had better outcomes than mine), one talk that talked about word embeddings (tangentially related if you consider the vector representation I was using to be an image embedding), and a talk around a platform that supports tensor based search (arguably a better platform for Image Search than text based indexes such as Lucene).

Most importantly, I got to meet many people here who I had only known previously through their work in various open source projects. Hopefully, this will lead to me being more active on the Slack channel and actually learning something from other participants. As a home-based employee, an accidental benefit was meeting several Elsevier and LexisNexis colleagues who were attending/presenting as well. Also, I learned that Salmon Run (this blog) is far more popular in the search community than I had imagined, which was very gratifying, I am happy so many people find it useful.

In this post, I provide summaries of the talks I attended. There were two parallel tracks, so there were also a few (happening concurrently in the other track) I wish I could have attended, and I call them out here as well. Links to presentation slides were made available for most talks, so I have linked to them if they were available by the time I published this post.

DAY 1 (Tuesday, April 10)


Keynote - by Eric Pugh and Doug Turnbull
Eric started off the keynote by welcoming everyone, and talked about the motivation for coming up with the conference. Doug then spoke of how as a search engineer, he felt that conferences on search were either too academic or too focused on a single company or product. He wanted something that would be more focused on search relevance techniques regardless of what product you use. It was also a call to arms for the target audience, i.e., search experts involved in open source, to come together and build components that can be composed by others, making it easier and faster to build high quality search applications. He compared this to the standardization of tools such as the plunger in the plumbing industry. Interestingly, there is a golden plunger prominently displayed in the o19s lobby, so I guess he has used the analogy before.

Facets and Similarity - Exploring the Meta-Informational Hyperspace - by Ted Sullivan
Ted spoke about using facets not only to filter records as it is normally used, but as real metadata. He brought up how historically facets were called parameters by Verity, dimensions by Endeca, navigators by FAST and refiners by Microsoft FAST, showing how the notion of dimensionality is built into facets. He proposed the facet similarity theorem, which states that facets can be used to find similar things. Thus, a collections of facets could be used to compose feature vectors, which could then be compared to other feature vectors using notions of similarity in vector space. In addition, facets can be used as metadata to construct knowledge bases using entity and fact extraction, find paths in category space using pivot facets, build multi-dimensional query suggesters, provide a way to increase precision as user supplies more data, allow applications such as dynamic boosting of suggestions based on previous queries, and allow much more precise clustering than standard LDA. He also proposed Facet Ratios as a way to find keyword clusters. Topic maps and clusters using keywords generated through facet ratios (also known as Facet based clustering) ends up being cleaner than ones derived from raw TF/IDF. Facet based clustering is going to be incorporated into the LucidWorks Fusion product.

Algorithmic Extraction of Keywords, Concepts and Vocabularies - by Max Irwin
Max described various approaches he used to extract keywords, some of which I knew about but others were new to me (at least I hadn't tried them myself). Some tools for keyword extraction are gensim's implementation of Latent Dirichlet Allocation (LDA), the RAKE algorithm and Maui 2.0. He also described concept extraction techniques using POS tagging and edge labeling using SpaCy, Topia TermExtract and his own SkipChunk system. He also mentioned a system called TAXI (Taxonomy Induction) for taxonomy generation. He applied these techniques on the o19s blog corpus and produced results from each of these approaches for comparison. Lot of good information about tools here, especially interesting to me since I am currently doing something quite similar.

From clicks to models, the Wikimedia LTR pipeline - by Eric Bernhardson
Eric talked about the Wikimedia Learning To Rank (LTR) pipeline. Most of the talk was focused on the engineering aspects of the pipeline. He talked about the MjoLniR, a library written in PySpark and Scala, that transforms the click logs into ML models for ranking in ElasticSearch (ES). For a baseline, they developed a model that learns the existing ranking function. They use click models, a principled way to translate implicit preferences into unbiased labels, using DbnModel from the Python clickmodels library. These operate on groups of sessions with same intent. This allows them to optimize query results by adding specific rewrites based on predicted intent.

Expert Customers: A Hybrid Approach to Clickstream Analytics - by Elizabeth Haubert
Elizabeth talked about her experience building a ClickStream Analytics platform for search. She talked about the features that are normally considered - query features such as click position, number of clicks and query length, session features such as number of queries per session, number of no-click queries, session time, number of reformulaions, and the URLs visited during the session, as well as user features such as number of clicks, number of queries, the user's dwell time, and the similarity of this user to other users in the system. She talked about the importance of a labeled test set in addition to these captured features, since once the test set becomes available, we can, depending on the amount of human judgement information available, go from simple set differences to increasingly richer metrics such as Precision/Recall, Mean/Expected Reciprocal Rank (MRR/ERR), and Discounted Cumulative Gain (DCG and nDCG). She talked about using TREC data (Cranfield model) to validate her results. She also covered some common-sense techniques to increase your chances for getting good data from human testers, such as reducing task ambiguity by building stories and guidelines, and use a scale that does not overwhelm users. This was also an interesting talk for me since I am looking at some of these ideas myself.

In addition, one talk I wish I could have attended as well was Embracing Diversity: Searching over multiple languages by Jeff Zemerick and Suneel Marthi. Both speakers are seasoned Apache committers on various projects, including ones I have benefited from in the past, such as Mahout and OpenNLP. Their talk covered the need for multi-lingual search, the basics of Machine Translation (MT) such as alignment, phrase models, etc, and evaluation of MT using BiLingual Evaluation Understudy (BLEU) scores. They also introduced Apache Joshua, an Apache Incubator project to support statistical MT system for phrase-based, hierarchical and syntax based translations, written in Java.

The first day ended with a series of 5 minute lightning talks from participants. Lightning talks (as well as all Track 2 talks) were held in Random Row, a brewery and pub in the same compound as the o19s offices. The beer was awesome, definitely recommend it if you happen to be in the area, but next time I will remember to take notes as well, only because it is so hard to remember all the great ideas that came up in these talks. Here are a few that I do remember.

  • Representative Nouns by David Smiley - this is in the context of product search. The ideas is that there usually is a noun that uniquely represents the product, the idea is to identify and make it available as a searchable facet.
  • Concept Indexing by Shyamsundar Mutcha - my colleague Shyam gave a high level overview of our concept indexing and search algorithm and explained its benefits.
  • Solr Concordancer plugin by Tim Allison - Tim described a Solr plugin to generate concordances. This is useful for exploring your data. Tim and I got to talking after his talk about some use cases I had handled with a concordancer of my own earlier, and it turned out that my concordancer code served as inspiration for his plugin (which he maintains and makes sure it is aligned to various Lucene versions).
  • Solr explain plan visualizer - a Solr based plugin by Tom Burgmans that parses the JSON output from Solr's explain plan and produces a nice visualization that is easier to understand and use than the text version.
  • Querqy Solr Query Rewriter plugin by Rene Kriegler - a very useful Solr plugin that allows you to set up pattern matches to queries and associated rewrite rules.
  • Search Metrics - by Doug Rosenoff - my colleague Doug described a family of search metrics that provides greater expressivity as more and more training data is provided. Doug also did a full-length presentation LexisNexis Learning to Rank Case Study with Doug Heitkamp and Tito Sierra, which I did not attend as I had already attended an internal presentation they did on the same subject.


DAY 2 (Wednesday, April 11)


Learning to Rank in an Hourly Job marketplace - by Xun Wang and Jason Kowaleswski
Xun and Jason are from Snag, the largest online marketplace for hourly workers. Their objective is to match up a job seeker to multiple jobs, and recommend the best candidate for a given job. In that sense, the problem is of limited supply and unlimited access. A peculiarity of the hourly job marketplace is that often schedule and location are more important than the actual job content, and the queries reflect that, so the challenge is often to determine intent and context. They set out to migrate from their legacy rule based search system to a more modern one using the ES Learning to Rank (LTR) plugin by Doug Turnbull and the team at o19s. Relevancy signals are collected from multiple levels of interactions, such as clicks, intent classification, completed applications, interviews and actual hires, and is as much recommendation as it is search. The LTR model generates features for the ES LTR plugin by taking relevancy features and composing them using LambdaMart. Their migration is through a new parallel system which will gradually take on more and more of the existing workload as it gets closer in terms of relevancy performance to the existing system over time.

A picture is worth a thousand words - approaches to search relevance scoring based on product data, including image recognition - by Rene Kriegler
Rene described the various stages of eCommerce search (problem recognition, information search, alternative evaluation, purchase and post-purchase) and how each stage informs search intent. He also pointed out that in eCommerce search, each document is a proxy for the thing being searched for, and that consumer interest becomes part of the relevance criteria. Other parts of the relevance criteria are seller's perspective and personalization and individualization (i.e., topicality). This last part is where image recognition comes in. The idea is that you can compute the likelihood of a query given an image recognition vector subspace, i.e., relevancy is a function of the Jaccard similarity between image vectors for products that match a query. Image vectors in this model are generated from a pre-trained Inception network similar to this one. Clustering is done using Random Projections to produce binary vectors for each image, which can be clustered. Results corresponding to larger clusters are boosted and rescored. He humerously concludes that while a picture may or may not be worth a thousand words, it is definitely worth a language model. One other interesting point he brought up is that the binary vectors as a result of the random projections can be stored in Lucene bit vectors.

A Vespa Tour - by Matt Overstreet
Vespa bills itself as an open big data serving engine. It provides a hybrid indexing platform that supports both text and similarity searches in vector space. You can use it to build both standard text based search applications, personalized recommendations as well as machine learning oriented similarity engines. You can also build navigation pages computed on demand, and provides realtime data displays such as tag clouds, maps and graphs. Matt took us through configuring Vespa, using it for Linguistics use cases (mainly text search), and talked about the flexibility of Vespa's ranking. You can provide a middleware component to implment some kind of similarity and Vespa will support that similarity. Tensorflow (TF) models can also be embedded as part of this middleware, which means that you can dynamically compare records based on the notion of similarity that is embodied by the TF model. Vespa provides a query language called YQL (Yahoo Query Language). Both Python and Java are supported for developing middleware. I had been putting off looking at Vespa because of the complexity and breadth of the software, but looks like it might be worth checking out in connection with some of the work I have been doing around image similarity.

The Solr Synonyms Maze: Pros, Cons and Pitfalls of Various Synonym Usage Patterns - by Bertrand Rigaldies
Bertrand talks about the difference in the way Lucene handles multi-term synonyms on the query side versus the index side and the problems that can arise as a result. Specifically, the index side handles offsets for multi-term synonyms incorrectly, leading to flattening and weird results. Mike McCandless called this behavior sausagization and wrote this blog post in 2012 describing the problem. He uses the Solr JSON API and the Python networkx library to build visualizations of the query paths to demonstrate the problem and suggests several solutions. The best solution, if it is possible for your installation, is to reduce all synonyms to a single term semantic tokens. However, in many instances it is not feasible to do so, and the talk describes what strategies to take to prevent problems with synonym generation in those cases.

Evolving a Medical Image Similarity Search - by Sujit Pal
This was my talk. I talked about our (still incomplete) Medical Image Similarity project. I covered the various strategies for feature extraction starting with somewhat naive features such as color and texture, local features such as edges and corners, and ending up with deep learning features such as vectors generated from pre-trained image classification models. I then covered various indexing strategies I had considered, some of which allow you to represent image vectors as text based postings lists, and others which depend on platforms to compare image vectors natively. I also covered how we evaluated the various search algorithms and indexing strategies using human ratings on a 4-point scale, and while the results so far are not impressive in absolute terms, they do represent some progress compared to our baselines, showing that we are at least on the right track. Finally, I talked about various ideas that we would like to try, hopefully this will give us better results. Interestingly, some of the ideas were also mentioned by the other speakers, so it was good to get corraboration. I was quite impressed by the quality of the questions and suggestions I got, they were very well thought out and insightful.

The Relevance of Solr's Semantic Knowledge Graph - by Trey Grainger
Trey described the Semantic Knowledge Graph plugin for Solr contributed by LucidWorks and based upon his research at CareerBuilder. This can be used to discover and rank relationships between arbitary queries and terms in the index. Other uses could be to discover related terms and concepts, disambiguate different meanings of terms given the context, cleanup noise in datasets, discover unknown relationships between documents and fields, summarize documents, etc. It does this by maintaining a so-called forward index in addition to Lucene's existing inverted index. The forward index allows you to map from document to terms. Traversals happen by alternately traversing through the forward and the inverted index. Weights for each node are assigned as a ratio of the foreground vs background weights, which are in turn derived from other metrics. The code for this is open-source and available on github at careerbuilder/semantic-knowledge-graph. Trey also covers various applications of the Knowledge Graph such as data cleaning, predictive analytics, intelligent search expansion, document summarization and enrichment.

Catch my Drift? Building bridges with Word Embeddings - by Peter Dixon-Moses
Peter described an interesting strategy to increase recall using word2vec. Essentially, you look up neighbors for your query word using the word2vec.find_synonyms() call and then rewrite your query to look for them as well. Another approach could be also to generate thesaurus. He also pointed out an ElasticSearch plugin called elasticsearch-vector that allows you to do vector arithmetic. However, the problem with the approach that the similarity is applied to the entire corpus, so strategy when using vector matching should be to use it as a rescorer. Other applications could be to use embeddings representing history to rescore current results. He also suggested using analogies to do queries, for example, in a real estate search scenario, you could do cityname -professionals to find similar cities but without so many professionals (so hopefully cheaper). An interesting insight Peter provided is for training word2vec models with your own data. Just like fine tuning trained models, it should be possible to tune an embedding model on your own data by starting with a trained model and fine tuning with your own data.

Day 2 had many interesting talks and once again I was forced to make choices. Talks I missed because there was an equally (or more, at least to me) interesting talk on the other track are as follows. Very short summaries are provided based on a quick read of the slides, where they were available.

  • Understanding Queries with NER - by Ryan Pedela
  • The gentle art of incorporating "business rules" into results - by Scott Stults. Slides not available at the moment.
  • Realtime Entity Resolution with ElasticSearch - by Dave Moore. Dave talks about a ES plugin from Zentity that uses facets to identify and extract features for named entities in a search index.
  • Interleaving: from evaluation to self learning - by John T Kane. This is a very nice introduction into the idea of LTR. LTR models are usually trained using a batch process at the moment. The idea is to transform it into an online learning setup using Reinforcement Learning (RL), and continuous competition by interleaving results from competing engines instead of A/B tests.
  • Bad Text, Bad Search: Evaluating Text Extration with Apache Tika's tika-eval Module - by Tim Allison. Tim Allison is the creator of Apache Tika, a toolkit to extract text and metadata from over thousand different file types. He describes at a high level what Tika can do for you, and then focuses on the tika-eval module, which allows you to compare extraction results.

Overall, I thought this conference was very useful. Even though search is no longer my primary focus, most applications I am part of building rely on search to some extent. In addition, search itself is expanding to become more than just efficient information retrieval. Many innovations in search depend on Natural Language Processing (NLP), ML, and increasingly RL techniques. In a sense search relevance is tied to all these innovations, since each of them serve to push the relevance envelope a bit further, leading to better and more relatable results for human users. In that sense, I think Haystack has placed itself in a very interesting position. I learned of many such innovations in the two days I was at Haystack, and look forward to applying these ideas in my own work.


Saturday, April 07, 2018

AWS ML Week and adventures with SageMaker


I attended the AWS ML Week at San Francisco couple of weeks ago. It was held over 2 days and consisted of presentations and workshops, presented and run by Amazon Web Services (AWS) architects. The event was meant to showcase the ML capabilities of AWS and was targeted at Data Scientists and Engineers, as well as innovators who want to include Machine Learning (ML) capabilities in their applications.

The AWS ML stack at the time of writing is as shown below. This image comes from one of the presentation slides. The top layer (Application Services) is a set of canned ML models exposed through an API and is aimed at people who want to exploit ML capabilities in their applications without having to go through the hassle of building it themselves. The middle layer (Platform Services) is aimed at the Data Scientist / Engineer types who are training and consuming their own ML models. The bottom layer (Frameworks and Interfaces) is the infrastructure layer, based upon the Amazon Deep Learning AMIs that were released some time back.


The first day of talks covered the Application Services (top) layer, and the second day covered the Platform Services (middle) layer. The Frameworks and Interfaces (bottom) layer was not covered at all, but those of us who've trained Deep Learning (DL) models on AWS have probably used the Amazon Deep Learning AMIs and know enough about them already.

My main reason for attending the event was twofold. First, some colleagues were talking about the cool canned ML algorithms that AWS was coming out with, and I thought attending this kind of event would be a way to quickly learn about them all at a high level. Second, a colleague and I had evaluated SageMaker earlier for our own use, and I had concluded that while it was a good managed notebook environment for development, I wasn't too impressed with its stated goal as an unified platform for distributed model training and deployment, and I was hoping that I would learn new information here that would change my mind.

In this post, I will focus on these two aspects in depth.

Application Services


This is just a list of application services with a brief description. All of these can be consumed through an API, and provide very general services such as emotion detection in faces, keyword extraction from text, etc. However, it is often possible to compose these undifferentiated services to produce unique functionality. Such applications can just use the AWS Application Services rather than build them themselves, saving them some time and wheel reinvention effort in the process.

  • Rekognition - a group of Computer Vision (CV) services. There is a Rekognition for Images and a Rekognition for Video. Rekognition for Images provides functionality for Object and Scene Detection, Facial Analysis (sentiment, gender, facial features), Face Recognition, Unsafe (NSFW) detection, Celebrity recognition, Text in Images and Face Similarity comparison. Rekognition for Video has all the services of Rekognition for Images plus Person tracking.
  • Transcribe - speech to text conversion. This was in preview at the time but has since become generally available. Unlike other services, the API is asynchronous only.
  • Translate - language translation. Supported only English and Spanish at the time, but they are adding more languages, so this might have changed also. As with Transcribe, it was in preview but has become generally available.
  • Comprehend - a set of language services that work on text to detect sentiment, entities, language and key phrases. It also has functionality to build topic models out of a corpus of text.
  • Polly - text to speech conversion. Multiple voices and accents available for customization.
  • Lex - conversational interface for text and voice based applications, it is the API underlying the Amazon Echo and Alexa family of devices.

Some examples of applications that could be composed using these components are listed below. Some of these are covered in more depth in the presentation slides (see list at bottom).

  • Using Comprehend to detect non-English tweets and translate to translate them into English, and extract keywords from them using Comprehend.
  • Using Comprehend to generate sentiment on incoming customer service requests.
  • Extracting entities and keyphrases from a text corpus to generate knowledge graphs.
  • Video captioning of different languages simultaneously using Translate.
  • Pollexy project (video) - application to remind autistic child to do specific things at different times of the day.
  • Finding missing persons by comparing images on social media with reference image - this was our workshop example from day 1.

SageMaker


SageMaker bills itself as a fully-managed platform to build, train and deploy ML models at scale. As a user, I see two main use cases - a managed notebook platform for development, and a unified platform for training and deploying ML models.

For the first use case, as long as you are using Keras, Tensorflow (TF) or MXNet with either Python2 or Python3, you could simply choose the appropriate notebook type and use it. You could also install other frameworks such as Pytorch using pip on the notebook's virtual terminal and use that instead. It's not very different from running Jupyter notebooks on your Deep Learning AMI and possibly a little less flexible, but it can be convenient for enterprise customers with their own Virtual Private Clouds (VPCs) since the SageMaker notebook is available within your Amazon console without having to do complex network finagling. Strangely enough, Amazon does not emphasize this use case at all.

The other use case is as a unified platform for large scale (possibly distributed) model training and model deployment. In this mode, SageMaker acts as a wrapper that calls into user provided functionality at different points in its life cycle. This allows you to run the SageMaker notebook on a relatively low end EC2 instance because you would spin up a high performance EC2 box (possibly even a GPU box if needed) for the duration of the training. Similarly, you would deploy the trained model to a different EC2 instance as well. In Java object oriented terms, SageMaker does this by exposing an Estimator interface that various ML models must implement. In this mode, SageMaker supports a wide variety of ML algorithms (Deep Learning and traditional), as listed below.

  • Built-in ML algorithms - the following algorithms are provided as part of SageMaker - Linear Learner, Factorization Machines, XGBoost, Image Classification, Sequence2Sequence, KMeans, Principal Components Analysis (PCA), Latent Dirichlet Allocation (LDA), Neural Topic Models (NTM), DeepAR Forecasting (Time Series) and BlazingText (word2vec implementation). The built-in algorithms are all exposed via a common Estimator interface that uses Docker registry paths to identify a specific algorithm.
  • MXNet and TF Estimators - these SageMaker Estimators allow wrapping of the user's MXNet and TF models (as well as Keras models built using Keras embedded in TF, also known as tf.keras). The user has to provide implementations of certain functions to SageMaker and SageMaker calls them at different points in its lifecycle. Since TF comes with its own Estimators which are pre-built DL and ML networks, this opens up even more possibilities. So overall, this allows wrapping of the following kinds of models.
  • Bring Your Own Model (BYOM) Estimators - you set up a Docker contiainer in a specific way to expose training and serving functionality via scripts, and the SageMaker Estimator would use these scripts to train and deploy the model. This is the same Estimator that exposes SageMaker's built-in ML functionality.

Examples of each of these use cases can be found in the awslabs/amazon-sagemaker-examples repository. We did run through some of these in one of the workshops, but I decided to expose one of my own recent Keras models through SageMaker to figure out the steps involved.

The documentation about wrapping TF/Keras models on the aws/sagemaker-python-sdk repository says that the training script must contain the following function overrides.

  • Exactly one of model_fn, keras_model_fn or estimator_fn - defines the model to be trained, each of the options correspond to definitions for one of TF model, tf.keras model or TF Estimator respectively
  • train_input_fn - to preprocess and load training data
  • eval_input_fn - to preprocess and load evaluation data
  • serving_input_fn - required for deploying endpoint through SageMaker

My model takes as input dense vectors of size (2048,) and predicts one of two classes. Since the model was built using tf.keras I started by defining a keras_model_fn and the other function overrides as follows:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import boto3
import numpy as np
import os
import tensorflow as tf
from tensorflow.python.estimator.export.export import build_raw_serving_input_receiver_fn

INPUT_TENSOR_NAME = 'inputs_input' # needs to match the name of the first layer + "_input"
hyperparams = {
    "learning_rate": 1e-3
}


def keras_model_fn(hyperparams):
    # build model
    model = tf.keras.models.Sequential()
    
    model.add(tf.keras.layers.Dense(2048, input_shape=(2048,), name="inputs"))
    model.add(tf.keras.layers.Activation("relu"))
    model.add(tf.keras.layers.Dropout(0.5))
    
    model.add(tf.keras.layers.Dense(512))
    model.add(tf.keras.layers.Activation("relu"))
    model.add(tf.keras.layers.Dropout(0.5))

    model.add(tf.keras.layers.Dense(128))
    model.add(tf.keras.layers.Activation("relu"))
    model.add(tf.keras.layers.Dropout(0.5))

    model.add(tf.keras.layers.Dense(2))
    model.add(tf.keras.layers.Activation("softmax", name="output"))

    # compile model
    optim = tf.keras.optimizers.Adam(lr=hyperparams["learning_rate"])
    model.compile(optimizer=optim, loss="categorical_crossentropy", 
                  metrics=["accuracy"])
    return model


def train_input_fn(training_dir, hyperparams):
    return _train_eval_input_fn(training_dir, "train_file.csv")


def eval_input_fn(training_dir, hyperparams):
    return _train_eval_input_fn(training_dir, "eval_file.csv")


def _train_eval_input_fn(training_dir, training_file):
    xs, ys = [], []
    ftest = open(os.path.join(training_dir, training_file), "rb")
    for line in ftest:
        _, label, vec_str = line.strip().split("\t")
        xs.append(np.array([float(e) for e in vec_str.split(",")]))
        ys.append(int(label))
    ftest.close()
    X = np.array(xs, dtype=np.float32)
    Y = tf.keras.utils.to_categorical(
        np.array(ys), num_classes=2).astype(np.int)
    return tf.estimator.inputs.numpy_input_fn(
        x={INPUT_TENSOR_NAME: X},
        y=Y,
        num_epochs=None,
        shuffle=True)()


def serving_input_fn(hyperparams):
    tensor = tf.placeholder(tf.float32, [1, 2048])
    return build_raw_serving_input_receiver_fn({INPUT_TENSOR_NAME: tensor})()

Later, however, I found that the deployed model was not able to parse request vectors serialized by TF's make_proto mechanism, so I had to switch to treating it as a TF model instead. This just meant that I had to replace the keras_model_fn with a model_fn function. The other change was that now my INPUT_TENSOR_NAME was no longer under the control of the Keras API, so I could rename it to the more readable "inputs". In addition, since I now have to explicitly provide EstimatorSpec objects for each of my operation modes (train, eval, predict), I need an additional import (PredictOutput) and an additional prediction signature key given by SIGNATURE_NAME. Also notice how the model definition has changed from the Sequential model to a functional form, where the input to the network comes from the features parameter. The other functions remain unchanged.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from tensorflow.python.estimator.export.export_output import PredictOutput

INPUT_TENSOR_NAME = "inputs"
SIGNATURE_NAME = "serving_default"

def model_fn(features, labels, mode, hyperparams):
    # build model (notice no input layer, fed from features parameter)
    hidden_1 = tf.keras.layers.Dense(2048, activation="relu")(features[INPUT_TENSOR_NAME])
    hidden_1 = tf.keras.layers.Dropout(0.5)(hidden_1)
    hidden_2 = tf.keras.layers.Dense(512, activation="relu")(hidden_1)
    hidden_2 = tf.keras.layers.Dropout(0.5)(hidden_2)
    hidden_3 = tf.keras.layers.Dense(128, activation="relu")(hidden_2)
    hidden_3 = tf.keras.layers.Dropout(0.5)(hidden_3)
    predictions = tf.keras.layers.Dense(2, activation="softmax", name="output")(hidden_3)
    
    # estimator for predictions
    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(
            mode=mode,
            predictions={"output": predictions},
            export_outputs={SIGNATURE_NAME: PredictOutput({"output": predictions})})

    # define loss function (using TF)
    loss = tf.losses.softmax_cross_entropy(labels, predictions)
    
    # define training op (using TF)
    train_op = tf.contrib.layers.optimize_loss(
        loss=loss,
        global_step=tf.train.get_global_step(),
        learning_rate=hyperparams["learning_rate"],
        optimizer="Adam")
    
    # generate predictions as TF tensors
    predictions_dict = {"output": predictions}
    
    # generate eval_metric ops
    eval_metric_ops = {
        "accuracy": tf.metrics.accuracy(
            tf.cast(labels, tf.float32), predictions)
    }
    
    # estimator for train and eval
    return tf.estimator.EstimatorSpec(
        mode=mode,
        loss=loss,
        train_op=train_op,
        eval_metric_ops=eval_metric_ops)

The above functions are written out into a script file and passed into the SageMaker Estimator. I tested each individual function separately to verify that they work by themselves. I did note that the SageMaker Estimator code (as well as TF code) is much more picky about data types - for the features, it expects a matrix of np.float32 and for the labels it expects a vector of np.int. Here is the code to train this model using SageMaker.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import json
import numpy as np
import os
import sagemaker
import tensorflow as tf

from sagemaker import get_execution_role
from sagemaker.tensorflow import TensorFlow

# (1) set up
sagemaker_session = sagemaker.Session()
# role = get_execution_role()
role = "arn:aws:iam:..." # copy-paste the IAM role from your SageMaker console

# (2) upload data to S3
inputs = sagemaker_session.upload_data(path="data/", 
                                       key_prefix="data")

# (3) train model
compdetect_estimator = TensorFlow(entry_point="composite-detector-tf.py",
                                  role=role,
                                  training_steps=100,
                                  evaluation_steps=100,
                                  hyperparameters={"learning_rate": 1e-3},
                                  train_instance_count=1,
                                  train_instance_type="ml.p2.xlarge")
compdetect_estimator.fit(inputs)

# (4) deploy model
compdetect_predictor = compdetect_estimator.deploy(initial_instance_count=1, 
                                                   instance_type='ml.m4.xlarge')

# (5) load data for evaluation and run predictions
Xeval = load_eval_data()
for i in range(Xeval.shape[0]):
    data = Xeval[i]
    tensor_proto = tf.make_tensor_proto(values=np.asarray(data), 
                                        shape=[1, len(data)], 
                                        dtype=tf.float32)
    pred = compdetect_predictor.predict(tensor_proto)
    Y_ = pred["outputs"]["output"]["floatVal"]
    y = np.argmax(Y_)
    print(i, y)

# (6) delete the endpoint when done
sagemaker.Session().delete_endpoint(compdetect_predictor.endpoint)

Below I explain each of these steps in detail.

  1. The first step is to open a SageMaker session and extract the IAM role from it. There seems to be a bug in this code, so I found (and others on the Internet have the same advice) that just using the IAM role value from the SageMaker notbook console works just as well.
  2. Lately, I usually have code in my notebooks to copy any data I need (and don't already have locally) from S3 and write back my models and output datasets back to S3 using the boto3 package. Here, the upload_data call expects to see the data locally, so I used awscli to copy it down from S3 to a local data subfolder, then invoke the command. The upload_data command will place it in a well-known (within the session) S3 bucket accessible to the training instance as well.
  3. The next step is to train the model. The entry_point parameter to the Tensorflow Estimator points to the script file with the functions that we set up above. The only hyperparameter we are passing in is the learning rate. The training will be done on a single m1.p2.xlarge instance (as indicated by the train_instance_count and train_instance_type parameters). We could have used distributed training by setting the train_instance_count to a value larger than 1. Note that tf.keras models exposed through the keras_model_fn cannot be trained in distributed mode. The model trains for 100 iterations and is evaluated for 100 iterations.
  4. Once trained, the model is deployed to yet another m1.m4.xlarge instance, called the endpoint, with the estimator.deploy call. The endpoint can auto-scale, meaning that SageMaker can automatically spin up additional instances of the endpoint in case the usage goes too high.
  5. We can now run predictions against the model by hitting the endpoint. Our endpoint is set up to consume input one at a time, but we could also set it up to consume fixed size batches if desired. The data has to be serialized using tf.make_tensor_proto and then passed to the predictor.predict call. This was the part that was failing for my Estimator using the keras_model_fn function, I suspect it has to do mismatch between the way TF serializes the data and the way Keras expects it. A sample output from the endpoint is shown below.

  6. Finally, if we no longer need to use the endpoint, we can just destroy it using delete_endpoint call. We can also delete it from the console.

So that's what it took to wrap my model into a SageMaker estimator and train and deploy it. Its not a lot of code, but documentation and Stack Overflow style support is still scarce, so the going is not very smooth. However, having trained and deployed one network through SageMaker, I feel more confident of being able to do the same with others. So given that it's a pain to work with, but does provide a lot of benefits, I have reconsidered my original skepticism towards it.

Other stuff: IoT and DeepLens


The last two talks focused on ML on IoT devices using MXNet. Models could be on-board on the IoT device or be accessed from the cloud over a SageMaker endpoint or using AWS Greengrass. In line with the focus on IoT, AWS DeepLens device is a device that can host ML algorithms, either canned ones from the AWS ML Application Services layer, or those you build yourself. Similar to the Echo/Alexa family of devices, I think DeepLens is meant to catalyze development of novel ML and CV applications for the consumer market. It is expected to be available in June 2018 and is available for pre-order on Amazon.

Links to Slides


Links to presentation slides were provided after the event and are all publicly available on Slideshare, would be awesome (wink wink nudge nudge, AWS guys) if these links were also updated on the original event page for AWS ML Week.


So that's all I had for the AWS ML Week. I think I ended up getting what I went there for. First I now have a good idea of the different services available in the Application Services layer. Second, I have a much better understanding and appreciation of SageMaker. I hope you found my writeup useful as well.