Saturday, January 27, 2018

Cleaning up Noisy Labels using Snorkel's Generative Model


According to its creators at the HazyResearch group at Stanford, Snorkel is a system for rapidly creating, modeling and managing training data. I first heard of it when attending Prof Christopher Ré's talk on his DeepDive project at the Data Science Summit at San Francisco almost 2 years ago. The DeepDive project has since morphed into a lighter-weight (and arguably more user friendly) project called Snorkel. Snorkel was under active development at the time, but since then the Snorkel team has simplified its API and added many features, including recently the ability to work with Apache Spark. They also provide many more case-studies with code than when I looked at it last, making it easier to get started with.

Snorkel is a system for data programming, described in the NIPS 2016 paper Data Programming: Creating Large Training Sets, Quickly by Ratner, De Sa, Wu, Selsam & Ré. It provides a set of labeling functions, which allows the user to quickly label large amounts of data in an unsupervised manner using regularities of language, a technique called weak supervision. For example, finding a word suffixed by "land" or "shire" in a sentence could be evidence for a labeling function to mark up the sentence as a GEOGRAPHICAL class. You could also distant supervision, where you use an ontology or dictionary to generate such labels based on matches in the sentence with phrases in the dictionary. Other techniques could be (non-expert) crowdsourcing or the use of unsupervised statistical methods. Multiple labels for the same data is generated in this way. These labels are plentiful, since they are generated programmatically, but generally noisy, i.e, they may conflict with each other.

Snorkel provides a generative model that will take these noisy labels and return a clean label per data point, i.e, the probability that the data point belongs to one of our classes given the distribution of the labels. This clean label can then be used as input to train a discriminative model such as a classifier.

In this post, I describe a little proof-of-concept that I built in order to understand the Snorkel API better. My example uses data from their Crowdsourced Sentiment Analysis using Snorkel example, and is an analog for a use case I have in mind, where these crowdsourced noisy labels are replaced by outputs from unsupervised (or weaker supervised) sources.

The data for the crowdsourced sentiment analysis example has two files - the first is a raw data file of 1,000 weather tweets, each annotated for "emotion" by 20 crowdsourced workers, for a total of 20,000 records. The second file contains 1,000 of these tweets that have been manually annotated (presumably by a domain expert on weather emotions) making this the gold set. The original example uses a subset of the gold data that have high confidence because they want to be able to evaluate the generated model -- since I wanted to evaluate the generated model also, I did the same. The example uses Snorkel on Spark, but since the data is not that large, I decided to use Pandas on my local machine.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from scipy.sparse import csr_matrix
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from snorkel import SnorkelSession
from snorkel.learning.gen_learning import GenerativeModel
import operator
import os
import numpy as np
import pandas as pd

DATA_DIR = "data"
RAW_FILE = os.path.join(DATA_DIR, "weather-non-agg-DFE.csv")
GOLD_FILE = os.path.join(DATA_DIR, "weather-evaluated-agg-DFE.csv")

# raw file to dataframe
raw_df = pd.read_csv(RAW_FILE)

# gold file to dataframe
gold_df = pd.read_csv(GOLD_FILE)
# select high confidence gold data
gold_df = gold_df[gold_df["correct_category"]=="Yes"]
gold_df = gold_df[gold_df["correct_category_conf"]== 1]

# get the crowdsourced "sentiment" from raw_df
candidate_df = raw_df.join(gold_df.set_index("tweet_id"), 
                           how="inner", on="tweet_id", rsuffix="_r")
candidate_df = candidate_df.loc[:, ["tweet_id", "worker_id", 
                                    "emotion", "sentiment"]]
candidate_df.head()


The filtering to retain only high confidence gold data results in 632 records. Joining with the raw data results in 12,640 (20 * 632) records. We now transform this to the standard X and y matrices as follows:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# build lookup tables
values = candidate_df.emotion.unique().tolist()

idx2label = {(i+1):x for i, x in enumerate(values)}
label2idx = {x:(i+1) for i, x in enumerate(values)}

# reduce by tweet_id
labels, predictions = {}, {}
for row in candidate_df.itertuples():
    noisy_label = label2idx[row.emotion]
    pred_label = label2idx[row.sentiment]
    if row.tweet_id in labels:
        labels[row.tweet_id].append((row.worker_id, noisy_label))
        predictions[row.tweet_id] = pred_label
    else:
        labels[row.tweet_id] = [(row.worker_id, noisy_label)]

# sort noisy labels for each tweet_id by worker_id
for tweet_id in labels.keys():
    sorted_labels = sorted(labels[tweet_id], key=operator.itemgetter(0))
    sorted_labels = [l for w, l in sorted_labels]
    labels[tweet_id] = sorted_labels

num_features = len(labels[tweet_id])
num_tweets = len(labels)

X = np.zeros((num_tweets, num_features), dtype=np.int64)
y = np.zeros((num_tweets, 1), dtype=np.int64)
i = 0
for tweet_id in sorted(list(labels.keys())):
    X[i] = np.array(labels[tweet_id], dtype=np.int64)
    y[i] = predictions[tweet_id]
    i += 1

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=0.7, 
                                                test_size=0.3)

The lookup tables convert the list of unique emotion/sentiment labels into equivalent numbers. There are 5 unique emotions -- "Neutral / author is just sharing information", "Tweet not related to weather condition", "Positive", "Negative", "I can't tell", and they map to the numbers [1, 2, 3, 4, 5]. Note the 1-based list -- for some reason, the 0-based list results in terrible accuracy numbers. You need to do a similar transformation when converting the predictions from the generative model back to these class IDs, but we will get to that later.

The code gives us an X matrix of shape (632, 20) and a y vector of labels (derived from the gold_df.sentiment column) of shape (632, 1). We will use the label vector only for evaluation. Splitting the X and y tensors into a 70/30 training/test set gives us Xtrain, ytrain, Xtest and ytest of shapes (442, 20), (442, 1), (190, 20), (190, 1) respectively.

As a baseline for our evaluation, we compute the statistics if we just considered the majority vote from the crowdsourced workers.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def majority_preds(inputs):
    preds = []
    for i in range(inputs.shape[0]):
        votes = inputs[i]
        counts = np.bincount(votes)
        preds.append(np.argmax(counts))
    return np.array(preds, dtype=np.int64)

def report_results(title, preds, labels):
    print("\n\n**** {:s} ****".format(title.upper()))
    acc = accuracy_score(preds, labels)
    cm = confusion_matrix(preds, labels)
    cr = classification_report(preds, labels)
    print("accuracy: {:.3f}".format(acc))
    print("\nconfusion matrix")
    print(cm)
    print("\nclassification report")
    print(cr)
    
preds_train = majority_preds(Xtrain)
report_results("train", preds_train, ytrain)

preds_test = majority_preds(Xtest)
report_results("test", preds_test, ytest)










As you can see, the accuracies for majority voting are quite high, 0.98 for the training set and 1.0 for the test set.

Next, we define and train Snorkel's generative model and train it for 30 epochs. I specify the cardinality explicitly here, but it can figure it out by itself as well. The marginals are a set of probabilities of being in each of the 5 classes, their shapes are (442, 5) and (190, 5) respectively.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
session = SnorkelSession()

gen_model = GenerativeModel(lf_propensity=True)
gen_model.train(
    Xtrain,
    reg_type=2,
    reg_param=0.1,
    epochs=30,
    cardinality=5
)

train_marginals = gen_model.marginals(csr_matrix(Xtrain))
test_marginals = gen_model.marginals(csr_matrix(Xtest))

Next we evaluate the generative model and see how it stacks up against our simple-minded majority vote model. The best label is just the class which has the maximum probability, but since we are trying to map against 1-based class IDs, we need to add one to the argmax value for the correct label.

1
2
3
4
5
train_predictions = np.argmax(train_marginals, axis=1)+1
test_predictions = np.argmax(test_marginals, axis=1)+1

report_results("train (gen model)", train_predictions, ytrain)
report_results("test (gen model)", test_predictions, ytest)










Results for Snorkel's generative model don't seem to be as good as our majority vote model, but they are quite good nevertheless. The next step would be to try building a more realistic generative model using the full raw dataset and evaluate it against the full gold set, and then build a discriminative model to do the classification on that. I will talk more about this in my next post.

Obviously there is a lot more to Snorkel than I just covered. It is a complete package that provides most of the tools you are likely to need to do data programming. To learn more about Snorkel, check out the Snorkel website on github. If you have lots of raw data but are struggling to build large labeled datasets to feed your machine learning algorithms, I think you will find it well worth the effort.


Saturday, January 13, 2018

Crowdsourcing a Labeling task using Amazon Mechanical Turk


Happy New Year! My New Year's resolution for 2018 is, perhaps unsurprisingly, to blog more frequently than I have in 2017.

Despite the recent advances in unsupervised and reinforcement learning, supervised learning remains the most time-tested and reliable method to build Machine Learning (ML) models today, as long as you have enough training data. Among ML models, Deep Learning (DL) has proven to be more effective in many cases. DL's grreatest advantage is its capability to consider all sorts of non-linear feature interactions automatically. In return all it asks for is more processing power and more training data.

With the ubiquity of the computer and the Internet in our everyday lives, it is not surprising that our very act of collective living generates vast amounts of data. In many cases it is possible, with a little bit of ingenuity, to discover implicit labels in this data, making the data usable for training supervised DL models. In most other cases, we are not so lucky and must take explicit steps to generate these labels. Traditionally, people have engaged human experts to do tha labeling from scratch, but this is usually very expensive and time-consuming. More recently, the trend is to generate noisy labels using unsupervised techniques, and validate them using human feedback.

Which brings me to the subject of my current post. One way to get this human feedback is through Amazon's Mechanical Turk (AMT or MTurk), where you can post a Human Intelligence Task (HIT) and have people do these HITs in return for micropayments made through the MTurk network. In this post, I describe the process of creating a collection of HITs and making them available for MTurk workers (aka turkers), then collecting the resulting labels.

Problem Description


I was trying to generate tags for snippets of text. These tags are intended to be keywords that are self-contained and describe some aspect of the text. And yes, I realize that this looks like something plain old search could do as well, but bear with me here -- this data is a first step of a larger pipeline and I do need these multi-word labels.

So each record consists of a snippet of text and 5 multi-word candidate labels. The labels are generated using various unsupervised techniques, some rule based and some that exploit statistical features of language. Because the scoring is not compatible across the various techniques, we select the top 10 percentile from each set, then randomly chose 5 labels for each snippet from the merged label pool.

The first step is to pre-pay for the HITs and push them to the MTurk site where they become visible to turkers, some of whom will take them on and complete them. After the specified number of turkers have completed the HITs to assign their crowdsourced labels and we accept their work, they get paid by AMR, and we need to download their work. MTurk provides an API that allows you to upload the HITs and retrieve the crowdsourced labels, which I will talk about here. My coverage is more from a programming standpoint, so I have done this against the MTurk sandbox site, which is free to use.

In terms of required software, I recently upgraded to Anaconda Python3. The other libraries used are boto3 to handle the network connections, the jinja2 templating engine included with Anaconda for generating the XML for the HIT in the MTurk request, and xmltodict to parse XML payloads in the MTurk response to Python data structures. Both boto3 and xmltodict can be installed using pip install. I also had a lot of help from this post Tutorial: A beginner's guide to crowdsourcing ML training data with Python and MTurk on the MTurk blog.

Creating HITs and uploading to MTurk


The unsupervised algorithms are run and the top results from each merged on our Apache Spark based analytics platform. A sample of these merged results are downloaded and used as input for creating the HITs. The input data looks like this:


The first step is to establish a connection to the MTurk (sandbox) server. For this, you need to have an AWS account, an MTurk development/requester account, and also link your AWS account to the MTurk account. This AWS Documentation page covers these steps in more detail. Once you are done, you should be able to establish a connection to the sandbox and see how much pretend money you have in the sandbox to pay your pretend workers.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from jinja2 import Template
import boto3
import os

# constants
MTURK_SANDBOX = "https://mturk-requester-sandbox.us-east-1.amazonaws.com"
MTURK_REGION = "us-east-1"
MTURK_PREVIEW_URL = "https://workersandbox.mturk.com/mturk/preview?groupId={:s}"

DATA_DIR = "../data"
HIT_ID_FILE = os.path.join(DATA_DIR, "best-keywords-hitids.txt")

NUM_QUESTIONS_PER_HIT = 10

# extract AWS credentials from local file
creds = []
CREDENTIALS_FILE = "/path/to/amazon-credentials.txt"
with open(CREDENTIALS_FILE, "r") as f:
    for line in f:
        if line.startswith("#"):
            continue
        _, cred = line.strip().split("=")
        creds.append(cred)

# verify that we can access MTurk sandbox server
mturk = boto3.client('mturk',
   aws_access_key_id=creds[0],
   aws_secret_access_key=creds[1],
   region_name=MTURK_REGION,
   endpoint_url=MTURK_SANDBOX
)
print("Sandbox account pretend balance: ${:s}".format(
    mturk.get_account_balance()["AvailableBalance"]))

We have (in our example) just 24 snippets with associated keywords. I want to group them into 10 snippets per HIT, so I have 3 HITs with 10, 10 and 4 snippets respectively. In reality you want a larger number for labeling, but since I was in development mode, I was the person doing the HIT each time, and I wanted to minimize my effort. At the same time, I wanted to make sure I could group my input into HITs of 10 snippets each, hence the choice of 24 snippets.

Each HIT needs to get formatted as an HTML form, which is then embedded inside a HTMLQuestion tag that is part of the XML syntax MTurk understands. Since we wanted to put multiple snippets into a single HIT, it was more convenient to use the loop unrolling capabilities of the Jinja2 templating engine than rely on Python's native templating through format() calls. Here is the template for our HIT.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
full_xml = Template("""
<HTMLQuestion xmlns="http://mechanicalturk.amazonaws.com/AWSMechanicalTurkDataSchemas/2011-11-11/HTMLQuestion.xsd">
    <HTMLContent><![CDATA[
<!DOCTYPE html>
<html>
    <head>
        <meta http-equiv='Content-Type' content='text/html; charset=UTF-8'/>
        <script type='text/javascript' src='https://s3.amazonaws.com/mturk-public/externalHIT_v1.js'></script>
    </head>
    <body>
        <form name="mturk_form" method="post" id="mturk_form" 
              action="https://www.mturk.com/mturk/externalSubmit">
        <input type="hidden" value="" name="assignmentId" id="assignmentId" />
        <ol>
        {% for row in rows %}
            <input type="hidden" name="iid_{{ row.id }}" value="{{ row.iid }}"/>
            <li>
                <b>Select all keywords appropriate for the snippet below:</b><br/>
                {{ row.snippet }}
                <p>
                <input type="checkbox" name="k_{{ row.id }}_1">{{ row.keyword_1 }}<br/>
                <input type="checkbox" name="k_{{ row.id }}_2">{{ row.keyword_2 }}<br/>
                <input type="checkbox" name="k_{{ row.id }}_3">{{ row.keyword_3 }}<br/>
                <input type="checkbox" name="k_{{ row.id }}_4">{{ row.keyword_4 }}<br/>
                <input type="checkbox" name="k_{{ row.id }}_5">{{ row.keyword_5 }}<br/>
                </p>
            </li>
            <hr/>
        {% endfor %}
        </ol>

            <p><input type="submit" id="submitButton" value="Submit"/>
            </p>
        </form>
        <script language='Javascript'>turkSetAssignmentID();</script>
    </body>
</html>
]]>
    </HTMLContent>
    <FrameHeight>600</FrameHeight>
</HTMLQuestion>
""")

We then group our data into 10 rows each, create a data structure rows, each row of which contains a dictionary of field names and values, then render the snippet above for this data structure. The resulting XML is fed to the MTurk sandbox server using boto3. Each call corresponds to a single HIT and the server will return a corresponding HIT Id, which we save for later use. It also returns a HIT group ID which we will use to generate a set of preview URLs.

We have modeled each group of 10 snippets as a completely separate HITs, with its own unique title (trailing #n). We could also have run multiple create_hit calls using the same title, in which case, a group of HITs are created under the same title. However, I noticed that I was sometimes getting back duplicate HIT Ids in that case, so I went with the separate HIT per 10 snippets strategy.

I also found a good use for the Keywords parameter - if you put some oddball term in there, you could share it with your team to get back the list of HITs you want them to look at.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def create_hit(mturk, question, hit_seq):
    hit = mturk.create_hit(
        Title="Best Keywords in Caption #{:d}".format(hit_seq),
        Description="Find best keywords in caption text",
        Keywords="aardvaark",
        Reward="0.10",
        MaxAssignments=1,
        LifetimeInSeconds=172800,
        AssignmentDurationInSeconds=600,
        AutoApprovalDelayInSeconds=14400,
        Question=question
    )
    group_id = hit["HIT"]["HITGroupId"]
    hit_id = hit["HIT"]["HITId"]
    return group_id, hit_id


rows = []
hit_group_ids, hit_ids = [], []
hit_seq = 1
with open(os.path.join(DATA_DIR, "best-keywords.tsv"), "r") as f:
    for lid, line in enumerate(f):
        if lid > 0 and lid % NUM_QUESTIONS_PER_HIT == 0:
            question = full_xml.render(rows=rows)
            hit_group_id, hit_id = create_hit(mturk, question, hit_seq)
            hit_group_ids.append(hit_group_id)
            hit_ids.append(hit_id)
            rows = []
            hit_seq += 1
        iid, snippet, key_1, key_2, key_3, key_4, key_5 = line.strip().split("\t")
        row = {
            "id": (lid + 1),
            "iid": iid, 
            "snippet": snippet,
            "keyword_1": key_1,
            "keyword_2": key_2,
            "keyword_3": key_3,
            "keyword_4": key_4,
            "keyword_5": key_5,
        }
        rows.append(row)
        
if len(rows) > 0:
    question = full_xml.render(rows=rows)
    create_hit(mturk, question, hit_seq)
    hit_group_ids.append(hit_group_id)
    hit_ids.append(hit_id)

The code above results in a flat file of HIT Ids that I can use to recall results for these HITs later. You can also see your HITs appear as shown below:


As you might expect, this is a giant form consisting of text snippets separated by checkbox group of 5 candidate keywords, terminated with a single Submit button. I am not sure if you can have Javascript support for more sophisticated use cases, but you can do a lot with HTML5 nowadays. Here is what (part of) the form looks like, marked up by the dev turker (me :-)).



Retrieving crowdsourced labels on HITs from MTurk


In a real-life scenario, the HITs would be on the MTurk production server and real humans would (hopefully) find my micro-payment of 10 cents per HIT adequate and do the marking up for me. I have configured my HIT to have MaxAssignments=1, which means I want only 1 worker to work on the HIT -- in reality, you want at least 3 people to work on each HIT so you can do a majority vote (or something more sophisticated) on their labels. In any case, once all your HITs have been handled by the required number of turkers, it is time to download the results.

Results for a HIT can be retrieved using the list_assignments_for_hit() method of the MTurk client -- you need the HIT Id for the HIT that was returned during HIT creation, and which we had stored away for use now. The response from the MTurk server is a JSON response, with the actual Answer value packaged as an XML payload. We use the xmltodict.parse() method to parse this payload into a Python data structure, which we then pick apart to write out the output.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import boto3
import os
import xmltodict

# constants
MTURK_SANDBOX = "https://mturk-requester-sandbox.us-east-1.amazonaws.com"
MTURK_REGION = "us-east-1"

DATA_DIR = "../data"
HIT_ID_FILE = os.path.join(DATA_DIR, "best-keywords-hitids.txt")
RESULTS_FILE = os.path.join(DATA_DIR, "best-keywords-results.txt")

# extract AWS credentials from local file
creds = []
CREDENTIALS_FILE = "/path/to/amazon-credentials.txt"
with open(CREDENTIALS_FILE, "r") as f:
    for line in f:
        if line.startswith("#"):
            continue
        _, cred = line.strip().split("=")
        creds.append(cred)

# verify access to MTurk
mturk = boto3.client('mturk',
   aws_access_key_id=creds[0],
   aws_secret_access_key=creds[1],
   region_name=MTURK_REGION,
   endpoint_url=MTURK_SANDBOX
)
print("Sandbox account pretend balance: ${:s}".format(
    mturk.get_account_balance()["AvailableBalance"]))

# get HIT Ids stored from during HIT creation
hit_ids = []
with open(HIT_ID_FILE, "r") as f:
    for line in f:
        hit_ids.append(line.strip())

# retrieve MTurk results
fres = open(RESULTS_FILE, "w")
for hit_id in hit_ids:
    snippet_ids, keyword_ids = {}, {}
    results = mturk.list_assignments_for_hit(HITId=hit_id, 
        AssignmentStatuses=['Submitted'])
    if results["NumResults"] > 0:
        for assignment in results["Assignments"]:
            worker_id = assignment["WorkerId"]
            answer_dict = xmltodict.parse(assignment["Answer"])
            answer_dict_2 = answer_dict["QuestionFormAnswers"]["Answer"]
            for answer_pair in answer_dict_2:
                field_name = answer_pair["QuestionIdentifier"]
                field_value = answer_pair["FreeText"]
                if field_name.startswith("iid_"):
                    id = field_name.split("_")[1]
                    snippet_ids[id] = field_value
                    keyword_ids[id] = []
                else:
                    _, iid, kid = field_name.split("_")
                    keyword_ids[iid].append(kid)
    for id, iid in snippet_ids.items():
        selected_kids = ",".join(keyword_ids[id])
        fres.write("{:s}\t{:s}\t{:s}\n".format(worker_id, iid, selected_kids))

fres.close()

The output of this step is a TSV file that contains the worker ID, the snippet ID, and a comma-separated list of keyword IDs that were found to be meaningful by the turker(s). This can now be joined with the original input file to find the preferred labels.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
A2AQYARTZTL5EE S0735109710021418-gr5 2,3
A2AQYARTZTL5EE S0894731707005962-gr1 2,4
A2AQYARTZTL5EE S1740677311000118-gr2 1,5
A2AQYARTZTL5EE S0031938414005393-gr2 1,3,4,5
A2AQYARTZTL5EE S1542356515000415-gr2 3
A2AQYARTZTL5EE S1521661616300158-gr8 2
A2AQYARTZTL5EE S0091743514001212-gr2 1,2,3
A2AQYARTZTL5EE S0735109712023662-gr2 1,2,3
A2AQYARTZTL5EE S0026049509000456-gr1 
A2AQYARTZTL5EE S0079610715000103-gr3 1,3
...

This is all I have for today. I hope you enjoyed the post and found it useful. I believe crowdsourcing will become more important as people begin to realize the benefits of weak supervision, and the MTurk API makes it quite easy to set up this kind of jobs.