Salmon Run: Keyword deduplication using the Python dedupe library

I have been experimenting with keyword extraction techniques against the NIPS Papers dataset, consisting of titles, abstracts and full text of all papers from the Neural Information Processing Systems (NIPS) conference from 1987-2017, and contributed by Ben Hamner. The collection has 7239 papers written by 9785 authors. The reason I preferred this dataset to others such as Reuters or Medline is because it is smaller, and I can be both programmer and domain expert, and because I might learn interesting things while combing through the text of the papers looking for patterns to exploit.

In order to generate keywords, I used Ted Dunning's Log Likelihood Ratio (LLR) method (blog post, paper (PDF)) and the RAKE algorithm (described in Pattern Recognition and Machine Learning by Christopher Bishop, and adapted from code at aneesha/RAKE). I then selected the top scoring keywords from both methods, merged them to remove duplicate suggestions, went through them manually to remove what I considered spurious keywords (things like "et al", "better proposals", "usually set conservatively", etc), and then trained a MAUI model (following this blog post by Alyona Medelyan, creator of MAUI) using these keywords and the text. I then turned around and used the model to predict keywords against the same text (hoping to extract some more keywords that were missed in the training set). I then merged the predicted keywords with the training keywords.

The next step was to remove obviously duplicate keywords, such as singular and plural versions of the same thing ("absolute value", "absolute values"), words that were spelled differently but meant the same ("datasets", "data sets"), words that were close enough in meaning to appear as duplicates ("local linear", "locally linear"), etc. At this point, I had only about 2281 keywords, so I could have just done brute force matching (i.e., run 2281x2281 comparisons across all keywords), but I had recently heard about the Python Dedupe Library from Anne Moshyedi and Taylor Kramer, University of Virginia (UVA) Computer Science students interning at SWIFT Innovation Labs, who I was collaborating with around my open source project Solr Dictionary Annotator (SoDA) and wanted to try it out. They were using it as a step in their pipeline to resolve entities in structured data, which looks like the suggested use case for dedupe, going by the examples at their examples repository dedupeio/dedupe-examples. This post is about using dedupe against this set of keywords.

Since keywords are just text data, the first step was to convert it to some kind of structured data. For that I chose to generate 3-character shingles from the data, then run it through a feature hasher to hash these shingles down to a fixed length vector. Here is the code to do that.

import nltk
import numpy as np
import os

from sklearn.feature_extraction import FeatureHasher
from sklearn.metrics import jaccard_similarity_score

hasher = FeatureHasher(input_type="string", n_features=25, dtype=np.int32)
keywords = ["absolute value", "absolute values"]
hashes = []
for keyword in keywords:
    shingles = ["".join(trigram) for trigram 
        in nltk.trigrams([c for c in keyword])]
    keyword_hash = hasher.transform([shingles]).toarray()
    hashes.append(keyword_hash)
    print(keyword, keyword_hash)

print("jaccard:", jaccard_similarity_score(hashes[0][0], hashes[1][0]))

The output shows that we don't lose too much information after the feature hashing, the Jaccard similarity between the two fixed length (size 25) hashes for the two strings "absolute value" and "absolute values" is 0.96.

1
2
3

absolute value : [0, 0, 1, 0, 0, 0, 0, -1, -1, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, -2, -1, 0, 1, 0, 0]
absolute values: [0, 0, 1, 0, 0, 0, 0, -2, -1, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, -2, -1, 0, 1, 0, 0]
jaccard similarity: 0.96

We then apply this feature hashing procedure to all our keywords and write these hashes out to a CSV file along with the original keyword. Originally I added the keyword because there is an active learning component in the dedupe pipeline where the user is asked to identify if a pair of records are duplicates or not, and having the original keyword show up is the easiest way for the human trainer to identify duplicates. Later I realized that the keyword string is a useful feature by itself, because dedupe contains functionality to match against strings inside structured input as well.

DATA_DIR = "../data"

CURATED_KEYWORDS = os.path.join(DATA_DIR, "raw_keywords.txt")
CURATED_KEYWORD_HASHES = os.path.join(DATA_DIR, "curated_keywords_hash.csv")

fcurated = open(CURATED_KEYWORDS, "r")
fhashed = open(CURATED_KEYWORD_HASHES, "w")

# header
cols = ["id"]
cols.extend(["col_{:d}".format(i+1) for i in range(25)])
cols.append("keyword")
fhashed.write("{:s}\n".format(",".join(cols)))

# shingle each word into 3-char trigrams, then hash to 25 features
hasher = FeatureHasher(input_type="string", n_features=25, dtype=np.int32)
for rowid, keyword in enumerate(fcurated):
    keyword = keyword.strip()
    shingles = ["".join(trigram) for trigram 
        in nltk.trigrams([c for c in keyword])]
    keyword_hash = hasher.transform([shingles]).todense().tolist()[0]
    cols = [str(rowid)]
    cols.append(",".join([str(h) for h in keyword_hash]))
    cols.append(keyword)
    fhashed.write("{:s}\n".format(",".join(cols)))

fhashed.close()
fcurated.close()
print("num keywords: {:d}".format(rowid))

Next we take this set of deduped 2281 keywords, and run it through the dedupe pipeline. The code below is adapted from and mimics closely the CSV example on the dedupe-examples site. Dedupe is a machine learning pipeline that uses a combination of blocking (or grouping by fields), hierarchical clustering and logistic regression, along with active learning, to come up with its results.

The first step in the training is for the user to tell dedupe what fields to pay attention to during the deduping process. In our case, it is every element of the feature hasher output (represented by col_1 through col_25) plus the keyword string. Using this information, dedupe will identify potential duplicate pair candidates and ask the user for confirmation, a process it calls active labeling. During this phase, dedupe writes out a settings file (dedupe_keywords_learned_settings) and label file (dedupe_keywords_training.json). The code below checks for the presence of these files, and if found, it will skip the active labeling step (so for retraining, you should remove these files).

Once the active labeling, blocking and clustering is done, the predicted duplicate pairs can be retrieved from the dedupe classifier by supplying a threshold, essentially telling it how much more you value recall over precision. The last block of code below just retrieves these candidate pairs and outputs it into a tab separated format into the file keyword_dedup_mappings.tsv. I sort the keywords so that the longer one is first in the pair, that is to accomodate the next step in this pipeline which I will not go into in this post.

import csv
import os
import dedupe

DATA_DIR = "../data"
MODEL_DIR = "../models"

RAW_INPUT = os.path.join(DATA_DIR, "curated_keywords_hash.csv")
SETTINGS_FILE = os.path.join(MODEL_DIR, "dedupe_keywords_learned_settings")
TRAINING_FILE = os.path.join(MODEL_DIR, "dedupe_keywords_training.json")
OUTPUT_FILE = os.path.join(DATA_DIR, "keyword_dedupe_mappings.tsv")

# read training file (written using 09-keyword-dedupe) and transform
# into format suitable for use by dedupe
data = {}
with open(RAW_INPUT, "rb") as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        row_id = int(row["id"])
        data[row_id] = dict(row.items())
        
# training
if os.path.exists(SETTINGS_FILE):
    print("reading from settings file {:s}".format(SETTINGS_FILE))
    with open(SETTINGS_FILE, "rb") as f:
        deduper = dedupe.StaticDedupe(f)
else:
    # define fields for deduper to pay attention to, here we
    # will ask for all fields except id
    field_names = ["col_{:d}".format(i+1) for i in range(25)]
    fields = [{"field": field_name, "type": "Exact"} 
              for field_name in field_names]
    fields.append({"field": "keyword", "type": "String"})
    deduper = dedupe.Dedupe(fields)
    # feed the deduper a sample of records for training
    deduper.sample(data, 15000)
    # use training data for previous run if available
    if os.path.exists(TRAINING_FILE):
        print("reading labeled examples from training file: {:s}"
              .format(TRAINING_FILE))
        with open(TRAINING_FILE, "rb") as f:
            deduper.readTraining(f)

    # active learning
    print("starting active labeling...")
    dedupe.consoleLabel(deduper)

    deduper.train()            
    
    # save training data to disk
    with open(TRAINING_FILE, "w") as f:
        deduper.writeTraining(f)
    with open(SETTINGS_FILE, "w") as f:
        deduper.writeSettings(f)
        
# blocking
print("blocking...")
threshold = deduper.threshold(data, recall_weight=1.5)

# clustering
print("clustering...")
clustered_dupes = deduper.match(data, threshold)

# write results
with open(OUTPUT_FILE, "w") as f:
    for cluster_id, cluster in enumerate(clustered_dupes):
        id_set, scores = cluster
        keywords = sorted([data[id]["keyword"] for id in id_set],
                          key=lambda x: len(x), reverse=True)
        f.write("{:s}\t{:s}\t{:.3f}\n".format(
            keywords[0], keywords[1], scores[0]))

The active learning part shows a pair of records on the console and asks the user to say yes or no to whether these records should be considered the same. User can exit the training loop at any point by choosing finish, or ignore a record as well by choosing unsure. This is done by the consoleLabel convenience method. The recommendation is to train 10 records, but I ended up training close to 40 because I wanted to give it an even split of positive and negative examples. Here is what a single query looks like.

col_1 : 0
col_2 : 1
col_3 : 0
col_4 : 0
col_5 : 1
col_6 : 0
col_7 : -1
col_8 : 1
col_9 : 0
col_10 : 1
col_11 : 0
col_12 : 0
col_13 : 0
col_14 : 0
col_15 : 0
col_16 : 0
col_17 : 0
col_18 : 0
col_19 : 1
col_20 : -1
col_21 : 0
col_22 : 0
col_23 : 1
col_24 : 2
col_25 : 0
keyword : learning model

col_1 : 0
col_2 : 1
col_3 : 0
col_4 : 0
col_5 : 1
col_6 : 0
col_7 : -1
col_8 : 1
col_9 : 0
col_10 : 1
col_11 : 0
col_12 : 0
col_13 : 0
col_14 : 0
col_15 : 1
col_16 : 0
col_17 : 0
col_18 : 0
col_19 : 1
col_20 : -1
col_21 : 0
col_22 : 0
col_23 : 1
col_24 : 2
col_25 : 0
keyword : learning models

11/10 positive, 19/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious

Our final step is evaluating the quality of the predictions that dedupe gave us. Since I am looking for words which are almost duplicates but not quite, the edit distance would be a good way to measure such a thing. Since dedupe reports the confidence score when predicting duplicates, we will consider any pair as a duplicate when predicted with a confidence of over 0.75. Conversely, we will consider a pair real duplicates if the edit distance is upto 2 characters. We have codified this and use the metrics provided by Scikit-Learn to generate evaluation scores.

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

i = 0
labels, preds = [], []
f = open(KEYWORD_DEDUPE_MAPPINGS, "r")
for line in f:
    keyword_left, keyword_right, score = line.strip().split("\t")
    score = float(score)
    preds.append(1 if score > 0.75 else 0)
    edit_dist = nltk.edit_distance(keyword_left, keyword_right)
    labels.append(1 if edit_dist <= 2 else 0)
    if i <= 10:
        print("{:25s}\t{:25s}\t{:.3f}\t{:.3f}".format(keyword_left, keyword_right, 
                                                      score, edit_dist))
    i += 1
f.close()

acc = accuracy_score(labels, preds)
cm = confusion_matrix(labels, preds)
cr = classification_report(labels, preds)

print("---")
print("accuracy: {:.3f}".format(acc))
print("---")
print("confusion matrix")
print(cm)
print("---")
print("classification report")
print(cr)

The output of this code block is shown below. The first block is the first 10 candidate pairs predicted by dedupe, along with the reported confidence and the computed edit distance. The accuracy of the deduper, given the thresholds, is 0.889 (almost 89%), and the precision, recall and f-score is 0.92, 0.89 and 0.90, all very good, especially given the effort spent in training dedupe.

learning approaches       learning approach         0.776 2.000
absolute values           absolute value            0.796 1.000
dual variables            dual variable             0.878 1.000
synaptic weights          synaptic weight           0.816 1.000
performance measures      performance measure       0.818 1.000
synthetic dataset         synthetic data            0.684 3.000
dynamical systems         dynamical system          0.836 1.000
action pairs              action pair               0.877 1.000
action potentials         action potential          0.853 1.000
learning models           learning model            0.816 1.000
action spaces             action space              0.816 1.000

---
accuracy: 0.889

---
confusion matrix
[[ 52   5]
 [ 31 235]]

---
classification report
             precision    recall  f1-score   support

          0       0.63      0.91      0.74        57
          1       0.98      0.88      0.93       266

avg / total       0.92      0.89      0.90       323

One small caveat for those of you who are using Anaconda Python 3.x. I am, and I initially tried installing dedupe using pip, but it failed because of some incompatibility with the Anaconda libraries and dedupe-hcluster (a dependency for dedupe). I ended up installing using conda from the derickl repository on my Mac, but that downgraded my Python version from 3 to 2 and broke lots of other code, so I had to reinstall Anaconda (fortunately or unfortunately I have had to do this before and I have the process down, so I was back up and running, including all additional libraries except dedupe, within an afternoon). I ended up installing dedupe on my spare linux box on top of Anaconda Python 2.x using the conda from the riipl-org repository for the linux-64 version of dedupe. I believe that dedupe itself and its dependencies is Python 3 compatible now according to this issue, so hopefully at some point soon dedupe developers will push the latest version to conda-forge for both OSX and Linux.

That's all I have for today. As far as I can see, the dedupe library has been applied to structured data, I haven't seen it applied to unstructured data before, so I hope that the idea of shingling and feature hashing the shingles to convert unstructured to structured tabular data is a useful one for those of you with this use case. I cannot share the code for this at the moment since it is part of a bigger effort that I haven't put on a public repository yet. Once I publish the project, I will update this post with links.