Salmon Run: December 2014

Saturday, December 13, 2014

Computing a Comorbidity Network with Spark

I've been meaning to learn about Apache Spark since I read Fast Data processing with Spark and attended this meetup earlier this year. I have also been experimenting with some anonymized Claims data published by the Centers for Medicare and Medicaid Services (CMS), consisting of 1.3M inpatient and 15.8M outpatient records for 2.3M (unique) members. In order to get familiar with the Spark API, I came upon the idea of creating a network of member comorbidities, similar to the idea described in the paper Flavor Network and the Principles of Food Pairing by Ahn, et al. The Flavor Network analyzes food pairings based on shared flavor compounds, whereas my Comorbidity Network analyzes disease pairings based on shared procedure codes. While it may be intuitively simpler to use the co-occurrence frequency directly from the member data, this pairing will hopefully produce different insights. This post describes the approach and the code to create this network.

The member data is annotated with the member's comorbidities, ie, the diseases (s)he has. Each member record is annotated with up to 11 diseases using a binary indicator variable (1=Yes, 2=No). The published data is for the period 2008-2010, and there are multiple member entries for the same member, probably corresponding to data at different time points in this period. Each member can have zero or more inpatient and outpatient claims. Each claim has one or more procedure codes - in case of inpatient records, these are ICD-9 procedure codes and in case of outpatient records these are HCPCS codes. The diagram below shows the interaction between the member and claims data, with the uninteresting fields elided.

In order to transform this data into a Comorbidity Network, we need to run it through a series of transformations. The Spark API exposes a data structure called an Resilient Distributed Dataset (RDD) with methods on it that allow you to treat it like a Scala collection, very similar to Scalding. The set of transformations is summarized in the diagram below.

The implementation code is housed in a project built from the Spark Example Project template from Snowplow Analytics on GitHub, updated to use Spark 1.1 and Scala 2.10 and JUnit4 instead of Spec2 for tests. The example demonstrates use of the Spark API to implement the Word Count example, and is instrumented to generate artifacts to run the job on a Spark cluster. Here is the code for the pipeline. For ease of testing, I have modeled individual operations as independent functions. The execute() function ties these operations into the pipeline described in the diagram above.

// Source: src/main/scala/com/mycompany/diseasegraph/GraphDataGenerator.scala
package com.mycompany.diseasegraph

import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import SparkContext._

object GraphDataGenerator {

  private val AppName = "GraphDataGeneratorJob"

  def execute(master: String, args: List[String], 
      jars: Seq[String]=Nil): Unit = {
    
    val sc = new SparkContext(master, AppName, null, jars)
    
    // dedup member so the record with the highest number
    // of comorbidities is retained
    val membersDeduped = dedupMemberInfo(sc.textFile(args(0)))
    // normalize member to (member_id,(disease,weight))
    val membersNormalized = normalizeMemberInfo(membersDeduped)

    // normalize inpatient and outpatient claims to 
    // (member_id,(code,weight))
    // This involves grouping by (member_id, claim_id) and
    // computing the code weights, then removing claim_id.
    // Codes used in inpatient claims are ICD-9 and the
    // ones used for outpatient claims are HCPCS
    val claimsNormalized = normalizeClaimInfo(
      sc.textFile(args(1)), (30, 35)) ++
      normalizeClaimInfo(sc.textFile(args(2)), (31, 75))
    
    // join the membersNormalized and claimsNormalized RDDs
    // on memberId to get mapping of disease to code
    // membersNormalized: (member_id, (disease, d_weight))
    // claimsNormalized: (member_id, (code, c_weight))
    // diseaseCodes: (disease, code, weight)
    val diseaseCodes = joinMembersWithClaims(
      membersNormalized, claimsNormalized)
    diseaseCodes.map(t => format(t)).saveAsTextFile(args(3))
    
    // finally do a self join with the diseaseCodes RDD joining
    // on code to compute a measure of disease-disease similarity
    // by the weight of the shared procedure codes to produce
    // (disease_A, disease_B, weight)
    val diseaseAllPairs = selfJoinDisease(diseaseCodes)
    diseaseAllPairs.map(t => format(t)).saveAsTextFile(args(4))
  }
  
  def format(t: (String,String,Double)): String = {
    "%s,%s,%.3f".format(t._1, t._2, t._3)
  }
  
  val columnDiseaseMap = Map(
    12 -> "ALZM",
    13 -> "CHF",
    14 -> "CKD",
    15 -> "CNCR",
    16 -> "COPD",
    17 -> "DEP",
    18 -> "DIAB",
    19 -> "IHD",
    20 -> "OSTR",
    21 -> "ARTH",
    22 -> "TIA"
  )

  def dedupMemberInfo(input: RDD[String]): 
      RDD[(String,List[String])] = {
    input.map(line => {
      val cols = line.split(",")
      val memberId = cols(0)
      val comorbs = columnDiseaseMap.keys
        .toList.sorted
        .map(e => cols(e))
      (memberId, comorbs.toList)
    })
    .reduceByKey((v1, v2) => {
      // 1 == Yes, 2 == No
      val v1fsize = v1.filter(_ == 1).size
      val v2fsize = v2.filter(_ == 1).size
      if (v1fsize > v2fsize) v1 else v2
    })
  }
  
  def normalizeMemberInfo(input: RDD[(String,List[String])]): 
      RDD[(String,(String,Double))]= {
    input.flatMap(elem => {
      val diseases = elem._2.zipWithIndex
        .map(di => if (di._1.toInt == 1) columnDiseaseMap(di._2 + 12) else "")
        .filter(d => ! d.isEmpty())
      val weight = 1.0 / diseases.size
      diseases.map(disease => (elem._1, (disease, weight)))
    })
  }
  
  def normalizeClaimInfo(input: RDD[String], 
      pcodeIndex: (Int,Int)): RDD[(String,(String,Double))] = {
    input.flatMap(line => {
      val cols = line.split(",")
      val memberId = cols(0)
      val claimId = cols(1)
      val procCodes = cols.slice(pcodeIndex._1, pcodeIndex._2)
      procCodes.filter(pcode => ! pcode.isEmpty)
        .map(pcode => ("%s:%s".format(memberId, claimId), pcode))
    })
    .groupByKey()
    .flatMap(grouped => {
      val memberId = grouped._1.split(":")(0)
      val weight = 1.0 / grouped._2.size
      val codes = grouped._2.toList
      codes.map(code => {
        (memberId, (code, weight))
      })
    })
  }
  
  def joinMembersWithClaims(members: RDD[(String,(String,Double))],
      claims: RDD[(String,(String,Double))]): 
      RDD[(String,String,Double)] = {
    members.join(claims)
      .map(rec => {
        val disease = rec._2._1._1
        val code = rec._2._2._1
        val weight = rec._2._1._2 * rec._2._2._2
        (List(disease, code).mkString(":"), weight)
      })
      .reduceByKey(_ + _)
      .sortByKey(true)
      .map(kv => {
        val Array(k,v) = kv._1.split(":")
        (k, v, kv._2)
      })
  }
  
  def selfJoinDisease(dcs: RDD[(String,String,Double)]): 
      RDD[(String,String,Double)] = {
    val dcsKeyed = dcs.map(t => (t._2, (t._1, t._3)))
    dcsKeyed.join(dcsKeyed)
      // join on code and compute edge weight 
      // between disease pairs
      .map(rec => {
        val ldis = rec._2._1._1
        val rdis = rec._2._2._1
        val diseases = Array(ldis, rdis).sorted
        val weight = rec._2._1._2 * rec._2._2._2
        (diseases(0), diseases(1), weight)
      })
      // filter out cases where LHS == RHS
      .filter(t => ! t._1.equals(t._2))
      // group on (LHS + RHS)
      .map(t => (List(t._1, t._2).mkString(":"), t._3))
      .reduceByKey(_ + _)
      .sortByKey(true)
      // convert back to triple format
      .map(p => {
        val Array(lhs, rhs) = p._1.split(":")
        (lhs, rhs, p._2)
      })
  }
}

On the member side, we start with the input files of denormalized, duplicate member data that looks like this:

0001448457F2ED81,19750401,,2,2,0,15,150,12,12,12,12,1,1,1,2,1,1,1,1,2,2,2,21000.00,4096.00,0.00,670.00,430.00,0.00,2070.00,450.00,70.00
0001448457F2ED81,19750401,,2,2,0,15,150,12,12,12,12,1,1,1,2,2,1,1,1,2,2,2,21000.00,4096.00,0.00,670.00,430.00,0.00,2070.00,450.00,70.00
000188A3402777A5,19220401,,2,5,0,31,270,12,12,0,12,1,1,1,1,1,1,1,1,2,2,2,0.00,0.00,0.00,2100.00,890.00,0.00,2330.00,690.00,30.00
0001A1CB9A542107,19220201,,2,1,0,45,610,12,12,0,08,2,2,2,2,2,2,1,1,2,2,2,0.00,0.00,0.00,4890.00,1460.00,0.00,1060.00,290.00,300.00
0001EB1229306825,19540501,,1,3,0,22,090,12,12,0,12,2,1,2,1,2,2,2,1,2,1,2,0.00,0.00,0.00,660.00,410.00,0.00,4470.00,1220.00,20.00
000238CD55A321A2,19110501,,2,1,0,19,350,12,12,0,12,1,2,2,2,2,1,1,2,1,2,2,0.00,0.00,0.00,0.00,0.00,0.00,750.00,180.00,0.00
...

Our first operation filters out the member_id and comorbidity indicators and groups by member_id to remove all duplicate occurrences except the one with the maximum number of comorbidities for the member. The member data is transformed to something like this:

0001EB1229306825,2,1,2,1,2,2,2,1,2,1,2
000188A3402777A5,1,1,1,1,1,1,1,1,2,2,2
0001A1CB9A542107,2,2,2,2,2,2,1,1,2,2,2
000238CD55A321A2,1,2,2,2,2,1,1,2,1,2,2
0001448457F2ED81,1,1,1,2,2,1,1,1,2,2,2
...

Our next step normalizes the data by member_id and disease, assigning a weight for the disease as the reciprocal of the number of diseases a member has. For example, the first member in the data above has 4 diseases, so each of his diseases will be assigned a weight of 0.25. At this point, the data looks like this:

0001EB1229306825,CHF,0.250
0001EB1229306825,CNCR,0.250
0001EB1229306825,IHD,0.250
0001EB1229306825,ARTH,0.250
000188A3402777A5,ALZM,0.125
...

On the claims side, each claim is keyed by member_id and claim_id. Each claim lists a bunch of procedure codes. ICD-9 codes are used for Inpatient claims and HCPCS codes are used for Outpatient claims. Processing for both claims is similar, the only difference is the code to capture the claims from the input data. Here is what outpatient claims data looks like.

00000B48BCF4AD29,391362254696220,1,20080315,20080315,10026U,70.00,0.00,2296429757,,,0.00,2721,V1250,2330,,,,,,,,,,,,,,0.00,0.00,,G0103,80076,80053,80053,85610,87086,82248,82565,84443,76700,80053,80061,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
00000B48BCF4AD29,391612254357344,1,20080405,20080405,0300YM,50.00,0.00,5824170688,,,0.00,72999,72981,,,,,,,,,,,,,,,0.00,10.00,7295,93971,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
00000B48BCF4AD29,391812254100253,1,20080604,20080621,1002GD,300.00,0.00,0631751348,,,0.00,7216,,,,,,,,,,,,,,,,0.00,40.00,7242,97110,97001,97010,97110,97110,97110,97110,97110,97110,97010,97140,97035,97110,97124,97530,97110,97012,97026,97001,97010,97140,97110,,,,,,,,,,,,,,,,,,,,,,,
00000B48BCF4AD29,391012254328356,1,20081024,20081024,1000GD,1100.00,0.00,7611012531,7611012531,,0.00,99671,4019,V1046,53081,V5861,3441,2859,,,,,,,,,,0.00,30.00,,,84484,V2632,J1170,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
00000B48BCF4AD29,391602254471197,1,20090414,20090414,1000UU,60.00,0.00,7190755841,,,0.00,7916,V7644,,,,,,,,,,,,,,,0.00,0.00,,36415,86800,85027,82043,84460,85025,82435,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...

The first step is to filter out the member_id, claim_id and procedure codes from the data and group by (member_id+claim_id) to find the number of codes per claim and to weight them as a reciprocal of how many codes are present in the claim, similar to how we weighted the diseases. We then aggregate code weights for individual codes across claims for each member, arriving at normalized member data that looks like this:

00009C897C3D8372,82043,1.000
0000525AB30E4DEF,78315,0.333
0000525AB30E4DEF,83036,0.333
0000525AB30E4DEF,96372,0.333
0000525AB30E4DEF,99281,1.000
...

We now join the normalized member data and normalized claim data by member_id and combining the weights, to get (disease, code, weight) triples that look like this:

ALZM,7935,0.167
ALZM,9427,0.167
CHF,7935,0.167
CHF,9427,0.167
CKD,7935,0.167
...

Joining this data with itself on the code gives us the disease pairs with a score that indicates the degree of similarity between then, in other words our Comorbidity Network.

ALZM,ARTH,10.206
ALZM,CHF,22.205
ALZM,CKD,14.718
ALZM,CNCR,4.790
ALZM,COPD,11.838
...

When running this on Spark via the command line, the pipeline described would be invoked through a job class like this one. However, during development, I incrementally developed each of the operations individually and tested them using JUnit tests. I also used a JUnit test to run the full pipeline against my "local" Spark cluster on my laptop with a small subset of data consisting of 1,000 members and 5,000 each of inpatient and outpatient records. The (final) JUnit test to run the full pipeline is shown below. The full test suite is here.

// Source: src/test/scala/com/mycompany/diseasegraph/GraphDataGeneratorTest.scala
package com.mycompany.diseasegraph

class GraphDataGeneratorTest {

  val TestName = "GraphDataGeneratorTest"
    
  ...

  @Test
  def testExecute(): Unit = {
    val args = List(
      "data/benefit_summary_1000.csv",
      "data/inpatient_claims_1000.csv",
      "data/outpatient_claims_1000.csv",
      "src/test/resources/final_outputs/disease_codes",
      "src/test/resources/final_outputs/diseases"
    )
    forceDeleteIfExists(args(3))
    forceDeleteIfExists(args(4))
    GraphDataGenerator.execute(master="local", args=args)
  }
}

I then used the generated data to construct a graph using the Python script shown below, adapted from this example on the Udacity wiki. The code reads from the output of the Spark job, and constructs a NetworkX graph object and plots it using the Matplotlib library. The edge weights are normalized (so maximum edge weight is 1), and only edges with a weight greater than 0.3 are shown to reduce clutter.

# Source: src/main/python/disease_graph.py
import networkx as nx
import matplotlib.pyplot as plt
import os

def draw_graph(G, labels=None, graph_layout='shell',
               node_size=1600, node_color='blue', node_alpha=0.3,
               node_text_size=12,
               edge_color='blue', edge_alpha=0.3, edge_tickness=1,
               edge_text_pos=0.3,
               text_font='sans-serif'):

    # these are different layouts for the network you may try
    # shell seems to work best
    if graph_layout == 'spring':
        graph_pos=nx.spring_layout(G)
    elif graph_layout == 'spectral':
        graph_pos=nx.spectral_layout(G)
    elif graph_layout == 'random':
        graph_pos=nx.random_layout(G)
    else:
        graph_pos=nx.shell_layout(G)
    # draw graph
    nx.draw_networkx_nodes(G,graph_pos,node_size=node_size, 
                           alpha=node_alpha, node_color=node_color)
    nx.draw_networkx_edges(G,graph_pos,width=edge_tickness,
                           alpha=edge_alpha,edge_color=edge_color)
    nx.draw_networkx_labels(G, graph_pos,font_size=node_text_size,
                            font_family=text_font)
    nx.draw_networkx_edge_labels(G, graph_pos, edge_labels=labels, 
                                 label_pos=edge_text_pos)
    # show graph
    frame = plt.gca()
    frame.axes.get_xaxis().set_visible(False)
    frame.axes.get_yaxis().set_visible(False)

    plt.show()

def add_node_to_graph(G, node, node_labels):
    if node not in node_labels:
        G.add_node(node)
    node_labels.add(node)

datadir = "../../../src/test/resources/final_outputs/diseases"
lines = []
for datafile in os.listdir(datadir):
    if datafile.startswith("part-"):
        fin = open(os.path.join(datadir, datafile), 'rb')
        for line in fin:
            disease_1, disease_2, weight = line.strip().split(",")
            lines.append((disease_1, disease_2, float(weight)))
        fin.close()

max_weight = max([x[2] for x in lines])
norm_lines = map(lambda x: (x[0], x[1], x[2] / max_weight), lines)

G = nx.Graph()
edge_labels = dict()
node_labels = set()
for line in norm_lines:
    add_node_to_graph(G, line[0], node_labels)
    add_node_to_graph(G, line[1], node_labels)
    if line[2] > 0.3:
        G.add_edge(line[0], line[1], weight=line[2])
        edge_labels[(line[0], line[1])] = "%.2f" % (line[2])
draw_graph(G, labels=edge_labels, graph_layout="shell")

This code produces the Comorbidity Network shown below.

The network is obviously inaccurate because it was generated off a subset of the input data, so we can't conclude much from it. In this initial part of the task, I deliberately concentrated on just learning the Spark API without thinking too much about how I would process the full dataset. From the little I have read on the subject, one can use either Amazon's EC2 or EMR to run Spark jobs, but neither seemed very straightforward - I plan on spending some time figuring this out later and write about it when successful.

That's all I have for today. My take on the Spark API is that it is really well-designed and has almost zero learning curve if you are already developing using Scala's Higher Order Functions. In some respects, Scalding's API is similar, but I found Spark's API to be more intuitive. If you would like to try the code out yourself, you can find it here on GitHub.

Thursday, December 04, 2014

Computing Semantic Similarity for Short Sentences

A reader recently recommended a paper for me to read - Sentence Similarity Based on Semantic Nets and Corpus Statistics. I found the algorithm quite interesting and I ended up implementing it. While I was not able to replicate the results exactly, my results did agree with results you would intuitively expect. I describe the algorithm and my implementation in this post.

My implementation is built with Python and Natural Language Tool Kit (NLTK). The Semantic Net referred to in the paper is Wordnet and the Corpus Statistics are from the Brown Corpus, both of which are available using NLTK's corpus API. Here is the complete code, which I explain below.

from __future__ import division
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import brown
import math
import numpy as np
import sys

# Parameters to the algorithm. Currently set to values that was reported
# in the paper to produce "best" results.
ALPHA = 0.2
BETA = 0.45
ETA = 0.4
PHI = 0.2
DELTA = 0.85

brown_freqs = dict()
N = 0

######################### word similarity ##########################

def get_best_synset_pair(word_1, word_2):
    """ 
    Choose the pair with highest path similarity among all pairs. 
    Mimics pattern-seeking behavior of humans.
    """
    max_sim = -1.0
    synsets_1 = wn.synsets(word_1)
    synsets_2 = wn.synsets(word_2)
    if len(synsets_1) == 0 or len(synsets_2) == 0:
        return None, None
    else:
        max_sim = -1.0
        best_pair = None, None
        for synset_1 in synsets_1:
            for synset_2 in synsets_2:
               sim = wn.path_similarity(synset_1, synset_2)
               if sim > max_sim:
                   max_sim = sim
                   best_pair = synset_1, synset_2
        return best_pair

def length_dist(synset_1, synset_2):
    """
    Return a measure of the length of the shortest path in the semantic 
    ontology (Wordnet in our case as well as the paper's) between two 
    synsets.
    """
    l_dist = sys.maxint
    if synset_1 is None or synset_2 is None: 
        return 0.0
    if synset_1 == synset_2:
        # if synset_1 and synset_2 are the same synset return 0
        l_dist = 0.0
    else:
        wset_1 = set([str(x.name()) for x in synset_1.lemmas()])        
        wset_2 = set([str(x.name()) for x in synset_2.lemmas()])
        if len(wset_1.intersection(wset_2)) > 0:
            # if synset_1 != synset_2 but there is word overlap, return 1.0
            l_dist = 1.0
        else:
            # just compute the shortest path between the two
            l_dist = synset_1.shortest_path_distance(synset_2)
            if l_dist is None:
                l_dist = 0.0
    # normalize path length to the range [0,1]
    return math.exp(-ALPHA * l_dist)

def hierarchy_dist(synset_1, synset_2):
    """
    Return a measure of depth in the ontology to model the fact that 
    nodes closer to the root are broader and have less semantic similarity
    than nodes further away from the root.
    """
    h_dist = sys.maxint
    if synset_1 is None or synset_2 is None: 
        return h_dist
    if synset_1 == synset_2:
        # return the depth of one of synset_1 or synset_2
        h_dist = max([x[1] for x in synset_1.hypernym_distances()])
    else:
        # find the max depth of least common subsumer
        hypernyms_1 = {x[0]:x[1] for x in synset_1.hypernym_distances()}
        hypernyms_2 = {x[0]:x[1] for x in synset_2.hypernym_distances()}
        lcs_candidates = set(hypernyms_1.keys()).intersection(
            set(hypernyms_2.keys()))
        if len(lcs_candidates) > 0:
            lcs_dists = []
            for lcs_candidate in lcs_candidates:
                lcs_d1 = 0
                if hypernyms_1.has_key(lcs_candidate):
                    lcs_d1 = hypernyms_1[lcs_candidate]
                lcs_d2 = 0
                if hypernyms_2.has_key(lcs_candidate):
                    lcs_d2 = hypernyms_2[lcs_candidate]
                lcs_dists.append(max([lcs_d1, lcs_d2]))
            h_dist = max(lcs_dists)
        else:
            h_dist = 0
    return ((math.exp(BETA * h_dist) - math.exp(-BETA * h_dist)) / 
        (math.exp(BETA * h_dist) + math.exp(-BETA * h_dist)))
    
def word_similarity(word_1, word_2):
    synset_pair = get_best_synset_pair(word_1, word_2)
    return (length_dist(synset_pair[0], synset_pair[1]) * 
        hierarchy_dist(synset_pair[0], synset_pair[1]))

######################### sentence similarity ##########################

def most_similar_word(word, word_set):
    """
    Find the word in the joint word set that is most similar to the word
    passed in. We use the algorithm above to compute word similarity between
    the word and each word in the joint word set, and return the most similar
    word and the actual similarity value.
    """
    max_sim = -1.0
    sim_word = ""
    for ref_word in word_set:
      sim = word_similarity(word, ref_word)
      if sim > max_sim:
          max_sim = sim
          sim_word = ref_word
    return sim_word, max_sim
    
def info_content(lookup_word):
    """
    Uses the Brown corpus available in NLTK to calculate a Laplace
    smoothed frequency distribution of words, then uses this information
    to compute the information content of the lookup_word.
    """
    global N
    if N == 0:
        # poor man's lazy evaluation
        for sent in brown.sents():
            for word in sent:
                word = word.lower()
                if not brown_freqs.has_key(word):
                    brown_freqs[word] = 0
                brown_freqs[word] = brown_freqs[word] + 1
                N = N + 1
    lookup_word = lookup_word.lower()
    n = 0 if not brown_freqs.has_key(lookup_word) else brown_freqs[lookup_word]
    return 1.0 - (math.log(n + 1) / math.log(N + 1))
    
def semantic_vector(words, joint_words, info_content_norm):
    """
    Computes the semantic vector of a sentence. The sentence is passed in as
    a collection of words. The size of the semantic vector is the same as the
    size of the joint word set. The elements are 1 if a word in the sentence
    already exists in the joint word set, or the similarity of the word to the
    most similar word in the joint word set if it doesn't. Both values are 
    further normalized by the word's (and similar word's) information content
    if info_content_norm is True.
    """
    sent_set = set(words)
    semvec = np.zeros(len(joint_words))
    i = 0
    for joint_word in joint_words:
        if joint_word in sent_set:
            # if word in union exists in the sentence, s(i) = 1 (unnormalized)
            semvec[i] = 1.0
            if info_content_norm:
                semvec[i] = semvec[i] * math.pow(info_content(joint_word), 2)
        else:
            # find the most similar word in the joint set and set the sim value
            sim_word, max_sim = most_similar_word(joint_word, sent_set)
            semvec[i] = max_sim if max_sim > PHI else 0.0
            if info_content_norm:
                semvec[i] = semvec[i] * info_content(joint_word) * info_content(sim_word)
        i = i + 1
    return semvec                
            
def semantic_similarity(sentence_1, sentence_2, info_content_norm):
    """
    Computes the semantic similarity between two sentences as the cosine
    similarity between the semantic vectors computed for each sentence.
    """
    words_1 = nltk.word_tokenize(sentence_1)
    words_2 = nltk.word_tokenize(sentence_2)
    joint_words = set(words_1).union(set(words_2))
    vec_1 = semantic_vector(words_1, joint_words, info_content_norm)
    vec_2 = semantic_vector(words_2, joint_words, info_content_norm)
    return np.dot(vec_1, vec_2.T) / (np.linalg.norm(vec_1) * np.linalg.norm(vec_2))

######################### word order similarity ##########################

def word_order_vector(words, joint_words, windex):
    """
    Computes the word order vector for a sentence. The sentence is passed
    in as a collection of words. The size of the word order vector is the
    same as the size of the joint word set. The elements of the word order
    vector are the position mapping (from the windex dictionary) of the 
    word in the joint set if the word exists in the sentence. If the word
    does not exist in the sentence, then the value of the element is the 
    position of the most similar word in the sentence as long as the similarity
    is above the threshold ETA.
    """
    wovec = np.zeros(len(joint_words))
    i = 0
    wordset = set(words)
    for joint_word in joint_words:
        if joint_word in wordset:
            # word in joint_words found in sentence, just populate the index
            wovec[i] = windex[joint_word]
        else:
            # word not in joint_words, find most similar word and populate
            # word_vector with the thresholded similarity
            sim_word, max_sim = most_similar_word(joint_word, wordset)
            if max_sim > ETA:
                wovec[i] = windex[sim_word]
            else:
                wovec[i] = 0
        i = i + 1
    return wovec

def word_order_similarity(sentence_1, sentence_2):
    """
    Computes the word-order similarity between two sentences as the normalized
    difference of word order between the two sentences.
    """
    words_1 = nltk.word_tokenize(sentence_1)
    words_2 = nltk.word_tokenize(sentence_2)
    joint_words = list(set(words_1).union(set(words_2)))
    windex = {x[1]: x[0] for x in enumerate(joint_words)}
    r1 = word_order_vector(words_1, joint_words, windex)
    r2 = word_order_vector(words_2, joint_words, windex)
    return 1.0 - (np.linalg.norm(r1 - r2) / np.linalg.norm(r1 + r2))

######################### overall similarity ##########################

def similarity(sentence_1, sentence_2, info_content_norm):
    """
    Calculate the semantic similarity between two sentences. The last 
    parameter is True or False depending on whether information content
    normalization is desired or not.
    """
    return DELTA * semantic_similarity(sentence_1, sentence_2, info_content_norm) + \
        (1.0 - DELTA) * word_order_similarity(sentence_1, sentence_2)
        
######################### main / test ##########################

# the results of the algorithm are largely dependent on the results of 
# the word similarities, so we should test this first...
word_pairs = [
  ["asylum", "fruit", 0.21],
  ["autograph", "shore", 0.29],
  ["autograph", "signature", 0.55],
  ["automobile", "car", 0.64],
  ["bird", "woodland", 0.33],
  ["boy", "rooster", 0.53],
  ["boy", "lad", 0.66],
  ["boy", "sage", 0.51],
  ["cemetery", "graveyard", 0.73],
  ["coast", "forest", 0.36],
  ["coast", "shore", 0.76],
  ["cock", "rooster", 1.00],
  ["cord", "smile", 0.33],
  ["cord", "string", 0.68],
  ["cushion", "pillow", 0.66],
  ["forest", "graveyard", 0.55],
  ["forest", "woodland", 0.70],
  ["furnace", "stove", 0.72],
  ["glass", "tumbler", 0.65],
  ["grin", "smile", 0.49],
  ["gem", "jewel", 0.83],
  ["hill", "woodland", 0.59],
  ["hill", "mound", 0.74],
  ["implement", "tool", 0.75],
  ["journey", "voyage", 0.52],
  ["magician", "oracle", 0.44],
  ["magician", "wizard", 0.65],
  ["midday", "noon", 1.0],
  ["oracle", "sage", 0.43],
  ["serf", "slave", 0.39]
]
for word_pair in word_pairs:
    print "%s\t%s\t%.2f\t%.2f" % (word_pair[0], word_pair[1], word_pair[2], 
                                  word_similarity(word_pair[0], word_pair[1]))

sentence_pairs = [
    ["I like that bachelor.", "I like that unmarried man.", 0.561],
    ["John is very nice.", "Is John very nice?", 0.977],
    ["Red alcoholic drink.", "A bottle of wine.", 0.585],
    ["Red alcoholic drink.", "Fresh orange juice.", 0.611],
    ["Red alcoholic drink.", "An English dictionary.", 0.0],
    ["Red alcoholic drink.", "Fresh apple juice.", 0.420],
    ["A glass of cider.", "A full cup of apple juice.", 0.678],
    ["It is a dog.", "That must be your dog.", 0.739],
    ["It is a dog.", "It is a log.", 0.623],
    ["It is a dog.", "It is a pig.", 0.790],
    ["Dogs are animals.", "They are common pets.", 0.738],
    ["Canis familiaris are animals.", "Dogs are common pets.", 0.362],
    ["I have a pen.", "Where do you live?", 0.0],
    ["I have a pen.", "Where is ink?", 0.129],
    ["I have a hammer.", "Take some nails.", 0.508],
    ["I have a hammer.", "Take some apples.", 0.121]
]
for sent_pair in sentence_pairs:
    print "%s\t%s\t%.3f\t%.3f\t%.3f" % (sent_pair[0], sent_pair[1], sent_pair[2], 
        similarity(sent_pair[0], sent_pair[1], False),
        similarity(sent_pair[0], sent_pair[1], True))

Proceeding from the top-down (wrt the code, or bottom-up wrt the algorithm), the lowest unit of the algorithm is the semantic similarity between a pair of words. The word similarity is a combination of two functions f(l) and f(h), where l is the shortest path between the two words in Wordnet (our Semantic Network) and h the height of their Lowest Common Subsumer (LCS) from the root of the Semantic Network. The intuition behind these is that l is a proxy for how similar the words are, and d is a proxy for the specificity of the LCS, ie, LCS nodes closer to the root indicate broader/more abstract concepts and less similarity. The functions f(l) and f(h) serve to normalize these values to the range [0,1]. In formulas, then:

sim(w₁, w₂) = f(l).f(h)

    where:

    f(l) = e^-αl
            e^βh - e^-βh
    f(h) = ------------
            e^βh + e^-βh

Word similarities between a set of word pairs were reported in the paper. As a test, I computed the similarities between the same word pairs with my code above. The similarity values reported in the paper are shown under Exp.Sim and the ones returned by my code are shown under Act.Sim. As you can see, they are close but not identical - however, note that the computed similarites seem to line up with intuition. For example, sim(autograph, signature) is higher than sim(autograph, shore), sim(magician, wizard) is higher than sim(magician, oracle), etc.

Word #1	Word #2	Exp. Sim	Act. Sim
asylum	fruit	0.21	0.30
autograph	shore	0.29	0.16
autograph	signature	0.55	0.82
automobile	car	0.64	1.00
bird	woodland	0.33	0.20
boy	rooster	0.53	0.11
boy	lad	0.66	0.82
boy	sage	0.51	0.37
cemetery	graveyard	0.73	1.00
coast	forest	0.36	0.36
coast	shore	0.76	0.80
cock	rooster	1.00	1.00
cord	smile	0.33	0.13
cord	string	0.68	0.82
cushion	pillow	0.66	0.82
forest	graveyard	0.55	0.20
forest	woodland	0.70	0.98
furnace	stove	0.72	0.17
glass	tumbler	0.65	0.82
grin	smile	0.49	0.99
gem	jewel	0.83	1.00
hill	woodland	0.59	0.36
hill	mound	0.74	0.99
implement	tool	0.75	0.82
journey	voyage	0.52	0.82
magician	oracle	0.44	0.30
magician	wizard	0.65	1.00
midday	noon	1.00	1.00
oracle	sage	0.43	0.37
serf	slave	0.39	0.55

One thing I did differently from the paper is to select the most similar pair of synsets instead of just picking the first noun synset for each word (see the function get_best_synset_pair). This is because a word can map to multiple synsets, and finding the most similar pair mimics the human tendency to maximize pattern seeking (ie see patterns where there are none).

Sentence similarity is computed as a linear combination of semantic similarity and word order similarity. Semantic Similarity is computed as the Cosine Similarity between the semantic vectors for the two sentences. To build the semantic vector, the union of words in the two sentences is treated as the vocabulary. If the word occurs in the sentence, its value for that position is 1. If it doesn't, the similarity for the word is computed against all the other words in the sentence. If it happens to be above a threshold φ, then the value of the element is φ, else it is 0. This value is further attenuated by the information content for the word as found in the Brown corpus. In equations:

s₁ • s₂
    S_s = -----------------
          ||s₁|| * ||s₂||

    where:

    s_i = s * I(w_i) * I(w_j)

                log(n + 1)
    I(w) = 1 - ------------
                log(N + 1)

    where:
      n = number of times word w occurs in corpus
      N = number of words in the corpus

The word order similarity attempts to correct for the fact that sentences with the same words can have radically different meanings. This is done by computing the word order vector for each sentence and computing a normalized similarity measure between them. The word order vector, like the semantic vector is based on the joint word set. If the word occurs in the sentence, its position in the joint word set is recorded. If not, the similarity to the most similar word in the sentence is recorded if it crosses a threshold η else it is 0. In equations:

||r₁ - r₂||
    S_r = 1 - --------------
              ||r₁ + r₂||

    where:
      r₁ = word position vector for sentence 1
      r₂ = word position vector for sentence 2

The similarity between two sentences are modeled as a linear combination of their semantic similarity and word order similarity, ie:

S = δS_s + (1 - δ)S_r

Similar to word similarities, the paper also lists some sentence similarities computed with their algorithm. I tried these same sentences through my code, and as expected, got slightly different results (since the sentence similarity is dependent on word similarities). Here they are. Exp.Sim are the values reported in the paper, Act.Sim (w/o IC) are computed similarities without Information Content normalization and Act.Sim (w/IC) are computed similarities with Information Content normalization. As you can see the Exp.Sim and Act.Sim (w/IC) values are quite consistent.

Sentence #1	Sentence #2	Exp.Sim	Act.Sim (w/o IC)	Act.Sim (w/IC)
I like that bachelor.	I like that unmarried man.	0.561	0.801	0.345
John is very nice.	Is John very nice?	0.977	0.592	0.881
Red alcoholic drink.	A bottle of wine.	0.585	0.477	0.307
Red alcoholic drink.	Fresh orange juice.	0.611	0.467	0.274
Red alcoholic drink.	An English dictionary.	0.000	0.237	0.028
Red alcoholic drink.	Fresh apple juice.	0.420	0.389	0.215
A glass of cider.	A full cup of apple juice.	0.678	0.659	0.347
It is a dog.	That must be your dog.	0.739	0.452	0.709
It is a dog.	It is a log.	0.623	0.858	0.497
It is a dog.	It is a pig.	0.790	0.863	0.500
Dogs are animals.	They are common pets.	0.738	0.550	0.377
Canis familiaris are animals.	Dogs are common pets.	0.362	0.458	0.151
I have a pen.	Where do you live?	0.000	0.134	0.158
I have a pen.	Where is ink?	0.129	0.112	0.077
I have a hammer.	Take some nails.	0.508	0.431	0.288
I have a hammer.	Take some apples.	0.121	0.344	0.147

Once more, while the numbers don't match exactly, the results seem intuitively correct. For example, "Red alcoholic drink" is more similar to "A bottle of wine" than "Fresh apple juice", which is more similar than "An English dictionary", etc.

Thats all I have for today. Hope you found this paper (and my implementation) interesting. The (latest) code for this post is available on GitHub here.

Update 2017-03-13: Many thanks to Mathieu Chrétien for updating the code to use Python3 and contributing it back, you can find it on this github gist.

Salmon Run

Saturday, December 13, 2014

Computing a Comorbidity Network with Spark

Thursday, December 04, 2014

Computing Semantic Similarity for Short Sentences

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me