Salmon Run: Homonym Disambiguation using Concept Distance

According to Wikipedia, a homonym is one of a group of words that often share the same spelling and pronunciation but have different meanings. Homonym disambiguation, from my (admittedly limited point of view), is the process by which a program that extracts concepts from text (such as my TGNI application) infers the correct sense of such a word if it encounters one.

In my world, some examples of homonyms are medical abbreviations which have non-medical analogs (ie AA, Aortic Aneurysm and American Airlines), medical words and abbreviations (ie COLD, the common cold and Chronic Obstructive Lung Disease), and abbreviations for different medical concepts that are identical (ie, ARF, Acute Renal Failure, Acute Respiratory Failure, and Activation Resorption Formation). We do have a pretty reliable algorithm to detect and disambiguate homonyms in our in-house mapper, that relies on hand-crafted lists of reinforcer and dereinforcer words, but it is high maintenance, and I wanted to see if I could figure out something simpler.

Of course, disambiguation is a bit of a misnomer, since given enough context, there is no ambiguity. In my case, the context comes from the document itself. So in a document which has one or more occurrences of ARF, the intuition is that there will be other words or phrases that make it clear "which" ARF is being referred to. Further, the concepts representing these words or phrases will occur "closer to" one of the concepts representing ARF - the one for which the closeness is the highest would be the correct sense of the homonym ARF.

To codify this intuition, I came up with a metric that defines the distance of a concept from a set of other concepts in concept (taxonomy) space. Basically it is the inverse of the root mean square of the weights covered by the paths starting from the homonym concept to the top concepts in the document. The distance between consecutive vertices on the path is computed as the weight (indicating relevance) assigned to the edge multiplied by the number of occurrences of the top concept in the document. As the path becomes longer, each successive edge is damped by the square of the distance from the original homonym concept. Here is the formula:

d = 1 / (√ Σ p_i²)
p_i = Σ r_j-1,j * n_i / j²
where:
d = the distance between homonym concept C_h and the group of top concepts {C_T}.
p = the path weight of the path between C_h and the i^th top concept.
r_j-1,j = the rank assigned to the edge connecting the j-1^th and the j^th node in path p_i.
n_i = the number of occurrences of the i^th top concept in the document.

Notice that our weights are more an indicator of likelihood than distance, ie, higher values indicate lower cost or distance. This is why we need to calculate the inverse of the root-mean-square of the paths to get a distance metric.

The root-mean-square came from the notion of Pythagorean distance in concept space and the damping factor for consecutive edges of a path come from the behavior of gravitational attraction, which falls off by the square of the distance. Multiplying by the number of occurrences seemed to make sense because a greater occurrence of a particular related concept would indicate more weighting towards that concept.

Of course, computing the shortest path between any two nodes in a graph is likely to be compute intensive, and worse, it is quite likely that there may be no such path at all. So we limit our algorithm in two ways for reasonable performance. First we consider only the distance from the homonym concepts to the top 3 concepts in the document, and second, we limit the search for the shortest path to a maximum path length of 5. We may need to tinker with these values a bit, but it seems to work fine for my initial tests.

So anyway, to test out my idea, I found two documents, one for Acute Renal Failure and another for Acute Respiratory Failure, concept mapped it, and wrote up a little JUnit test to see if the intuition outlined above is correct. Here it is:

// Source: src/main/java/com/mycompany/tgni/neo4j/ConceptDistanceHomonymDisambiguatorTest.java
package com.mycompany.tgni.neo4j;

import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.Iterator;
import java.util.List;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.lang.math.DoubleRange;
import org.apache.commons.math.stat.descriptive.DescriptiveStatistics;
import org.junit.AfterClass;
import org.junit.Assert;
import org.junit.BeforeClass;
import org.junit.Test;
import org.neo4j.graphalgo.GraphAlgoFactory;
import org.neo4j.graphalgo.PathFinder;
import org.neo4j.graphdb.Node;
import org.neo4j.graphdb.Path;
import org.neo4j.graphdb.PropertyContainer;
import org.neo4j.graphdb.Relationship;
import org.neo4j.graphdb.RelationshipExpander;
import org.neo4j.kernel.Traversal;

/**
 * Tests for calculating shortest path between two nodes
 * for homonym disambiguation.
 */
public class ConceptDistanceHomonymDisambiguatorTest {

  private static final String[][] CONCEPTS_FOR_ACUTE_RENAL_FAILURE_DOC = 
    new String[][] { ... };
  private static final String[][] CONCEPTS_FOR_ACUTE_RESPIRATORY_FAILURE_DOC = 
    new String[][] { ... };

  private static NodeService nodeService;
  
  @BeforeClass
  public static void setupBeforeClass() throws Exception {
    nodeService = NodeServiceFactory.getInstance();
  }
  
  @AfterClass
  public static void teardownAfterClass() throws Exception {
    NodeServiceFactory.destroy();
  }

  @Test
  public void testPathBetweenTwoNodes() throws Exception {
    // Acute renal failure: 3814964
    // Acute respiratory failure: 5355868
    double d11 = findDistance(3814964, CONCEPTS_FOR_ACUTE_RENAL_FAILURE_DOC);
    double d12 = findDistance(5355868, CONCEPTS_FOR_ACUTE_RENAL_FAILURE_DOC);
    double d21 = findDistance(3814964, CONCEPTS_FOR_ACUTE_RESPIRATORY_FAILURE_DOC);
    double d22 = findDistance(5355868, CONCEPTS_FOR_ACUTE_RESPIRATORY_FAILURE_DOC);
    System.out.println("d11=" + d11);
    System.out.println("d12=" + d12);
    Assert.assertTrue(d11 < d12);
    System.out.println("d21=" + d21);
    System.out.println("d22=" + d22);
    Assert.assertTrue(d22 < d21);
  }
  
  private double findDistance(int oid, String[][] docterms) 
      throws Exception {
    double pythDist = 0.0D;
    List<String[]> topTerms = getTopTerms(oid, docterms);
    for (String[] topTerm : topTerms) {
      Integer topTermOid = Integer.valueOf(topTerm[0]);
      Integer occurrences = Integer.valueOf(topTerm[2]);
      Path shortestPath = findShortestPath(oid, topTermOid);
      System.out.println(showPath(shortestPath));
      if (shortestPath == null) continue;
      double distance = 0.0D;
      int hops = 1;
      for (Iterator<PropertyContainer> it = shortestPath.iterator(); 
          it.hasNext(); ) {
        PropertyContainer pc = it.next();
        if (pc instanceof Relationship) {
          Long strength = (Long) ((Relationship) pc).getProperty("mrank");
          distance += (occurrences * strength) / Math.pow(hops, 2);
          hops++;
        }
      }
      pythDist += Math.pow(distance, 2);
    }
    return (1.0D / Math.sqrt(pythDist));
  }

  private List<String[]> getTopTerms(int oid, String[][] docterms) {
    List<String[]> topTerms = new ArrayList<String[]>();
    for (String[] docterm : docterms) {
      topTerms.add(docterm);
    }
    Collections.sort(topTerms, new Comparator<String[]>() {
      @Override
      public int compare(String[] term1, String[] term2) {
        Integer count1 = Integer.valueOf(term1[2]);
        Integer count2 = Integer.valueOf(term2[2]);
        return count2.compareTo(count1);
    }});
    if (topTerms.size() > 3) {
      for (String[] topterm : topTerms.subList(0, 3)) {
        System.out.println(StringUtils.join(topterm, ";"));
      }
      return topTerms.subList(0, 3);
    } else {
      for (String[] topterm : topTerms) {
        System.out.println(StringUtils.join(topterm, ";"));
      }
      return topTerms;
    }
  }

  private Path findShortestPath(int oid1, int oid2) throws Exception {
    long nid1 = nodeService.indexService.getNid(oid1);
    long nid2 = nodeService.indexService.getNid(oid2);
    Node node1 = nodeService.graphService.getNodeById(nid1);
    Node node2 = nodeService.graphService.getNodeById(nid2);
    RelationshipExpander expander = Traversal.expanderForAllTypes();
    PathFinder<Path> finder = GraphAlgoFactory.shortestPath(expander, 5);
    Iterable<Path> paths = finder.findAllPaths(node1, node2);
    // these are the shortest path(s) in terms of number of hops
    // now we need to find the most likely path based on the 
    // sum of the rank of relationships
    Path bestPath = null;
    Long maxStrength = 0L;
    for (Path path : paths) {
      Long strength = 0L;
      for (Iterator<PropertyContainer> it = path.iterator(); it.hasNext(); ) {
        PropertyContainer pc = it.next();
        if (pc instanceof Relationship) {
          strength += (Long) ((Relationship) pc).getProperty("mrank"); 
        }
      }
      if (strength > maxStrength) {
        maxStrength = strength;
        bestPath = path;
      }
    }
    return bestPath;
  }

  private String showPath(Path path) {
    if (path == null) return "NONE";
    StringBuilder buf = new StringBuilder();
    for (Iterator<PropertyContainer> it = path.iterator(); it.hasNext(); ) {
      PropertyContainer pc = it.next();
      if (pc instanceof Node) {
        Node npc = (Node) pc;
        buf.append((String) npc.getProperty("pname")).
          append("(").
          append((Integer) npc.getProperty("oid")).
          append(")");
      } else if (pc instanceof Relationship) {
        Relationship rpc = (Relationship) pc;
        buf.append("--(").
          append(rpc.getType().name()).
          append("[").
          append((Long) rpc.getProperty("mrank")).
          append("])-->");
      }
    }
    return buf.toString();
  }
}

Based on the results of the test, the approach appears to be sound. As you can see, the distance measure of ARF (acute renal failure) is lower in a document about Acute Renal Failure than ARF (acute respiratory failure), and vice versa. So the metric provides a good indication of the correct sense of a homonym term from its context. Of course, a lot depends on the quality of the mapping and the taxonomy.

                          Doc (Acute Renal Failure)   Doc (Acute Resp Failure)
------------------------------------------------------------------------------
Acute Renal Failure       0.003245729458981506        0.003333949158063146
Acute Resp. Failure       0.013424935407337267        0.007470294606214862

Conclusion

The nice thing about this approach is that it can be applied to all the types of homonyms we encounter, and needs no extra data to be maintained apart from the taxonomy relations, which is what we are doing already. While I focus mainly on the text mapping here, this approach could be adapted for the search side as well. In case of search, the general approach is to provide the user with a list of possibilities and ask him/her to refine the search, providing results for the "most likely" possibility. If the search application "knew" the user's search history (like Google+ does now), it could provide results for the "most likely" possibility based on the user's search history (using the same Pythagorean distance approach) instead.