Salmon Run: Extracting SVO Triples from Wikipedia

I recently came across this discussion (login required) on LinkedIn about extracting (subject, verb, object) (SVO) triples from text. Jack Park, owner of the SolrSherlock project, suggested using ReVerb to do this. I remembered an entertaining Programming Assignment from when I did the Natural Language Processing Course on Coursera, that involved finding spouse names from a small subset of Wikipedia, so I figured I it would be interesting to try using ReVerb against this data.

This post describes that work. As before, given the difference between this and the "preferred" approach that the automatic grader expects, results are likely to be wildly off the mark. BTW, I highly recommend taking the course if you haven't already, there are lots of great ideas in there. One of the ideas deals with generating "raw" triples, then filtering them using known (subject, object) pairs to find candidate verbs, then turning around and using the verbs to find unknown (subject, object) pairs.

So in order to find the known (subject, object) pairs, I decided to parse the Infobox content (the "semi-structured" part of Wikipedia pages). Wikipedia markup is a mini programming language in itself, so I went looking for some pointers on how to parse it (third party parsers or just ideas) on StackOverflow. Someone suggested using DBPedia instead, since they have already done the Infobox extraction for you. I tried both, and somewhat surprisingly, manually parsing Infobox gave me better results in some cases, so I describe both approaches below.

Data and Initial Setup

The data is a small (tiny) XML dump of 24 Wikipedia pages about famous people, US Presidents, authors, actors and other historical figures. The markup is in MediaWiki format. I used Scala's native XML support and some character counting logic to extract and parse the Infoboxes from Wikipedia to a Map of name-value pairs.

To extract the text (ie the non-Infobox) portion from the Wikipedia data, I used the Bliki engine to convert the MediaWiki markup to HTML, then used the Jericho HTML parser to convert it to plain text. I needed to do this because of the richness of the Wiki format - direct conversion to text leaves some of the markup behind.

In order to access data on DBPedia, I used Apache Jena, the Java framework for all things Semantic. As I learnt at the Knowledge Engineering with Semantic Web Technologies course on OpenHPI, the Semantic Web landscape is full of great ideas and implementations which don't mesh together very well. Jena provides a common (albeit complex) API that attempts to unify all of these. It is composed of multiple sub-projects, but fortunately sbt (and friends) allow you to declare it with a single call to an uber-dependency (see below). In any case, all I used Jena for was to build a client to query DBPedia's SPARQL endpoint. As an aside, the call is an HTTP GET and the result can be streamed back as XML, so I could just have used plain Scala to do this, as this EBook by Mark Watson demonstrates.

For reference, here are the additions to my build.sbt that I had to do for this post.

resolvers ++= Seq(
  ...
  "Bliki Repo" at "http://gwtwiki.googlecode.com/svn/maven-repository/",
  ...
)

libraryDependencies ++= Seq(
  ...
  "info.bliki.wiki" % "bliki-core" % "3.0.19",
  "net.htmlparser.jericho" % "jericho-html" % "3.3",
  "edu.washington.cs.knowitall" % "reverb-core" % "1.4.0",
  "org.apache.jena" % "apache-jena-libs" % "2.12.1",
  ...
)

Parsing Infoboxes

As mentioned above, to understand which triples are interesting in the text, we need some indication of what/who the subject and object are in the triples, so we can filter on them and find interesting verbs. So the first step was to parse out the Infobox. We do this by isolating, for each page in the dump, the section of text bounded by the string "{{Infobox" and the matching "}}". Here is the code to parse the various components we care about - the titles, Infoboxes and texts for each page. The Infoboxes are returned as a List of Map of name-value pairs. Also, as mentioned above, the text is extracted in two steps - first to HTML then to plain text.

// Source: src/main/scala/com/mycompany/scalcium/utils/MediaWikiParser.scala
package com.mycompany.scalcium.utils

import java.io.File

import scala.collection.mutable.ArrayBuffer
import scala.util.control.Breaks._
import scala.xml.XML

import info.bliki.wiki.dump.WikiPatternMatcher
import info.bliki.wiki.model.WikiModel
import net.htmlparser.jericho.Source

class MediaWikiParser(xmlfile: File) {

  val InfoboxStartPattern = "{{Infobox"
    
  val titles = ArrayBuffer[String]()
  val texts = ArrayBuffer[String]()
  val infoboxes = ArrayBuffer[Map[String,String]]()
  
  def parse(): Unit = {
    val mediaWikiElement = XML.loadFile(xmlfile)
    (mediaWikiElement \ "page").map(pageElement => {
      val title = (pageElement \ "title").text
      val text = (pageElement \ "revision" \ "text").text
      // parse out Infobox
      val infoboxStart = text.indexOf(InfoboxStartPattern)
      var bcount = 2
      var infoboxEnd = infoboxStart + InfoboxStartPattern.length()
      breakable {
        (infoboxEnd until text.length()).foreach(i => { 
          val c = text.charAt(i)
          var binc = 0
          if (c == '}') binc = -1
          else if (c == '{') binc = 1
          else binc = 0
          bcount += binc
          infoboxEnd = i
          if (bcount == 0) break
        })
      }
      if (infoboxStart >= 0) {
        addTitle(title)
        addInfobox(text.substring(infoboxStart, infoboxEnd))
        addText(text.substring(infoboxEnd + 1))
      }
    })
  }
  
  def addTitle(title: String): Unit = titles += title

  def addInfobox(ibtext: String): Unit = {
    val infobox = ibtext.split("\n")
      .map(line => {
        val pipePos = line.indexOf('|')
        val nvp = line.substring(pipePos + 1).split("=")
        if (nvp.length == 2) {
          val wpm = new WikiPatternMatcher(nvp(1).trim())
          (nvp(0).trim(), wpm.getPlainText())
        } 
        else ("None", "")
      })
      .filter(nvp => ! "None".equals(nvp._1))
      .toMap
    infoboxes += infobox
  }
  
  def addText(text: String): Unit = {
    // convert wiki text to HTML, then to plain text
    val htmltext = WikiModel.toHtml(text)
    val plaintext = new Source(htmltext).getTextExtractor().toString()
    texts += plaintext
  }
  
  def getTitles(): List[String] = titles.toList
  
  def getInfoboxes(): List[Map[String,String]] = infoboxes.toList
  
  def getTexts(): List[String] = texts.toList
}

For the DBPedia client, we query the spouse relationship for each entry using the code below. Note that the actual code is mostly boilerplate, similar to JDBC code.

// Source: src/main/scala/com/mycompany/scalcium/triples/DBPediaClient.scala
package com.mycompany.scalcium.triples

import scala.collection.mutable.ArrayBuffer

import com.hp.hpl.jena.query.Query
import com.hp.hpl.jena.query.QueryExecution
import com.hp.hpl.jena.query.QueryExecutionFactory
import com.hp.hpl.jena.query.QueryFactory
import com.hp.hpl.jena.rdf.model.Literal

class DBPediaClient(url: String = "http://dbpedia.org/sparql") {

  val sparqlQueryTemplate = """
    PREFIX dbpedia: <http://dbpedia.org/resource/>
    PREFIX onto: <http://dbpedia.org/ontology/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    
    SELECT ?label  WHERE {
      dbpedia:%s onto:%s ?o .
      ?o rdfs:label ?label .
    } LIMIT 100
  """

  def getObject(subj: String, verb: String): String = {
    val sparqlQuery = sparqlQueryTemplate
      .format(subj.replace(" ", "_"), verb)
    val query: Query = QueryFactory.create(sparqlQuery)
    val qexec: QueryExecution = QueryExecutionFactory.sparqlService(url, query)
    val results = qexec.execSelect()
    val objs = ArrayBuffer[String]()
    while (results.hasNext()) {
      val qsol = results.next()
      val literal = qsol.get("label").asInstanceOf[Literal]
      if ("en".equals(literal.getLanguage())) 
        objs += literal.getLexicalForm()
    }
    if (objs.isEmpty) null else objs.head
  }
}

The JUnit test below demonstrates the use of my MediaWikiParser and DBPediaClient to extract information directly from Infoboxes and from DBPedia resource pages respectively. The people list is the list of titles that were extracted from the Wikipedia XML dump.

// Source: src/test/scala/com/mycompany/scalcium/triples/DBPediaClientTest.scala
package com.mycompany.scalcium.triples

import java.io.File

import scala.collection.mutable.ArrayBuffer
import scala.util.matching.Regex

import org.junit.Test

import com.mycompany.scalcium.utils.MediaWikiParser

class DBPediaClientTest {

  val people = List("Andrew Johnson", "Edgar Allan Poe", 
      "Desi Arnaz", "Elvis Presley", "Albert Camus", 
      "Arthur Miller", "Boris Yeltsin", "Ernest Hemingway", 
      "Benjamin Franklin", "Bill Oddie", "Abraham Lincoln", 
      "Billy Crystal", "Bill Clinton", "Alfonso V of Aragon", 
      "Dwight D. Eisenhower", "Colin Powell", "Cary Elwes", 
      "Alexander II of Russia", "Arnold Schwarzenegger", 
      "Christopher Columbus", "Barry Bonds", "Bill Gates", 
      "Elizabeth Garrett Anderson")
  
  @Test
  def testExtractSpouseFromInfobox(): Unit = {
    val inParens = new Regex("""\(.*?\)""")
    val xmlfile = new File("src/main/resources/wiki/small_wiki.xml")
    val parser = new MediaWikiParser(xmlfile)
    parser.parse()
    val triples = ArrayBuffer[(String,String,String)]()
    parser.getTitles().zip(parser.getInfoboxes())
      .map(ti => {
        val spouse = if (ti._2.contains("spouse")) ti._2("spouse") else null
        // clean up data received (situation dependent)
        if (spouse != null) {
          val spouses = inParens.replaceAllIn(spouse, "")
            .split("\\s{2,}")
            .map(_.trim)
          spouses.foreach(spouse => 
            triples += ((ti._1, "spouse", spouse)))
        } else triples += ((ti._1, "spouse", "NOTFOUND"))
      })
    triples.foreach(Console.println(_))
  }
  
  @Test
  def testExtractSpouseFromDBPedia(): Unit = {
 val triples = ArrayBuffer[(String,String,String)]()
    val client = new DBPediaClient()
    people.map(person => {
      val spouse = client.getObject(person, "spouse")
      if (spouse != null) {
        // clean up data received (situation dependent)
        val spouses = spouse.replace(',', ' ')
          .replace(')', ' ')
          .split('(')
          .map(s => s.trim())
          .foreach(s => triples += ((person, "spouse", s)))
      } else triples += ((person, "spouse", "NOTFOUND"))
    })
    triples.foreach(Console.println(_))
  }
}

Parsing Text

Finally, I use the ReVerb project to parse triples out of the text portion of each Wikipedia entry. The classical approach to parsing out triples is to do deep parsing of sentences (ie convert it to a tree) and then find structures that are rooted at verbs. From what I could glean from the the paper backing ReVerb, the approach used here is to match regular expressions of POS tags of input sentences. The authors claim that this results in higher quality triples being extracted, since it misses out "incoherent extractions". It is also faster since (time-consuming) deep parsing is not involved, so is more suitable for large amounts of data. Here is the code for my ReVerbClient. It takes in a block of text and uses its own tokenizer to break it into sentences.

// Source: src/main/scala/com/mycompany/scalcium/triples/ReVerbClient.scala
package com.mycompany.scalcium.triples

import java.io.StringReader

import scala.collection.JavaConversions._

import edu.washington.cs.knowitall.extractor.ReVerbExtractor
import edu.washington.cs.knowitall.normalization.BinaryExtractionNormalizer
import edu.washington.cs.knowitall.util.DefaultObjects

class ReVerbClient {

  def parse(text: String): List[(String,String,String)] = {
    DefaultObjects.initializeNlpTools()
    val reader = DefaultObjects.getDefaultSentenceReader(
      new StringReader(text))
    val extractor = new ReVerbExtractor()
    val normalizer = new BinaryExtractionNormalizer()
    reader.iterator()
      .flatMap(sent => extractor.extract(sent))
      .map(extract => {
        val normExtract = normalizer.normalize(extract)
        val subj = normExtract.getArgument1().toString()
        val verb = normExtract.getRelation().toString()
        val obj = normExtract.getArgument2().toString()
        (subj, verb, obj)
      })
      .toList
  }
}

This extracts a large number of raw triples. Since our objective is to return spouse triples, we do a considerable amount of filtering and post-processing on the raw triples to return triples that are easily consumable by downstream code. This custom logic is in the JUnit test below.

// Source: src/test/scala/com/mycompany/scalcium/triples/ReVerbClientTest.scala
package com.mycompany.scalcium.triples

import java.io.File

import scala.collection.mutable.ArrayBuffer
import scala.util.matching.Regex

import org.junit.Test

import com.mycompany.scalcium.utils.MediaWikiParser

class ReVerbClientTest {

  val SpouseWords = Set("wife", "husband")
  val BeWords = Set("is", "was", "be")
  val StopWords = Set("of")

  @Test
  def testExtractTriplesFromXml(): Unit = {
    val reverb = new ReVerbClient()
    val infile = new File("src/main/resources/wiki/small_wiki.xml")
    val parser = new MediaWikiParser(infile)
    parser.parse()
    parser.getTitles.zip(parser.getTexts())
      .map(tt => {
        val title = tt._1
        val triples = reverb.parse(tt._2)
        Console.println(">>> " + title)
        // clean up triple
        val resolvedTriples = triples.map(triple => {
          // resolve pronouns in subj, replace with title
          if (isPronoun(triple._1)) (title, triple._2, triple._3)
          else triple
        })
        val tripleBuf = ArrayBuffer[(String,String,String)]()
        // filter out where verb is (married, divorced)
        tripleBuf ++= resolvedTriples.filter(triple => {
          (triple._2.indexOf("married") > -1 || 
            triple._2.indexOf("divorced") > -1)
        })
        // filter out where subj or obj has (wife, husband)
        // and the verb is (is, was, be)
        tripleBuf ++= resolvedTriples.filter(triple => {
          val wordsInSubj = triple._1.split("\\s+").map(_.toLowerCase).toSet
          val wordsInVerb = triple._2.split("\\s+").map(_.toLowerCase).toSet
          val wordsInObj = triple._3.split("\\s+").map(_.toLowerCase).toSet
          (wordsInSubj.intersect(SpouseWords).size > 0 &&
            wordsInVerb.intersect(BeWords).size > 0 &&
            isProperNoun(triple._3)) ||
          (isProperNoun(triple._1) &&
            wordsInVerb.intersect(BeWords).size > 0 &&
            wordsInObj.intersect(SpouseWords).size > 0)
        })
        // extract patterns like "Bill and Hillary Clinton" from either
        // subj or obj
        tripleBuf ++= resolvedTriples.map(triple => {
            val names = title.split("\\s+")
            val pattern = new Regex("""%s (and|&) (\w+) %s"""
              .format(names(0), names(names.size - 1)))
            val sfName = pattern.findAllIn(triple._1).matchData
                .map(m => m.group(2)).toList ++
              pattern.findAllIn(triple._3).matchData
                .map(m => m.group(2)).toList
            if (sfName.size == 1)
              (title, "spouse", List(sfName.head, 
                names(names.size - 1)).mkString(" "))
            else ("x", "x", "x")
          })
          .filter(triple => "spouse".equals(triple._2))
        // post-process the triples
        val spouseTriples = tripleBuf.map(triple => {
            // fix incomplete name in subj and obj from title
            val subjHasTitle = containsTitle(triple._1, title)
            val objHasTitle = containsTitle(triple._3, title)
            val subj = if (subjHasTitle) title else triple._1
            val obj = if (objHasTitle && !subjHasTitle) title else triple._3
            val verb = if (subjHasTitle || objHasTitle) "spouse" else triple._2
            (subj, verb, obj)
          })
          .filter(triple => 
            // make sure both subj and obj are proper nouns
            (isProperNoun(triple._1) && 
              "spouse".equals(triple._2) &&
              isProperNoun(triple._3)))
          
        spouseTriples.foreach(Console.println(_))
      })
  }
  
  def isPronoun(s: String): Boolean =
    "she".equalsIgnoreCase(s) || "he".equalsIgnoreCase(s)
    
  def isProperNoun(s: String): Boolean = {
    val words = s.split("\\s+").filter(w => !StopWords.contains(w))
    val initCapWords = words.filter(w => w.charAt(0).isUpper == true)
    words.length == initCapWords.length
  }
  
  def containsTitle(s: String, title: String): Boolean = {
    val words = Set() ++ s.split("\\s+") 
    val names = Set() ++ title.split("\\s+")
    words.intersect(names).size > 0
  }
}

Results

Each of the three approaches result in overlapping sets of spouse triples, as you can see from the summary below. Note that the "spouse" relationship is bidirectional, so the number of triples extracted are actually double what is shown. As you can see from the results below, none of the three sources seem to be authoritative in the sense that you can depend on them absolutely, and at least a few of them seem to be incorrect (although that could be very likely a bad text parse on my part).

Title	Infobox	DBPedia	Text
Andrew Johnson	(Andrew Johnson, spouse, Eliza McCardle Johnson)	(Andrew Johnson, spouse, Eliza McCardle Johnson)	(Andrew Johnson, spouse, Eliza McCardle)
Edgar Allan Poe	(Edgar Allan Poe, spouse, Virginia Eliza Clemm Poe)	(Edgar Allan Poe, spouse, Virginia Eliza Clemm Poe)	(Edgar Allan Poe, spouse, Virginia Clemm) (Edgar Allan Poe, spouse, Virginia)
Desi Arnaz		(Desi Arnaz, spouse, Lucille Ball) (Desi Arnaz, spouse, Edith Mack Hirsch)	(Desi Arnaz, spouse, Lucille Ball)
Elvis Presley
Albert Camus			(Albert Camus, spouse, Simone Hie) (Albert Camus, spouse, Francine Faure)
Arthur Miller		(Arthur Miller, spouse, Mary Slattery) (Arthur Miller, spouse, Marilyn Monroe) (Arthur Miller, spouse, Inge Morath)	(Arthur Miller, spouse, Mary Slattery) (Arthur Miller, spouse, Marilyn Monroe)
Boris Yeltsin	(Boris Yeltsin, spouse, Naina Yeltsina)	(Boris Yeltsin, spouse, Naina Yeltsina)
Ernest Hemingway	(Ernest Hemingway, spouse, Pauline Pfeiffer)	(Ernest Hemingway, spouse, Elizabeth Hadley Richardson) (Ernest Hemingway, spouse, Pauline Pfeiffer) (Ernest Hemingway, spouse, Martha Gellhorn) (Ernest Hemingway, spouse, Mary Welsh Hemingway)	(Ernest Hemingway, spouse, Hadley Richardson)
Benjamin Franklin	(Benjamin Franklin, spouse, Deborah Read)	(Benjamin Franklin, spouse, Deborah Read)	(Benjamin Franklin, spouse, Richard Bache) (Benjamin Franklin, spouse, Deborah Franklin)
Bill Oddie		(Bill Oddie, spouse, Jean Hart) (Bill Oddie, spouse, Laura Beaumont)	(Bill Oddie, spouse, Laura Beaumont)
Abraham Lincoln	(Abraham Lincoln, spouse, Mary Todd Lincoln)	(Abraham Lincoln, spouse, Mary Todd Lincoln)
Billy Crystal		(Billy Crystal, spouse, Janice Goldfinger)	(Billy Crystal, spouse, Janice Goldfinger)
Bill Clinton		(Bill Clinton, spouse, Hillary Rodham Clinton)	(Bill Clinton, spouse, Hillary Clinton)
Alfonso V of Aragon	(Alfonso V of Aragon, spouse, Maria of Castile Queen of Aragon)	(Alfonso V of Aragon, spouse, Maria of Castile)
Dwight D. Eisenhower	(Dwight D. Eisenhower, spouse, Mamie Eisenhower)	(Dwight D. Eisenhower, spouse, Mamie Mamie Doud Eisenhower)	(Dwight D. Eisenhower, spouse, Mamie Geneva Doud of Denver)
Colin Powell		(Colin Powell, spouse, Alma Vivian Johnson Powell)	(Colin Powell, spouse, Alma) (Colin Powell, spouse, Alma Powell)
Cary Elwes		(Cary Elwes, spouse, Lisa Marie Kurbikoff) (Cary Elwes, spouse, 1 child)
Alexander II of Russia	(Alexander II of Russia, spouse, Maria Alexandrovna) (Alexander II of Russia, spouse, Marie of Hesse and by Rhine)	(Alexander II of Russia, spouse, Marie of Hesse and by Rhine)	(Alexander II of Russia, spouse, Princess Marie of Hesse)
Arnold Schwarzenegger			(Arnold Schwarzenegger, spouse, Maria Shriver)
Christopher Columbus	(Christopher Columbus, spouse, Filipa Moniz Perestrelo)	(Christopher Columbus, spouse, Filipa Moniz)	(Christopher Columbus, spouse, Filipa Moniz Perestrello)
Barry Bonds
Bill Gates		(Bill Gates, spouse, Melinda Gates)	(Bill Gates, spouse, Melinda Gates)
Elizabeth Garrett Anderson			(Elizabeth Garrett Anderson, spouse, Richard Garrett III)

This is what I have for this week. Hope you found it interesting. I am hoping to do something with SVO triples on a larger scale (not on Wikipedia) so this was a good way for me to check out the toolkit. It also introduced me to Apache Jena and SPARQL, something I've been meaning to do ever since I attended the OpenHPI course on Knowledge Engineering.

2 comments (moderated to prevent spam):

amit said...: This is really interesting. Thanks for the explanation. I will try to do this myself. I see there are a lot of things which can be solved through this.; 8/19/2016 12:08 AM
Sujit Pal said...: You are welcome Amit, glad you found it interesting, and indeed, there are many applications that would benefit from this kind of info.; 8/25/2016 11:26 AM

Salmon Run

Saturday, October 25, 2014

Extracting SVO Triples from Wikipedia

Data and Initial Setup

Parsing Infoboxes

Parsing Text

Results

2 comments (moderated to prevent spam):

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me