Monday, July 22, 2013

Dictionary Backed Named Entity Recognition with Lucene and LingPipe


Domain-specific Concept Search (such as ours) typically involves recognizing entities in the query and matching them up to entities that make sense in the particular domain - in our case, the entities correspond to concepts in our medical taxonomy. This does mean that we fall short when we try to solve for a slightly different use case such as doctor search.

By doctor search, I mean the search interface that health insurance sites have for their members to find a doctor near them. Typical search patterns are by zip code, city, provider name (doctor or hospital), specialty, symptom, or some combination thereof. Our Named Entity Recognition (NER) system is very good at extracting medical concepts, such as specialties and symptoms, from queries, and we can draw useful inferences based on relationship graphs, but we don't have a good way of recognizing names and addresses. So the idea is to pass the query through an additional pre-processing step through a chain of entity extractors which will identify and extract the different name and address fields.

This has been on my to-do list for a while. The doctor search project came about while I was busy with another one, and someone else built the product using different techniques. So a doctor search product already exists at the moment. This is a proof of concept for my additional pre-processing NER idea mentioned above, and it remains to be seen whether its performance compares favorably with the existing product or not.

NERs can be either regex based, dictionary based or model based. Regex based NERs match incoming text against one or more predefined regular expressions. Dictionary based NERs, also known as gazetteer based NERs, match text against a dictionary of (term, category) pairs. Model based NERs use a training set of (term, category) pairs to train a model, and then use the model to predict the category of new (potentially previously unseen) terms.

Since a doctor search application is closed, ie, the list of doctors are finite and all relevant attributes about them are known, and a search for an unknown doctor or doctor attributes returning no results is expected and desired, we only use regex based and dictionary based NERs. In this post, I describe 3 NERs built using the LingPipe API, one regex based, and two dictionary based, one in-memory and one using Lucene.

LingPipe provides a RegExChunker, which takes a regular expression and a category string, which we use to build a Zip Code NER. For the dictionary NERs, we use LingPipe's ExactDictionaryChunker which takes a MapDictionary object and a category. In the first case, the MapDictionary is in-memory and in the second, we extend MapDictionary to use a pre-populated Lucene index (via Solr). The ExactDictionaryChunker implements the Aho-Corasick algorithm for scalable (O(n)) string matching. There is also an ApproximateDictionaryChunker which returns matches within a given edit distance, but I haven't used it.


Since most of the classes are already provided by LingPipe, the only code we need to write is the interface by which our application (currently a JUnit test, but perhaps a custom QParser in the future) calls the NERs, and the custom MapDictionary subclass that uses a Solr index. These are shown in the code below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
package com.mycompany.solr4extras.ner

import scala.collection.JavaConversions.{asJavaIterator, asScalaIterator, asScalaSet, mapAsJavaMap}

import org.apache.solr.client.solrj.SolrServer
import org.apache.solr.common.params.{CommonParams, MapSolrParams}

import com.aliasi.chunk.{Chunk, Chunker}
import com.aliasi.dict.{DictionaryEntry, MapDictionary}

/**
 * Pass in a list of Chunkers during construction, and then
 * call the chunk method with the text to be chunked. Returns
 * a set of Chunk objects of various types in the text.
 */
class Chunkers(val chunkers: List[Chunker]) {

  def chunk(text: String): Set[Chunk] = chunkers.
    map(chunker => chunker.chunk(text).chunkSet.toList).
    flatten.
    toSet[Chunk]

  def mkString(text: String, chunk: Chunk): String = {
    val pair = mkPair(text, chunk)
    pair._1 + "/" + pair._2
  }
  
  def mkPair(text: String, chunk: Chunk): (String,String) = 
    (text.substring(chunk.start(), chunk.end()), 
      chunk.`type`())
}

/**
 * Custom MapDictionary backed by a Solr index. This is 
 * used by our Dictionary based NER (ExactMatchDictionaryChunker)
 * for large dictionaries of entity names. Dictionary entries
 * are stored as (category, value) pairs in Solr fields
 * (nercat, nerval).
 */
class SolrMapDictionary(
    val solr: SolrServer, val nrows: Int, val category: String) 
    extends MapDictionary[String] {

  override def addEntry(entry: DictionaryEntry[String]) = {} 
  
  override def iterator(): 
      java.util.Iterator[DictionaryEntry[String]] = {
    phraseEntryIt("*:*")
  }
  
  override def phraseEntryIt(phrase: String): 
      java.util.Iterator[DictionaryEntry[String]] = {
    val params = new MapSolrParams(Map(
      CommonParams.Q -> phrase,
      CommonParams.FQ -> ("nercat:" + category),
      CommonParams.FL -> "nerval",
      CommonParams.START -> "0", 
      CommonParams.ROWS -> String.valueOf(nrows)))
    val rsp = solr.query(params)
    rsp.getResults().iterator().
      toList.
      map(doc =>  new DictionaryEntry[String](
        doc.getFieldValue("nerval").asInstanceOf[String], 
        category, 1.0D)).
      iterator
  }
}

The SolrMapDictionary needs two fields nercat and nerval, that hold the NER category and the NER value respectively. These need to be defined in the schema.xml file like so:

1
2
   <field name="nercat" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="nerval" type="text_general" indexed="true" stored="true"/>

The Chunkers.chunk() method passes in an Array of Chunkers, one each for the different entities that we want to recognize, and the text to be chunked. The JUnit test below shows example calls for 3 of our NERs.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
package com.mycompany.solr4extras.ner

import scala.Array.canBuildFrom

import org.apache.solr.client.solrj.impl.{ConcurrentUpdateSolrServer, HttpSolrServer}
import org.junit.{After, Assert, Test}

import com.aliasi.chunk.RegExChunker
import com.aliasi.dict.{DictionaryEntry, ExactDictionaryChunker, MapDictionary}
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory

class ChunkersTest {

  val solrUrl = "http://localhost:8983/solr/"
  val solrWriter = new ConcurrentUpdateSolrServer(solrUrl, 10, 1)
  val solrReader = new HttpSolrServer(solrUrl)
  
  val texts = Array[String]( 
    "Cardiologist Berkeley 94701 Dr Chen",
    "Herpetologist 94015-1234",
    "Cost of care $1000 wtf",
    "San Francisco points of interest",
    "Liberty Island, New York"
  )
  val cities = Array[String]("Berkeley", "San Francisco", "New York")
  val specialists = Array[String]("cardiologist", "herpetologist")
  
  @After def teardown(): Unit = {
    solrWriter.shutdown()
    solrReader.shutdown()
  }

  @Test def testRegexChunking(): Unit = {
    val zipCodeChunker = new RegExChunker(
      "\\d{5}-\\d{4}|\\d{5}", "zipcode", 1.0D)
    val chunkers = new Chunkers(List(zipCodeChunker))
    val expected = Array[Int](1, 1, 0, 0, 0)
    val actuals = texts.map(text => {
      val chunkSet = chunkers.chunk(text)
      chunkSet.foreach(chunk => 
        Console.println(chunkers.mkString(text, chunk)))
      chunkSet.size
    })
    Assert.assertArrayEquals(expected, actuals)
  }
  
  @Test def testInMemoryDictChunking(): Unit = {
    val dict = new MapDictionary[String]()
    specialists.foreach(specialist => 
      dict.addEntry(
      new DictionaryEntry[String](specialist, "specialist", 1.0D)))
    val specialistChunker = new ExactDictionaryChunker(
      dict, IndoEuropeanTokenizerFactory.INSTANCE, false, false)   
    val chunkers = new Chunkers(List(specialistChunker))
    val expected = Array[Int](1, 1, 0, 0, 0)
    val actuals = texts.map(text => {
      val chunkSet = chunkers.chunk(text)
      chunkSet.foreach(chunk => 
        Console.println(chunkers.mkString(text, chunk)))
      chunkSet.size
    })
    Assert.assertArrayEquals(expected, actuals)
  }
  
  @Test def testSolrDictChunking(): Unit = {
    val dict = new SolrMapDictionary(solrReader, 18000, "city")
    val cityChunker = new ExactDictionaryChunker(
      dict, IndoEuropeanTokenizerFactory.INSTANCE, false, false)
    val chunkers = new Chunkers(List(cityChunker))
    val expected = Array[Int](1, 0, 0, 1, 2)
    val actuals = texts.map(text => {
      val chunkSet = chunkers.chunk(text)
      chunkSet.foreach(chunk => 
        Console.println(chunkers.mkString(text, chunk)))
      chunkSet.size
    })
    Assert.assertArrayEquals(expected, actuals)
  }
}

The results from the run are as follows (edited slightly for readability), and show the snippets of the input text that were matched (the entities or chunks) and what entity category they were matched to. As you can see, there is a false positive - Liberty is being considered a city.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
94701/zipcode
94015-1234/zipcode

Cardiologist/specialist
Herpetologist/specialist

Berkeley/city
San Francisco/city
Liberty/city
New York/city

While the SolrMapDictionary returns the expected results, I found that it calls the iterator() method once for the entire run and thus needs to have the full dataset of 17k+ cities loaded in one shot, which kind of defeats the purpose of having an external store - I would have expected the phraseEntryIt() method to be called with specific phrases instead. I haven't had a chance to investigate this in depth yet, will update the post when I find out more.

References



Wednesday, July 17, 2013

Porting Payloads to Solr4


This post discusses porting our Payload code, originally written against Solr/Lucene 3.2.0, to Solr/Lucene 4.3.0. The original code is described here and here. It also discusses implementing support for query time boosting of Payload queries that we were unable to implement earlier because the Solr/Lucene 3.x API was too restrictive.

Preparation


I had already built an application using Solr 4.0 some months before, so it was simply a matter of downloading the 4.3.0 distribution from here and building the JAR. To do this, you need to execute the following commands:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
sujit@cyclone:opt$ # expand the distribution tarball locally
sujit@cyclone:opt$ tar xvzf ~/Downloads/solr-4.3.0.tar.gz
sujit@cyclone:opt$ cd solr-4.3.0
sujit@cyclone:solr-4.3.0$ # download ivy (if you haven't already)
sujit@cyclone:solr-4.3.0$ ant ivy-bootstrap jar
sujit@cyclone:solr-4.3.0$ # build the solr.war and copy to example/webapps
sujit@cyclone:solr-4.3.0$ cd solr
sujit@cyclone:solr$ ant dist-war
sujit@cyclone:solr$ cp dist/solr-4.3-SNAPSHOT.war example/webapps/solr.war
sujit@cyclone:solr$ # start SOLR
sujit@cyclone:solr$ cd example
sujit@cyclone:example$ java -jar start.jar

On the application side, I needed to update the reference to the solr-core and solr-solrj libraries from 4.1 to 4.3.0. Additionally, I had to add the Maven Restlet repository because thats apparently the only repository that has the restlet-parent.pom correctly referenced (I tried a few other repositories and this is the only one that worked).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// Source: build.sbt
resolvers ++= Seq(
  "Maven Restlet" at "http://maven.restlet.org",
  ...
)

libraryDependencies ++= Seq(
  "org.apache.solr" % "solr-core" % "4.3.0",
  "org.apache.solr" % "solr-solrj" % "4.3.0",
  ...
  )

At this point, the 4.0 code compiled and ran fine with the updated build.sbt, so I was ready to start porting my old 3.2.0 Payload code over to this project.

Background


Before going further, it may help to understand our usage of Payloads. Payloads are a generic Lucene feature that allows you to store little bits of data (the payload) with each occurrence of a term in the index, and (obviously) different applications use it in different ways. In our case, part of our indexing pipeline consists of annotating documents with concept IDs (node IDs) from our taxonomy graph representing relationships between different kinds of medical entities, such as diseases, drugs, treatments, symptoms, etc. One view of the concepts associated with a document is the "concept map", ie a list of (conceptID, score) pairs.

A similar annotation process happens on the query string, and one class of our search applications work (very well, if I may add) just by making boolean queries on combinations of concept IDs.

Scoring for such queries is very simple. Scores for documents matching a single concept is the concept score in that document. Scores for AND queries is the sum of the constituent concept scores in the document. Scores for OR queries are the sum of the concept scores for the concepts found in the document.

Initial Port


The initial port is just a straight port of the old code described in the two blog posts from Java to Scala. Not much to say here, so I just show the code and describe the basic configuration. However, this code does not allow you to boost parts of such queries, so I then describe the changes I needed to support that further down.

First, the PayloadSimilarity, which just extends DefaultSimilarity and returns the payload score if the field is a Payload field, otherwise returns 1. So this PayloadSimilarity can be used for both regular and Payload fields. The one major change here is that the payload argument in scorePayloads is now a BytesRef instead of a Byte, and the change to handle it is described in Tech Collage: Using Payloads with Solr (4.x).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// Source: src/main/scala/com/mycompany/solr4extras/PayloadSimilarity.scala
package com.mycompany.solr4extras.payloads

import org.apache.lucene.analysis.payloads.PayloadHelper
import org.apache.lucene.search.similarities.DefaultSimilarity
import org.apache.lucene.util.BytesRef

class PayloadSimilarity extends DefaultSimilarity {

  override def scorePayload(doc: Int, start: Int, end: Int, 
      payload: BytesRef): Float = {
    if (payload == null) 1.0F
    else PayloadHelper.decodeFloat(payload.bytes, payload.offset)
  }
}

We add in our Payload field, named "cscores" (for concept scores) in the schema.xml file (in example/solr/collection1/conf). We also configure our new Similarity implementation.

1
2
3
4
5
....
   <field name="cscores" type="payloads" indexed="true" stored="true"/>
   ....
   <similarity class="com.mycompany.solr4extras.payloads.PayloadSimilarity"/>
   ....

We then need a new query type that will take queries written using standard Lucene query syntax (ie field:value) and convert them to PayloadTermQuery objects which will use the concept scores for scoring. To do this, we use the (relatively) new flexible Lucene query parser. Again, no significant difference from the 3.2.0 Java version, except for the addition of the setBuilder call to map BoostQueryNode to BoostQueryNodeBuilder in PayloadQueryTreeBuilder. Adding this mapping enables the boost to be recognized.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Source: src/main/scala/com/mycompany/solr4extras/payloads/PayloadQParserPlugin.scala
package com.mycompany.solr4extras.payloads

import org.apache.lucene.index.Term
import org.apache.lucene.queryparser.flexible.core.QueryParserHelper
import org.apache.lucene.queryparser.flexible.core.nodes.{BoostQueryNode, FieldQueryNode, QueryNode}
import org.apache.lucene.queryparser.flexible.standard.builders.{BoostQueryNodeBuilder, StandardQueryBuilder, StandardQueryTreeBuilder}
import org.apache.lucene.queryparser.flexible.standard.config.StandardQueryConfigHandler
import org.apache.lucene.queryparser.flexible.standard.parser.StandardSyntaxParser
import org.apache.lucene.queryparser.flexible.standard.processors.StandardQueryNodeProcessorPipeline
import org.apache.lucene.search.Query
import org.apache.lucene.search.payloads.{AveragePayloadFunction, PayloadTermQuery}
import org.apache.solr.common.params.SolrParams
import org.apache.solr.common.util.NamedList
import org.apache.solr.request.SolrQueryRequest
import org.apache.solr.search.{QParser, QParserPlugin}

class PayloadQParserPlugin extends QParserPlugin {

  override def init(args: NamedList[_]): Unit = {}
  
  override def createParser(qstr: String, localParams: SolrParams, 
      params: SolrParams, req: SolrQueryRequest): QParser =  
    new PayloadQParser(qstr, localParams, params, req)
}

class PayloadQParser(qstr: String, localParams: SolrParams,
    params: SolrParams, req: SolrQueryRequest) 
    extends QParser(qstr, localParams, params, req) {
  
  req.getSearcher().setSimilarity(new PayloadSimilarity())
  
  override def parse(): Query = {
    val parser = new PayloadQueryParser()
    parser.parse(qstr, null).asInstanceOf[Query]
  }
}

class PayloadQueryParser extends QueryParserHelper(
    new StandardQueryConfigHandler(), 
    new StandardSyntaxParser(), 
    new StandardQueryNodeProcessorPipeline(null), 
    new PayloadQueryTreeBuilder()) {
}

class PayloadQueryTreeBuilder() extends StandardQueryTreeBuilder {
  
  setBuilder(classOf[FieldQueryNode], new PayloadQueryNodeBuilder())
  setBuilder(classOf[BoostQueryNode], new BoostQueryNodeBuilder())
}

class PayloadQueryNodeBuilder extends StandardQueryBuilder {
  
  override def build(queryNode: QueryNode): PayloadTermQuery = {
    val node = queryNode.asInstanceOf[FieldQueryNode]
    val fieldName = node.getFieldAsString()
    val payloadQuery = new PayloadTermQuery(
      new Term(fieldName, node.getTextAsString()), 
      new AveragePayloadFunction(), false)
    payloadQuery
  }
}

We configure this in our solrconfig.xml file (also in example/solr/collection1/conf) with the following XML snippet.

1
2
3
4
5
6
7
8
9
...
  <queryParser name="payloadQueryParser" 
    class="com.mycompany.solr4extras.payloads.PayloadQParserPlugin"/>
  <requestHandler name="/cselect" class="solr.SearchHandler">
    <lst name="defaults">
      <str name="defType">payloadQueryParser</str>
    </lst>
  </requestHandler>
  ...

To test the code changes so far, we build our code and drop the application JAR file and the scala-compiler and scala-library JARs into the collection's lib directory (example/solr/collection1/lib). Note that you will have to create the lib directory the first time.

1
2
3
4
5
6
7
sujit@cyclone:solr4-extras$ sbt package
sujit@cyclone:solr4-extras$ cp target/scala-2.9.2/solr4-extras_2.9.2-1.0.jar \
> /opt/solr-4.3.0/solr/example/solr/collection1/lib
sujit@cyclone:solr4-extras$ cp /opt/scala-2.9.2/lib/scala-compiler.jar \
> /opt/solr-4.3.0/solr/example/solr/collection1/lib
sujit@cyclone:solr4-extras$ cp /opt/scala-2.9.2/lib/scala-library.jar \
> /opt/solr-4.3.0/solr/example/solr/collection1/lib

Next, we write a simple JUnit test to populate the index with 10,000 documents, each of which contains one or more concepts named "A" to "Z", with random scores between 1 and 100. The titles reflect the concept score makeup of the document. Once the index is populated, the test hits it with various representative payload queries via SolrJ.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
// Source: src/test/scala/com/mycompany/solr4extras/payloads/PayloadTest.scala
package com.mycompany.solr4extras.payloads

import scala.collection.JavaConversions.mapAsJavaMap

import org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer
import org.apache.solr.common.params.{CommonParams, MapSolrParams}
import org.junit.{After, Assert, Before, Test}

class PayloadTest {
  
  val solr = new ConcurrentUpdateSolrServer(
    "http://localhost:8983/solr/", 10, 1)

  @Before def setup(): Unit = {
    solr.deleteByQuery("*:*")
    solr.commit()
    val randomizer = new Random(42)
    val concepts = ('A' to 'Z').map(_.toString)
    val ndocs = 10000
    for (i <- 0 until ndocs) {
      val nconcepts = randomizer.nextInt(10)
      val alreadySeen = Set[String]()
      val cscores = ArrayBuffer[(String,Float)]()
      for (j <- 0 until nconcepts) {
        val concept = concepts(randomizer.nextInt(26))
        if (! alreadySeen.contains(concept)) {
          cscores += ((concept, randomizer.nextInt(100)))
          alreadySeen += concept
        }
      }
      val title = "{" + 
        cscores.sortWith((a, b) => a._1 < b._1).
          map(cs => cs._1 + ":" + cs._2).
          mkString(", ") + 
          "}"
      val payloads = cscores.map(cs => cs._1 + "|" + cs._2).
        mkString(" ")
      Console.println(payloads)
      val doc = new SolrInputDocument()
      doc.addField("id", i)
      doc.addField("title", title)
      doc.addField("cscores", payloads)
      solr.add(doc)
    }
    solr.commit()
  }
  
  @After def teardown(): Unit = solr.shutdown()

  @Test def testAllDocsQuery(): Unit = {
    val params = new MapSolrParams(Map(
      (CommonParams.Q -> "*:*")))   
    val rsp = solr.query(params)
    Assert.assertEquals(10000, rsp.getResults().getNumFound()) 
  }
  
  @Test def testSingleConceptQuery(): Unit = {
    runQuery("cscores:A")
  }
  
  @Test def testAndConceptQuery(): Unit = {
    runQuery("cscores:A AND cscores:B")
  }
  
  @Test def testOrConceptQuery(): Unit = {
    runQuery("cscores:A OR cscores:B")
  }

  @Test def testBoostedOrConceptQuery(): Unit = {
    runQuery("cscores:A^10.0 OR cscores:B")
  }

  @Test def testBoostedAndConceptQuery(): Unit = {
    runQuery("cscores:A^10.0 AND cscores:B")
  }

  def runQuery(q: String): Unit = {
    val params = new MapSolrParams(Map(
        CommonParams.QT -> "/cselect",
        CommonParams.Q -> q,
        CommonParams.FL -> "*,score"))
    val rsp = solr.query(params)
    val dociter = rsp.getResults().iterator()
    Console.println("==== Query %s ====".format(q))
    while (dociter.hasNext()) {
      val doc = dociter.next()
      Console.println("%f: (%s) %s".format(
        doc.getFieldValue("score"), 
        doc.getFieldValue("id"), 
        doc.getFieldValue("title")))
    }
  }
}

Running this JUnit test against the initial port of the code results in this output. Each block corresponds to the first 10 documents resulting from the query in the header. The first field is the score of the document for the query, the number in parenthesis is the docID, and the string is the title, which is built up from the payload (concept,score) pairs in the cscores field. You can see from the output that:
    :
  • Ordering is by descending order of payload score, as expected.
  • Scores for A OR B and A AND B are also as expected.
  • Boosting does not seem to have any effect.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
==== Query cscores:A ====
99.000000: (549) [{A:99.0, C:77.0, G:12.0, K:46.0, L:49.0, T:37.0, V:59.0, W:53.0, X:11.0}]
99.000000: (905) [{A:99.0, J:3.0}]
99.000000: (1171) [{A:99.0, G:87.0, K:61.0, O:83.0, U:1.0, W:23.0}]
99.000000: (1756) [{A:99.0, L:35.0, Q:56.0, X:65.0}]
99.000000: (1818) [{A:99.0}]
99.000000: (1884) [{A:99.0, E:32.0, J:5.0, L:28.0, P:89.0, S:90.0}]
99.000000: (2212) [{A:99.0, D:68.0, J:9.0, O:51.0, R:16.0, T:26.0, W:60.0, Z:68.0}]
99.000000: (2634) [{A:99.0, D:74.0, G:73.0, I:11.0, Q:52.0, S:55.0}]
99.000000: (4131) [{A:99.0, C:22.0, D:27.0, J:7.0, L:37.0, N:81.0, O:50.0, Q:27.0, X:4.0}]
99.000000: (4552) [{A:99.0, B:21.0, L:81.0, M:34.0, P:25.0, S:40.0}]

==== Query cscores:A AND cscores:B ====
193.000000: (8297) [{A:94.0, B:99.0, F:30.0, M:26.0, Q:3.0}]
191.000000: (2769) [{A:93.0, B:98.0, L:61.0, P:14.0, S:88.0, Y:4.0, Z:66.0}]
188.000000: (2154) [{A:98.0, B:90.0, C:50.0, E:24.0, L:65.0, U:92.0, W:36.0, Y:56.0}]
188.000000: (6309) [{A:93.0, B:95.0, H:89.0, K:98.0, M:46.0, X:67.0}]
184.000000: (2891) [{A:97.0, B:87.0, D:20.0, L:25.0, P:50.0, Q:33.0, V:71.0, X:57.0}]
183.000000: (6046) [{A:88.0, B:95.0, G:45.0, I:83.0, L:15.0, Y:93.0}]
183.000000: (8069) [{A:85.0, B:98.0, C:47.0, M:32.0, P:57.0}]
171.000000: (4146) [{A:80.0, B:91.0, D:60.0, H:49.0, U:20.0, Z:2.0}]
168.000000: (7961) [{A:97.0, B:71.0, I:58.0, Q:45.0, T:20.0, U:26.0, W:20.0, X:72.0}]
167.000000: (2568) [{A:72.0, B:95.0, E:22.0, F:98.0, M:64.0, N:95.0, V:72.0}]

==== Query cscores:A OR cscores:B ====
193.000000: (8297) [{A:94.0, B:99.0, F:30.0, M:26.0, Q:3.0}]
191.000000: (2769) [{A:93.0, B:98.0, L:61.0, P:14.0, S:88.0, Y:4.0, Z:66.0}]
188.000000: (2154) [{A:98.0, B:90.0, C:50.0, E:24.0, L:65.0, U:92.0, W:36.0, Y:56.0}]
188.000000: (6309) [{A:93.0, B:95.0, H:89.0, K:98.0, M:46.0, X:67.0}]
184.000000: (2891) [{A:97.0, B:87.0, D:20.0, L:25.0, P:50.0, Q:33.0, V:71.0, X:57.0}]
183.000000: (6046) [{A:88.0, B:95.0, G:45.0, I:83.0, L:15.0, Y:93.0}]
183.000000: (8069) [{A:85.0, B:98.0, C:47.0, M:32.0, P:57.0}]
171.000000: (4146) [{A:80.0, B:91.0, D:60.0, H:49.0, U:20.0, Z:2.0}]
168.000000: (7961) [{A:97.0, B:71.0, I:58.0, Q:45.0, T:20.0, U:26.0, W:20.0, X:72.0}]
167.000000: (2568) [{A:72.0, B:95.0, E:22.0, F:98.0, M:64.0, N:95.0, V:72.0}]

==== Query cscores:A^10.0 OR cscores:B ====
193.000000: (8297) [{A:94.0, B:99.0, F:30.0, M:26.0, Q:3.0}]
191.000000: (2769) [{A:93.0, B:98.0, L:61.0, P:14.0, S:88.0, Y:4.0, Z:66.0}]
188.000000: (2154) [{A:98.0, B:90.0, C:50.0, E:24.0, L:65.0, U:92.0, W:36.0, Y:56.0}]
188.000000: (6309) [{A:93.0, B:95.0, H:89.0, K:98.0, M:46.0, X:67.0}]
184.000000: (2891) [{A:97.0, B:87.0, D:20.0, L:25.0, P:50.0, Q:33.0, V:71.0, X:57.0}]
183.000000: (6046) [{A:88.0, B:95.0, G:45.0, I:83.0, L:15.0, Y:93.0}]
183.000000: (8069) [{A:85.0, B:98.0, C:47.0, M:32.0, P:57.0}]
171.000000: (4146) [{A:80.0, B:91.0, D:60.0, H:49.0, U:20.0, Z:2.0}]
168.000000: (7961) [{A:97.0, B:71.0, I:58.0, Q:45.0, T:20.0, U:26.0, W:20.0, X:72.0}]
167.000000: (2568) [{A:72.0, B:95.0, E:22.0, F:98.0, M:64.0, N:95.0, V:72.0}]

==== Query cscores:A^10.0 AND cscores:B ====
193.000000: (8297) [{A:94.0, B:99.0, F:30.0, M:26.0, Q:3.0}]
191.000000: (2769) [{A:93.0, B:98.0, L:61.0, P:14.0, S:88.0, Y:4.0, Z:66.0}]
188.000000: (2154) [{A:98.0, B:90.0, C:50.0, E:24.0, L:65.0, U:92.0, W:36.0, Y:56.0}]
188.000000: (6309) [{A:93.0, B:95.0, H:89.0, K:98.0, M:46.0, X:67.0}]
184.000000: (2891) [{A:97.0, B:87.0, D:20.0, L:25.0, P:50.0, Q:33.0, V:71.0, X:57.0}]
183.000000: (6046) [{A:88.0, B:95.0, G:45.0, I:83.0, L:15.0, Y:93.0}]
183.000000: (8069) [{A:85.0, B:98.0, C:47.0, M:32.0, P:57.0}]
171.000000: (4146) [{A:80.0, B:91.0, D:60.0, H:49.0, U:20.0, Z:2.0}]
168.000000: (7961) [{A:97.0, B:71.0, I:58.0, Q:45.0, T:20.0, U:26.0, W:20.0, X:72.0}]
167.000000: (2568) [{A:72.0, B:95.0, E:22.0, F:98.0, M:64.0, N:95.0, V:72.0}]

Looking at the explanation for the first result of the last query (from the form on http://localhost:8983/solr/#/collection1/query) gives this, which confirms that boosting has absolutely no effect.

1
2
3
4
5
193.0 = (MATCH) sum of:
  94.0 = (MATCH) btq(includeSpanScore=false), result of:
    94.0 = AveragePayloadFunction.docScore()
  99.0 = (MATCH) btq(includeSpanScore=false), result of:
    99.0 = AveragePayloadFunction.docScore()

Changes to Accomodate Boosting


This had me stumped for a while, until (thanks to Google), I chanced on this conversation I had with myself on the Lucene mailing list. This was my previous attempt at understanding Payload scoring, where I found that boosting (I was looking at index-time boosting at that time) could be enabled by setting the includeSpanScore parameter of the PayloadTermQuery constructor to true. Here is the output of the JUnit test for an unboosted AND query versus a boosted AND query.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
==== Query cscores:A AND cscores:B ====
120.944977: (8297) [{A:94.0, B:99.0, F:30.0, M:26.0, Q:3.0}]
117.237335: (2856) [{A:98.0, B:66.0, C:57.0, G:89.0}]
114.721512: (8069) [{A:85.0, B:98.0, C:47.0, M:32.0, P:57.0}]
110.805031: (8931) [{A:74.0, B:50.0}]
108.161667: (9212) [{A:72.0, B:79.0, Y:6.0}]
107.296494: (5011) [{A:84.0, B:66.0, H:91.0, L:33.0}]
106.956329: (633) [{A:52.0, B:97.0, L:41.0, X:72.0}]
103.960968: (1654) [{A:61.0, B:84.0, Y:51.0}]
103.083862: (6237) [{A:74.0, B:70.0, J:73.0, W:64.0}]
102.593079: (2769) [{A:93.0, B:98.0, L:61.0, P:14.0, S:88.0, Y:4.0, Z:66.0}]

==== Query cscores:A^10.0 AND cscores:B ====
108.256470: (7481) [{A:86.0, B:28.0}]
105.056259: (2856) [{A:98.0, B:66.0, C:57.0, G:89.0}]
99.181076: (8931) [{A:74.0, B:50.0}]
91.358391: (8297) [{A:94.0, B:99.0, F:30.0, M:26.0, Q:3.0}]
91.010086: (5011) [{A:84.0, B:66.0, H:91.0, L:33.0}]
89.514427: (1757) [{A:88.0, B:12.0, H:60.0, K:69.0}]
89.097290: (6001) [{A:96.0, B:54.0, E:51.0, Q:55.0, Y:29.0}]
83.368156: (8069) [{A:85.0, B:98.0, C:47.0, M:32.0, P:57.0}]
81.385170: (6237) [{A:74.0, B:70.0, J:73.0, W:64.0}]
80.518898: (5436) [{A:64.0, B:2.0}]

As we can see, boosting does seem to have an effect now. However, the scores are no longer the sum of individual payload scores like before. To understand the numbers, we once again look at the explanation for the first document in the last query, ie for "cscores:A AND cscores:B^10.0".

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
108.25647 = (MATCH) sum of:
  0.74635303 = (MATCH) btq, product of:
    0.12439217 = weight(cscores:A in 8860) [PayloadSimilarity], result of:
      0.12439217 = score(doc=8860,freq=0.5 = phraseFreq=0.5), product of:
        0.09868614 = queryWeight, product of:
          2.8521466 = idf(docFreq=1568, maxDocs=10000)
          0.034600653 = queryNorm
        1.2604827 = fieldWeight in 8860, product of:
          0.70710677 = tf(freq=0.5), with freq of:
            0.5 = phraseFreq=0.5
          2.8521466 = idf(docFreq=1568, maxDocs=10000)
          0.625 = fieldNorm(doc=8860)
    6.0 = AveragePayloadFunction.docScore()
  107.51012 = (MATCH) btq, product of:
    1.2648249 = weight(cscores:B^10.0 in 8860) [PayloadSimilarity], result of:
      1.2648249 = score(doc=8860,freq=0.5 = phraseFreq=0.5), product of:
        0.9951186 = queryWeight, product of:
          10.0 = boost
          2.8760111 = idf(docFreq=1531, maxDocs=10000)
          0.034600653 = queryNorm
        1.2710292 = fieldWeight in 8860, product of:
          0.70710677 = tf(freq=0.5), with freq of:
            0.5 = phraseFreq=0.5
          2.8760111 = idf(docFreq=1531, maxDocs=10000)
          0.625 = fieldNorm(doc=8860)
    85.0 = AveragePayloadFunction.docScore()

The last time, I had attempted to handle this by overriding most of the methods to return 1.0F, but was unable to override the value of the fieldNorm. This time, owing to some heavy duty refactoring of the Similarity subsystem by the Solr/Lucene developers, I was able to override the fieldNorm by overriding the decodeNormValue() method from TFIDFSimilarity.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
package com.mycompany.solr4extras.payloads

import org.apache.lucene.analysis.payloads.PayloadHelper
import org.apache.lucene.index.FieldInvertState
import org.apache.lucene.search.similarities.DefaultSimilarity
import org.apache.lucene.util.BytesRef

class PayloadSimilarity extends DefaultSimilarity {

  override def coord(overlap: Int, maxOverlap: Int) = 1.0F
  
  override def queryNorm(sumOfSquaredWeights: Float) = 1.0F
  
  override def lengthNorm(state: FieldInvertState) = state.getBoost()
  
  override def tf(freq: Float) = 1.0F
  
  override def sloppyFreq(distance: Int) = 1.0F
  
  override def scorePayload(doc: Int, start: Int, end: Int, 
      payload: BytesRef): Float = {
    if (payload == null) 1.0F
    else PayloadHelper.decodeFloat(payload.bytes, payload.offset)
  }
  
  override def idf(docFreq: Long, numDocs: Long) = 1.0F

  override def decodeNormValue(b: Byte) = 1.0F
}

Once this was done, the explanation for the first result in our last query looks like this:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
1070.0 = (MATCH) sum of:
  980.0 = (MATCH) btq, product of:
    10.0 = weight(cscores:A^10.0 in 2154) [PayloadSimilarity], result of:
      10.0 = score(doc=2154,freq=1.0 = phraseFreq=1.0), product of:
        10.0 = queryWeight, product of:
          10.0 = boost
          1.0 = idf(docFreq=1568, maxDocs=10000)
          1.0 = queryNorm
        1.0 = fieldWeight in 2154, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = phraseFreq=1.0
          1.0 = idf(docFreq=1568, maxDocs=10000)
          1.0 = fieldNorm(doc=2154)
    98.0 = AveragePayloadFunction.docScore()
  90.0 = (MATCH) btq, product of:
    1.0 = weight(cscores:B in 2154) [PayloadSimilarity], result of:
      1.0 = fieldWeight in 2154, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = phraseFreq=1.0
        1.0 = idf(docFreq=1531, maxDocs=10000)
        1.0 = fieldNorm(doc=2154)
    90.0 = AveragePayloadFunction.docScore()

And here is the output for our entire run:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
==== Query cscores:A ====
99.000000: (549) [{A:99.0, C:77.0, G:12.0, K:46.0, L:49.0, T:37.0, V:59.0, W:53.0, X:11.0}]
99.000000: (905) [{A:99.0, J:3.0}]
99.000000: (1171) [{A:99.0, G:87.0, K:61.0, O:83.0, U:1.0, W:23.0}]
99.000000: (1756) [{A:99.0, L:35.0, Q:56.0, X:65.0}]
99.000000: (1818) [{A:99.0}]
99.000000: (1884) [{A:99.0, E:32.0, J:5.0, L:28.0, P:89.0, S:90.0}]
99.000000: (2212) [{A:99.0, D:68.0, J:9.0, O:51.0, R:16.0, T:26.0, W:60.0, Z:68.0}]
99.000000: (2634) [{A:99.0, D:74.0, G:73.0, I:11.0, Q:52.0, S:55.0}]
99.000000: (4131) [{A:99.0, C:22.0, D:27.0, J:7.0, L:37.0, N:81.0, O:50.0, Q:27.0, X:4.0}]
99.000000: (4552) [{A:99.0, B:21.0, L:81.0, M:34.0, P:25.0, S:40.0}]

==== Query cscores:A AND cscores:B ====
193.000000: (8297) [{A:94.0, B:99.0, F:30.0, M:26.0, Q:3.0}]
191.000000: (2769) [{A:93.0, B:98.0, L:61.0, P:14.0, S:88.0, Y:4.0, Z:66.0}]
188.000000: (2154) [{A:98.0, B:90.0, C:50.0, E:24.0, L:65.0, U:92.0, W:36.0, Y:56.0}]
188.000000: (6309) [{A:93.0, B:95.0, H:89.0, K:98.0, M:46.0, X:67.0}]
184.000000: (2891) [{A:97.0, B:87.0, D:20.0, L:25.0, P:50.0, Q:33.0, V:71.0, X:57.0}]
183.000000: (6046) [{A:88.0, B:95.0, G:45.0, I:83.0, L:15.0, Y:93.0}]
183.000000: (8069) [{A:85.0, B:98.0, C:47.0, M:32.0, P:57.0}]
171.000000: (4146) [{A:80.0, B:91.0, D:60.0, H:49.0, U:20.0, Z:2.0}]
168.000000: (7961) [{A:97.0, B:71.0, I:58.0, Q:45.0, T:20.0, U:26.0, W:20.0, X:72.0}]
167.000000: (2568) [{A:72.0, B:95.0, E:22.0, F:98.0, M:64.0, N:95.0, V:72.0}]

==== Query cscores:A OR cscores:B ====
193.000000: (8297) [{A:94.0, B:99.0, F:30.0, M:26.0, Q:3.0}]
191.000000: (2769) [{A:93.0, B:98.0, L:61.0, P:14.0, S:88.0, Y:4.0, Z:66.0}]
188.000000: (2154) [{A:98.0, B:90.0, C:50.0, E:24.0, L:65.0, U:92.0, W:36.0, Y:56.0}]
188.000000: (6309) [{A:93.0, B:95.0, H:89.0, K:98.0, M:46.0, X:67.0}]
184.000000: (2891) [{A:97.0, B:87.0, D:20.0, L:25.0, P:50.0, Q:33.0, V:71.0, X:57.0}]
183.000000: (6046) [{A:88.0, B:95.0, G:45.0, I:83.0, L:15.0, Y:93.0}]
183.000000: (8069) [{A:85.0, B:98.0, C:47.0, M:32.0, P:57.0}]
171.000000: (4146) [{A:80.0, B:91.0, D:60.0, H:49.0, U:20.0, Z:2.0}]
168.000000: (7961) [{A:97.0, B:71.0, I:58.0, Q:45.0, T:20.0, U:26.0, W:20.0, X:72.0}]
167.000000: (2568) [{A:72.0, B:95.0, E:22.0, F:98.0, M:64.0, N:95.0, V:72.0}]

==== Query cscores:A^10.0 OR cscores:B ====
1070.000000: (2154) [{A:98.0, B:90.0, C:50.0, E:24.0, L:65.0, U:92.0, W:36.0, Y:56.0}]
1057.000000: (2891) [{A:97.0, B:87.0, D:20.0, L:25.0, P:50.0, Q:33.0, V:71.0, X:57.0}]
1051.000000: (9290) [{A:99.0, B:61.0, F:16.0, J:27.0, M:34.0, P:36.0}]
1046.000000: (2856) [{A:98.0, B:66.0, C:57.0, G:89.0}]
1045.000000: (2417) [{A:98.0, B:65.0, F:23.0, I:80.0, J:67.0, R:86.0, T:7.0, W:59.0}]
1041.000000: (7961) [{A:97.0, B:71.0, I:58.0, Q:45.0, T:20.0, U:26.0, W:20.0, X:72.0}]
1039.000000: (8297) [{A:94.0, B:99.0, F:30.0, M:26.0, Q:3.0}]
1037.000000: (2548) [{A:97.0, B:67.0, K:63.0, Q:36.0, V:12.0, W:39.0, Z:50.0}]
1028.000000: (2769) [{A:93.0, B:98.0, L:61.0, P:14.0, S:88.0, Y:4.0, Z:66.0}]
1025.000000: (6309) [{A:93.0, B:95.0, H:89.0, K:98.0, M:46.0, X:67.0}]

==== Query cscores:A^10.0 AND cscores:B ====
1070.000000: (2154) [{A:98.0, B:90.0, C:50.0, E:24.0, L:65.0, U:92.0, W:36.0, Y:56.0}]
1057.000000: (2891) [{A:97.0, B:87.0, D:20.0, L:25.0, P:50.0, Q:33.0, V:71.0, X:57.0}]
1051.000000: (9290) [{A:99.0, B:61.0, F:16.0, J:27.0, M:34.0, P:36.0}]
1046.000000: (2856) [{A:98.0, B:66.0, C:57.0, G:89.0}]
1045.000000: (2417) [{A:98.0, B:65.0, F:23.0, I:80.0, J:67.0, R:86.0, T:7.0, W:59.0}]
1041.000000: (7961) [{A:97.0, B:71.0, I:58.0, Q:45.0, T:20.0, U:26.0, W:20.0, X:72.0}]
1039.000000: (8297) [{A:94.0, B:99.0, F:30.0, M:26.0, Q:3.0}]
1037.000000: (2548) [{A:97.0, B:67.0, K:63.0, Q:36.0, V:12.0, W:39.0, Z:50.0}]
1028.000000: (2769) [{A:93.0, B:98.0, L:61.0, P:14.0, S:88.0, Y:4.0, Z:66.0}]
1025.000000: (6309) [{A:93.0, B:95.0, H:89.0, K:98.0, M:46.0, X:67.0}]

As you can see, we have achieved both our objectives, ie, an intuitive scoring system based directly on the concept scores, as well as support for boosting. The boosting is also quite intuitive, boosting a query by n results in multiplying the concept score for the query by n.

At this point, though, our PayloadSimilarity is completely customized for Payload queries. It can no longer do double duty for non-Payload queries as it was doing before. Fortunately, the Lucene/Solr team now provides a PerFieldSimilarityWrapper class by which you can configure Similarity implementations for individual fields (similar to the PerFieldAnalyzerWrapper in earlier versions). My Similarity Wrapper looks like this:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// Source: src/main/scala/com/mycompany/solr4extras/payloads/MyCompanySimilarityWrapper.scala
package com.mycompany.solr4extras.payloads

import org.apache.lucene.search.similarities.PerFieldSimilarityWrapper
import org.apache.lucene.search.similarities.Similarity
import org.apache.lucene.search.similarities.DefaultSimilarity

class MyCompanySimilarityWrapper extends PerFieldSimilarityWrapper {

  override def get(fieldName: String): Similarity = fieldName match {
    case "payloads"|"cscores" => new PayloadSimilarity()
    case _ => new DefaultSimilarity()
  }
}

And this Similarity implementation replaces the previous one we configured in schema.xml.

1
2
3
4
...
  <similarity 
    class="com.mycompany.solr4extras.payloads.MyCompanySimilarityWrapper"/>
  ...

And thats pretty much it. Its been a long post, thanks for staying with me this far. The current setup will use the PayloadSimilarity for payload fields and DefaultSimilarity for non-payload fields. Payload queries are requested using the defType=/cselect, and are ranked by the sum of payload scores for matched concepts. Boosting is also equally intuitive and follows a simple multiplicative model. Lastly, the Scala code is terse, but hopefully you find it no harder to read than the Java equivalent.

Of course, like I mentioned before, this is just one implementation of Lucene/Solr Payloads. Your implementation is very likely to be different, but perhaps my post can provide you with some pointers. In any case, all the code for this example can be found in my project on GitHub.

Friday, July 05, 2013

Bayesian Network Inference with R and bnlearn


The Web Intelligence and Big Data course at Coursera had a section on Bayesian Networks. The associated programming assignment was to answer a couple of questions about a fairly well-known (in retrospect) Bayesian network called "asia" or "chest clinic". The approach illustrated in the course was to use SQL, which worked great, but I wanted to see if I could also do it using Python or R.

I did look at Python first, but I was looking for a package which was mature and well-documented, and I couldn't find one (suggestions welcome). In the R world, I found gRain and bnlearn. gRain looked promising initially, although the installation was somewhat dodgy. However, I found later that the version I had wouldn't let me apply evidence, ie setFinding() or setEvidence() seemed to have no effect, so I ditched it in favor of bnlearn.

The network is shown below. Each node in the network corresponds to a particular event and has probabilities associated with it. This is an example of a Bayesian Network that has built based on probabilities assigned by domain experts. You can read more about the asia network and Bayesian networks in general here.


The code below defines this network to bnlearn, and then applies constraints (or evidence) to the network to get the probabilities of the three diseases Tuberculosis (T), Lung Cancer (L) and Bronchitis (B).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
library(bnlearn)

set.seed(42)

# a BN using expert knowledge
net <- model2network("[A][S][T|A][L|S][B|S][E|T:L][X|E][D|B:E]")
yn <- c("yes", "no")
cptA <- matrix(c(0.01, 0.99), ncol=2, dimnames=list(NULL, yn))
cptS <- matrix(c(0.5, 0.5), ncol=2, dimnames=list(NULL, yn))
cptT <- matrix(c(0.05, 0.95, 0.01, 0.99), 
               ncol=2, dimnames=list("T"=yn, "A"=yn))
cptL <- matrix(c(0.1, 0.9, 0.01, 0.99), 
               ncol=2, dimnames=list("L"=yn, "S"=yn))
cptB <- matrix(c(0.6, 0.4, 0.3, 0.7), 
               ncol=2, dimnames=list("B"=yn, "S"=yn))
# cptE and cptD are 3-d matrices, which don't exist in R, so
# need to build these manually as below.
cptE <- c(1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0)
dim(cptE) <- c(2, 2, 2)
dimnames(cptE) <- list("E"=yn, "L"=yn, "T"=yn)
cptX <- matrix(c(0.98, 0.02, 0.05, 0.95), 
               ncol=2, dimnames=list("X"=yn, "E"=yn))
cptD <- c(0.9, 0.1, 0.7, 0.3, 0.8, 0.2, 0.1, 0.9)
dim(cptD) <- c(2, 2, 2)
dimnames(cptD) <- list("D"=yn, "E"=yn, "B"=yn)
net.disc <- custom.fit(net, dist=list(A=cptA, S=cptS, T=cptT, L=cptL, 
                                      B=cptB, E=cptE, X=cptX, D=cptD))

# Unit test: Given no evidence, the chances of tuberculosis is about 1%
cpquery(net.disc, (T=="yes"), TRUE)
cpquery(net.disc, (L=="yes"), TRUE)
cpquery(net.disc, (B=="yes"), TRUE)
# [1] 0.01084444
# [1] 0.05428889
# [1] 0.4501667

# Question 1:
# Patient has recently visited Asia and does not smoke. Which is most
# likely?
# (a) the patient is more likely to have tuberculosis then anything else.
# (b) the chance that the patient has lung cancer is higher than he/she 
#     having tuberculosis
# (c) the patient is more likely to have bronchitis then anything else
# (d) the chance that the patient has tuberculosis is higher than he/she 
#     having bronchitis
cpquery(net.disc, (T=="yes"), (A=="yes" & S=="no"))
cpquery(net.disc, (L=="yes"), (A=="yes" & S=="no"))
cpquery(net.disc, (B=="yes"), (A=="yes" & S=="no"))
# [1] 0.04988124
# [1] 0.00462963
# [1] 0.316092
# shows that (c) is correct.

# Question 2
# The patient has recently visited Asia, does not smoke, is not 
# complaining of dyspnoea, but his/her x-ray shows a positive shadow
# (a) the patient most likely has tuberculosis, but lung cancer is 
#     almost equally likely
# (b) the patient most likely has tuberculosis as compared to any of 
#     the other choices
# (c) the patient most likely has bronchitis, and tuberculosis is 
#     almost equally likely
# (d) the patient most likely has tuberculosis, but bronchitis is 
#     almost equally likely
cpquery(net.disc, (T=="yes"), (A=="yes" & S=="no" & D=="no" & X=="yes"))
cpquery(net.disc, (L=="yes"), (A=="yes" & S=="no" & D=="no" & X=="yes"))
cpquery(net.disc, (B=="yes"), (A=="yes" & S=="no" & D=="no" & X=="yes"))
# [1] 0.2307692
# [1] 0.04166667
# [1] 0.2105263
# shows that (d) is correct

We then try to build the network from data, which is probably going to be the more common case. The bnlearn package contains the "asia" dataset, which we load as follows, then build a network and ask the same questions of it. It turns out that the answers are the same as the one built by domain experts.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# same BN using data
data(asia)
head(asia)
#    A   S   T  L   B   E   X   D
# 1 no yes  no no yes  no  no yes
# 2 no yes  no no  no  no  no  no
# 3 no  no yes no  no yes yes yes
# 4 no  no  no no yes  no  no yes
# 5 no  no  no no  no  no  no yes
# 6 no yes  no no  no  no  no yes

net.data <- bn.fit(hc(asia), asia)

# unit test
cpquery(net.data, (T=="yes"), TRUE)
cpquery(net.data, (L=="yes"), TRUE)
cpquery(net.data, (B=="yes"), TRUE)
# [1] 0.008564706
# [1] 0.066
# [1] 0.5077882

# question 1
cpquery(net.data, (T=="yes"), (A=="yes" & S=="no"))
cpquery(net.data, (L=="yes"), (A=="yes" & S=="no"))
cpquery(net.data, (B=="yes"), (A=="yes" & S=="no"))
# [1] 0.01630435
# [1] 0.02752294
# [1] 0.2978723
# still shows (c) is correct.

cpquery(net.data, (T=="yes"), (A=="yes" & S=="no" & D=="no" & X=="yes"))
cpquery(net.data, (L=="yes"), (A=="yes" & S=="no" & D=="no" & X=="yes"))
cpquery(net.data, (B=="yes"), (A=="yes" & S=="no" & D=="no" & X=="yes"))
# [1] 0.1
# [1] 0
# [1] 0.1666667
# still shows (d) is correct.

One thing I noticed is that if you make the same cpquery call twice in a row you get different answers. From what I understand about Bayesian Networks, I don't think this should happen. I am not very familiar with the R ecosystem, I couldn't find the appropriate R mailing list to ask about this, very likely I don't know where to look (if you do please let me know).

In case you need it, all the code in this post, as well as the SQL+Python based one I wrote for the original assignment, is available on my github page for this project.